Cdec Development Guide
From cdec Decoder
cdec is hosted on github.com. Developers wishing to contribute should follow roughly these steps:
- Start by forking the official cdec repository and develop against your own repository.
- Changes are committed to the official repository by issuing pull requests.
- Prior to issuing pull requests, make sure you have merged down the most recent version of the official repository.
- Here are some instructions for building cdec from source.
- cdec uses a derivative of the Google c++ coding style. Read it!
- cdec's differences to the official Google style:
- Use of the boost libraries is encouraged
- Feel free to go over 80 characters per line, but not horribly over
- commented out code is permitted
- Don't write -inl.h files. If the header definition becomes difficult to read, put the inline function implementation at the end of the file.
- Virtual functions should be protected and called from (public) non-virtual functions (template method pattern)
- Do not write inline implementations of virtual destructors or virtual functions
- Really, go read the style guide. All of it.
A word on types
I dislike opaquely named types like Sentence since although they tell the developer what something means, they give no information about how to use the API. To access the first word, does one do s.GetFirstWord(), or s, or s.words(), or what? In cdec, use the following philosophy: the variable name should tell the developer what the variable (and by extension, the type) means, and the type name should give you clues to how to access it. Therefore, I favor types like vector<WordID> for "sentences", since the vector<T> interface is common knowledge and well defined in the context of "sentences".
As a corollary to this, when it is necessary to define wholly new types (for example, prob_t to represent values using their log), their interfaces should be "standard":
- implement standard operators like +, -, *, ==, and , where applicable
- use begin() and end() to access iterators
- Don't over-engineer things: it's better to refactor later than to make things overly abstract up front
- Make something abstract if you know for sure there will be more than two of them
- Duplicated code is bad, but code with unintuitive hierarchies / call graphs is worse
- Do refactor
- Use object hierarchies and polymorphism sparingly
- Do use templates (generics) in places where speed matters
- Minimize #includes in header files
- Minimize heap memory allocations (calls to new, either explicit or implicit) in code that will be called often
Unit tests and system tests
- Unit tests are small binaries that exercise small amounts of code in a very controlled way
- System tests run cdec with particular inputs to verify that the decoder works as a whole
- Unit tests (mostly) require the google testing framework to run
- System tests have no extra dependencies: if you have a cdec binary, you can run the system tests
- Write both kinds and run them often!
- Every model (translation model or rescoring model) should be exercised by at least one system test.
- Run system tests with the following command:
- To add a new system test
- create a new directory under tests/system_tests that is the name of the test you wish to run
- in this directory, create a configuration file, a feature weights file, some sample input, and any supporting data files that are necessary (refer to an existing test for the formats)
- also create the expected outputs (refer to an existing test for the formats)
If you're new to C++, debugging SEGVs can be a bit challenging. Here are some tips.
- Use the gdb (gnu debugger) to help pinpoint the location
- the where command shows you where the program crashed.
- print var can be used to inspect the value of a variable
- Turning off compiler optimization in the Makefile (grep for the string O2 in the generated Makefile and change it to O0) and rebuilding can make this easier to do.
- Heap corruption is a particularly nasty problem to track down. It happens when you write to unallocated memory or to memory that has already been freed. However, crashes often don't occur until much later (for example, the next time memory is allocated), making the source of the problem hard to identify. Fortunately, there are some very good memory debuggers that can help you identify the locus of the original problem. I've had very good luck using Electric Fence.
Dependencies on third-party software
- In general, introducing dependencies on other software packages (i.e., beyond SRILM and boost) should be avoided.
- If some dependency is necessary, it should be configured using autoconf and optional if at all possible