From cdec Decoder
Google Summer of Code 2012 Project Page
- Python interface to decoder
- Improved Decoder interface
- functionality to set weights, grammar, deal with multipass decoding, two-phase synchronous decoding
- New translation back ends
- Tree-to-string decoder that takes a source parse tree (forest?) as input and uses the linear time top-down tree transducer algorithm described in Huang et al. (AMTA 2006) and subsequent papers
- "Hypergraph MBR" implementation. Algorithm description paper
- come up with some "online decoding" solution
- sa-extractor/ contains Python/Cython code from Adam Lopez (EMNLP 2007) that generates grammars on-the-fly from a parallel corpus
- Integration with a project workflow system (recommended: Scala-based ducttape
Code quality, efficiency
- Refactor top-level binary so that it also creates a Decoder object with a listener and implements its UI that way
- When representing weight vectors, use weight_t instead of double.
Refactor mteval scoring components
Assume that the sufficient statistics off all scores decompose linearly across the corpus (i.e., for BLEU, n-gram matches, hypothesized length, ref length; for TER, substitutions, matches, etc.) and use a single scoring object that is basically just a (key,value) map. Done!
- Edges currently duplicate the feature values from their rules. It would be better if edge's only had feature values that were just associated with that edge, and rule features were associated just with the rule object.
- Come up with a universal hypergraph representation format (get rid of slow, crappy JSON format)
- Should have a finite state parser generated using flex (or similar)
- Should be similar to the SLF lattice format (identical?)
- Should coordinate with moses and joshua people
- Use a profiler! cdec is fast but hasn't been profiled. (the open source Google CPU profiler is quite excellent for C++).