Main Page
From cdec Decoder
cdec is a decoder, aligner, and learning framework for statistical machine translation and other structured prediction models. It is written by Chris Dyer and many volunteers and collaborators.
- Instructions for downloading and building cdec.
- Have questions? Join our mailing list! Follow us on Twitter!
Contents |
Why cdec?
cdec is a mature software platform for research in development of translation models and algorithms. Its architecture (described here) was developed with machine learning and algorithmic research use-cases in mind. It is designed to run efficiently in both limited resource environments (single processor, limited memory) up to very large cluster environments (using MPI and MapReduce).
Citation
If you make use of this software, please cite the following:
- C. Dyer, A. Lopez, J. Ganitkevitch, J. Weese, F. Ture, P. Blunsom, H. Setiawan, V. Eidelman, and P. Resnik. cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models. In Proceedings of ACL, July, 2010. [bibtex]
Acknowledgements
cdec development has been and continues to be supported by:
- The GALE program of the Defense Advanced Research Projects Agency, Contract No. HR0011-06-2-001
- The U.S. Army Research Laboratory and the U.S. Army Research Office under contract/grant number W911NF-10-1-0533
- The National Science Foundation through grants IIS-0844507, IIS-0915187, IIS-0713402, and IIS-0915327
- The Computational Linguistics and Information Processing Lab, University of Maryland
- Noah's ARK Research Group, Language Technologies Institute, Carnegie Mellon University
Cdec sample grammar and test set
User's guide
- How to run MERT, PRO, or Rampion
- How to run Suffix Array Grammar Extraction
- Evaluating your translations
Best practices
- Instead of building large grammars use per-sentence grammars for each sentence you want to translate.
What it supports
- It's written in C++, so it's fast and uses minimal memory
- Multiple translation formalisms
- SCFG translation. FSA (lattice) input can be translated with an SCFG (cf. Chiang (CL 2007), Dyer et al. (ACL 2008))
- Forest translation. CFG (forest) input can be translated with an FST (cf. Dyer & Resnik (NAACL 2010))
- Lexical translation / Word alignment. Latent variable CRF variant of lexical translation models (cf. Dyer et al. (ACL 2011))
- Word segmentation lattices. Supervised compound word segmentation / lattice generation with semi-CRFs (cf. Dyer (NAACL 2009))
- Phrase based translation. (Koehn et al. (2003); experimental)
- Generic sequence tagger (POS, NER, etc.) with Markovian sequence models
- Supports models with millions of features
- Cube pruning integration of non-local features (Language models, etc.). See Feature functions for rescoring an SCFG translation forest
- Multi-pass scoring, enabling costly features functions to only be computed over promising translation hypotheses
- Various decoding decision rules
- Max-derivation decoding using dynamic programming (Viterbi algorithm)
- Max-translation decoding using sampling
- Max-translation decoding using beam search approximation (cf., Blunsom et al. (ACL 2008))
- Simple and powerful feature function interface (no pre-translation rule cost estimation required, single object can be used to detect multiple features)
- Hypergraph library supporting generic semiring computations (cf. Li and Eisner (EMNLP 2009))
- Semirings in cdec
- Viterbi / k-best algorithms handled separately. Cdec k-best framework
- "Hypergraph MERT" implementation using semirings (MERT is a semiring!)
- "Forced" decoding
- Provide a source (lattice/forest/sentence) and a reference translation (sentence/lattice), compute all derivations that reach the target.
- Alignment functionality
- Show a Pharaoh-style alignment grid aligning a source sentence to a reference translation (sentence).
- MAP alignment
- Viterbi alignment
- k-best Viterbi alignment
- Language model notes
- Latent variable CRF training for any translation backend (e.g. Blunsom (ACL/EMNLP 2008))
- Using forced decoding (i.e., given an e,f pair), the decoder can efficiently compute <math>\mathcal{L}=-\log \sum_d p(e,d|f;\lambda)</math> and <math>\frac{\partial\mathcal{L}}{\partial\lambda_k}</math>
- Generic function optimization code included (uses LBFGS or Rprop), supports spherical Gaussian priors
- Online optimizer with L1 penalty also included
- Persistence of translation forest to disk (using JSON)
- Search space visualization with GraphViz
Encoding and input format
- UTF-8 encoding should be used with cdec
- Input may be wrapped in <seg> tags. Read about SGML input markup.
Developers
- Cdec development is spreading! Want to contribute? Here are some project ideas.
- Instructions for cdec developers.
Limitations
- cdec provides on limited support for extracting translation grammars from parallel data. The freely available Joshua and Moses support much more feature-rich grammar extractors.
- There is no experiment management pipeline. Loony bin provides support for cdec.
- cdec is designed for batch, not online operation.
- Projects that should be done
Examples
- Run cdec with no parameters to get the command line/configuration file options and their default values:
$ ./cdec ... shows all options ...
- cdec -L shows all feature functions that can be applied to score a derivation
$ ./cdec -L cdec v1.0 (c) 2009-2011 by Chris Dyer Available feature functions (specify with -F): KLanguageModel WordPenalty ...
- Here's an example decoding a test sentence (-i test_data/australia.in) with a 3-gram LM (-F "LanguageModel...), a word penalty feature (-F WordPenalty) in addition to the rule features, which come from the grammar (-g test_data/australia.scfg.gz) and are always included, and producing a unique (-r) 10-best list (-k 10), using a cube-pruning limit of 500 items (-K 500). Feature weights are given in a weights file (-w test_data/australia.weights), and the decoder is instructed to create "pass through" rules that permit the source word to be translated as itself (-P).
$ ./cdec -i test_data/australia.in \
-g test_data/australia.scfg.gz \
-w test_data/australia.weights \
-F "KLanguageModel -o 3 ./test_data/c2e.3gram.lm.gz" \
-F WordPenalty \
-K 500 -k 10 -r -P
cdec v1.0 (c) 2009 by Chris Dyer
Reading SCFG grammar from test_data/australia.scfg
33737 rules read.
Reading weights from test_data/australia.weights
Adding feature: LanguageModel (with config parameters '-o 3 ./test_data/c2e.3gram.lm.gz')
Reading 3-gram LM from ./test_data/c2e.3gram.lm.gz
Feature name: LanguageModel
Adding feature: WordPenalty (no config parameters)
Feature name: WordPenalty
Reading input from test_data/australia.in
INPUT: 澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。
Goal category: [S]
...........
-LM forest (nodes/edges): 78/253341
-LM forest (paths): 4.86506e+28
-LM Viterbi: australia is have diplomatic relations with north korea one of the few countries .
-LM Viterbi: -12.7893
Rescoring forest (cube pruning, pop_limit = 500)
..............................................................................
Best path: -17.7349 -17.7349
+LM forest (nodes/edges): 1405/6300
+LM forest (paths): 9.53093e+07
+LM Viterbi: australia is one of the few countries have diplomatic relations with north korea .
1 ||| australia is one of the few countries have diplomatic relations with north korea . ||| PhraseModel_0=4.60735;PhraseMo... ||| -17.7349
1 ||| australia is one of the small number of countries have diplomatic relations with north korea . ||| PhraseModel_0=4.60735;PhraseMo...||| -19.2465
1 ||| australia is one of the few countries have diplomatic relations with the north . ||| PhraseModel_0=5.20941;PhraseMo... ||| -19.5312
1 ||| australia is one of the handful of countries have diplomatic relations with north korea . ||| PhraseModel_0=5.08447;PhraseMo... ||| -19.6519
1 ||| australia is one of the few countries to have diplomatic relations with north korea . ||| PhraseModel_0=5.96908;PhraseMo... ||| -19.9282
1 ||| australia have diplomatic relations with north korea is one of the few countries . ||| PhraseModel_0=4.77471;PhraseMo... ||| -19.9828
1 ||| australia is one of the few countries have diplomatic relations with pyongyang . ||| PhraseModel_0=5.20941;PhraseMod... ||| -20.0654
1 ||| australia is one of the few countries with diplomatic relations with north korea . ||| PhraseModel_0=7.0099;PhraseMo... ||| -20.0917
1 ||| australia is one of the few countries with north korea have diplomatic relations . ||| PhraseModel_0=5.14991;PhraseMo... ||| -20.4772
1 ||| australia is one of few countries have diplomatic relations with north korea . ||| PhraseModel_0=5.96908;PhraseMo... ||| -20.5605
- If you want to see the output of "forced" decoding, all you have to do is specify both the source and target as the input to the decoder using the following form:
澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 ||| australia is one of the few countries that have diplomatic relations with north korea .
- Here's a sample configuration file.
$ cat cdec.ini feature_function=LanguageModel /fs/clip-dissertation/cdyer/ftrans/iwslt/en.3gram.lm.gz -o 3 feature_function=WordPenalty formalism=scfg grammar=/fs/clip-dissertation/cdyer/ftrans/iwslt/cdec-vest-baseline/grammar.gz add_pass_through_rules=true
- Running the decoder in alignment mode (-a)
./cdec -g test_data/australia.scfg.gz -w test_data/australia.weights -a
cdec v1.0 (c) 2009 by Chris Dyer
Reading SCFG grammar from test_data/australia.scfg.gz
33737 rules read.
Reading weights from test_data/australia.weights
Reading input from STDIN
澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 ||| australia is one of the few countries that have diplomatic relations with north korea .
INPUT: 澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 ||| australia is one of the few countries that have diplomatic relations with north korea .
Goal category: [S]
...........
-LM forest (nodes/edges): 78/253341
-LM forest (paths): 4.86506e+28
-LM Viterbi: australia is have diplomatic relations with north korea one of the few countries .
-LM Viterbi: -12.7893
Goal category: [CAT_77]
...............
Constr. forest (nodes/edges): 116/374
Constr. forest (paths): 24148
Constr. VitTree: (CAT_77 (CAT_76 (CAT_73 (CAT_13 (CAT_12 australia is)) (CAT_66 one of (CAT_60 the (CAT_20 few countries) \
that (CAT_36 have diplomatic relations (CAT_15 with north korea))))) (CAT_11 .)))
012345678901234
0*..............0
1.*.............1
2...........*...2
3............**.3
4........*......4
5.........**....5
6....*..........6
7.....*.........7
8......*........8
9..**...........9
0..............*0
012345678901234
0-0 1-1 11-2 12-3 13-3 8-4 9-5 10-5 4-6 5-7 6-8 2-9 3-9 14-10
