Main Page
From cdec Decoder
cdec is a decoder, aligner, and learning framework for statistical machine translation and other structured prediction models written by Chris Dyer in the University of Maryland Department of Linguistics. It is written in C++.
- Quick start guide.
- Have questions? Join our mailing list!
- Follow us on Twitter!
- Cdec development is spreading! Want to contribute? Here are some project ideas.
- Instructions for cdec developers.
If you make use of this software, please cite the following:
C. Dyer, A. Lopez, J. Ganitkevitch, J. Weese, F. Ture, P. Blunsom, H. Setiawan, V. Eidelman, and P. Resnik. cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models. To appear in Proceedings of ACL (Demonstration track), Uppsala, Sweden, July, 2010. bibtex
Contents |
See cdec in action
cdec sample grammar and test set.
Comparison with other decoders
| Decoder | Language | BLEU | Time | Memory |
|---|---|---|---|---|
| cdec | c++ | 31.47 | 0.37 sec/sent | 1.0-1.1GB |
| Joshua | Java | 31.55 | 2.34 sec/sent | 4.0-4.8GB |
| Hiero | Python / Pyrex | 31.22 | 27.2 sec/sent | 1.7-1.9GB |
- SCFG translation, 3gm LM, cube pruning (30 items popped at each node), Chinese-English.
What it supports / why it's awesome
- It's written in C++, so it's fast and uses minimal memory
- Multiple translation / labeling back ends
- SCFG translation. FSA (lattice) input can be translated with an SCFG (cf. Chiang (CL 2007), Dyer et al. (ACL 2008))
- Forest translation. CFG (forest) input can be translated with an FST (cf. Dyer & Resnik (NAACL 2010))
- Lexical translation. Latent variable CRF variant of lexical translation models (only really useful for alignment)
- Word segmentation lattices. Supervised compound word segmentation / lattice generation (cf. Dyer (NAACL 2009))
- Phrase based translation. (Koehn et al. (2003); experimental)
- Lexical alingment. Implements Model 1 and HMM, experimental. Only useful for alignment.
- Generic tagger (POS, NER, etc.) with CRF sequence models
- Supports millions of features
- Cube pruning integration of non-local features (Language models, etc.)
- Max-derivation decoding using dynamic programming (Viterbi algorithm)
- Max-translation decoding using sampling
- Max-translation decoding using beam search approximation (cf., Blunsom et al. (ACL 2008))
- Simple and powerful feature function interface (no pre-translation rule cost estimation required, single object can be used to detect multiple features)
- Hypergraph library supporting generic semiring computations (cf. Li and Eisner (EMNLP 2009))
- Semirings in cdec
- Viterbi / k-best algorithms handled separately. Cdec k-best framework
- "Hypergraph MERT" implementation using semirings (MERT is a semiring!)
- "Forced" decoding
- Provide a source (lattice/forest/sentence) and a reference translation (sentence/lattice), compute all derivations that reach the target.
- Alignment mode
- Show a Pharaoh-style alignment grid aligning a source sentence to a reference translation (sentence).
- Latent variable CRF training for any translation backend (e.g. Blunsom (ACL/EMNLP 2008))
- Using forced decoding (i.e., given an e,f pair), the decoder can efficiently compute <math>\mathcal{L}=-\log \sum_d p(e,d|f;\lambda)</math> and <math>\frac{\partial\mathcal{L}}{\partial\lambda_k}</math>
- Generic function optimization code included (uses LBFGS or Rprop), supports spherical Gaussian priors
- Persistence of translation forest to disk (using JSON)
- Search space visualization with GraphViz
Limitations
- cdec does not extract translation grammars from parallel data. For this, you need to use Hiero, Joshua, SAMT, or Moses. However, if you're nice to it, cdec will generate word alignments, something those other tools can't.
- Grammars must be fully loaded into memory
- Projects that should be done
Examples
- Run cdec with no parameters to get the command line/configuration file options and their default values:
$ ./cdec ... shows all options ...
- cdec -L shows all feature functions that can be applied to score a derivation
$ ./cdec -L cdec v1.0 (c) 2009 by Chris Dyer Available feature functions (specify with -F): LanguageModel WordPenalty
- Here's an example decoding a test sentence (-i test_data/australia.in) with a 3-gram LM (-F "LanguageModel...), a word penalty feature (-F WordPenalty) in addition to the rule features, which come from the grammar (-g test_data/australia.scfg.gz) and are always included, and producing a unique (-r) 10-best list (-k 10), using a cube-pruning limit of 500 items (-K 500). Feature weights are given in a weights file (-w test_data/australia.weights), and the decoder is instructed to create "pass through" rules that permit the source word to be translated as itself (-P).
$ ./cdec -i test_data/australia.in \
-g test_data/australia.scfg.gz \
-w test_data/australia.weights \
-F "LanguageModel -o 3 ./test_data/c2e.3gram.lm.gz" \
-F WordPenalty \
-K 500 -k 10 -r -P
cdec v1.0 (c) 2009 by Chris Dyer
Reading SCFG grammar from test_data/australia.scfg
33737 rules read.
Reading weights from test_data/australia.weights
Adding feature: LanguageModel (with config parameters '-o 3 ./test_data/c2e.3gram.lm.gz')
Reading 3-gram LM from ./test_data/c2e.3gram.lm.gz
Feature name: LanguageModel
Adding feature: WordPenalty (no config parameters)
Feature name: WordPenalty
Reading input from test_data/australia.in
INPUT: 澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。
Goal category: [S]
...........
-LM forest (nodes/edges): 78/253341
-LM forest (paths): 4.86506e+28
-LM Viterbi: australia is have diplomatic relations with north korea one of the few countries .
-LM Viterbi: -12.7893
Rescoring forest (cube pruning, pop_limit = 500)
..............................................................................
Best path: -17.7349 -17.7349
+LM forest (nodes/edges): 1405/6300
+LM forest (paths): 9.53093e+07
+LM Viterbi: australia is one of the few countries have diplomatic relations with north korea .
1 ||| australia is one of the few countries have diplomatic relations with north korea . ||| PhraseModel_0=4.60735;PhraseMo... ||| -17.7349
1 ||| australia is one of the small number of countries have diplomatic relations with north korea . ||| PhraseModel_0=4.60735;PhraseMo...||| -19.2465
1 ||| australia is one of the few countries have diplomatic relations with the north . ||| PhraseModel_0=5.20941;PhraseMo... ||| -19.5312
1 ||| australia is one of the handful of countries have diplomatic relations with north korea . ||| PhraseModel_0=5.08447;PhraseMo... ||| -19.6519
1 ||| australia is one of the few countries to have diplomatic relations with north korea . ||| PhraseModel_0=5.96908;PhraseMo... ||| -19.9282
1 ||| australia have diplomatic relations with north korea is one of the few countries . ||| PhraseModel_0=4.77471;PhraseMo... ||| -19.9828
1 ||| australia is one of the few countries have diplomatic relations with pyongyang . ||| PhraseModel_0=5.20941;PhraseMod... ||| -20.0654
1 ||| australia is one of the few countries with diplomatic relations with north korea . ||| PhraseModel_0=7.0099;PhraseMo... ||| -20.0917
1 ||| australia is one of the few countries with north korea have diplomatic relations . ||| PhraseModel_0=5.14991;PhraseMo... ||| -20.4772
1 ||| australia is one of few countries have diplomatic relations with north korea . ||| PhraseModel_0=5.96908;PhraseMo... ||| -20.5605
- If you want to see the output of "forced" decoding, all you have to do is specify both the source and target as the input to the decoder using the following form:
澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 ||| australia is one of the few countries that have diplomatic relations with north korea .
- Here's a sample configuration file.
$ cat cdec.ini feature_function=LanguageModel /fs/clip-dissertation/cdyer/ftrans/iwslt/en.3gram.lm.gz -o 3 feature_function=WordPenalty formalism=scfg grammar=/fs/clip-dissertation/cdyer/ftrans/iwslt/cdec-vest-baseline/grammar.gz add_pass_through_rules=true
- Running the decoder in alignment mode (-a)
./cdec -g test_data/australia.scfg.gz -w test_data/australia.weights -a
cdec v1.0 (c) 2009 by Chris Dyer
Reading SCFG grammar from test_data/australia.scfg.gz
33737 rules read.
Reading weights from test_data/australia.weights
Reading input from STDIN
澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 ||| australia is one of the few countries that have diplomatic relations with north korea .
INPUT: 澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 ||| australia is one of the few countries that have diplomatic relations with north korea .
Goal category: [S]
...........
-LM forest (nodes/edges): 78/253341
-LM forest (paths): 4.86506e+28
-LM Viterbi: australia is have diplomatic relations with north korea one of the few countries .
-LM Viterbi: -12.7893
Goal category: [CAT_77]
...............
Constr. forest (nodes/edges): 116/374
Constr. forest (paths): 24148
Constr. VitTree: (CAT_77 (CAT_76 (CAT_73 (CAT_13 (CAT_12 australia is)) (CAT_66 one of (CAT_60 the (CAT_20 few countries) \
that (CAT_36 have diplomatic relations (CAT_15 with north korea))))) (CAT_11 .)))
012345678901234
0*..............0
1.*.............1
2...........*...2
3............**.3
4........*......4
5.........**....5
6....*..........6
7.....*.........7
8......*........8
9..**...........9
0..............*0
012345678901234
0-0 1-1 11-2 12-3 13-3 8-4 9-5 10-5 4-6 5-7 6-8 2-9 3-9 14-10
- Using the remote language model (assumes there is a 5-gram LM server running on dsub01.umiacs.umd.edu on port 7000).
$ cdec -P -g grammar.sorted-F "LanguageModel -o 5 lm://dsub01.umiacs.umd.edu:7000" -F "WordPenalty"
