Main Page

From cdec Decoder

Revision as of 00:07, 4 February 2012 by Admin (Talk | contribs)
Jump to: navigation, search

cdec is a decoder, aligner, and learning framework for statistical machine translation and other structured prediction models. It is written by Chris Dyer and many volunteers and collaborators.

Contents

Why cdec?

cdec is a mature software platform for research in development of translation models and algorithms. Its architecture (described here) was developed with machine learning and algorithmic research use-cases in mind. It is designed to run efficiently in both limited resource environments (single processor, limited memory) up to very large cluster environments (using MPI and MapReduce).

Citation

If you make use of this software, please cite the following:

Acknowledgements

cdec development has been and continues to be supported by:

Cdec sample grammar and test set

User's guide

Best practices

What it supports

  • It's written in C++, so it's fast and uses minimal memory
  • Multiple translation formalisms
  • Supports models with millions of features
  • Cube pruning integration of non-local features (Language models, etc.). See Feature functions for rescoring an SCFG translation forest
  • Multi-pass scoring, enabling costly features functions to only be computed over promising translation hypotheses
  • Various decoding decision rules
    • Max-derivation decoding using dynamic programming (Viterbi algorithm)
    • Max-translation decoding using sampling
    • Max-translation decoding using beam search approximation (cf., Blunsom et al. (ACL 2008))
  • Simple and powerful feature function interface (no pre-translation rule cost estimation required, single object can be used to detect multiple features)
  • Hypergraph library supporting generic semiring computations (cf. Li and Eisner (EMNLP 2009))
  • "Forced" decoding
    • Provide a source (lattice/forest/sentence) and a reference translation (sentence/lattice), compute all derivations that reach the target.
  • Alignment functionality
    • Show a Pharaoh-style alignment grid aligning a source sentence to a reference translation (sentence).
    • MAP alignment
    • Viterbi alignment
    • k-best Viterbi alignment
  • Language model notes
  • Latent variable CRF training for any translation backend (e.g. Blunsom (ACL/EMNLP 2008))
    • Using forced decoding (i.e., given an e,f pair), the decoder can efficiently compute <math>\mathcal{L}=-\log \sum_d p(e,d|f;\lambda)</math> and <math>\frac{\partial\mathcal{L}}{\partial\lambda_k}</math>
    • Generic function optimization code included (uses LBFGS or Rprop), supports spherical Gaussian priors
    • Online optimizer with L1 penalty also included
  • Persistence of translation forest to disk (using JSON)
  • Search space visualization with GraphViz‎

Encoding and input format

Developers

Limitations

  • cdec provides on limited support for extracting translation grammars from parallel data. The freely available Joshua and Moses support much more feature-rich grammar extractors.
  • There is no experiment management pipeline. Loony bin provides support for cdec.
  • cdec is designed for batch, not online operation.
  • Projects that should be done

Examples

  • Run cdec with no parameters to get the command line/configuration file options and their default values:
$ ./cdec
... shows all options ...
  • cdec -L shows all feature functions that can be applied to score a derivation
$ ./cdec -L
cdec v1.0 (c) 2009-2011 by Chris Dyer
Available feature functions (specify with -F):
  KLanguageModel
  WordPenalty
  ...
  • Here's an example decoding a test sentence (-i test_data/australia.in) with a 3-gram LM (-F "LanguageModel...), a word penalty feature (-F WordPenalty) in addition to the rule features, which come from the grammar (-g test_data/australia.scfg.gz) and are always included, and producing a unique (-r) 10-best list (-k 10), using a cube-pruning limit of 500 items (-K 500). Feature weights are given in a weights file (-w test_data/australia.weights), and the decoder is instructed to create "pass through" rules that permit the source word to be translated as itself (-P).
$ ./cdec -i test_data/australia.in \
         -g test_data/australia.scfg.gz \
         -w test_data/australia.weights \
         -F "KLanguageModel -o 3 ./test_data/c2e.3gram.lm.gz" \
         -F WordPenalty \
         -K 500 -k 10 -r -P
cdec v1.0 (c) 2009 by Chris Dyer
Reading SCFG grammar from test_data/australia.scfg
  33737 rules read.
Reading weights from test_data/australia.weights
Adding feature: LanguageModel (with config parameters '-o 3 ./test_data/c2e.3gram.lm.gz')
Reading 3-gram LM from ./test_data/c2e.3gram.lm.gz
  Feature name: LanguageModel
Adding feature: WordPenalty (no config parameters)
  Feature name: WordPenalty
Reading input from test_data/australia.in

INPUT: 澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。
  Goal category: [S]
    ...........
  -LM forest (nodes/edges): 78/253341
  -LM forest       (paths): 4.86506e+28
  -LM Viterbi: australia is have diplomatic relations with north korea one of the few countries .
  -LM Viterbi: -12.7893
  Rescoring forest (cube pruning, pop_limit = 500)
    ..............................................................................
  Best path: -17.7349	-17.7349
  +LM forest (nodes/edges): 1405/6300
  +LM forest       (paths): 9.53093e+07
  +LM Viterbi: australia is one of the few countries have diplomatic relations with north korea .

1 ||| australia is one of the few countries have diplomatic relations with north korea . ||| PhraseModel_0=4.60735;PhraseMo... ||| -17.7349
1 ||| australia is one of the small number of countries have diplomatic relations with north korea . ||| PhraseModel_0=4.60735;PhraseMo...||| -19.2465
1 ||| australia is one of the few countries have diplomatic relations with the north . ||| PhraseModel_0=5.20941;PhraseMo... ||| -19.5312
1 ||| australia is one of the handful of countries have diplomatic relations with north korea . ||| PhraseModel_0=5.08447;PhraseMo... ||| -19.6519
1 ||| australia is one of the few countries to have diplomatic relations with north korea . ||| PhraseModel_0=5.96908;PhraseMo... ||| -19.9282
1 ||| australia have diplomatic relations with north korea is one of the few countries . ||| PhraseModel_0=4.77471;PhraseMo... ||| -19.9828
1 ||| australia is one of the few countries have diplomatic relations with pyongyang . ||| PhraseModel_0=5.20941;PhraseMod... ||| -20.0654
1 ||| australia is one of the few countries with diplomatic relations with north korea . ||| PhraseModel_0=7.0099;PhraseMo... ||| -20.0917
1 ||| australia is one of the few countries with north korea have diplomatic relations . ||| PhraseModel_0=5.14991;PhraseMo... ||| -20.4772
1 ||| australia is one of few countries have diplomatic relations with north korea . ||| PhraseModel_0=5.96908;PhraseMo... ||| -20.5605
  • If you want to see the output of "forced" decoding, all you have to do is specify both the source and target as the input to the decoder using the following form:
澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 ||| australia is one of the few countries that have diplomatic relations with north korea .
$ cat cdec.ini
feature_function=LanguageModel /fs/clip-dissertation/cdyer/ftrans/iwslt/en.3gram.lm.gz -o 3
feature_function=WordPenalty
formalism=scfg
grammar=/fs/clip-dissertation/cdyer/ftrans/iwslt/cdec-vest-baseline/grammar.gz
add_pass_through_rules=true
  • Running the decoder in alignment mode (-a)
./cdec -g test_data/australia.scfg.gz -w test_data/australia.weights -a
cdec v1.0 (c) 2009 by Chris Dyer
Reading SCFG grammar from test_data/australia.scfg.gz
  33737 rules read.
Reading weights from test_data/australia.weights
Reading input from STDIN
澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 ||| australia is one of the few countries that have diplomatic relations with north korea .

INPUT: 澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 ||| australia is one of the few countries that have diplomatic relations with north korea .
  Goal category: [S]
    ...........
  -LM forest (nodes/edges): 78/253341
  -LM forest       (paths): 4.86506e+28
  -LM Viterbi: australia is have diplomatic relations with north korea one of the few countries .
  -LM Viterbi: -12.7893
  Goal category: [CAT_77]
    ...............
  Constr. forest (nodes/edges): 116/374
  Constr. forest       (paths): 24148
  Constr. VitTree: (CAT_77 (CAT_76 (CAT_73 (CAT_13 (CAT_12 australia is)) (CAT_66 one of (CAT_60 the (CAT_20 few countries) \
            that (CAT_36 have diplomatic relations (CAT_15 with north korea))))) (CAT_11 .)))
 012345678901234
0*..............0
1.*.............1
2...........*...2
3............**.3
4........*......4
5.........**....5
6....*..........6
7.....*.........7
8......*........8
9..**...........9
0..............*0
 012345678901234

0-0 1-1 11-2 12-3 13-3 8-4 9-5 10-5 4-6 5-7 6-8 2-9 3-9 14-10
Personal tools