Language model notes

From cdec Decoder

Jump to: navigation, search

cdec supports rescoring translation forests with n-gram target language models. In general, since doing this exactly is very expensive (albeit polynomial), it is advisable to use cube pruning for the rescoring intersection strategy.

LanguageModel vs. KLanguageModel

The feature function called LanguageModel, which uses SRI's language model library, is deprecated in favor of KLanguageModel, which is based on KenLM. At a future date, the references to SRILM will be removed (for the sake of keeping cdec as small as possible).

Dealing with beginning and ending sentence markers (<s> and </s>)

Language models typically include two special symbols "before" and "after" sentences, <s> and </s>. cdec (specifically the KLanguageModel feature function) can deal with this in one of two ways.

  1. translation forests do not contain them, and they are added implicitly at the beginning and every path. This is the default.
  2. translation forests do contain them (and therefore, probably your input is assumed to contain them explicitly). To enable this, make sure your translation forests explicitly generate these symbols, and add the -x option to your KLanguageModel configuration. If in some derivation <s> or </s> occur sentence-internally, the a penalty of -100 will be added to the probability of the rest of the derivation.

Why choose one or the other? Including edge contexts in translation rules seems to improve quality, and cube pruning can use more aggressive state recombination (since once a <s> symbol has been seen, then the LM has a full context with which to score, rather than waiting for the customary n-1 words to have been seen).

Why was the interface to SRILM removed?

SRI's language model library is widely used in MT, and was formerly supported by cdec. While it is a great software, the newer KenLM had a number of advantages (in particular, I was able to distribute it together with cdec, simplifying software distribution), and no disadvantages. Therefore, to minimize the amount of redundant and unused code, I removed the library.

Personal tools