cdec includes a word aligner that can make use of arbitrary, overlapping features but does not require manual word alignments during training, which is described in the following paper:


Parallel corpus file format and file.source-target naming convention

The parallel corpus that is to be aligned should be represented in a single file where each line is of the form:

sentence in source language ||| sentence in target language

The script CDEC/word-aligner/paste-parallel-files.pl can be used to create a properly formatted parallel file from multiple parallel files.

The word aligner assumes that the input file has an extension that encodes the source and target language types using two-letter ISO 639-1 language codes, with the source listed first and the target second, separated by a hyphen. For example, the following are acceptable input filenames:

  • corpus.fr-en
  • btec.en-zh
  • europarl.de-nl

Running the word aligner

To align a corpus:

  1. CDEC/word-aligner/aligner.pl corpus.source-target
  2. cd talign
  3. make
