Lexical translation / Word alignment
From cdec Decoder
cdec includes a word aligner that can make use of arbitrary, overlapping features but does not require manual word alignments during training, which is described in the following paper:
- C. Dyer, J. Clark, A. Lavie, N. A. Smith. 2011. "Unsupervised Word Alignment with Arbitrary Features" in Proc. of ACL.
Prerequisites
- Download and compile the code for mkcls from the Giza++/mkcls repository at Google code.
Parallel corpus file format and file.source-target naming convention
The parallel corpus that is to be aligned should be represented in a single file where each line is of the form:
sentence in source language ||| sentence in target language
The script CDEC/word-aligner/paste-parallel-files.pl can be used to create a properly formatted parallel file from multiple parallel files.
The word aligner assumes that the input file has an extension that encodes the source and target language types using two-letter ISO 639-1 language codes, with the source listed first and the target second, separated by a hyphen. For example, the following are acceptable input filenames:
- corpus.fr-en
- btec.en-zh
- europarl.de-nl
Running the word aligner
To align a corpus:
- CDEC/word-aligner/aligner.pl corpus.source-target
- cd talign
- make
