Suffix Array Grammar Extraction

From cdec Decoder

Jump to: navigation, search

cdec contains a suffix array grammar extractor implemented by Adam Lopez that can be used to efficiently extract SCFG grammars from very large corpora. Although this code was originally designed for online operation, it is now most commonly used to extract test-set specific grammars, or per-sentence grammars in advance of large (or repetitive) batch decoding jobs.

Building the suffix array extractor

The grammar extractor is found in the sa-extractor/ subdirectory on the main project. To build it, you will need:

  1. Cython (0.14.1 or newer)
  2. Python 2.7

Ensure that the sa-extractor/Makefile points to the proper location of Python and Cython, and then run make from within that directory.

Extracting Grammars

There are two main operations of the grammar extractor:

  1. extraction of per-sentence grammars (default),
  2. extraction of complete testset grammars.

These are selected with the per_sentence_grammar option in the extract.ini configuration files.

Personal tools