Synchronous context-free grammar

From cdec Decoder

Jump to: navigation, search

Synchronous context-free grammars generalize the concept of a synchronous context-free grammar to two languages.

Contents

cdec SCFG rule format

SCFGs can be written as a collection of rules in a text file, where rules are listed, one per line. Rules consist of several fields, some of which are optional, which are separated by three vertical pipes |||. The rule format is compatible with the one used by the Joshua and Hiero toolkits, but also supports a number of extensions. Finally, this is merely a representation of an SCFG. Some parsing algorithms may place additional restrictions on the form of the grammar (such as the number and orientation of non-terminals, the number of non-terminal categories, and whether insertion / deletion rules are supported).

The following are well-formed rules:

[NP] ||| la [NN] [JJ] ||| the [2] [1] ||| TransLogProb=-2.146 SomeFeature=1.2
[NP] ||| la [NN,1] [JJ,2] ||| the [2] [1] ||| TransLogProb=-2.146 ||| 0-0
[NP] ||| la [NN,1] [JJ,2] ||| the [JJ,2] [NN,1] ||| 1.23 -2.6423 1.0 0.0 0.55219
[S] ||| [NP] [VP] ||| [1] [2] ||| TopFeature=1
[X] ||| [X,1] de [X,2] a [Y,3] ||| [2] 's [1] [3] ||| 0.9842 0.7279

Field 1 - LHS

The first field is the left hand side (LHS) of the rule. It is a single non-terminal symbol, which is enclosed in square brackets. Non-terminal symbols must begin with a capital letter possibly followed by other capital letters or _ : = / \ + and digits.

The following are examples of well-formed LHS's

[X]
[NP]
[VP_H=V]
[VP\NP]

The following are invalid LHS's

[Noun]             ( lowercase letters are forbidden)
(X)                ( square brackets missing)
[NP,VP]            ( commas are forbidden)

Field 2 - source RHS

The second field is the source language right-hand side (RHS). In the source language field, non-terminal symbols must be indicated by type, and they may optionally have an explicit index, but they must appear in order. Indices start at 1.

The following are examples of well-formed source RHS's:

je mehr [X] um so [X]
je mehr [X,1] um so [X,2]
je mehr [VP,1] um so [COMP,2]
ein kleines haus
[X] [X] [Y] [X]
la [NN,1] [JJ,2]

The follow are imvalid source RHS's:

la [NN,2] [JJ,1]                (source indices not in order)
la [1] [2]                      (types not given: will be interpreted as terminal symbols)

Field 3 - target RHS

The second field is the target language right-hand side (RHS). In the target language field, non-terminal symbols may be indexed purely by position, and if a type is given explicitly it must match the source. The number of non-terminals must be identical in the source and target, and each source non-terminal position must be referenced exactly once on the target side.

The following are examples of well-formed source-target pairs of RHS's:

je mehr [X] um so [X]           ||| the more [X,1] the [X,2]
je mehr [X,1] um so [X,2]       ||| the [2] the [1]
je mehr [VP,1] um so [COMP,2]   ||| [COMP,2] [VP,1]
ein kleines haus                ||| a small house
[X] [X] [Y] [X]                 ||| [2] [4] [1] [3]
la [NN,1] [JJ,2]                ||| the [2] [1]

Field 4 - feature values

Feature values can either be specified as name value pairs or as a list of values. The name-value pair format is preferred for models that have many sparse features, while the second format is to preserve compatibility with MT systems that typically only have a small number of dense features. If the list of values format is used, the default name of the features that is assigned is PhraseModel_0, PhraseModel_1, PhraseModel_2, etc.

Field 5 (optional) - alignment

Personal tools