Synchronous context-free grammar
From cdec Decoder
Synchronous context-free grammars generalize the concept of a synchronous context-free grammar to two languages.
cdec SCFG rule format
SCFGs can be written as a collection of rules in a text file, where rules are listed, one per line. Rules consist of several fields, some of which are optional, which are separated by three vertical pipes |||. The rule format is compatible with the one used by the Joshua and Hiero toolkits, but also supports a number of extensions. Finally, this is merely a representation of an SCFG. Some parsing algorithms may place additional restrictions on the form of the grammar (such as the number and orientation of non-terminals, the number of non-terminal categories, and whether insertion / deletion rules are supported).
The following are well-formed rules:
[NP] ||| la [NN] [JJ] ||| the   ||| TransLogProb=-2.146 SomeFeature=1.2 [NP] ||| la [NN,1] [JJ,2] ||| the   ||| TransLogProb=-2.146 ||| 0-0 [NP] ||| la [NN,1] [JJ,2] ||| the [JJ,2] [NN,1] ||| 1.23 -2.6423 1.0 0.0 0.55219 [S] ||| [NP] [VP] |||   ||| TopFeature=1 [X] ||| [X,1] de [X,2] a [Y,3] |||  's   ||| 0.9842 0.7279
Field 1 - LHS
The first field is the left hand side (LHS) of the rule. It is a single non-terminal symbol, which is enclosed in square brackets. Non-terminal symbols must begin with a capital letter possibly followed by other capital letters or _ : = / \ + and digits.
The following are examples of well-formed LHS's
[X] [NP] [VP_H=V] [VP\NP]
The following are invalid LHS's
[Noun] ( lowercase letters are forbidden) (X) ( square brackets missing) [NP,VP] ( commas are forbidden)
Field 2 - source RHS
The second field is the source language right-hand side (RHS). In the source language field, non-terminal symbols must be indicated by type, and they may optionally have an explicit index, but they must appear in order. Indices start at 1.
The following are examples of well-formed source RHS's:
je mehr [X] um so [X] je mehr [X,1] um so [X,2] je mehr [VP,1] um so [COMP,2] ein kleines haus [X] [X] [Y] [X] la [NN,1] [JJ,2]
The follow are imvalid source RHS's:
la [NN,2] [JJ,1] (source indices not in order) la   (types not given: will be interpreted as terminal symbols)
Field 3 - target RHS
The second field is the target language right-hand side (RHS). In the target language field, non-terminal symbols may be indexed purely by position, and if a type is given explicitly it must match the source. The number of non-terminals must be identical in the source and target, and each source non-terminal position must be referenced exactly once on the target side.
The following are examples of well-formed source-target pairs of RHS's:
je mehr [X] um so [X] ||| the more [X,1] the [X,2] je mehr [X,1] um so [X,2] ||| the  the  je mehr [VP,1] um so [COMP,2] ||| [COMP,2] [VP,1] ein kleines haus ||| a small house [X] [X] [Y] [X] |||     la [NN,1] [JJ,2] ||| the  
Field 4 - feature values
Feature values can either be specified as name value pairs or as a list of values. The name-value pair format is preferred for models that have many sparse features, while the second format is to preserve compatibility with MT systems that typically only have a small number of dense features. If the list of values format is used, the default name of the features that is assigned is PhraseModel_0, PhraseModel_1, PhraseModel_2, etc.