MaltParser

User guide

The user guide consists of the these sections:

Start using MaltParser
Controlling MaltParser
Configuration
Input and output format
Parsing algorithms
Feature model
Prediction strategy
Parsing phrase structure
References

Start using MaltParser

This section contains a short guide to get familiar with MaltParser. We start by running MaltParser without any arguments by typing the following at the command line prompt (it is important that you are in the malt-1.2 directory):

prompt> java -jar malt.jar
This command will display the following output:
-----------------------------------------------------------------------------
                          MaltParser 1.2
-----------------------------------------------------------------------------
         MALT (Models and Algorithms for Language Technology) Group
             School of Mathematics and Systems Engineering (MSI)
                        Vaxjo University, Sweden
-----------------------------------------------------------------------------

Usage:
   java -jar malt.jar -f  
   java -jar malt.jar -h for more help and options

help                  (  -h) : Show options
-----------------------------------------------------------------------------
option_file           (  -f) : Path to option file
-----------------------------------------------------------------------------
verbosity             (  -v) : Verbosity level
  debug      - Logging of debugging messages
  error      - Logging of error events
  fatal      - Logging of very severe error events
  info       - Logging of informational messages
  off        - Logging turned off
  warn       - Logging of harmful situations
-----------------------------------------------------------------------------

Documentation: docs/index.html
Here you can see the basic usage and options. To get all available options:
prompt> java -jar malt.jar -h
All these options are also described in a short documentation and in a full documentation.

Train a parsing model

Now we are ready to train our first parsing model. In the directory examples/data there are two data files talbanken05_train.conll and talbanken05_test.conll, which contain very small portions of the Swedish treebank Talbanken05. The example data sets are formatted according to the CoNLL data format. Note that these data sets are very small and that you need more training data to create a useful parsing model.

To train a default parsing model with MaltParser type the following at the command line prompt:

prompt> java -jar malt.jar -c test -i examples/data/talbanken05_train.conll -m learn
This line tells MaltParser to create a parsing model named test.mco (also know as a Single Malt configuration file) from the data in the file examples/data/talbanken05_train.conll. The parsing model gets its name from the configuration name, which is specified by the option flag -c without the file suffix .mco. The configuration name is a name of your own choice. The option flag -i tells the parser where to find the input data. The last option flag -m specifies the processing mode learn (as opposed to parse), since in this case we want to induce a model by using the default learning method (LIBSVM).

MaltParser outputs the following information:

-----------------------------------------------------------------------------
                          MaltParser 1.2
-----------------------------------------------------------------------------
         MALT (Models and Algorithms for Language Technology) Group
             School of Mathematics and Systems Engineering (MSI)
                        Vaxjo University, Sweden
-----------------------------------------------------------------------------

Started: Tue Dec 09 20:22:33 CET 2008
Initialize the parsing algorithm...
Initialize the guide model...
.                     1       0s              1MB
.                    10       0s              1MB
                     32       0s              1MB
Creating LIBSVM model odm0.libsvm.mod
Learning time: 00:00:02 (2547 ms)
Finished: Tue Dec 09 20:22:36 CET 2008
Most of the logging information is self-explaining: it tells you that the parser is started at a certain time and date and that it reads sentences from a specified file containing 32 sentences. It continues with information about the learning models that are created, in this case only one LIBSVM model. It then saves the symbol table and all options (which cannot be changed later during parsing) and stores everything in a configuration file named test.mco. Finally, the parser informs you about the learning time.

Parse data with your parsing model

We have now created a parsing model that we can use for parsing new sentences from the same language. It is important that unparsed sentences are formatted according to the format that was used during training (except that the output columns for head and dependency relation are missing). In this case tokens are represented by the first six columns of the CoNLL data format. To parse type the following:

prompt> $ java -jar malt.jar -c test -i examples/data/talbanken05_test.conll -o out.conll -m parse
where -c test is the name of the configuration (the prefix file name of test.mco), -i examples/data/talbanken05_test.conll tells the parser where to find the input data, -o out.conll is the output file name, and finally -m parse specifies that the parser should be executed in parsing mode.

Controlling MaltParser

MaltParser can be controlled by specifying values for a range of different options. The values for these option can be specified in different ways:

MethodDescriptionExample
Command-line option flagUses the option flag with a dash (-) before the option flag and a blank between the option flag and the value-c test
Command-line option group and option nameUses both the option group name and the option name to specify the option, with two dashes (--) before the option group name and one dash (-) to separate the option group name and the option name. The equality sign (=) is used for separating the option and the value.--config-name=test
Command-line option nameIs a shorter version of Command-line option group and option name and can only be used when the option name is unambiguous. --name=test
Option fileThe option settings are specified in a option file, formatted in XML. To tell MaltParser to read the option file the option flag -f is used. Note that command line option settings override the settings in the option file if options are specified twice.
<?xml version="1.0" encoding="UTF-8"?>
<experiment>
  <optioncontainer>
    <optiongroup groupname="config">
      <option name="name" value="test"/>
    </optiongroup>
  </optioncontainer>
</experiment>
All options are described in a short documentation and a full documentation.

Option file

An option file is useful when you have many options that differ from the default value, as is often the case when you are training a parsing model. The option file should have the following XML format:

ElementDescription
experimentAll other elements must be enclosed by an experiment element.
optioncontainerIt is possible to have one or more option containers, but MaltParser 1.2 only uses the first option container. Later releases may make use of multiple option containers, for instance, to build ensemble systems.
optiongroupThere can be one or more option group elements within an option container. The attribute groupname specifies the option group name (see description of all available options).
optionAn option group can consist of one or more option. The element option has two attributes: name that corresponds to an option name and value that is the value of the option. Please consult the description of all available options to see all legal option names and values.

Here is an example (examples/optionexample.xml):

<?xml version="1.0" encoding="UTF-8"?>
<experiment>
	<optioncontainer>
		<optiongroup groupname="config">
			<option name="name" value="example1"/>
			<option name="flowchart" value="learn"/>
		</optiongroup>
		<optiongroup groupname="singlemalt">
			<option name="parsing_algorithm" value="nivrestandard"/>
		</optiongroup>
		<optiongroup groupname="input">
			<option name="infile" value="examples/data/talbanken05_train.conll"/>
		</optiongroup>
		<optiongroup groupname="nivre">
			<option name="root_handling" value="strict"/>
		</optiongroup>
		<optiongroup groupname="libsvm">
			<option name="libsvm_options" value="-s_0_-t_1_-d_2_-g_0.2_-c_1.0_-r_0.4_-e_0.1"/>
		</optiongroup>
		<optiongroup groupname="guide">
			<option name="data_split_column" value="POSTAG"/>
			<option name="data_split_structure" value="Input[0]"/>
			<option name="data_split_threshold" value="100"/>
		</optiongroup>
	</optioncontainer>
</experiment>

To run MaltParser with the above option file type:

prompt> java -jar malt.jar -f examples/optionexample.xml
This command will create a configuration file example1.mco based on the settings in the option file. It is possible to override the options by command-line options, for example:
prompt> java -jar malt.jar -f examples/optionexample.xml -a nivreeager
which will create a configuration based on the same setting except the parsing algorithm is now nivreeager instead of nivrestandard. If you want to create a configuration that has the same settings as the option file with command-line options, you need to type:
prompt> java -jar malt.jar -c example1 -m learn 
                           -i examples/data/talbanken05_train.conll -a nivrestandard 
                           -r strict -lso -s_0_-t_1_-d_2_-g_0.2_-c_1.0_-r_0.4_-e_0.1 
                           -d POSTAG -s Input[0] -T 100
To parse using one of the three configurations you simply type:
prompt> java -jar malt.jar -c example1 -m parse 
                           -i examples/data/talbanken05_test.conll -o out1.conll

Configuration

The purpose of the configuration is to gather information about all settings and files into one file. During learning, the configuration is created and stored in a configuration file with the file suffix .mco. This configuration file can later be reused whenever the trained model is used to parse new data. Potentially there can be several types of configuration, but MaltParser 1.2 only knows one type: the Single Malt configuration (singlemalt).

Flow chart

MaltParser have seven pre-defined flow charts that describe what tasks MaltPasrer should perform. These seven flow charts are:

NameDescription
learnCreates a Single Malt configuration and induces a parsing model from input data.
parseParses sentences using a Single Malt configuration.
infoPrints information about a configuration.
unpackUnpacks a configuration into a directory with the same name.
projCreates a configuration and projectivizes input data without inducing a parsing model.
deprojDeprojectivizes input data using a configuration.
convertA simple data format converter

A Single Malt configuration creates a parsing model based on one set of option values. The learn and parse modes are explained above in Train a parsing model and Parse data with your parsing model, the other four modes are described below using the same example.

Get configuration information

Sometimes it is useful to get information about a configuration, for instance, to know which settings have been used when creating the configuration. To get this information you type:

prompt> java -jar malt.jar -c test -m info
This will output a lot of information about the configuration:
CONFIGURATION
Configuration name:   test
Configuration type:   singlemalt
Created:              Tue Dec 09 20:22:33 CET 2008

SYSTEM
Operating system architecture: x86
Operating system name:         Windows XP
JRE vendor name:               Sun Microsystems Inc.
JRE version number:            1.6.0_07

MALTPARSER
Version:                       1.2
Build date:                    December 9 2008

SETTINGS
config
  workingdir (  -w)                     user.dir
  name (  -c)                           test
  logging ( -cl)                        info
  flowchart (  -m)                      learn
  types ( -tt)
  type (  -t)                           singlemalt
  logfile (-lfi)                        stdout
  url (  -u)
covington
  allow_root ( -cr)                     true
  allow_shift ( -cs)                    false
graph
  max_sentence_length (-gsl)            256
  head_rules (-ghr)
  root_label (-grl)                     ROOT
guide
  decision_settings (-gds)              T.TRANS+A.DEPREL
  kbest_type ( -kt)                     rank
  data_split_structure (  -s)
  learner (  -l)                        libsvm
  kbest (  -k)                          -1
  features (  -F)
  classitem_separator (-gcs)            ~
  data_split_column (  -d)
  data_split_threshold (  -T)           50
input
  infile (  -i)                         examples/data/talbanken05_train.conll
  reader ( -ir)                         tab
  charset ( -ic)                        UTF-8
  reader_options (-iro)
  format ( -if)                         /appdata/dataformat/conllx.xml
libsvm
  libsvm_external (-lsx)
  save_instance_files (-lsi)            false
  libsvm_options (-lso)
  verbosity (-lsv)                      silent
malt0.4
  depset (-mcd)
  behavior (-mcb)                       false
  posset (-mcp)
  cposset (-mcc)
nivre
  root_handling (  -r)                  normal
  post_processing (-npp)                false
output
  charset ( -oc)                        UTF-8
  writer_options (-owo)
  format ( -of)                         /appdata/dataformat/conllx.xml
  writer ( -ow)                         tab
  outfile (  -o)
pproj
  covered_root (-pcr)                   none
  marking_strategy ( -pp)               none
  lifting_order (-plo)                  shortest
singlemalt
  parsing_algorithm (  -a)              nivreeager
  null_value ( -nv)                     one
  guide_model ( -gm)                    single
  diagnostics ( -di)                    false
  diafile (-dif)                        stdout
  mode ( -sm)                           parse

DEPENDENCIES
--guide-features (  -F)                 /appdata/features/NivreEager.xml

FEATURE MODEL
MAIN
0       InputColumn(POSTAG, Stack[0])
1       InputColumn(POSTAG, Input[0])
2       InputColumn(POSTAG, Input[1])
3       InputColumn(POSTAG, Input[2])
4       InputColumn(POSTAG, Input[3])
5       InputColumn(POSTAG, Stack[1])
6       OutputColumn(DEPREL, Stack[0])
7       OutputColumn(DEPREL, ldep(Stack[0]))
8       OutputColumn(DEPREL, rdep(Stack[0]))
9       OutputColumn(DEPREL, ldep(Input[0]))
10      InputColumn(FORM, Stack[0])
11      InputColumn(FORM, Input[0])
12      InputColumn(FORM, Input[1])
13      InputColumn(FORM, head(Stack[0]))

LIBSVM INTERFACE
  LIBSVM version: 2.86
  SVM-param string:
LIBSVM SETTINGS
  SVM type      : C_SVC (0)
  Kernel        : POLY (1)
  Degree        : 2
  Gamma         : 0.2
  Coef0         : 0.0
  Cache Size    : 100.0 MB
  C             : 1.0
  Eps           : 1.0
  Shrinking     : 1
  Probability   : 0
  #Weight       : 0
The information is grouped into different categories:
CategoryDescription
CONFIGURATIONThe name and type of the configuration and the date when it was created.
SYSTEMInformation about the system that was used when creating the configuration, such as processor, operating system and version of Java Runtime Environment (JRE).
MALTPARSERVersion of MaltParser and when it was built.
SETTINGSAll option settings divided into several categories.
DEPENDENCIESIn some cases the parser self-corrects when an illegal combination of options is specified or some option is missing. In the example above the feature specification file is not specified and the parser uses the default feature specification file for the Nivre arc-eager parsing algorithm.
FEATURE MODELOutputs the content of the feature specification file.
<LEARNER> INTERFACEInformation about the interface to the learner, in this case LIBSVM.
<LEARNER> SETTINGSAll settings of specific learner options, in this case LIBSVM.

Unpack a configuration

It is possible to unpack the configuration file test.mco by typing:

prompt> java -jar malt.jar -c test -m unpack
This command will create a new directory test containing the following files:
FileDescription
libsvm.modThe LIBSVM model that is used for predicting the next parsing action.
savedoptions.sopAll option settings that cannot be changed during parsing.
symboltables.symAll distinct symbols in the training data, divided into different columns. For example, the column POSTAG in the CoNLL format has its own symbol table with all distinct values occurring in the training data.
test_singlemalt.infoInformation about the configuration (same as described above).

Projectivize input data

It is possible to projectivize an input file, with or without involving parsing.

All non-projective arcs in the input file are replaced by projective arcs by applying a lifting operation. The lifts are encoded in the dependency labels of the lifted arcs. The encoding scheme can be varied using the flag -pp (marking_strategy), and there are currently five of them: none, baseline, head, path and head+path. (See Nivre & Nilsson (2005) for more details concerning the encoding schemes.) A dependency file can be projectivized using the head encoding by typing:

prompt> java -jar malt.jar -c pproj -m proj
                           -i examples/data/talbanken05_test.conll 
                           -o projectivized.conll
                           -pp head

There is one additional option for the projectivization called covered_root, which is mainly used for handling dangling punctuation. Depending on the treebank, a punctuation token located in the middle of a sentence can attach directly to the root, which entails that all arcs crossing the head arc of the punctuation token are non-projective. This, in turn, results in lots of (unnecessary) lifts, and can be avoided by using the covered_root flag -pcr. This option has four values: none, left, right and head. For the last three values, tokens like dangling punctuation are then attached to one of the tokens connected by the shortest arc covering the token, either the leftmost (left), rightmost (right), or head (head) token of the covering arc. This will prevent all the unnecessary lifts.

The projecitivization and deprojectivization (below), including the encoding schemes, are know as pseudo-projective transformations and are described in more detail in Nivre & Nilsson (2005). The only difference compared to Nivre & Nilsson is that it is the most deeply nested non-projective arc that is lifted first, not the shortest one. Lifting the most deeply nested arc first is likely to result in fewer lifts when two or more non-projective arcs interact. In practice, however, this will probably have little impact for the parsing accuracy.

Deprojectivize input data

MaltParser can also be used to deprojectivize a projective file containing pseudo-projective encoding, with or without involving parsing, where it is assumed that the configuration pproj contains the same encoding scheme as during projectivization. It could look like this:
prompt> java -jar malt.jar -c pproj -m deproj
                           -i projectivized.conll
                           -o deprojectivized.conll
The file deprojectivized.conll will contain the deprojectivized data. Note that is is only the encoding schemes head, path and head+path that actively try to recover the non-projective arcs.

Input and output format

The format and encoding of the input and output data is controlled by the format, reader, writer and charset options in the input and output option group. The CoNLL, Malt-TAB and simplified version of Negra data format specification files are already included in the MaltParser jar-file (malt.jar) in the appdata/dataformat directory. The CoNLL data format specification file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<dataformat name="conllx">
	<column name="ID" category="INPUT" type="ECHO"/>
	<column name="FORM" category="INPUT" type="STRING"/>
	<column name="LEMMA" category="INPUT" type="STRING"/>
	<column name="CPOSTAG" category="INPUT" type="STRING"/>
	<column name="POSTAG" category="INPUT" type="STRING"/>
	<column name="FEATS" category="INPUT" type="STRING"/>
	<column name="HEAD" category="HEAD" type="INTEGER"/>
	<column name="DEPREL" category="DEPENDENCY_EDGE_LABEL" type="STRING"/>
	<column name="PHEAD" category="HEAD" type="IGNORE" default="_"/>
	<column name="PDEPREL" category="DEPENDENCY_EDGE_LABEL" type="IGNORE" default="_"/>
</dataformat>

A data format specification file has two types of XML elements. First, there is the dataformat element with the attribute name, which gives the data format a name. The dataformat element encloses one or more column elements, which contain information about individual columns. The column elements have three attributes:

AttributeDescription
nameThe column name. Note that the column name can be used by an option and within a feature model specification as an identifier of the column.
categoryThe column category, one of the following:
INPUTInput data in both learning and parsing mode, such as part-of-speech tags or word forms.
DEPENDENCY_EDGE_LABELDenote that the column contain a dependency label. If the parser is to learn to produce labeled dependency graph, these must be present in learning mode.
OUTPUTSame as DEPENDENCY_EDGE_LABEL, used by MaltParser version 1.0-1.1
PHRASE_STRUCTURE_EDGE_LABELDenote that the column contain a phrase structure edge label.
PHRASE_STRUCTURE_NODE_LABELDenote that the column contain a phrase category label.
SECONDARY_EDGE_LABELDenote that the column contain a secondary edge label.
HEADThe head column defines the unlabeled structure of a dependency graph and is also output data of the parser in parsing mode.
typeDefines the data type of the column and/or its treatment during learning and parsing:
STRINGThe column value will be stored as a string value in a symbol table.
INTEGERThe column value will be stored as an integer value.
BOOLEANThe column value will be stored as a boolean value.
ECHOThe column value will be stored as an integer value, but cannot be used in the definition of features.
IGNOREThe column value will be ignored and therefore will not be present in the output file.
defaultThe default output for columns that have the column type IGNORE.

It is possible to define your own input/output format and then supply the data format specification file with the format option.

Currently, MaltParser only supports tab-separated data files, which means that a sentence in a data file in the CoNLL data format could look like this:

1	Den	_	PO	PO	DP	2	SS	_	_
2	blir	_	V	BV	PS	0	ROOT	_	_
3	gemensam	_	AJ	AJ	_	2	SP	_	_
4	för	_	PR	PR	_	2	OA	_	_
5	alla	_	PO	PO	TP	6	DT	_	_
6	inkomsttagare	_	N	NN	HS	4	PA	_	_
7	oavsett	_	PR	PR	_	2	AA	_	_
8	civilstånd	_	N	NN	SS	7	PA	_	_
9	.	_	P	IP	_	2	IP	_	_

Finally, the character encoding can be specified with the charset option and this option is used by MaltParser to define the java class Charset.

Parsing Algorithms

Any deterministic parsing algorithm compatible with the MaltParser architecture can be implemented in the MaltParser package. MaltParser 1.2 contains two families of parsing algorithm: Nivre and Covington, both with two members.

Nivre

Nivre's algorithm (Nivre 2003, Nivre 2004) is a linear-time algorithm limited to projective dependency structures. It can be run in arc-eager (-a nivreeager) or arc-standard (-a nivrestandard) mode. In addition, the root handling option can be used to change the algorithm's behavior with respect to root tokens, i.e., tokens of the input sentence that are not dependent on another token.

Nivre's algorithm uses two data structures:

Covington

Covington's algorithm (Covington 2001) is a quadratic-time algorithm for unrestricted dependency structures, which proceeds by trying to link each new token to each preceding token. It can be run in a projective (-a covproj) mode, where the linking operation is restricted to projective dependency structures, or in a non-projective (-a covnonproj) mode, allowing non-projective (but acyclic) dependency structures. In addition, there are two options, allow shift and allow root, that controls the behavior of Covington's algorithm.

Covington's algorithm uses four data structures:

Feature model

MaltParser uses history-based feature models for predicting the next action in the deterministic derivation of a dependency structure, which means that it uses features of the partially built dependency structure together with features of the (tagged) input string. Features that make use of the partially built dependency structure corresponds to the OUTPUT category of the data format, for example DEPREL in the CoNLL data format, and features of the input string corresponds to the INPUT category of the data format, for example CPOSTAG and FORM.

The feature model specification must be specified in an XML file according to the format below or in a text file formatted according to the specification given by the MaltParser 0.x user guide. The latter specification format should be saved in a text file where the file name must end with the file suffix .par. Below you can see an example of the new XML format (Nivre arc-eager default feature model):

<?xml version="1.0" encoding="UTF-8"?>
<featuremodels>
	<featuremodel name="nivreeager">
		<feature>InputColumn(POSTAG, Stack[0])</feature>
		<feature>InputColumn(POSTAG, Input[0])</feature>
		<feature>InputColumn(POSTAG, Input[1])</feature>
		<feature>InputColumn(POSTAG, Input[2])</feature>
		<feature>InputColumn(POSTAG, Input[3])</feature>
		<feature>InputColumn(POSTAG, Stack[1])</feature>
		<feature>OutputColumn(DEPREL, Stack[0])</feature>
		<feature>OutputColumn(DEPREL, ldep(Stack[0]))</feature>
		<feature>OutputColumn(DEPREL, rdep(Stack[0]))</feature>
		<feature>OutputColumn(DEPREL, ldep(Input[0]))</feature>
		<feature>InputColumn(FORM, Stack[0])</feature>
		<feature>InputColumn(FORM, Input[0])</feature>
		<feature>InputColumn(FORM, Input[1])</feature>
		<feature>InputColumn(FORM, head(Stack[0]))</feature>
	</featuremodel>
</featuremodels>

Each feature is defined using a functional notation with three types of functions:

TypeDescription
Feature functionA feature function returns a particular attribute of a token or graph node and takes two arguments: a column name and an address funtion. There are two feature functions available:
InputColumnThe column name must correspond to an input column in the data format and the address function must return a token node in the input string. (If the address function is undefined, a null-value is returned.)
OutputColumnThe column name must correspond to an output column in the data format and the address function must return a graph node in the dependency graph. (If the address function is undefined, a null-value is returned.)
Address functionThere are two types of address functions: parsing algorithm specific functions and dependency graph functions. The parsing algorithm specific functions have the form Data-structure[i], where Data-structure is a data structure used by a specific parsing algorithm and i is an offset from the start position in this data structure. The following data structures are available for different parsing algorithms:
Nivre arc-eagerStack, Input
Nivre arc-standardStack, Input
Covington projectiveLeft, Right.
Covington non-projectiveLeft, Right, LeftContext, RightContext.
The dependency graph address functions take a graph node as argument and navigates from this graph node to another graph node (if possible). There are seven dependency graph address functions:
headReturns the head of the graph node if defined; otherwise, a null-value.
ldepReturns the leftmost (left) dependent of the graph node if defined; otherwise, a null-value.
rdepReturns the rightmost (right) dependent of the graph node if defined; otherwise, a null-value.
lsibReturns the next left (same-side) sibling of the graph node if defined; otherwise, a null-value.
rsibReturns the next right (same-side) sibling of the graph node if defined; otherwise, a null-value.
predReturns the predecessor of the graph node in the linear order of the input string if defined; otherwise, a null-value.
succReturns the successor of the graph node in the linear order of the input string if defined; otherwise, a null-value.
Feature map functionMaps a feature value onto a new set of values and takes as arguments a feature specification and one or more arguments that control the mapping. There is one feature map function:
SplitSplits the feature value into a set of feature values. In addition to a feature specification it takes a delimiter (regular expression) as an argument. The example below shows how the value of the FEATS column in the CoNLL data format is split into a set of values using the delimiter |:
Split(InputColumn(FEATS, Input[0]),\|)
SuffixExtract the suffix of a feature value (only InputColumn) with a suffix length n. By convention, if n = 0, the entire feature value is included; otherwise only the n last characters are included in the feature value. The following specification defines a feature the value of which is the four-character suffix of the word form (FORM) of the next input token.
Suffix(InputColumn(FORM, Input[0]), 4)
PrefixExtract the prefix of a feature value with a prefix length n. By convention, if n = 0, the entire feature value is included; otherwise only the n first characters are included in the feature value. The following specification defines a feature the value of which is the four-character prefix of the word form (FORM) of the next input token.
Prefix(InputColumn(FORM, Input[0]), 4)
MergeMerge two feature value into one feature value. The following specification defines a feature the value of which the part-of-speech of the top token of the stack and the next input token are merged into one feature value.
Merge(InputColumn(POSTAG, Stack[0]), InputColumn(POSTAG, Input[0]))
Merge3Merge three feature value into one feature value. The following specification defines a feature the value of which the part-of-speech of the three next input token are merged into one feature value.
Merge3(InputColumn(POSTAG, Input[0]), InputColumn(POSTAG, Input[1]), InputColumn(POSTAG, Input[2]))

MaltParser is equipped with a default feature model specification for each parsing algorithm and it automatically identifies the corresponding feature model specification. It is possible to define your own feature model specification using the description above and using the --guide-features option to specify the feature model specification file.

Prediction strategy

From version 1.1 of MaltParser it is possible to choose different prediction strategies. Previously, MaltParser (version 1.0.4 and earlier) combined the prediction of the transition with the prediction of the arc label into one complex prediction with one feature model. With MaltParser 1.1 and later versions it is possible to divide the prediction of the parser action into several predictions. For example with the Nivre arc-eager algorithm, it is possible to first predict the transition; if the transition is SHIFT or REDUCE the nondeterminism is resolved, but if the predicted transition is RIGHT-ARC or LEFT-ARC the parser continues to predict the arc label. This prediction strategy enables the system to have three different feature models: one for predicting the transition and two for predicting the arc label (RIGHT-ARC and LEFT-ARC).

To control the prediction strategy the --guide-decision_settings option is used with following notation:

NotationNameDescription
T.TRANS+A.DEPRELCombined predictionCombines the prediction of the transition (T.TRANS) and the arc label (A.DEPREL). This is the default setting of MaltParser 1.1 and was the only setting available for previous versions of MaltParser.
T.TRANS,A.DEPRELSequential predictionFirst predicts the transition (T.TRANS) and continues to predict the arc label (A.DEPREL) if the transition requires an arc label.
T.TRANS#A.DEPRELBranching predictionFirst predicts the transition (T.TRANS) and if the transition does not require any arc label then the nondeterminism is resolved, but if the predicted transition requires an arc label then the parser continues to predict the arc label. If the transition is a left arc transition it predicts the arc label using the corresonding model for left arc transition and if it is a right arc transition it uses the right arc model.
To differentiate the feature model when using sequential prediction you can specify two submodels for T.TRANS and A.DEPREL. Here is a truncated example:
<?xml version="1.0" encoding="UTF-8"?>
<featuremodels>
	<featuremodel name="sequential">
		<submodel name="T.TRANS">
			<feature>InputColumn(POSTAG, Stack[0])</feature>
			<feature>InputColumn(POSTAG, Input[0])</feature>
			<feature>InputColumn(POSTAG, Input[1])</feature>
			...
		</submodels>
		<submodel name="A.DEPREL">
			<feature>InputColumn(POSTAG, Stack[0])</feature>
			<feature>InputColumn(POSTAG, Input[0])</feature>
			<feature>InputColumn(POSTAG, Input[1])</feature>
			...
			<feature>InputColumn(FORM,ldep(Input[0]))</feature>
            <feature>InputColumn(FORM,rdep(Stack[0]))</feature>
		</submodels>
	</featuremodel>
</featuremodels>
When using branching prediction it is possible to use three submodels (T.TRANS, RA.A.DEPREL and LA.A.DEPREL), where RA denotes the right arc model and LA the left arc model:
<?xml version="1.0" encoding="UTF-8"?>
<featuremodels>
	<featuremodel name="sequential">
		<submodel name="T.TRANS">
			<feature>InputColumn(POSTAG, Stack[0])</feature>
			<feature>InputColumn(POSTAG, Input[0])</feature>
			<feature>InputColumn(POSTAG, Input[1])</feature>
			...
		</submodels>
		<submodel name="RA.A.DEPREL">
			<feature>InputColumn(POSTAG, Stack[0])</feature>
			<feature>InputColumn(POSTAG, Input[0])</feature>
			<feature>InputColumn(POSTAG, Input[1])</feature>
			...
			<feature>InputColumn(FORM,ldep(Input[0]))</feature>
            <feature>InputColumn(FORM,rdep(Stack[0]))</feature>
		</submodels>
		<submodel name="LA.A.DEPREL">
			<feature>InputColumn(POSTAG, Stack[0])</feature>
			<feature>InputColumn(POSTAG, Input[0])</feature>
			<feature>InputColumn(POSTAG, Input[1])</feature>
			...
			<feature>InputColumn(FORM,ldep(Input[0]))</feature>
            <feature>InputColumn(FORM,rdep(Stack[0]))</feature>
		</submodels>
	</featuremodel>
</featuremodels>
If the feature specification file does not contain any submodels then the parser uses the same feature model for all submodels.

Phrase structure parsing

MaltParser 1.1 and later versions can be turned into a phrase structure parser that recovers both continuous and discontinuous phrases with both phrase labels and grammatical functions. The parser induces a parser model from treebank data by automatically transforming the phrase structure representations into dependency representations with complex arc labels, which makes it possible to recover the phrase structure with both phrase labels and grammatical functions (See Hall (2008), Hall and Nivre (2008a) and Hall and Nivre (2008b) for more details). Each edge label in the dependency graph is a quadruple consisting of four sublabels (DEPREL, HEADREL, PHRASE, ATTACH). The meaning of each sublabel is following:

There are three different readers and writers for phrase structure parsing:

Reader/WriterData formatDescription
negranegraps and negradsReading and writing phrase structures inspired by the NEGRA export format
tigertigerps and tigerdsReading and writing phrase structures inspired by TigerXML

The following example is taken from the TIGER Treebank (Version 2.1) in NEGRA export format:

%% word			lemma			tag	morph		edge	parent	secedge comment
#BOS 1 0 1098266456 1 %% @SB2AV@
``			--			$(	--		--	0
Ross			Ross			NE	Nom.Sg.Masc	PNC	500
Perot			Perot			NE	Nom.Sg.Masc	PNC	500
w�re			sein			VAFIN	3.Sg.Past.Subj	HD	502
vielleicht		vielleicht		ADV	--		MO	502
ein			ein			ART	Nom.Sg.Masc	NK	501
pr�chtiger		pr�chtig		ADJA	Pos.Nom.Sg.Masc	NK	501
Diktator		Diktator		NN	Nom.Sg.Masc	NK	501
''			--			$(	--		--	0
#500			--			PN	--		SB	502
#501			--			NP	--		PD	502
#502			--			S	--		--	0
#EOS 1

MaltParser ignores the header of the file, the information about secondary edges and the information after "#BOS 1", but the second column must include the lemma or the dummy symbol "--". Given that you have training data in the file train.negra formatted as above and a feature specification file, type the following at the command line prompt:

prompt> java -jar malt.jar -c testps -i train.negra -if negraps -ir negra -ic ISO8859-1 -m learn \
			-gds T.TRANS#A.DEPREL,A.HEADREL,A.PHRASE,A.ATTACH -grl DEPREL=--,HEADREL=*,PHRASE=VROOT,ATTACH=0  \
			-F examples/covnonproj_ps.xml -a covnonproj -d POS -s Right[0] -T 1000

This command will create testps.mco containing a parser model for parsing phrase structure. MaltParser will transform the phrase structure into a dependency graph by using a very simple head-finding rule. It will perform a left-to-right search to find the leftmost lexical child. If no lexical child can be found, the head-child of the phrase will be the leftmost phrase child and the lexical head will be the lexical child of the head child recursively.

The options -if negraps -ir negra informs MaltParser to use the Negra export format. The prediction strategy -gds T.TRANS;A.DEPREL,A.HEADREL,A.PHRASE,A.ATTACH tells the parser to first predict the transition T.TRANS and if it is a left or right arc transition it continues to predict the sublabels A.DEPREL, A.HEADREL, A.PHRASE and A.ATTACH in that order. There is a default root label for each sublabel: -grl DEPREL=--,HEADREL=*,PHRASE=VROOT,ATTACH=0. The prediction strategy allows the feature specification file to have nine submodels (one transition-model, two models for each sublabel). You can start to optimize the feature model by using this file examples/covnonproj_ps.xml. We use the Covington non-projective parsing algorithm, because it is capable of parsing non-projective dependency graphs (a discontinuous phrase structure will result in a non-projective dependency graph). If you train a parser model based on the TIGER Treebank (Version 2.1), make sure that you also use the correct character encoding ISO8859-1 (default is UTF-8). To parse type the following:

prompt> java -jar malt.jar -c testps -i testps.tab -o out.negra -if negrads -ir tab -ic ISO8859-1 -of negraps -ow negra -oc ISO8859-1 -m parse

The input file must contain four columns: WORD, LEMMA, POS, MORPH. A test file can look like this:

``      --      $(      --
Ross    Ross    NE      Nom.Sg.Masc
Perot   Perot   NE      Nom.Sg.Masc
w�re    sein    VAFIN   3.Sg.Past.Subj
vielleicht      vielleicht      ADV     --
ein     ein     ART     Nom.Sg.Masc
pr�chtiger      pr�chtig        ADJA    Pos.Nom.Sg.Masc
Diktator        Diktator        NN      Nom.Sg.Masc
''      --      $(      --

Head-finding rules

It is possible to define your own head-finding rules in a file. The following head-finding rules can be used for the TIGER Treebank:

CAT:AA  r       r[LABEL:HD]
CAT:AP  r       r[LABEL:HD]
CAT:AVP r       r[LABEL:HD CAT:AVP]
CAT:CAC l       l[LABEL:CJ]
CAT:CAP l       l[LABEL:CJ]
CAT:CAVP        l       l[LABEL:CJ]
CAT:CH  l       *
CAT:CNP l       l[LABEL:CJ]
CAT:CO  l       l[LABEL:CJ]
CAT:CPP l       l[LABEL:CJ]
CAT:CS  l       l[LABEL:CJ]
CAT:CVP l       l[LABEL:CJ]
CAT:CCP l       l[LABEL:CJ]
CAT:CVZ l       l[LABEL:CJ]
CAT:DL  l       l[LABEL:DH]
CAT:ISU l       *
CAT:NM  r       *
CAT:NP  r       r[LABEL:NK]
CAT:PN  l       *
CAT:PP  r       r[LABEL:NK]
CAT:S   r       r[LABEL:HD]
CAT:VROOT       l       *
CAT:VP  r       r[LABEL:HD]

The file contains several head-finding rules (one per row). The first column states the phrase label of the parent nonterminal and the second column specifies the default direction (l = left-to-right search and r = right-to-left search). The third column is a priority list of children. For example the first row CAT:AA r r[LABEL:HD] indicates that the parser should first perform a right-to-left search for an outgoing edge with a label HD if the parent nonterminal is labeled AA. If no child with incoming edge label HD can be found, then use the default direction r to search for the rightmost lexical child. If no lexical child can be found, then take the rightmost nonterminal child. Another example is CAT:AVP r r[LABEL:HD CAT:AVP], which first searches for an outgoing edge label HD if the parent nonterminal is labeled AVP. If this label cannot be found, then the search continues for a nonterminal child labeled AVP. Some of the head-finding rules have the sign * in the third column, which indicates that there is no priority list for the nonterminal. The --graph-head_rules option (-ghr flag) specifies the URL or the path to a file that contains a list of head rules.

References