The user guide consists of the these sections:
This section contains a short guide to get familiar with MaltParser. We start by running MaltParser without any arguments by typing the following at the command line prompt (it is important that you are in the maltparser-1.9.2 directory):
prompt> java -jar maltparser-1.9.2.jar
This command will display the following output:
----------------------------------------------------------------------------- MaltParser 1.9.2 ----------------------------------------------------------------------------- MALT (Models and Algorithms for Language Technology) Group Vaxjo University and Uppsala University Sweden ----------------------------------------------------------------------------- Usage: java -jar maltparser-1.9.2.jar -f <path to option file> <options> java -jar maltparser-1.9.2.jar -h for more help and options help ( -h) : Show options ----------------------------------------------------------------------------- option_file ( -f) : Path to option file ----------------------------------------------------------------------------- verbosity *( -v) : Verbosity level debug - Logging of debugging messages error - Logging of error events fatal - Logging of very severe error events info - Logging of informational messages off - Logging turned off warn - Logging of harmful situations ----------------------------------------------------------------------------- Documentation: docs/index.html
Here you can see the basic usage and options. To get all available options:
prompt>$ java -jar maltparser-1.9.2.jar -h
All these options are also described in a short documentation and in a full documentation.
Now we are ready to train our first parsing model. In the directory examples/data there are two data files talbanken05_train.conll and talbanken05_test.conll, which contain very small portions of the Swedish treebank Talbanken05. The example data sets are formatted according to the CoNLL-X data format. Note that these data sets are very small and that you need more training data to create a useful parsing model.
To train a default parsing model with MaltParser type the following at the command line prompt:
prompt> java -jar maltparser-1.9.2.jar -c test -i examples/data/talbanken05_train.conll -m learn
This line tells MaltParser to create a parsing model named test.mco (also know as a Single Malt configuration file) from the data in the file examples/data/talbanken05_train.conll. The parsing model gets its name from the configuration name, which is specified by the option flag -c without the file suffix .mco. The configuration name is a name of your own choice. The option flag -i tells the parser where to find the input data. The last option flag -m specifies the processing mode learn (as opposed to parse), since in this case we want to induce a model by using the default learning method (LIBSVM).
MaltParser outputs the following information:
----------------------------------------------------------------------------- MaltParser 1.9.2 ----------------------------------------------------------------------------- MALT (Models and Algorithms for Language Technology) Group Vaxjo University and Uppsala University Sweden ----------------------------------------------------------------------------- Started: Fri May 02 23:45:18 CEST 2014 Transition system : Arc-Eager Parser configuration : Nivre with allow_root=true, allow_reduce=false and enforce_tree=false Oracle : Arc-Eager Data Format : conllx.xml . 1 0s 5MB . 10 0s 6MB 32 0s 8MB Creating Liblinear model odm0.liblinear.moo - Read all training instances. - Train a parser model using LibLinear. - Optimize the memory usage - Save the Liblinear model odm0.liblinear.moo Learning time: 00:00:01 (1290 ms) Finished: Fri May 02 23:45:19 CEST 2014
Most of the logging information is self-explaining: it tells you that the parser is started at a certain time and date and that it reads sentences from a specified file containing 32 sentences. It continues with information about the learning models that are created, in this case only one LIBSVM model. It then saves the symbol table and all options (which cannot be changed later during parsing) and stores everything in a configuration file named test.mco. Finally, the parser informs you about the learning time.
We have now created a parsing model that we can use for parsing new sentences from the same language. It is important that unparsed sentences are formatted according to the format that was used during training (except that the output columns for head and dependency relation are missing). In this case tokens are represented by the first six columns of the CoNLL-X data format. To parse type the following:
prompt> java -jar maltparser-1.9.2.jar -c test -i examples/data/talbanken05_test.conll -o out.conll -m parse
where -c test is the name of the configuration (the prefix file name of test.mco), -i examples/data/talbanken05_test.conll tells the parser where to find the input data, -o out.conll is the output file name, and finally -m parse specifies that the parser should be executed in parsing mode.
MaltParser can be controlled by specifying values for a range of different options. The values for these option can be specified in different ways:
Method | Description | Example |
---|---|---|
Command-line option flag | Uses the option flag with a dash (-) before the option flag and a blank between the option flag and the value | -c test |
Command-line option group and option name | Uses both the option group name and the option name to specify the option, with two dashes (--) before the option group name and one dash (-) to separate the option group name and the option name. The equality sign (=) is used for separating the option and the value. | --config-name=test |
Command-line option name | Is a shorter version of Command-line option group and option name and can only be used when the option name is unambiguous. | --name=test |
Option file | The option settings are specified in a option file, formatted in XML. To tell MaltParser to read the option file the option flag -f is used. Note that command line option settings override the settings in the option file if options are specified twice. |
<?xml version="1.0" encoding="UTF-8"?> <experiment> <optioncontainer> <optiongroup groupname="config"> <option name="name" value="test"/> </optiongroup> </optioncontainer> </experiment> |
All options are described in a short documentation and a full documentation.
An option file is useful when you have many options that differ from the default value, as is often the case when you are training a parsing model.
The option file should have the following XML format:
Element | Description |
---|---|
experiment | All other elements must be enclosed by an experiment element. |
optioncontainer | It is possible to have one or more option containers, but MaltParser 1.9.2 only uses the first option container. Later releases may make use of multiple option containers, for instance, to build ensemble systems. |
optiongroup | There can be one or more option group elements within an option container. The attribute groupname specifies the option group name (see description of all available options). |
option | An option group can consist of one or more option. The element option has two attributes: name that corresponds to an option name and value that is the value of the option. Please consult the description of all available options to see all legal option names and values. |
Here is an example (examples/optionexample.xml):
<?xml version="1.0" encoding="UTF-8"?> <experiment> <optioncontainer> <optiongroup groupname="config"> <option name="name" value="example1"/> <option name="flowchart" value="learn"/> </optiongroup> <optiongroup groupname="singlemalt"> <option name="parsing_algorithm" value="nivrestandard"/> </optiongroup> <optiongroup groupname="input"> <option name="infile" value="examples/data/talbanken05_train.conll"/> </optiongroup> <optiongroup groupname="nivre"> <option name="allow_root" value="false"/> <option name="allow_reduce" value="false"/> </optiongroup> </optioncontainer> </experiment>
To run MaltParser with the above option file type:
prompt> java -jar maltparser-1.9.2.jar -f examples/optionexample.xml
This command will create a configuration file example1.mco based on the settings in the option file. It is possible to override the options by command-line options, for example:
prompt> java -jar maltparser-1.9.2.jar -f examples/optionexample.xml -a nivreeager
which will create a configuration based on the same setting except the parsing algorithm is now nivreeager instead of nivrestandard. If you want to create a configuration that has the same settings as the option file with command-line options, you need to type:
prompt> java -jar maltparser-1.9.2.jar -c example1 -m learn -i examples/data/talbanken05_train.conll -a nivrestandard -ne false -nr false
To parse using one of the three configurations you simply type:
prompt> java -jar maltparser-1.9.2.jar -c example1 -m parse -i examples/data/talbanken05_test.conll -o out1.conll
The purpose of the configuration is to gather information about all settings and files into one file. During learning, the configuration is created and stored in a configuration file with the file suffix .mco. This configuration file can later be reused whenever the trained model is used to parse new data. Potentially there can be several types of configuration, but MaltParser 1.9.2 only knows one type: the Single Malt configuration (singlemalt).
MaltParser have seven pre-defined flow charts that describe what tasks MaltPasrer should perform. These seven flow charts are:
Name | Description |
---|---|
learn | Creates a Single Malt configuration and induces a parsing model from input data. |
parse | Parses sentences using a Single Malt configuration. |
info | Prints information about a configuration. |
unpack | Unpacks a configuration into a directory with the same name. |
proj | Creates a configuration and projectivizes input data without inducing a parsing model. |
deproj | Deprojectivizes input data using a configuration. |
convert | A simple data format converter |
A Single Malt configuration creates a parsing model based on one set of option values. The learn and parse modes are explained above in Train a parsing model and Parse data with your parsing model, the other four modes are described below using the same example.
Sometimes it is useful to get information about a configuration, for instance, to know which settings have been used when creating the configuration. To get this information you type:
prompt> java -jar maltparser-1.9.2.jar -c test -m info
This will output a lot of information about the configuration:
----------------------------------------------------------------------------- MaltParser 1.9.2 ----------------------------------------------------------------------------- MALT (Models and Algorithms for Language Technology) Group Vaxjo University and Uppsala University Sweden ----------------------------------------------------------------------------- Started: Fri May 02 23:49:37 CEST 2014 CONFIGURATION Configuration name: test Configuration type: singlemalt Created: Fri May 02 23:45:18 CEST 2014 SYSTEM Operating system architecture: amd64 Operating system name: Linux JRE vendor name: Oracle Corporation JRE version number: 1.7.0_55 MALTPARSER Version: 1.8 Build date: May 2 2014 SETTINGS 2planar reduceonswitch (-2pr) false config workingdir ( -w) user.dir name ( -c) test logging ( -cl) info flowchart ( -m) learn type ( -t) singlemalt logfile (-lfi) stdout url ( -u) covington allow_root ( -cr) true allow_shift ( -cs) false graph max_sentence_length (-gsl) 256 head_rules (-ghr) root_label (-grl) ROOT guide decision_settings (-gds) T.TRANS+A.DEPREL kbest_type ( -kt) rank data_split_structure ( -s) learner ( -l) liblinear kbest ( -k) -1 features ( -F) classitem_separator (-gcs) ~ data_split_column ( -d) data_split_threshold ( -T) 50 input infile ( -i) examples/data/talbanken05_train.conll reader ( -ir) tab iterations ( -it) 1 charset ( -ic) UTF-8 reader_options (-iro) format ( -if) /appdata/dataformat/conllx.xml lib save_instance_files ( -li) false external ( -lx) verbosity ( -lv) silent options ( -lo) multiplanar planar_root_handling (-prh) normal nivre allow_root ( -nr) true enforce_tree ( -nt) false allow_reduce ( -ne) false output charset ( -oc) UTF-8 writer_options (-owo) format ( -of) writer ( -ow) tab outfile ( -o) planar no_covered_roots (-pcov) false connectedness (-pcon) none acyclicity (-pacy) true pproj covered_root (-pcr) none marking_strategy ( -pp) none lifting_order (-plo) shortest singlemalt parsing_algorithm ( -a) nivreeager null_value ( -nv) one guide_model ( -gm) single propagation ( -fp) diagnostics ( -di) false use_partial_tree ( -up) false diafile (-dif) stdout mode ( -sm) parse DEPENDENCIES --guide-features ( -F) NivreEager.xml FEATURE MODEL MAIN InputColumn(FORM,Input[0]) InputColumn(FORM,Input[1]) InputColumn(FORM,Stack[0]) InputColumn(FORM,head(Stack[0])) InputColumn(POSTAG,Input[0]) InputColumn(POSTAG,Input[1]) InputColumn(POSTAG,Input[2]) InputColumn(POSTAG,Input[3]) InputColumn(POSTAG,Stack[0]) InputColumn(POSTAG,Stack[1]) Merge(InputColumn(POSTAG,Input[0]),OutputColumn(DEPREL,ldep(Input[0]))) Merge(InputColumn(POSTAG,Stack[0]),InputColumn(POSTAG,Input[0])) Merge(InputColumn(POSTAG,Stack[0]),OutputColumn(DEPREL,Stack[0])) Merge3(InputColumn(POSTAG,Input[0]),InputColumn(POSTAG,Input[1]),InputColumn(POSTAG,Input[2])) Merge3(InputColumn(POSTAG,Input[1]),InputColumn(POSTAG,Input[2]),InputColumn(POSTAG,Input[3])) Merge3(InputColumn(POSTAG,Stack[0]),InputColumn(POSTAG,Input[0]),InputColumn(POSTAG,Input[1])) Merge3(InputColumn(POSTAG,Stack[0]),OutputColumn(DEPREL,ldep(Stack[0])),OutputColumn(DEPREL,rdep(Stack[0]))) Merge3(InputColumn(POSTAG,Stack[1]),InputColumn(POSTAG,Stack[0]),InputColumn(POSTAG,Input[0])) OutputColumn(DEPREL,Stack[0]) OutputColumn(DEPREL,ldep(Input[0])) OutputColumn(DEPREL,ldep(Stack[0])) OutputColumn(DEPREL,rdep(Stack[0])) liblinear INTERFACE Finished: Fri May 02 23:49:37 CEST 2014
The information is grouped into different categories:
Category | Description |
---|---|
CONFIGURATION | The name and type of the configuration and the date when it was created. |
SYSTEM | Information about the system that was used when creating the configuration, such as processor, operating system and version of Java Runtime Environment (JRE). |
MALTPARSER | Version of MaltParser and when it was built. |
SETTINGS | All option settings divided into several categories. |
DEPENDENCIES | In some cases the parser self-corrects when an illegal combination of options is specified or some option is missing. In the example above the feature specification file is not specified and the parser uses the default feature specification file for the Nivre arc-eager parsing algorithm. |
FEATURE MODEL | Outputs the content of the feature specification file. |
<LEARNER> INTERFACE | Information about the interface to the learner, in this case LIBSVM. |
<LEARNER> SETTINGS | All settings of specific learner options, in this case LIBSVM. |
It is possible to unpack the configuration file test.mco by typing:
prompt> java -jar maltparser-1.9.2.jar -c test -m unpack
This command will create a new directory test containing the following files:
File | Description |
---|---|
conllx.xml | XML document describing the data format. |
NivreEager.xml | XML document containing the feature model specification. |
odm0.libsvm.moo, odm0.libsvm.map | The LIBSVM model that is used for predicting the next parsing action. |
savedoptions.sop | All option settings that cannot be changed during parsing. |
symboltables.sym | All distinct symbols in the training data, divided into different columns. For example, the column POSTAG in the CoNLL-X format has its own symbol table with all distinct values occurring in the training data. |
test_singlemalt.info | Information about the configuration (same as described above). |
It is possible to projectivize an input file, with or without involving parsing.
All non-projective arcs in the input file are replaced by projective arcs by applying a lifting operation. The lifts are encoded in the dependency labels of the lifted arcs. The encoding scheme can be varied using the flag -pp (marking_strategy), and there are currently five of them: none, baseline, head, path and head+path. (See Nivre & Nilsson (2005) for more details concerning the encoding schemes.) A dependency file can be projectivized using the head encoding by typing:
prompt> java -jar maltparser-1.9.2.jar -c pproj -m proj -i examples/data/talbanken05_test.conll -o projectivized.conll -pp head
There is one additional option for the projectivization called covered_root, which is mainly used for handling dangling punctuation. Depending on the treebank, a punctuation token located in the middle of a sentence can attach directly to the root, which entails that all arcs crossing the head arc of the punctuation token are non-projective. This, in turn, results in lots of (unnecessary) lifts, and can be avoided by using the covered_root flag -pcr. This option has four values: none, left, right and head. For the last three values, tokens like dangling punctuation are then attached to one of the tokens connected by the shortest arc covering the token, either the leftmost (left), rightmost (right), or head (head) token of the covering arc. This will prevent all the unnecessary lifts.
The projecitivization and deprojectivization (below), including the encoding schemes, are know as pseudo-projective transformations and are described in more detail in Nivre & Nilsson (2005). The only difference compared to Nivre & Nilsson is that it is the most deeply nested non-projective arc that is lifted first, not the shortest one.
MaltParser can also be used to deprojectivize a projective file containing pseudo-projective encoding, with or without involving parsing, where it is assumed that the configuration pproj contains the same encoding scheme as during projectivization. It could look like this:
prompt> java -jar maltparser-1.9.2.jar -c pproj -m deproj -i projectivized.conll -o deprojectivized.conll
The file deprojectivized.conll will contain the deprojectivized data. Note that is is only the encoding schemes head, path and head+path that actively try to recover the non-projective arcs.
The format and encoding of the input and output data is controlled by the format, reader, writer and charset options in the input and output option group. The CoNLL-X, CoNLL-U and Malt-TAB data format specification files are already included in the MaltParser jar-file (maltparser-1.9.2.jar) in the appdata/dataformat directory. The CoNLL-X data format specification file looks like this:
<?xml version="1.0" encoding="UTF-8"?> <dataformat name="conllx"> <column name="ID" category="INPUT" type="INTEGER"/> <column name="FORM" category="INPUT" type="STRING"/> <column name="LEMMA" category="INPUT" type="STRING"/> <column name="CPOSTAG" category="INPUT" type="STRING"/> <column name="POSTAG" category="INPUT" type="STRING"/> <column name="FEATS" category="INPUT" type="STRING"/> <column name="HEAD" category="HEAD" type="INTEGER"/> <column name="DEPREL" category="DEPENDENCY_EDGE_LABEL" type="STRING"/> <column name="PHEAD" category="IGNORE" type="INTEGER" default="_"/> <column name="PDEPREL" category="IGNORE" type="STRING" default="_"/> </dataformat>
A data format specification file has two types of XML elements. First, there is the dataformat element with the attribute name, which gives the data format a name. The dataformat element encloses one or more column elements, which contain information about individual columns. The column elements have three attributes:
Attribute | Description | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | The column name. Note that the column name can be used by an option and within a feature model specification as an identifier of the column. | ||||||||||||||||
category | The column category, one of the following:
| ||||||||||||||||
type | Defines the data type of the column and/or its treatment during learning and parsing:
| ||||||||||||||||
default | The default output for columns that have the column type IGNORE. |
It is possible to define your own input/output format and then supply the data format specification file with the format option.
Currently, MaltParser only supports tab-separated data files, which means that a sentence in a data file in the CoNLL-X data format could look like this:
1 Den _ PO PO DP 2 SS _ _ 2 blir _ V BV PS 0 ROOT _ _ 3 gemensam _ AJ AJ _ 2 SP _ _ 4 för _ PR PR _ 2 OA _ _ 5 alla _ PO PO TP 6 DT _ _ 6 inkomsttagare _ N NN HS 4 PA _ _ 7 oavsett _ PR PR _ 2 AA _ _ 8 civilstånd _ N NN SS 7 PA _ _ 9 . _ P IP _ 2 IP _ _
Finally, the character encoding can be specified with the charset option and this option is used by MaltParser to define the java class Charset.
Any deterministic parsing algorithm compatible with the MaltParser architecture can be implemented in the MaltParser package. MaltParser 1.9.2 contains three families of parsing algorithms: Nivre, Covington and Stack.
Nivre's algorithm (Nivre 2003, Nivre 2004) is a linear-time algorithm limited to projective dependency structures. It can be run in arc-eager (-a nivreeager) or arc-standard (-a nivrestandard) mode. In addition, the allow_root option decides whether the parser will start parsing with an artificial root token on the stack (true) or with an empty stack (false), and the allow_reduce option decides whether the reduce transition is allowed even if the token on top of the stack does not have a head (true) or whether only attached tokens can be reduced (false). The enforce_tree option can be used with the arc-eager version to make sure that the output parse is a tree.
Nivre's algorithm uses two data structures:
NB: Please note that the allow_root and allow_reduce options replace the older root_handling option from version 1.7. In order to replicate the behavior of older versions, use the following settings:
The new default behavior is allow_root = true and allow_reduce = false.
Covington's algorithm (Covington 2001) is a quadratic-time algorithm for unrestricted dependency structures, which proceeds by trying to link each new token to each preceding token. It can be run in a projective (-a covproj) mode, where the linking operation is restricted to projective dependency structures, or in a non-projective (-a covnonproj) mode, allowing non-projective (but acyclic) dependency structures. In addition, there are two options, allow shift and allow root, that controls the behavior of Covington's algorithm.
Covington's algorithm uses four data structures:
The Stack algorithms are similar to Nivre's algorithm in that they use a stack and a buffer but differ in that they add arcs between the two top nodes on the stack (rather than the top node on the stack and the first node in the buffer) and that they guarantee that the output is a tree without post-processing. The Projective (-a stackproj) Stack algorithm uses essentially the same transitions as the arc-standard version of Nivre's algorithm and is limited to projective dependency trees. The Eager (-a stackeager) and Lazy (-a stacklazy) Stack algorithms in addition make use of a swap transition, which makes it possible to derive arbitrary non-projective dependency trees. The Eager algorithm applies the swap transition as soon as possible, while the Lazy algorithm postpones swapping as long as possible. The Stack algorithms are described in Nivre (2009) and Nivre, Kuhlmann and Hall (2009).
The Stack algorithms use three data structures:
Note that it is only the swap transition that can move nodes from Stack back to the buffer, which means that for the Projective Stack algorithm Input will always be empty and Lookahead will always contain all the nodes in the buffer.
The Planar algorithm (Gómez-Rodríguez and Nivre, 2010)
is a linear-time algorithm limited to planar dependency structures, the
set of structures that do not contain any crossing links. It works in a
similar way to Nivre's algorithm in arc-eager mode, but with more
fine-grained transitions. The connectedness, acyclicity and no covered roots
options can be used to configure which additional constraints, apart
from planarity, will be imposed on the target set of dependency graphs.
Just like Nivre's algorithm, the Planar algorithm uses two data structures:
The 2-Planar algorithm (Gómez-Rodríguez and Nivre,
2010) is a linear-time algorithm that can be used to parse 2-planar
dependency structures, i.e., those whose links may be coloured with two
colours in such a way that no two same-coloured links cross. The
2-planar algorithm uses two stacks, one of which is the active stack at
a given time while the other is the inactive stack. Input words are
always pushed into both stacks at the same time, but then the algorithm
behaves like the Planar parser working with only one stack (the active
stack), until a Switch transition is executed: this transition switches
the stacks around, making the previously inactive stack active and vice
versa.
The reduce on switch option can be used to change the specific behaviour of Switch transitions, while the planar root handling option can be employed to change the algorithm's behavior with respect to root tokens.
The 2-Planar algorithm uses three data structures:
MaltParser uses history-based feature models for predicting the next action in the deterministic derivation of a dependency
structure, which means that it uses features of the partially built dependency structure together with features of
the (tagged) input string. Features that make use of the partially built dependency structure corresponds to the OUTPUT category of
the data format, for example DEPREL
in the CoNLL-X data format, and features of the input string corresponds to the INPUT
category of the data format, for example CPOSTAG
and FORM
.
The feature model specification must be specified in an XML file according to the format below or in a text file formatted according to the specification
given by the MaltParser 0.x user guide. The latter specification format should be saved in a text file where the file name must end
with the file suffix .par
.
Below you can see an example of the new XML format (Nivre arc-eager default feature model):
<?xml version="1.0" encoding="UTF-8"?> <featuremodels> <featuremodel name="nivreeager"> <feature>InputColumn(POSTAG, Stack[0])</feature> <feature>InputColumn(POSTAG, Input[0])</feature> <feature>InputColumn(POSTAG, Input[1])</feature> <feature>InputColumn(POSTAG, Input[2])</feature> <feature>InputColumn(POSTAG, Input[3])</feature> <feature>InputColumn(POSTAG, Stack[1])</feature> <feature>OutputColumn(DEPREL, Stack[0])</feature> <feature>OutputColumn(DEPREL, ldep(Stack[0]))</feature> <feature>OutputColumn(DEPREL, rdep(Stack[0]))</feature> <feature>OutputColumn(DEPREL, ldep(Input[0]))</feature> <feature>InputColumn(FORM, Stack[0])</feature> <feature>InputColumn(FORM, Input[0])</feature> <feature>InputColumn(FORM, Input[1])</feature> <feature>InputColumn(FORM, head(Stack[0]))</feature> </featuremodel> </featuremodels>
Each feature is defined using a functional notation with three types of functions:
Type | Description | ||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Address function | There are two types of address functions: parsing algorithm specific functions and dependency graph functions.
The parsing algorithm specific functions have the form Data-structure[i], where Data-structure is a data structure used by a specific parsing algorithm
and i is an offset from the start position in this data structure. The following data structures are available for different parsing algorithms:
The dependency graph address functions take a graph node as argument and navigates from this graph node to another graph node (if possible). There are seven dependency graph address functions:
| ||||||||||||||||||||||||||||||||||||||||
Feature function | A feature function takes at least one address function as input and returns a feature value defined in terms of the input arguments.
There are seven feature functions available:
| ||||||||||||||||||||||||||||||||||||||||
Feature map function | Maps a feature value onto a new set of values and takes as arguments a feature specification
and one or more arguments that control the mapping. There is one feature map function:
|
MaltParser is equipped with a default feature model specification for each parsing algorithm and it automatically identifies the corresponding feature model specification. It is possible to define your own feature model specification using the description above and using the --guide-features option to specify the feature model specification file.
MaltParser can be used with different learning algorithms to induce classifiers from training data. From version 1.3 there are two built-in learners: LIBSVM and LIBLINEAR.
LIBSVM (Chang and Lin 2001) is a machine learning package for support vector machines with different kernels. Information about different options can be found on the LIBSVM web site.
LIBLINEAR (Fan et al. 2008) is a machine learning package for linear classifiers. Information about different options can be found on the LIBLINEAR web site.
From version 1.1 of MaltParser it is possible to choose different prediction strategies. Previously, MaltParser (version 1.0.4 and earlier) combined the prediction of the transition with the prediction of the arc label into one complex prediction with one feature model. With MaltParser 1.1 and later versions it is possible to divide the prediction of the parser action into several predictions. For example with the Nivre arc-eager algorithm, it is possible to first predict the transition; if the transition is SHIFT or REDUCE the nondeterminism is resolved, but if the predicted transition is RIGHT-ARC or LEFT-ARC the parser continues to predict the arc label. This prediction strategy enables the system to have three different feature models: one for predicting the transition and two for predicting the arc label (RIGHT-ARC and LEFT-ARC).
To control the prediction strategy the --guide-decision_settings option is used with following notation:
Notation | Name | Description |
---|---|---|
T.TRANS+A.DEPREL | Combined prediction | Combines the prediction of the transition (T.TRANS) and the arc label (A.DEPREL). This is the default setting of MaltParser 1.1 and was the only setting available for previous versions of MaltParser. |
T.TRANS,A.DEPREL | Sequential prediction | First predicts the transition (T.TRANS) and continues to predict the arc label (A.DEPREL) if the transition requires an arc label. |
T.TRANS#A.DEPREL | Branching prediction | First predicts the transition (T.TRANS) and if the transition does not require any arc label then the nondeterminism is resolved, but if the predicted transition requires an arc label then the parser continues to predict the arc label. If the transition is a left arc transition it predicts the arc label using the corresonding model for left arc transition and if it is a right arc transition it uses the right arc model. |
<?xml version="1.0" encoding="UTF-8"?> <featuremodels> <featuremodel name="sequential"> <submodel name="T.TRANS"> <feature>InputColumn(POSTAG, Stack[0])</feature> <feature>InputColumn(POSTAG, Input[0])</feature> <feature>InputColumn(POSTAG, Input[1])</feature> ... </submodel> <submodel name="A.DEPREL"> <feature>InputColumn(POSTAG, Stack[0])</feature> <feature>InputColumn(POSTAG, Input[0])</feature> <feature>InputColumn(POSTAG, Input[1])</feature> ... <feature>InputColumn(FORM,ldep(Input[0]))</feature> <feature>InputColumn(FORM,rdep(Stack[0]))</feature> </submodel> </featuremodel> </featuremodels>When using branching prediction it is possible to use three submodels (T.TRANS, RA.A.DEPREL and LA.A.DEPREL), where RA denotes the right arc model and LA the left arc model:
<?xml version="1.0" encoding="UTF-8"?> <featuremodels> <featuremodel name="branched"> <submodel name="T.TRANS"> <feature>InputColumn(POSTAG, Stack[0])</feature> <feature>InputColumn(POSTAG, Input[0])</feature> <feature>InputColumn(POSTAG, Input[1])</feature> ... </submodel> <submodel name="RA.A.DEPREL"> <feature>InputColumn(POSTAG, Stack[0])</feature> <feature>InputColumn(POSTAG, Input[0])</feature> <feature>InputColumn(POSTAG, Input[1])</feature> ... <feature>InputColumn(FORM,ldep(Input[0]))</feature> <feature>InputColumn(FORM,rdep(Stack[0]))</feature> </submodel> <submodel name="LA.A.DEPREL"> <feature>InputColumn(POSTAG, Stack[0])</feature> <feature>InputColumn(POSTAG, Input[0])</feature> <feature>InputColumn(POSTAG, Input[1])</feature> ... <feature>InputColumn(FORM,ldep(Input[0]))</feature> <feature>InputColumn(FORM,rdep(Stack[0]))</feature> </submodel> </featuremodel> </featuremodels>If the feature specification file does not contain any submodels then the parser uses the same feature model for all submodels.
Since MaltParser 1.4 it is possible to parse with partial trees, i.e., sentences may be input with a partial dependency structure, a subgraph of a complete dependency tree. To parse with partial trees you need to do the following:
The two data columns should look like these:
<column name="PARTHEAD" category="INPUT" type="INTEGER"/> <column name="PARTDEPREL" category="INPUT" type="STRING"/>
Note: To benefit from the partial dependency structure, the parser model should also be trained on partial trees. Moreover, since arcs can only be added between roots of the partial tree, the partial tree should satisfy the following constraint: if an arc (i, j) is included, then the subtree rooted at j in the complete tree must also be included.
Since MaltParser 1.4 it is possible to propagate column values towards the root of the dependency graph when a labeled transition is performed. The propagation is managed by a propagation specification file formatted in XML with the following attributes:
Attribute | Name | Description |
---|---|---|
FROM | The data column from which the values are copied. | |
TO | The data column to which the values are copied. This data column should not exist in the data format and the values are interpreted as sets. When a new value is copied to this column, the result is the set union of the new value and the old value. The atomic values (set members) are separated by the sign |. | |
FOR | A subset of values that can be copied (other values will not be copied). If empty then all values will be copied. | |
OVER | A subset of dependency labels that allow propagation when a labeled transition is performed. If empty then all dependency labels allow propagation. |
Below you can see an example of a propagation specification file:
<?xml version="1.0" encoding="UTF-8"?> <propagations> <propagation name="coordination"> <from>POSTAG</from> <to>CJ-POSTAG</to> <for></for> <over>CJ</over> </propagation> <propagation name="valency"> <from>DEPREL</from> <to>VALENCY</to> <for>EO|ES|FO|FS|IO|OA|OO|OP|SP|SS|VO|VS</for> <over></over> </propagation> </propagations>
The top half specifies that POSTAG values should be copied to the CJ-POSTAG field of the head, whenever an arc with the label CJ (for conjunct) is created. Assuming an analysis of coordination where the coordinating conjunction is the head of coordinate structure, this will have the effect of propagating information about the POSTAG values of the conjuncts to the head of the coordinate structure.
The bottom half specifies that DEPREL values should be copied to the VALENCY field of the head, whenever an arc labeled by one of the labels listed in the FOR parameter is created. Provided that these labels denote valency-bound functions, this will have the effect of propagating information about satisfaction of valency constraints to the head.
New columns introduced in the FROM attribute of a propagation specification can be referenced in feature specifications using the function InputTable, e.g. <feature>InputTable(CJ-POSTAG, Stack[0])</feature>.
MaltParser 1.1 and MaltParser 1.2 can be turned into a phrase structure parser that recovers both continuous and discontinuous phrases with both phrase labels and grammatical functions. The parser induces a parser model from treebank data by automatically transforming the phrase structure representations into dependency representations with complex arc labels, which makes it possible to recover the phrase structure with both phrase labels and grammatical functions (See Hall (2008), Hall and Nivre (2008a) and Hall and Nivre (2008b) for more details).
Note: The implementation of phrase structure parsing has been removed in later releases of MaltParser. Please download MaltParser 1.2 and read the offline user guide of this version to parse phrase structure with MaltParser 1.2.
From version MaltParser-1.8 there is a new interface to MaltParser located in org.maltparser.concurrent and contains following classes:
This interface can only be used during parsing time and can hopefully be used in a multi-threaded environment. If you need an interface to MaltParser during training of parser models then you have to use the old interface (see below).
The concurrent interface uses a more "light-weighted" parser and hopefully supports almost all features. One know exception is feature propagation is not supported in the new "light-weighted" parser.
To compile the examples in srcex/org/maltparser/examples
cd examples/apiexamples javac -d classes -cp ../../maltparser-1.9.2.jar:. srcex/org/maltparser/examples/*.java
To run the examples you first need to create a Swedish parser model swemalt-mini.mco by using MaltParser:
java -jar ../../maltparser-1.9.2.jar -w output -c swemalt-mini -i ../data/swemalt-mini/gold/sv-stb-dep-mini-train.conll -a stacklazy -m learn \ -l liblinear -llo -s_4_-c_0.1 -d CPOSTAG -s Stack[0] -T 1000 -F ../data/swemalt-mini/swedish-swap.xml
Note that swemalt-mini.mco is not the same as swemalt.mco you find on http://www.maltparser.org. swemalt-mini.mco is trained on approximately 5% of the training data and will not perform as well as the pre-trained model found on http://www.maltparser.org.
java -cp classes:../../maltparser-1.9.2.jar org.maltparser.examples.ParseSentence1 java -cp classes:../../maltparser-1.9.2.jar org.maltparser.examples.ParseSentence2 java -cp classes:../../maltparser-1.9.2.jar org.maltparser.examples.ParseSentence3 java -cp classes:../../maltparser-1.9.2.jar org.maltparser.examples.ConcurrentExample1 java -cp classes:../../maltparser-1.9.2.jar org.maltparser.examples.ConcurrentExample2 java -cp classes:../../maltparser-1.9.2.jar org.maltparser.examples.ConcurrentExample3
Before MaltParser-1.8 there was another interface to MaltParser. Note that this interface can only be used in a single-threaded environment and the interface doesn't use the light-weighted parser.
There are two ways to call the MaltParserService:
For more information about how to use MaltParserService, please see the examples provided in the directory examples/apiexamples/srcex/org/maltparser/examples/old
To compile the old examples (srcex/org/maltparser/examples/old) used by MaltParser-1.7.2 and previous versions of MaltParser
javac -d classes -cp ../../maltparser-1.9.2.jar:. srcex/org/maltparser/examples/old/*.java
To run the old examples
java -cp classes:../../maltparser-1.9.2.jar org.maltparser.examples.old.ReadWriteCoNLL ../data/talbanken05_test.conll out.conll ../../appdata/dataformat/conllx.xml UTF-8 java -cp classes:../../maltparser-1.9.2.jar org.maltparser.examples.old.CreateDependencyGraph java -cp classes:../../maltparser-1.9.2.jar org.maltparser.examples.old.CreatePhraseStructureGraph java -cp classes:../../maltparser-1.9.2.jar org.maltparser.examples.old.TrainingExperiment java -cp classes:../../maltparser-1.9.2.jar org.maltparser.examples.old.ParsingExperiment java -cp classes:../../maltparser-1.9.2.jar org.maltparser.examples.old.ParseSentence1 java -cp classes:../../maltparser-1.9.2.jar org.maltparser.examples.old.ParseSentence2 java -cp classes:../../maltparser-1.9.2.jar org.maltparser.examples.old.ParseSentence3
Other programs can invoke Maltparser in various ways, but the easiest way is to use the org.maltparser.MaltParserService class.
Since version 1.7, MaltParser is also available via the official Maven repository.
<dependency> <groupId>org.maltparser</groupId> <artifactId>maltparser</artifactId> <version>1.9.2</version> </dependency>
MaltParser is a fairly complex system with many parameters that need to be optimized. Simply using the system out of the box
with default
settings is therefore likely to result in suboptimal performance.
Please read "A Quick Guide to MaltParser Optimization" to get hints and tips to how you can optimize MaltParser.