This is short guide to get familiar with MaltParser. We start with running the MaltParser without any arguments by writing the following into the command line prompt (it is important that you are in the malt-1.0.2 directory):
prompt> java -jar malt.jarThis command will display the following output:
----------------------------------------------------------------------------- MaltParser 1.0.2 ----------------------------------------------------------------------------- MALT (Models and Algorithms for Language Technology) Group School of Mathematics and Systems Engineering (MSI) Vaxjo University, Sweden ----------------------------------------------------------------------------- Usage: java -jar malt.jar -fHere you can see the basic usage and options. To get all available options:java -jar malt.jar -h for more help help ( -h) : Show help ----------------------------------------------------------------------------- option_file ( -f) : Path to the option file ----------------------------------------------------------------------------- verbosity ( -v) : Verbosity level debug - Logging of debugging messages error - Logging of error events fatal - Logging of very severe error events info - Logging of informational messages off - Logging turned off warn - Logging of harmful situations -----------------------------------------------------------------------------
prompt> java -jar malt.jar -hAll these options are also described as a short version and as a fully described version.
Now we ready to train our first parsing model. In the directory examples/data there are two treebank data files talbanken05_train.conll and talbanken05_test.conll, which contain a very small portion of the treebank Talbanken05. The example data are formatted according to the CoNLL data format. Note that these data is very small and you need more training data to create a useful parsing model.
To train a default parsing model with MaltParser write the following into the command line prompt:
prompt> java -jar malt.jar -c test -i examples/data/talbanken05_train.conll -m learnThis line tells MaltParser to create a parsing model named test.mco from the treebank in the file 'examples/data/talbanken05_train.conll. The parsing model get it's name from the configuration name, which are specified by the option flag (-c) without the file suffix .mco. The configuration name is a name of your own choice. The option flag (-i) tells the parser where to find the input data. The last option flag (-m) specifies the parsing mode learn, in this case we want to induce a model by using the default learning method (LIBSVM). MaltParser outputs the following information:
Started: Sat Oct 13 15:03:52 CEST 2007 Initialize the parsing algorithm... Reading sentences from 'examples/data/talbanken05_train.conll':1 Number of sentences: 32 Creating all models Creating LIBSVM model libsvm.mod Saving the symbol table... Saving the configuration specific options... Creates configuration file 'd:\exp\malt1.0\install\malt-1.0.2\test.mco' ... Finished: Sat Oct 13 15:03:54 CEST 2007 Learning time: 00:00:02 (2234 ms)Must of the logging information is self-explaining: it tells you that parser is started at a certain time and date and it reads sentences from a specified file and it contains 32 sentences. It continues with information about which learning models that are created, in this case it only creates one LIBSVM model. Finally, it saves the symbol table and all options that cannot be changed during parsing and stores everything in a configuration file named test.mco. In the end the parser informs you about the learning time.
When we have a parsing model that we can use for parsing new sentences from the same language. It is important that unparsed sentences are formatted according to the format with all input information like part-of-speech tags, in this case we have formatted sentences with the six first column in CoNLL data format. To parse type the following:
prompt> $ java -jar malt.jar -c test -i examples/data/talbanken05_test.conll -o out.conll -m parsewhere -c test is the name of the configuration (the prefix file name of test.mco), -i examples/data/talbanken05_test.conll tells the parser where to find the input data, -o out.conll is the output file name and finally -m parse specifies that the parser should be excuted in parsing mode.
Sometimes it useful to get information about configuration, for instance which settings that have been used when creating the parsing model. To get this information you type:
prompt> java -jar malt.jar -c test -m infoThis will output a lot of information about the configuration:
CONFIGURATION Configuration name: test Configuration type: singlemalt Created: Sat Oct 13 15:03:52 CEST 2007 SYSTEM Operating system architecture: x86 Operating system name: Windows XP JRE vendor name: Sun Microsystems Inc. JRE version number: 1.6.0_03 MALTPARSER Version: 1.0.2 Build date: October 13 2007 SETTINGS config workingdir ( -w) user.dir name ( -c) test logging ( -cl) info type ( -t) singlemalt logfile (-lfi) stdout url ( -u) covington allow_root ( -cr) true allow_shift ( -cs) false graph max_sentence_length (-gsl) 256 root_label (-grl) ROOT guide data_split_structure ( -s) learner ( -l) libsvm kbest ( -k) -1 features ( -F) classitem_separator (-gcs) _ prediction_strategy (-gps) combined data_split_column ( -d) data_split_threshold ( -T) 50 input infile ( -i) examples/data/talbanken05_train.conll reader ( -ir) tab charset ( -ic) UTF-8 format ( -if) /appdata/dataformat/conllx.xml libsvm libsvm_exclude_null (-lse) no libsvm_options (-lso) libsvm_exclude_columns (-lsc) nivre root_handling ( -r) normal post_processing (-npp) false output charset ( -oc) UTF-8 format ( -of) /appdata/dataformat/conllx.xml writer ( -ow) tab outfile ( -o) pproj covered_root (-pcr) none marking_strategy ( -pp) none singlemalt parsing_algorithm ( -a) nivreeager behavior (-mcb) malt1.0 symbol_table (-mct) special_symbols (-mcs) /appdata/specialsymbols/malt1.0.xml mode ( -m) learn DEPENDENCIES --guide-features ( -F) /appdata/features/NivreEager.par FEATURE MODEL InputColumn(POSTAG, Stack[0]) InputColumn(POSTAG, Input[0]) InputColumn(POSTAG, Input[1]) InputColumn(POSTAG, Input[2]) InputColumn(POSTAG, Input[3]) InputColumn(POSTAG, Stack[1]) OutputColumn(DEPREL, Stack[0]) OutputColumn(DEPREL, ldep(Stack[0])) OutputColumn(DEPREL, rdep(Stack[0])) OutputColumn(DEPREL, ldep(Input[0])) InputColumn(FORM, Stack[0]) InputColumn(FORM, Input[0]) InputColumn(FORM, Input[1]) InputColumn(FORM, head(Stack[0])) LIBSVM INTERFACE LIBSVM version: 2.84 SVM-param string: Null-value handling: INCLUDE_NULL_VALUES LIBSVM SETTINGS SVM type : C_SVC (0) Kernel : POLY (1) Degree : 2 Gamma : 0.2 Coef0 : 0.0 Cache Size : 40.0 MB C : 0.5 Eps : 1.0 Shrinking : 1 Probability : 0 #Weight : 0The information consists of several types of information:
Information type | Description |
---|---|
CONFIGURATION | The name and the type of configuration and when it was created. |
SYSTEM | Information about the system that used when creating the configuration, such as processor, operating system and the version of Java Runtime Environment (JRE). |
MALTPARSER | Version of MaltParser and when it was built. |
SETTINGS | All option settings divided into several categories. |
DEPENDENCIES | In some cases the parser correct it self when illegal combination is specified or some option is missing. In example above the feature specification file is not specified and parser use the default feature specification file for the Nivre Arc-eager parsing algorithm. |
FEATURE MODEL | Outputs the content of the feature specification file. |
<LEARNER> INTERFACE | Information and settings of interface to the learner, in example above the LIBSVM is used. |
<LEARNER> SETTINGS | All settings of the specific learner options, in example above the LIBSVM is used. |
It is possible to unpack the configuration file test.mco by typing:
prompt> java -jar malt.jar -c test -m unpackThis command will create a new directory test containing following files:
File | Description |
---|---|
libsvm.mod | The LIBSVM model that is used for predicting the next parsing action. |
savedoptions.sop | Contains all option settings that cannot be changed when parsing. |
symboltables.sym | Contains all distinct value of the training data, divided into different columns. For example, column POSTAG in the CoNLL format has it's own symbol table with all distinct values occuring in the training data. |
test_singlemalt.info | Information about the configuration, contains the same information described above. |
MaltParser is equipped with different ways to specify the option settings:
Method | Description | Example |
---|---|---|
Command-line option flag | Uses the option flag with one minus sign - before the option flag and blank between the option flag and the value | -c test |
Command-line option group and option name | Uses both the option group name and option name to specify the option. It should always begin with two minus -- sign before the option group name and one minus sign - to separate the option group name and the option name. The equal sign =is used for separating the option and the value. | --config-name=test |
Command-line option name | Is a shorter version of Command-line option group and option name and can only be used when the option name is unambiguous. | --name=test |
Option file | The option settings is specified in a option file, formatted in XML. To read the option file the option flag -f is used. Note that command line option settings override the settings in the option file if they are specified twice. |
<?xml version="1.0" encoding="UTF-8"?> <experiment> <optioncontainer> <optiongroup groupname="config"> <optionvalue name="name" value="test"/> </optiongroup> </optioncontainer> </experiment> |