emCons - Constituency parser

About the tool

What is it good for? What does it do?

Constituency parsing of a sentence reveals what phrases the words of a sentence can create when combined with each other, and how they create a whole sentence.

What is the input?

The input is a text that had been tokenised and morphologically disambiguated. The words of the sentence (input tokens) arranged in a parse tree: every token is assigned an appropriate tag.

What is the output?

The output is a parse tree of the words of a sentence and of all the potential syntactic relations of every possible phrase that may be created of these.

An example:

Az exkatonát kórházba szállították, ahol két műtétet is végrehajtottak rajta.

Az az DET Definite=Def|PronType=Art (ROOT(CP(NP*
exkatonát exkatona NOUN Case=Acc|Number=Sing *)
kórházba kórház PROPN Case=Ill|Number=Sing (NP*)
szállították szállít VERB Definite=Def|Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin|Voice=Act (V_(V0*))
, , PUNCT _ *
ahol ahol ADV PronType=Rel (ADVP*)
két két NUM Case=Nom|NumType=Card|Number=Sing (NP*
műtétet műtét NOUN Case=Acc|Number=Sing *)
is is CONJ _ (C0*)
végrehajtottak végrehajt VERB Definite=Ind|Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin|Voice=Act (V_(V0*))
rajta rajta PRON Case=Sup|Number=Sing|Person=3|PronType=Prs (NP*)
. . PUNCT _ *))

For developers:

Source http://rgai.inf.u-szeged.hu/magyarlanc
Source code Java
Input Input is the output of the POS tagger (one token per row, separate column for word form with its lemma and morphological analysis), the respective sentences divided by an empty line.
Output One token per row, a separate column for word form, lemma, morphological analysis and syntactic parsing.
Execution java -Xmx2G -jar magyarlanc-3.0.jar -mode constparse -input in.txt -output out.txt
Licence The database is licensed under the Creative Commons Attribution-ShareAlike 4.0 (CC-BY-SA) licence. GNU General Public License (GPL v3) converts the primary source of the database).