e-magyar.hu

emCons - Constituency parser

About the tool

What is it good for? What does it do?

Constituency parsing of a sentence reveals what phrases the words of a sentence can create when combined with each other, and how they create a whole sentence.

What is the input?

The input is a text that had been tokenised and morphologically disambiguated. The words of the sentence (input tokens) arranged in a parse tree: every token is assigned an appropriate tag.

What is the output?

The output is a parse tree of the words of a sentence and of all the potential syntactic relations of every possible phrase that may be created of these.

An example:

Az exkatonát kórházba szállították, ahol két műtétet is végrehajtottak rajta.

Az	az	DET	Definite=Def\|PronType=Art	(ROOT(CP(NP*
exkatonát	exkatona	NOUN	Case=Acc\|Number=Sing	*)
kórházba	kórház	PROPN	Case=Ill\|Number=Sing	(NP*)
szállították	szállít	VERB	Definite=Def\|Mood=Ind\|Number=Plur\|Person=3\|Tense=Past\|VerbForm=Fin\|Voice=Act	(V_(V0*))
,	,	PUNCT	_	*
ahol	ahol	ADV	PronType=Rel	(ADVP*)
két	két	NUM	Case=Nom\|NumType=Card\|Number=Sing	(NP*
műtétet	műtét	NOUN	Case=Acc\|Number=Sing	*)
is	is	CONJ	_	(C0*)
végrehajtottak	végrehajt	VERB	Definite=Ind\|Mood=Ind\|Number=Plur\|Person=3\|Tense=Past\|VerbForm=Fin\|Voice=Act	(V_(V0*))
rajta	rajta	PRON	Case=Sup\|Number=Sing\|Person=3\|PronType=Prs	(NP*)
.	.	PUNCT	_	*))

For developers:

Source	http://rgai.inf.u-szeged.hu/magyarlanc
Source code	Java
Input	Input is the output of the POS tagger (one token per row, separate column for word form with its lemma and morphological analysis), the respective sentences divided by an empty line.
Output	One token per row, a separate column for word form, lemma, morphological analysis and syntactic parsing.
Execution	java -Xmx2G -jar magyarlanc-3.0.jar -mode constparse -input in.txt -output out.txt
Licence	The database is licensed under the Creative Commons Attribution-ShareAlike 4.0 (CC-BY-SA) licence. GNU General Public License (GPL v3) converts the primary source of the database).