emChunk - NP Chunker

About the tool

What is it good for? What does it do?

emChunk produces two types of output: i) it identifies maximal NPs in the text and ii) it identifies all kinds of phrases in the text.

What is the input?

Texts that had previously been processed in the toolchain, i.e.: i) they had been segmented into words and sentences ii) words are assigned their full morphological analyses. These pieces of information are neccesary for the NP Chunker module to be effective.

What is the output?

The module assigns a tag to every token in a text that had been segmented into words and sentences. Depending on the two modes of analysis two kinds of tags are possible.
In the first case the tag indicates i) whether the word is part of a maximal noun phrase (NP), and if yes, ii) whether the NP has a single or more components. If the latter, it also indicates iii) whether the given word is an initial, medial or final component of the NP. In the second case the tag indicates i) whether the given word is part of any phrase, if yes ii) what kind of phrase it is part of, iii) whether the phrase has a single or more components, and if the latter, iv) whether the word is an initial, medial or final component of the NP. The output keeps the analyses of the previous processing levels, and adds the tags of the chunker module.

An example:

In the first mode we are looking for maximal NPs in the text, i.e. NPs that are not part of any higher level NPs.
In the example sentences we can find two maximal NPs and two units represented with O -- these latter ones are not NPs. 'B' stands for tokens at the beginning of phrases (initial elements), 'I' for medial / internal elements, and 'E' for final elements.

A szállásunk egy Balaton melletti kis üdülőfaluban, Zamárdiban volt.

A B-NP
szállásunk E-NP
egy B-NP
Balaton I-NP
melletti I-NP
kis I-NP
üdülőfaluban I-NP
, I-NP
Zamárdiban E-NP
volt O
. O

In the second way of operation we are identifying every kind of phrase in the sentence..
In the sentence above we can find an NP with two components, an NP with a single component, and an ADVP (adverbial phrase) with two components:

Az osztály már csütörtökön fel volt villanyozva.

Az B-NP
osztály E-NP
már 1-ADVP
csütörtökön 1-NP
teljesen 1-ADVP
fel O
volt O
villanyozva O
. O

For developers

Source https://github.com/ppke-nlpg/HunTag3
Source code Python 3
Input format Plain text file in UTF-8 character encoding, one row - one word format, sentence boundaries marked by an empty line, first column containing the word, with each annotation tag following it in columns separated by tabs.
Output formátum Plain text file in UTF-8 character encoding, one row - one word format, sentence boundaries marked by an empty line, first column containing the word, with each annotation tag following it in columns separated by tabs; the last columns containing the Chunker tags.
Execution See the README file: https://github.com/ppke-nlpg/HunTag3
Licence GNU Lesser General Public License v3.0