emMorph - Morphological analyser

About the tool

What does it do?

The task of the morphological tagger in the toolchain is to assign all possible morphological and morphosyntactic analyses to each word of the input text. It determines every possible analyis that could apply to a given word form irrespective of its context (vár, e.g. would have an analysis as a noun and as a verb, too (Engl.: wait / castle)). The tool lemmatises the word form, determines the main POS categories, analyses the endings, marks possible morpheme boundaries, and so the boundaries of compunds, as well.
The tagger integrates the knowledge of other similar tools that have so far been available for Hungarian. According to its developers it is the most accurate tool of its kind with the widest lexicon to rely on. It is freely accessible, customisable to specific NLP requirements and to language varieties, while being based on a computational linguistic model (so-called finite-state technology) that ensures the fastest runtime.

What is the input?

The system presented here being a toolchain, with analysing steps building one on the other, respectively, the input of the morphological tagger is a word form in a row.
The previous language processing step (determining sentence- and word boundaries), is carried out by the tokeniser (emToken), while the following step (choosing the right morphological analysis from those offered by emMorph) is then carried out by a disambiguation algorithm, emTag.

What is the output?

The output of the tagger is the totality of morpheme-sequences with their respective analyses that could make up the character string in question according to the rules of Hungarian. This often amounts to a huge number of possible analyses most of which a speaker would not even be aware of. The majority of these analysis routes can then be filtered out depending on the higher level task using the morphological tagger, thus constraining possible analyses.

An example of the tagging.

The example below is supposed to illustrate two phenomena: on the one hand a disambiguity that is hard to detect for a speaker, on the other hand the case of a word form stored as one unit in the lexicon of the tagger but being able to be decomposed into further components. Depending on the application needed, the analysis ['fejetlenség' + accusative] (which is also the most obvious alternative for speakers) is sufficient in most cases. Using emLem, we do arrive at this analysis, but the remaining semantic disambiguity (depending on whether fej is interpreted as a noun or a verb) cannot be resolved on the level of morphological analysis or part-of-speech disambiguation.

fejetlenséget
1. fej[/N]etlen[_Abe/Adj]ség[_Nz_Abstr/N]et[Acc]
2. fej[/V]etlen[_NegPtcp/Adj]ség[_Nz_Abstr/N]et[Acc]
3. fej~etlen[/Adj]ség[_Nz_Abstr/N]et[Acc]
4. fej~etlen~ség[/N]et[Acc]

[/N] noun
[/Adj] adjective
[_Abe/Adj] adjectivizer negative suffix (its result: adjective)
[Acc] accusative
[_Nz_Abstr/N] derivational suffix: nominalizer suffix (its result: noun)
[_NegPtcp/Adj] Derivational suffix: negative passive (its result: adjective)


The full list of morphological codes


For developers:

Source https://github.com/dlt-rilmta/emMorph
Source language The tagger is basically a finite state translator (transducer), which – based on a register of lemmata, a register of endings and a morphophonological description (a grammar) – transforms the surface word form (a character string) into another character string made up of morphemes and morphological codes. The database of the tagger contains the description of the linguistic data in a specific format. Preparing the transducer and running the tagger can be carried out via the Helsinki Finite-State Transducer toolkit (HFST), implemented in C++. The lexicon that can be interpreted by HFST (in lexc) is generated from the primary source of the morphology by programs implemented in perl.
Input Text in unicode encoding, one word per row.
Output The analyses of the input word (each analysis is in a separate row) are themselves separated by an empty row. The format of the analysis is: input word [tab] analysis [tab] weight. Weight in the present implementation is set to 1 if an analysis exists, and to infinite (inf) if no analysis is available.
Execution hfst-lookup --cascade=composition hu.hfst
hfst-lookup --pipe-mode=input --cascade=composition hu.hfstol <intext >outtext
The tagger can be run with the lookup programs of the HFST toolkit.
Licence The database is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA) licence. The code responsible for the conversion of the primary source of the database is licenced under GNU General Public License (GPL v3).