emLem - Lemmatiser

About the tool

What is it good for? What does it do?

The morphological analyser offers a detailed analysis down to the level of morphemes for each word form: beyond identifying endings, it decomposes each word into its components and identifies derivational suffixes. Most modules using the output of the morphological analysis do not require such a deep analysis. Identifying the lemma of a given word form with its part-of-speech and its inflections (jel / rag) is sufficient. The lemmatiser computes the lemma of each word form based on the analysis of the morphological tagger, it computes its original part-of-speech (in case of a derived word), it identifies inflectional categories and returns these instead of (or besides) a detailed analysis.
The morphological analyser often produces several analyses in different detail for the same word form, with the same lemma, part-of-speech and inflectional categories. The reason for this is that the lemma-register of the lemmatiser contains many morphologically complex lexical units, may they be compounds or derivations or others. At the same time the lemmatiser generates the majority of these forms productively, as well. After lemmatisation these real or apparent disambiguities disappear. Accordingly, the output of the lemmatiser can support tasks like POS disambiguation, named entity recognition (NER) or syntactic parsing.

What is the input?

The input of the lemmatiser is the output of the morphological analyser. Lemmatisation requires the surface forms and lexical form (only in case of stems and derivational suffixes) of the morphs making up the word, as well as the respective morphosyntactic tag.

What is the output?

Based on the morphological analysis the lemmatiser returns the lemma, POS tag and the simplified analysis containing the inflectional categories..

An example:

The output of the morphological analysis:

fejetlenséget
1. fej[/N]etlen[_Abe/Adj]ség[_Nz_Abstr/N]et[Acc]
2. fej[/V]etlen[_NegPtcp/Adj]ség[_Nz_Abstr/N]et[Acc]
3. fej~etlen[/Adj]ség[_Nz_Abstr/N]et[Acc]
4. fej~etlen~ség[/N]et[Acc]

[/N] noun
[/Adj] adjective
[_Abe/Adj] Derivational suffix: adjectiviser negative suffix (its result: adjective)
[Acc] accusativus
[_Nz_Abstr/N] Derivational suffix: nominaliser suffix (its result: noun)
[_NegPtcp/Adj] Derivational suffix: negative passive (its result: adjective)

Of the above analyses the lemmatiser produces the only simplified analysis below:
fejetlenség[/N][Acc]

The remaining semantic disambiguity (depending on whether fej is interpreted as a noun or a verb) cannot be resolved on the level of morphological analysis or part-of-speech disambiguation.


For developers

Source https://github.com/dlt-rilmta/hunlp-GATE/tree/master/Lang_Hungarian/resources/hfst
Additionally, the HFST-lookup program running on the given platform needs to be downloaded from the website http://apertium.projectjj.com.
Source code Originally written in C++ , ported to Java, it calls the hfst-lookup programme of the Helsinki Finite-State Transducer (HFST) toolkit, generating its output from the hfst-analyses.
Input format Text in Unicode encoding, one word per row.
Output format The analyses of the input word (each analysis is in a separate row) are themselves separated by an empty row. The format of the analysis is: input word [tab] detailed analysis [tab] lemma [tab] Part-of-speech and inflection tags.
Execution java -jar hfst-wrapper.jar
Licence GNU Lesser General Public License (LGPL v3)
Others In order for the lemmatiser to run a Java Runtime Environment is required. The lemmatiser uses the hfst-lookup program included in the a HFST toolkit, so this and the binary lexicon of the analyser need to be placed next to the lemmatiser. The configuration of the lemmatiser is included in the hfst-wrapper.props file.