emLam - Language model

About the tool

What is it good for? What does it do?

The main task of language models is to support other NLP tools. Its goal is to judge how well a sentence or just a single word fits the rules of Hungarian, or how native an utterance sounds. It is useful for example in speech recognition, where it helps choose the most probable alternative from several options (e.g. „a hosszú béke” or „a hosszú béka”). Similar models are used by textual search engines in order to list search expressions resembling the one typed in. Besides these, language models can be used for generating texts, as well.

What is the input?

If we are just curious how much a text we have created resembles those found in the Hungarian National Corpus (Magyar Nemzeti Szövegtár), we should simply type in our sentences or paragraphs.

What is the output?

The default output is the probability of our text.

An example:

If, according to the model, the probability is 1 : 1000000, then on average our sentence will have one exact occurance among one million sentences in the HNC. In its generating mode the model is capable of creating text, as well. However, no special consistency should be expected from it.

Demo

Start typing something in the text field below. Every time you press the space key a word list will appear offering possible ways to continue the text typed in so far. You can choose from among the words offered by clicking on them, or you may continue typing.

Suggestions:

For developers

Source a "de-glutinised" (suffixes treated as separate tokens) 5-gram model
Source code
Input One sentence per row, space between the tokens (emToken may be used for tokenisation). The version above treats both lemmas (sometimes with their suffixes) and endings as separate tokens.
Output The probability of our text, optionally segmented into sentences and words.
Execution Using the 'ngram' program of the SRILM toolkit:
ngram -order 5 -lm lemmad_u50_krs.lm5.gz -ppl <text file>
Parameters: -order 5: 5-grams should be used (at the moment this is the largest possibility); -lm lemmad_u50_krs.lm5.gz: uses the "de-glutinised" model above, which had been trained on words with a frequency of more than 50 occurance; -ppl <text file>: our own text file should be specified here
Licence open CC BY