emNer - Named Entity Recogniser (NER Tagger)

Az eszközről

Mire jó? Mit csinál?

The automatic Named Entity Recogniser emNer identifies proper names in a running text, and assigns them to one of the predetermined categories (person names, organisation names, place names or other).

What is the input?

Texts that had previously been processed in the toolchain, i.e.: i) they had been segmented into words and sentences ii) words are assigned their full morphological analyses. These pieces of information are neccesary for the NER tagger module to be effective.

What is the output?

The module assigns a tag to every token in a text that had been segmented into words and sentences, indicating i) whether the given word is a propoer noun, and if yes, ii) what subcass it belongs to, iii) whether it has a single or more elements, and if the latter, iv) whether the given word has an initial, medial or final position in the Named Entity.
The output keeps the analyses of the previous processing levels, and adds the tags of the NER Tagger module..

An example:

Every token in the example sentence is tagged with one of the tags below 0 = not a proper noun, B-PER: initial element of a multiword person name, E-PER: final element of a multiword person name, B-ORG: initial element of a multiword organisation name, E-ORG: final element ofa multiword organisation name, 1-ORG: a single-word organisation name.

[...] közölte Wolf László, az OTP Bank vezérigazgató-helyettese az MTI érdeklődésére.

közölte 0
Wolf B-PER
László E-PER
, 0
az 0
OTP B-ORG
Bank E-ORG
vezérigazgató-helyettese 0
az 0
MTI 1-ORG
érdeklődésére 0
. 0

For developers

Source https://github.com/ppke-nlpg/HunTag3
Source language Python 3
Input Plain text file in UTF-8 character encoding, one row - one word format, sentence boundaries marked by an empty line, first column containing the word, with each annotation tag following it in columns separated by tabs.
Output The same as the input, with the last colums containing the NER tags.
Execution See in the README file.
Licence GNU Lesser General Public License v3.0