Processing chain, integration
GATE integration
We have integrated the different modules making up e-magyar.hu in the GATE language processing framework. One advantage of GATE, which is implemented in Java, is that it provides a convenient method for integrating any number of language processing tool (Processing Resource) in one system. Another of its advantages is a uniform annotation model, which enables the communication between the respective modules.
At the beginning of the processing all the spaces in the text are indexed with a number (so-called offset), and from there on every annotation is expressed by a pair of offsets, indicating the beginning and end of the annotation. Information is stored either directly in the annotation (Token), or in the attributes of the annotation (the attribute of the Token word stem). This way the different annotations do not interfere with each other; there can even be overlaps between them. This is a useful solution: every module may read only the annotation(s) relevant for it, while the output can be written in the existing or newly created annotation. For example: the tokeniser creates Token and SpaceToken units, in accordance with words and spaces; the morphological analyser will only fetch the list of Tokens, running the morphological analysis on these and ignoring the SpaceTokens. The modules can be parameterised with respect to which annotations they should be working with, which increases the flexibility of the system even more.
Our task is, then, to make every module capable of treating both its input and output according to the GATE annotation model. An additional task is, if we would like to specify a relation between independent annotations, it must be done explicitly. An obvious example for this is the relation between proper names and the tokens constituting them. Such tasks have been implemented during the integration.
Modules in the processing chain
The toolchain e-magyar.hu has the following modules integrated in GATE: emToken segments a text into sentences and tokens, emMorph carries out a morphological analysis and determines possible word stems, emTag disambiguates, i.e. choses the valid morphological analysis and lemma from the possible ones. emDep and emCons carries out syntactic parsing, followed by an additional tool connecting verbs and their respective separable prefixes, returning the prefixed verb stem. Finally, emChunk determines noun phrases, while emNer identifies proper names. These later tools add an IOB annotation to a given attribute, which, for a more convenient further processing, is transformed into an independent annotation by an additional tool.
Installation
The processing chain can be used from the graphical interface of GATE (from the GATE Developer), and can also be run from a command line, with the help of GATE Embedded.For a use through the graphical interface one needs to apply the simple installation mechanism of GATE Developer (after having installed GATE itself). This way the Lang_Hungarian plugin (contaning the entire toolchain) will be downloaded from the GATE Plugin repository that we had made public, and the toolchain will be integrated into the system. further details
For a use independent of the graphical interface, via a command line, one needs to install GATE, and to clone the github repository Lang_Hungarian. Furthermore the (automatic) acquisition of the elements missing in the github repository will be necessary. The system is ready to be used after these steps. further details
Use in GATE Developer
After installing the Lang_Hungarian GATE plugin, which contains the toolchain e-magyar.hu, we should carry out the following steps:
- Loading the processing tools: rightclick on Processing Resources in the left panel, and choosing the required tools.
- Creating a new Corpus Pipeline in the Applications section of the left panel.
- Clicking on the newly created Corpus Pipeline and putting together the processing chain by arranging the chosen tools in the list on the right side, following the required order. One should put a Document Reset PR at the top of the list, which will reset the document in its default state before each run. This can be loaded from the ANNIE plugin, which is always at our disposal.
- Creating a Language Resource in the left panel: a new GATE Document, which will contain the text to be processed.
- Creating a corpus from the text: rightclick on the newly created GATE Document, and New Corpus with this Document.
- Clicking on the Corpus Pipeline, then specifying the newly created corpus in the middle of the screen, at Corpus, then clicking on the Run this Application button.
The results can be viewed by clicking on the newly created GATE Document, by switching on the Annotation Sets and the Annotation List. By placing the mouse over the respective units, their annotation becomes visible.
For further details and possibilities of GATE, its documentation should be consulted.
For developers
Source |
GATE: https://gate.ac.uk/download/
Lang_Hungarian GATE plugin: The GATE Lang_Hungarian plugin containing the Hungarian processing toolchain is available at the https://github.com/dlt-rilmta/hunlp-GATE github repository, together with the gate-server application. |
Source code | Primarily Java. Tools written in Java were integrated in GATE directly, while modules written in other languages (such as Python or C++) were integrated via their binaries or their own interpreter. |
Input | In case of the web page and the gate-server plain text (txt). In case of using the GATE Developer the system can easily treat several formats (txt, html, xml, doc, xls, docx, xlsx...), automatically extracting their textual content. In the case of HTML and XML files the original markup is preserved, additional information is treated independently of it. |
Output | GATE XML format. The analysed material can be downloaded from the website in .tsv, as well. |
Execution |
Installation guide and further information on the
GATE Developer. Installation guide and further information on the GATE Embedded. |