monellimankeimontenegro

PRESEEA - A castillian language corpus

A language corpus is a database of annotated language data. Annotated in a sense, that linguists or related professionals added additional information to grabbed raw data from speakers. Raw data comes from mothertongue speakers, either they give written or audio samples. Then professionals label specific phrases, times, use of pronouns, verbs articles or other semantic or morphological information to that raw material. Later on, by doing so, it can be filtered depending on the sociolinguistic analysis. In our PRESEEA case, they even categorized the speakers. Their age, level of education and very important as well, their geographical place of data grabation. This enables to establish a connection between sociology and linguistics. It is really amazing to find those relationships between topics, that from first sight seem unrelated, boring or even unnecessary for ignorant eyes. Other interesting corpora in that area are for example CREA or COSER.

The phrase corpus became quite popular lately with the rise of ML (Machine learning) in NLP (Natural Language Processing). But corpoora existed way before ML and the term is not exclusively related to programmers and IT. Yet another bridge from sociologist and linguists to programmers is build. Even though stated by Frederick Jelinek, each NLP project got better with the exclusion of linguists, might tell the opposite, this might state true to some degree, it might be also related to the fact, that for example a RNN resolving some NLP problems is kind of brute forcing a language learning process. In the middle future it might be true that pure tech will resolve the issue with just throwing more and more data and hardware into the pot. But a final solution to NLP will sooner or later include other professionals at least to some degree. While this is a pure opinionated statement, a clean data engineering process is the absolute baseground for any science domain. Therefore, PRESEEA offers a great data base to build up a small Python package on top of it to automate some analysis.

Why a Python package and what is a Python package? What this analysis should do is grabbing some data from PRESEEA and bringing it to a densed state where a conclusion regarding a sociolinguist's question can be answered. For examle getting all samples withh all the usages of informal and formal pronouns compared through different ages of speakers to possibly find changes in language usage. This can be done by two ways. The first one would be applying the filters by the UI (User Interface) firing the request with the button and then writing down the results on a piece of paper or an Excel sheet. Then in case the sociolinguist question includes a broader context, the next requests with a different filter can be fired and added to the paper or sheet as well. When this is finished, a conclusion from the statistical properties on the paper can be drawn. Another approach would be, using the UI as kind of API (Application Programming Interface) and automating the requests and the compilation of the statistics. All the bunch of code, with its functions and classes and definitions could be thrown into a central place where it can get grabbed from to not reinvent the wheel each time such an analysis has to be exercised. By structuring the code in a specific way, in other words packaging, it can be put onto the pip server. When needed, it can then be installed via the local pip client. One could argue that just simply putting it on a SCM (Source Code Management) server like Github or Gitlab is enough. Well even though this hold true, but the pip server is in the end just a proxy for SCM hosts and therefor simplyfing the installation process for the user via pip is an advantage. Regarding the second question, programming languages have often things like libraries, modules, packages, frameworks and so on. At least for Python, a package is that bunch of code, that can be used by installing it via pip. Pip is Python's package manager, a little programm that runs the system. By calling it, it connects with the pip server, downloads some Python files to a place where the Python interpreter can find it later on. Downloading some Python files that the interpreter can use. This is then called the installation of a Python package.

To do so, the code for that package has to be hosted somewhere. In that case it is Github. Pip is then proxying to that specific Github repository .

[1] Deitsch Lyrikbuach paper