The lemmatisation of the verbal lexicon of old english on a relational database. Preterite-present, contracted, anomalous and strong VIII verbs

RESUMEN EN CASTELLANO Esta tesis contribuye al estudio de análisis lingüístico del inglés antiguo con bases de datos léxicas basadas en corpus. Aunque la lematización es considerada una de las tareas necesarias para la creación de diccionarios, no se dispone de corpus lematizados en inglés antiguo. Además, en el caso de este período histórico del inglés, que presenta numerosas variantes morfológicas y carece de estándar ortográfico, es imprescindible disponer de un corpus lematizado. Por ello, el objetivo de esta tesis es lematizar una parte del léxico verbal derivado del inglés antiguo, lo que combina aspectos de morfología, lexicografía y análisis de corpus. El alcance se restringe a las clases verbales más complejas morfológicamente del inglés antiguo, verbos irregulares y verbos reduplicativos, que incluyen los pretérito-presentes, los anómalos, los contractos y los fuertes de la clase VII. Esto requiere, en primer lugar, la selección y el manejo de las fuentes de datos y de verificación de resultados, y en segundo lugar, la formulación y secuenciado de los pasos de las tareas de lematización. Este trabajo también plantea la cuestión de la automatización en el proceso de la lematización, sobre la que escasa bibliografía se ha encontrado. La metodología combina búsquedas automáticas en el lematizador Norna y la revisión manual de los resultados con las fuentes lexicográficas disponibles. El lematizador está basado en la versión 2004 del corpus de The Dictionary of Old English (DOE), que contiene aproximadamente tres mil textos y tres millones de palabras. Las fuentes lexicográficas consultadas son, por un lado, la base de datos The Grid (Nerthus Project), y por otro lado, los diccionarios de inglés antiguo, icluyendo el DOE, Bosworth and Toller, Hall-Meritt, and Sweet. Se han tenido en cuenta dos enfoques diferentes para la lematización en esta investigación. Los verbos fuertes de la clase VII se han lematizado aplicando un algoritmo de búsqueda basado en las formas principales del verbo (Metola Rodríguez 2015). Este algoritmo se ha creado a partir de los radicales, las flexiones y los elementos preverbales de los verbos fuertes del inglés antiguo. Por otra parte, los verbos derivados de los pretérito-presentes, contractos y anómalos se han buscado a partir de sus formas simples. En conclusión, esta tesis ofrece un inventario de lemas y formas flexivas de los verbos analizados. Desde el punto de vista de la aplicabilidad, este trabajo presenta diferentes procedimientos de lematización automática y manual que pueden ser aplicados a los campos de la lexicografía y la lingüística de corpus. RESUMEN EN INGLÉS This thesis contributes to the research in the linguistic analysis of Old English with corpus-based lexical databases. Although lemmatisation is generally accepted as one of the necessary tasks of dictionary making, no lemmatised corpus is available in Old English. In the specific area of Old English, which presents numerous morphological variations and lacks a written standard, a lemmatised corpus is necessary. Thus, the aim of this thesis is to lemmatise a part of the derived verbal lexicon of Old English, combining aspects of Morphology, Lexicography and Corpus Analysis. The scope is restricted to the most morphologically complex verbal classes of Old English, including irregular verbs and reduplicative verbs, which comprise preterite-present, anomalous, contracted and strong VII verbs. This aim requires, firstly, the selection and management of the sources of data and verification of results; and secondly, the design and sequencing of the steps of the lemmatisation tasks. This research also raises the issue of the automatisation of the process of lemmatisation of Old English verbs, on which little previous literature has been found. The methodology comprises automatic searches on the lemmatiser Norna and the manual revision of the hits with the available lexicographical sources. The lemmatiser is based on the 2004 version of The Dictionary of Old English Corpus (DOE), which contains approximately three thousand texts and three million words. The lexicographical sources checked are, in the first place, the database The Grid (Nerthus Project), and secondly, the Old English dictionaries, including the DOE, Bosworth and Toller, Hall-Meritt, and Sweet. Two different approaches to lemmatisation have been taken in this research. On the one hand, the class VII strong verbs are lemmatised by means of a search algorithm that is based on the main forms of the verbs (Metola Rodríguez 2015). The search algorithm is created on the basis on the roots, the set of inflections and the preverbal items of the strong verbs of Old English. On the other hand, the derived preterite-present, anomalous and contracted verbs are searched by means of their simplexes. In conclusion, this thesis offers an inventory of inflectional forms and lemmas of the verbs under analysis. On the applied side, this work presents different procedures of automatic and manual lemmatisation that can be applied to the fields of Lexicography and Corpus Linguistics.