Friday, October 18, 2013

TEI-conform XML Annotation of a Digital Dictionary of Surnames in Germany

http://digilab2.let.uniroma1.it/teiconf2013/program/posters/abstracts-posters#C125

TEI-conform XML Annotation of a Digital Dictionary of Surnames in Germany

by Horn, Franziska; Denzer, Sandra

In this paper we focus on XML markup for the Digital Dictionary of Surnames in Germany (Digitales Familiennamenwörterbuch Deutschlands, DFD). The dictionary aims to explain the etymology, and the meaning of surnames respectively, occurring in Germany. Possibilities and constraints which are discussed can be stated by using the TEI module “Dictionaries” for editing a specialized dictionary such as the DFD. This topic includes situating the new project within the landscape of electronic dictionaries.

Our evaluation of the appropriateness of the proposed guidelines is seen as a contribution to the efforts of the TEI: The consortium regards their specifications as dynamic and ongoing development. The efforts in terms of lexical resources starting with the digitization of printed dictionaries are documented and discussed in various publications (e.g. Ide/Véronis/Warwick-Armstrong/Calzolari 1992; Ide/Le Maitre/Véronis 1994; Ide/Kilgarriff/Romary 2000). The module “Dictionaries” contains widely accepted proposals for digitizing printed dictionaries but projects which are born digital are progressively becoming more common nowadays (Budin/Majewski/Mörth 2012). For a more fine-grained encoding of these resources certain proposals for customization of the module “Dictionaries” can be found (e.g. Budin/Majewski/Mörth 2012). This paper aims to focus on the usefulness of the guidelines for a dynamic and specialized online dictionary without customized TEI extensions. Yet, our investigation points out possible extensions which may increase the acceptance and application of the TEI in other, similar projects.

At first, we want to introduce the Digital Dictionary of Surnames in Germany (2012-2036) as a new and ongoing collaboration between the Academy of Science and Literature in Mainz and Technische Universität Darmstadt. Work on DFD started in 2012. The project is based on data of the German telecommunications company Deutsche Telekom AG and preliminary studies of the German Surname Atlas (Deutscher Familiennamenatlas, DFA). It is planned to integrate the dictionary in an online portal of onomastics named namenforschung.net which can be seen as a gateway to various projects and information related to the field of name studies.

The intention of the DFD is to record the entire inventory of surnames occurring in Germany including foreign ones. Therefore, the entries consist of several features, for instance frequency, meaning and etymology, historical examples, variants and the distribution of the surnames. The short introduction includes a brief classification of the DFD into a typology of dictionaries (Kühn 1989; Hausmann 1989). Then, we focus on data annotation in terms of the DFD according to the TEI Guidelines as the consortium forms a de facto standard for the encoding of electronic texts (Jannidis 2009). Following the proposals means providing possibilities for data exchange and further exploration (Ide/Sperberg-McQueen 1995). Both aspects are particularly important considering the long duration of the project. The encoding scheme of the DFD is mainly based on the TEI module “Dictionaries”. Furthermore, components of the modules “Core” as well as “Names, Dates, People, and Places” are used. The main reason for considering the latter module is the close connection of surnames to geographical features, for example settlements or rivers. TEI extensions for customizing existing tags and annotation hierarchies according specific needs are set aside to provide a higher level of data interchangeability, for instance with other TEI and XML-based onomastic projects such as the Digitales Ortsnamenbuch Online (DONBO), a digital dictionary of place names (Buchner/Winner 2011).

To evaluate the appropriateness of the TEI Guidelines regarding to our project we compare them to the needs of annotating microstructures of the DFD entries. The intention of the TEI is to offer exact as well as flexible annotation schemes (Ide/Sperberg-McQueen 1995). Therefore, relevant criteria for the evaluation are the completeness of the tagset and the flexibility in arranging elements and attributes. Furthermore, the analysis discusses the comprehensibility of possible annotations in terms of descriptive and direct denotations.

In general, the TEI Guidelines – the tagset and the arrangement of its elements – can be used to represent the structure of the entries as well as the features of the DFD adequately. The applicability is, however, influenced by several aspects we want to discuss in greater detail.

At first, the aspect of completeness of the tagset is discussed. It would be useful to have elements within the module “Dictionaries” available to encode the frequency and the geographical distribution. The frequency of a surname is interesting for dictionary users, especially the name bearer. Other than for the DFD, options to encode frequencies seem to be important considering other lexical resources such as explicit frequency dictionaries or the frequency information in learner’s dictionaries, for instance. Elements to annotate the geographical distribution are needed, because the distribution in and outside of Germany serves as means to support, respectively verify, the given sense-related information (Schmuck/Dräger 2008; Nübling/Kunze 2006). These tags seem to be of further interest for parallel developments of national surname dictionaries, for example in Austria (FamOs) as well as for other types of dictionaries, for instance, variety dictionaries.

In our encoding scheme, the missing tags are replaced by more indirect combinations of tags und attributes, for example <usg type=”token”> to encode the frequency or <usg type=”german_distribution”> to annotate the distribution.

Furthermore, it would be helpful to have more possibilities to specify a sense. According to the presentation of surnames in the DFD, a sense is linked with a category, which can be understood as a type of motivation for the given name. An example is the category occupation belonging to the surname Bäcker (‘baker’). For our purposes it is adverse that the attribute @type is not allowed within the element <sense>. We are using the less concise attribute @value as an alternative.

A further example for missing options of explicit markup relates to the sense part. In the DFD senses are ordered according to their certainty. We are using the attribute @expand with the values “primary”, “uncommon”, “uncertain” and “obsolete” to differentiate. However, the definition provided by the TEI Guidelines entails giving an expanded form of information (TEI Consortium P5 2012). The slightly different usage in the DFD annotation scheme is based on the lack of suitable alternatives and the denotative meaning of the expression to expand. Furthermore, it would be helpful to have elements within the module “Names, Dates, People, and Places” which encode not only settlements, place names and geographical names in general but more precise features as hydronyms or agronyms, additionally. Currently, these features are tagged as follows in our articles: <geogName type=”hydronym”/>. Another aspect is the indefinite usage of one element in several contexts. An example is the tag <surname> which can be used to encode the surname in general as well as to annotate the explicit last name of a certain author of a cited publication.

The appropriateness of the module “Dictionaries” for encoding the DFD is diminished by restrictions concerning the arrangement of elements. The element <bibl> for annotating bibliographic references is not allowed on the entry or sense level. Within the project Wörterbuchnetz, the restriction in terms of the sense-element is overridden by embedding the element <bibl> within the element <title> or <cit> (Hildenbrandt 2011). The encoding scheme of the DFD uses the element <cit> as TEI-conform parent-element. For example: <cit> <bibl> <author> <surname>Gottschald</surname> </author> <date when="2006"/> <biblScope type="pp">5</biblScope> </bibl> </cit>

The risk of these flexible solutions is that similar projects might handle similar situations by choosing different TEI-conform markup strategies or customizations by TEI extensions which limits the possibilities for interchange.

As a result, we find that some aspects are not as adequately considered within the TEI modules “Dictionaries” and “Names, Dates, People, and Places” as it would be useful to realize the intended function of a new dictionary of surnames in Germany. An extension of the tagset might include elements for the frequency and the distribution. A further proposal refers to the element <bibl>, which should be allowed in more contexts. The pursuit of the TEI Guidelines, which is to provide an expressive and explicit tagset, is not fulfilled completely in terms of the DFD: The indirect denotations and the vast usage of attributes affect the readability for human lexicographers working on the XML adversely. These are among the reasons for the development of a working environment using the author view of the xml editor Oxygen instead of the source view.

Our explanations might give impetus for slight extensions of the TEI to develop a more comprehensive, comprehensible and flexible annotation scheme for general dictionaries as well as a more adequate annotation scheme for specialized dictionaries. An appropriate and profound encoding can be seen as the basis for an abundance of application scenarios of the DFD.

Bibliography

Austrian Academy of Sciences (ed.) (n.d.) Familiennamen Österreichs (FamOs). http://hw.oeaw.ac.at/famos (accessed June 30, 2013).

Buchner, S./Winner, M. (2011). Digitales Ortsnamenbuch (DONBO). Neue Perspektiven der Namenforschung. In Ziegler, A./Windberger-Heidenkummer, E. (eds.): Methoden der Namenforschung. Methodologie, Methodik und Praxis. Berlin: Akademie Verlag, pp. 183-198.

Budin, G./Majewski, S./Mörth, K. (2012). Creating Lexical Resources in TEI P5. A Schema for Multi-purpose Digital Dictionaries. In Journal of the Text Encoding Initiative. 3. November 2012, Online since 15 October 2012. URL: http://jtei.revues.org/522; DOI: 10.4000/jtei.522. (accessed June 30, 2013).

Hausmann, F. J. (1989). Wörterbuchtypologie. In Hausmann, F. J./Reichmann, O./Wiegand, H. E./Zgusta, L. (eds.): Wörterbücher: Ein internationales Handbuch zur Lexikographie. Berlin/New York: de Gruyter, pp. 968-980.

Hildenbrandt, V. (2011). TEI-basierte Modellierung von Retrodigitalisaten (am Beispiel des Trierer Wörterbuchnetzes). In Klosa, A./Müller-Spitzer, C. (eds.): Datenmodellierung für Internetwörterbücher. 1. Arbeitsbericht des wissenschaftlichen Netzwerks “Internetlexikografie”. Mannheim: Institut für Deutsche Sprache, pp. 21-35.

Ide, N./Kilgarriff, A./Romary, L. (2000). A Formal Model of Dictionary Structure and Content. In Proceedings of Euralex 2000. Stuttgart, 113-126.

Ide, N./Le Maitre, J./Véronis, J. (1994). Outline of a Model of for Lexical Databases. In Zampolli, A./Calzolari, N./Palmer, M. (eds.): Current Issues in Computational Linguistics. Pisa: Giardini Editori, pp. 283-320.

Ide, N./Sperberg-McQueen, M. (1995). The TEI. History, Goals, and Future. In Computers and the Humanities 29, 5-15.

Ide, N./Véronis, J./Warwick-Armstrong, S./Calzolari, N. (1992). Principles for encoding machine readable dictionaries. In Tommola, H./Varantola, K./Salmi-Tolonen, T./Schopp, Y. (eds.): EURALEX ’92. Pproceedings I- II. Papers submitted to the 5th EURALEX International Congress on Lexicography in Tampere, Finland. Tampere: Tampereen Yliopisto, pp. 239-246.

Jannidis, F. (2009). TEI in a Crystal Ball. In Literary and Linguistic Computing. 24(3), 253-265.
Kühn, P. (1989). Typologie der Wörterbücher nach Benutzungsmöglichkeiten. In Hausmann, F. J./Reichmann, O./Wiegand, H. E./Zgusta, L. (eds.): Wörterbücher: Ein internationales Handbuch zur Lexikographie. Berlin/New York: de Gruyter, pp. 111-127.

Nübling, D./Kunze, K. (2006). New Perspectives on Müller, Meyer, Schmidt: Computer-based Surname Geography and the German Surname Atlas Project. In Studia Anthroponymica Scandinavica. Tidskrift för nordisk personnamnsforskning 24, 53-85.

Schmuck, M./Dräger, K. (2008). The German Surname Atlas Project. Computer-Based Surname Geography. In Proceedings of the 23rd International Congress of Onomastic Sciences. Toronto, 319-336.
TEI Consortium (eds.). Guidelines for Electronic Text Encoding and Interchange. 17th January 2013. http://www.tei-c.org/P5/ (accessed June 30, 2013).

Trier Center for Digital Humanities (ed.) (n.d.) Wörterbuchnetz. http://woerterbuchnetz.de/ (accessed June 30, 2013).