QUICKSTART: Guide to Lemmatizing a Text

0. What happens before you begin

Someone has processed a plain text file (*.txt) of your text using Bridge/Tools.

This can be done via the web Lemmatizer or using the Bridge/Tools Scripts (available to the public on GitHub/Git-Classical/Bridge)

An alpha version of Bridge Tools is now available, which allows for the lemmatization of Latin and Greek texts.

1. Getting to Know the Lemmatization Sheet

 You now have a lemmatization spreadsheet that can be worked on in Excel, Google Docs, or any major spreadsheet program. By default, unambiguous words (i.e. non-homonyms) will have been lemmatized by the lemmatization program, allowing you to focus on the real work that requires human judgment.

Figure 1. The TEXT Sheet in the Lemmatization Spreadsheet

The Columns

CHECK: This cell checks the entry in TITLE (Column B) to see if it is already in The Bridge Dictionary.When you enter a valid TITLE in Column B, this text will fill with a matching entry when you SAVE the file. A Read #N/A will appear if the TITLE is not valid. Please note that recently generated sheets will leave this column blank (as it is now superfluous in the process).

TITLE: Where you will lemmatize the text by adding the UNIQUE ID that matches the inflected form in TEXT (Column C) with the word in the DICTIONARY.  Blank cells (which need TITLEs) should appear yellow.

TEXT: your text runs down the sheet in this column. NOTE: Latin words that end in -N or -QUE might have been split into multiple rows. If you, lemmatize the actual form and delete the superfluous row. For example, if “relinque” has been split into rows with “relin” and “que”, recombine and lemmatize the proper form. But also note that enclitics (-que ~ et) should be lemmatized separately; so if, conversely, you have a TEXT form of iustamque, please split that into two rows, lemmatize, and add an intermediate entry for the Running Count (e.g., 1014, 1014.5).

LOCATION: the book, chapter, poem, line, or section in which the word appears. This will be automatically created by the program that generated the spreadsheet but should be checked.

RUNNINGCOUNT: allows you to sort the spreadsheet back to text order. If you add any words, be sure to add a value between those in the rows above and below

SECTION: The lemmatizer will parse sentences in your text and creating a running count for the sentences.

DISPLAYLEMMA: the principal parts of the word; this will automatically fill for valid TITLEs when you save the file. DO NOT MODIFY THIS ENTRY.

SHORTDEF: a succinct definition of the word; ; this will automatically fill for valid TITLEs when you save the file. DO NOT MODIFY THIS ENTRY.

LONGDEF: a more expansive definition of the word; ; this will automatically fill for valid TITLEs when you save the file. DO NOT MODIFY THIS ENTRY.

LOCALDEF: [optional, but if you are adding custom definitions for your text, they will display here; ; this will automatically fill for valid TITLEs when you save the file. DO NOT MODIFY THIS ENTRY in the “TEXT” page. Instead you will create new definitions after you have lemmatized the text and formatted it into a Glossary

PROBLEM: Add notes if you detect any problems with the TEXT, DISPLAY LEMMA, SHORTDEF, OR LONGDEF of a word.

The Sheets

If you look at the bottom of the spreadsheet you will see three sheets: TEXT (the sheet you’ll be working in), DICTIONARY (a local copy of the Bridge Dictionary), and QUICKSTART (which contains a link to this page).

Figure 2. The sheets in the Lemmatization Spreadsheet

2. Lemmatize your passage by identifying each word and adding new vocabulary to the DICTIONARY.

Read through every entry, manually checking those words that have been lemmatized and lemmatizing the ambiguous forms that were not auto-lemmatized.

BEFORE YOU BEGIN WATCH THIS SHORT (SILENT) VIDEO TO SEE LEMMATIZATION IN ACTION

 Lemmatizing requires you to add the correct TITLE to the TITLE Column (C). A TITLE is either a Known Lemma or a New Lemma. First we will discuss Known Lemmas, then how to handle New Lemmas.

At the start you’ll need to find the correct TITLEs in the DICTIONARY sheet. When you start typing in a cell in Column C, possibilities will be suggested if that TITLE already appears in the TEXT sheet.  After a little while you’ll have a sense of what form a TITLE may take and the process can move quite quickly.

Note that TITLES follow a standard orthography and format:

    • TITLES are always ALL-CAPS
    • U’s are always V’s; J’s are I’s; e.g., the TITLE for “abjuro” is ABIVRO.
    • homonyms are distinguished by /1, /2, etc. These are usually ranked in a rational order (nouns, adjectives, numbers, pronouns, verbs, adverbs, prepositions, other) but unless you are absolutely certain about the TITLE, please verify it by looking at the DISPLAY LEMMA and DEFs (after you save your file, these will populate automatically)
    • There are a few other suffixes to distinguish homonyms: e.g., /N for proper names; /A for proper adjectives.

General principal: lemmatize to the most general form that will be accessible to a novice reader.

For example, say you encounter this word, legente; it’s possible that it could be used substantively to indicate a reader (and is a handful of times in common ancient texts) but even if you were lemmatizing one of those moments, if would be better for developing the reader’s lexical competency to lemmatize to the verb (LEGO/2). 

From this general principal, a few general practices:

* participles, supines, etc. to their verbs
* substantives to their adjectives
* rare orthographies to the more typical form (if somewhat common, we can add a note to the DISPLAYLEMMA, indicate this in PROBLEM)

For more details about Bridge lemmatization principals, please visit this page.

You can find complete instructions for lemmatizing your text here.

ADDING NEW LEMMAS: if a word is not in the DICTIONARY, first check and triple check that it is not in the DICTIONARY. Consider different spellings; try searching for a principal part (without macrons). If you are certain that the word is not in the DICTIONARY, then add it at the bottom of DICTIONARY. Preface the TITLE with three hashtags (###). Fill in the DISPLAY LEMMA, SHORTDEF, and LONGDEF. Don’t worry about the other columns, they will be generated automatically or must be added by the Project Director. By adding the ###, you will guarantee that your new entry will be vetted and add to the master Bridge DICTIONARY used by the Bridge Program.

E.g. if the proper name “Bevis” appeared in your text but there is no BEVIS/N entry in the DICTIONARY. At the bottom of the DICTIONARY sheet, add:

Figure 3. Adding a new TITLE

In the TEXT sheet, be sure to refer to the new entry as ###. 

ADDING DATA TO BLANK TITLES: if a word appears in the DICTIONARY but without any other information (i.e. the TITLE is there, but it lacks dictionary entries and definitions), you can add the DISPLAY LEMMA (with macrons), SHORTDEF, and LONGDEF. I’ll be able to harvest these. But note, if there is already information present, if cannot be harvested. You must make a note in the PROBLEM Column of DICTIONARY (Column N)

E.g. if the proper name “Bretus” appeared in your text and BRETVS/N appears in the DICTIONARY but without any other information you would add the dictionary entry and definition in the row.

If you need to add dictionary entries for Latin texts, the fastest and most accurate way to do so is to copy them from LaNe* which is available on Logeion

Figure 4. Logion

*  LaNe = Woordenboek Latijn/Nederlands, 6th revised edition 2014, a Latin-Dutch translation dictionary, originally based on Pons Globalwörterbuch Lateinisch-Deutsch (Klett) but with full coverage of all entries also contained in the Oxford Latin Dictionary.  It is the current gold standard for Latin vowel quantities.

ADDING CUSTOM DEFINITIONS [Optional]: if you are adding custom definitions for your text, be sure that the LOCALDEF for each word is the best definition for the word. Modify these as needed.

Logeion is also a great place to find/copy definitions (but be thoughtful about this; make sure that you include definitions relevant to your text).

Note that your spreadsheet may some additional columns that do not appear in the sheets described in these instructions. These contain morphological information and other coding that you can ignore.

3. Submit your lemmatization sheet

When you have finished lemmatizing your text, submit it to The Bridge. We will harvest the new information in your local DICTIONARY and add your text information to The Bridge!

Leave a Reply