A library for working with Greek in the pre-403 BCE Attic alphabet > Orthography of Attic Greek >

Word tokens

Lexical tokenization

A string of Attic Greek can be split into lexical tokens composed of one or more alphabetic characters, editorial characters and the elision character, but excluding punctuation and tokenizing white space. These are equivalent to splitting the string on the specified tokenizing characters, and removing all specified punctuation characters. These tokens are suitable for morphological analysis.

Examples: tokenization

The string EDOXSEN TEI BOLEI includes tokenizing white space. We can tokenize this string into:

Token
EDOXSEN
TEI
BOLEI

The string TAUTA D'ENAI includes a white space and an elision marker. We can tokenize this into:

Token
TAUTA
D'
ENAI

The string TAUTA:D'ENAI includes a punctuation mark and an elision marker. We can tokenize this identically to the preceding example:

Token
TAUTA
D'
ENAI

Syllabification

Lexical tokens can be split into ordered sets of syllables where each syllable is a valid, non-empty Greek string.

Examples: syllabification

The token EDOXSEN can be divided into three syllables:

Token
E
DO
XSEN

The token ENAI can be divided into two syllables:

Token
E
NAI

In the token E_NAI, the quantity marker is carried through:

Token
E_
NAI

Accentuation

Accent characters can be automatically added to word tokens in cases where a recessive or persistent accent pattern does not require further morphological information, and vowel quantity is explicitly indicated.

Examples: Accentuation

Adding recessive accentuation to EDOXSEN produces E/DOXSEN in ASCII mapping, or ἔδοχσεν in the Greek Unicode mapping.

Adding persistent accentuation on the penult to NIKE_S produces NI/KE_S in ASCII mapping, or νίκε_ς in the Greek Unicode mapping.