A string of Attic Greek can be split into lexical tokens composed of one or more alphabetic characters, editorial characters and the elision character, but excluding punctuation and tokenizing white space. These are equivalent to splitting the string on the specified tokenizing characters, and removing all specified punctuation characters. These tokens are suitable for morphological analysis.
The string EDOXSEN TEI BOLEI includes tokenizing white space. We can tokenize this string into:
Token |
---|
EDOXSEN |
TEI |
BOLEI |
The string TAUTA D'ENAI includes a white space and an elision marker. We can tokenize this into:
Token |
---|
TAUTA |
D' |
ENAI |
The string TAUTA:D'ENAI includes a punctuation mark and an elision marker. We can tokenize this identically to the preceding example:
Token |
---|
TAUTA |
D' |
ENAI |
Lexical tokens can be split into ordered sets of syllables where each syllable is a valid, non-empty Greek string.
The token EDOXSEN can be divided into three syllables:
Token |
---|
E |
DO |
XSEN |
The token ENAI can be divided into two syllables:
Token |
---|
E |
NAI |
In the token E_NAI, the quantity marker is carried through:
Token |
---|
E_ |
NAI |
Accent characters can be automatically added to word tokens in cases where a recessive or persistent accent pattern does not require further morphological information, and vowel quantity is explicitly indicated.
Adding recessive accentuation to EDOXSEN produces E/DOXSEN in ASCII mapping, or ἔδοχσεν in the Greek Unicode mapping.
Adding persistent accentuation on the penult to NIKE_S produces NI/KE_S in ASCII mapping, or νίκε_ς in the Greek Unicode mapping.
.