Abstract:
Spell-checking can be reduced to a dictionary search for the given word in a comprehensive
dictionary of the target language. Previous research on South African Southern Bantu Languages
(SBLs) has demonstrated that this approach does not work well for conjunctively written
agglutinative languages. It is not possible to create comprehensive dictionaries for such languages
because their morphology allows them infinite possibilities for creating spoken and written words in
real context. In the standard dictionary, the headwords of the entries are very often not complete
words but morphemes around which words are built by inflection and compounding. Therefore,
when developing spell-checkers, alternative approaches have had to be developed to counter this. In
the absence of larger data sets and dictionaries, most of these approaches aim to enhance dictionary
sizes synthetically by using various heuristics. Lately a data driven approach has shown promise in
delivering effective results without requiring an increase in dictionary size. However, there is limited
research on the effectiveness of all these approaches in dealing with out of vocabulary words (OOV).
Words are considered to be out of vocabulary if the system is built without being exposed to them.
Such words are highly prevalent within conjunctively written languages which include Shona.
This research had two broad aims. First, it seeks to establish the way in which developers of spell
checkers have addressed the question of out of vocabulary words within Southern Bantu Languages.
Second, it aims to develop a new method for conducting spell-checking of Shona that utilizes
morphological analysis to optimize their performance on out of vocabulary words.
A meta-narrative review of the literature on the spell-checking of conjunctively written agglutinative
languages was conducted. This revealed the lack of research focus on the question of how spell
checkers handled out of vocabulary words. Following this, a finite state transducer based
morphological analyser for Shona was developed. Verbs, nouns, and pronouns were prioritised for
inclusion in the morphological analyser due to their complexity and relative prevalence. This
morphological analyser is called Morphological Analysis of Shona using Knowledge and Observations
(MAShoKO). A spell checker for Shona which checks for the validity of Shona spellings in two phases
was built based on MAShoKO. It starts with a dictionary lookup and then follows this with a
morphological analysis for OOV words. OOV words that are not morphologically well formed are
flagged as invalid, whilst those that conform with Shona morphology are accepted. This spell
checker’s performance was then tested against a character trigram language model (CTLM) based
spell checker.
The MAShoKO based spell checker outperforms the CTLM spell checker on OOV words for the parts
of speech that were encoded into it. However, it does not perform as well on those words whose
structure is not encoded in the morphological analyser.
The study concluded that morphological analysis is effective for increasing the effectiveness of spell
checkers to handle out of vocabulary words in conjunctively written agglutinative languages. This,
however, requires that all the parts of speech be adequately encoded in the morphological analyser.