Out Of Vocabulary Words in Spell checking for Southern Bantu Languages:  a morphological analysis-based approach  for Shona

Kambarami, Farayi

Out Of Vocabulary Words in Spell checking for Southern Bantu Languages: a morphological analysis-based approach for Shona

Kambarami, Farayi

URI: https://ir.cut.ac.zw:8080/xmlui/handle/123456789/298

Date: 2021-12

Abstract:

Spell-checking can be reduced to a dictionary search for the given word in a comprehensive dictionary of the target language. Previous research on South African Southern Bantu Languages (SBLs) has demonstrated that this approach does not work well for conjunctively written agglutinative languages. It is not possible to create comprehensive dictionaries for such languages because their morphology allows them infinite possibilities for creating spoken and written words in real context. In the standard dictionary, the headwords of the entries are very often not complete words but morphemes around which words are built by inflection and compounding. Therefore, when developing spell-checkers, alternative approaches have had to be developed to counter this. In the absence of larger data sets and dictionaries, most of these approaches aim to enhance dictionary sizes synthetically by using various heuristics. Lately a data driven approach has shown promise in delivering effective results without requiring an increase in dictionary size. However, there is limited research on the effectiveness of all these approaches in dealing with out of vocabulary words (OOV). Words are considered to be out of vocabulary if the system is built without being exposed to them. Such words are highly prevalent within conjunctively written languages which include Shona. This research had two broad aims. First, it seeks to establish the way in which developers of spell checkers have addressed the question of out of vocabulary words within Southern Bantu Languages. Second, it aims to develop a new method for conducting spell-checking of Shona that utilizes morphological analysis to optimize their performance on out of vocabulary words. A meta-narrative review of the literature on the spell-checking of conjunctively written agglutinative languages was conducted. This revealed the lack of research focus on the question of how spell checkers handled out of vocabulary words. Following this, a finite state transducer based morphological analyser for Shona was developed. Verbs, nouns, and pronouns were prioritised for inclusion in the morphological analyser due to their complexity and relative prevalence. This morphological analyser is called Morphological Analysis of Shona using Knowledge and Observations (MAShoKO). A spell checker for Shona which checks for the validity of Shona spellings in two phases was built based on MAShoKO. It starts with a dictionary lookup and then follows this with a morphological analysis for OOV words. OOV words that are not morphologically well formed are flagged as invalid, whilst those that conform with Shona morphology are accepted. This spell checker’s performance was then tested against a character trigram language model (CTLM) based spell checker. The MAShoKO based spell checker outperforms the CTLM spell checker on OOV words for the parts of speech that were encoded into it. However, it does not perform as well on those words whose structure is not encoded in the morphological analyser. The study concluded that morphological analysis is effective for increasing the effectiveness of spell checkers to handle out of vocabulary words in conjunctively written agglutinative languages. This, however, requires that all the parts of speech be adequately encoded in the morphological analyser.

Show full item record