IMPHAL: Dr Medari Janai Tham, a researcher at Assam Don Bosco University (ADBU), has developed a Natural Language Processing (NLP) application called "Tham Khasi Annotated Corpus" for computing Khasi dialect.
It's a collection of computer methods for analysing and synthesising human language, such as speech and text. Additionally, creating a corpus – a collection of machine-readable content is a crucial step in developing NLP systems for a language.
The British National Corpus (BNC) is the most widely used corpus in English, and its accessibility makes it popular among scholars. Khasi is classified as a resource-poor language because there is no publicly available corpus.
The publication of the "Tham Khasi Annotated Corpus," which is available through the European Language Resources Association, has made a significant contribution to this field (ELRA). The corpus was manually linked using the BIS (Bureau of Indian Standards) PoS (Parts-of-Speech) system to ensure standardised tagging with other Indian dialects.
Tham received a Doctorate degree from ADBU's Computer Science and Engineering Department for her thesis 'Shallow Parsing for Khasi,' which she wrote under the supervision of IIT Bombay's Prof. Pushpak Bhattacharyya.
Details of the corpus, including the annotation scheme and the development of Khasi NLP tools, can be found in research papers published as part of her PhD and on www.grammarkhasi.in, which also serves as a companion website to Macmillan Education's book "Ka Grammar Khasi Da Ka Jingdro."
Tham's other contributions include the BIS Khasi tagset, a Hybrid Khasi PoS tagger, an HMM Khasi PoS tagger, an NLTK Khasi POS tagger, an HMM Khasi shallow parser, and a Khasi shallow parser using the bi-directional gated recurrent unit.
Copyright©2024 Living Media India Limited. For reprint rights: Syndications Today