DOI QR코드

DOI QR Code

Part-of-speech Tagging for Hindi Corpus in Poor Resource Scenario

  • Modi, Deepa (Department of CSE, Swami Keshvanand Institute of Technology, Management & Gramothan) ;
  • Nain, Neeta (Department of CSE, Malaviya National Institute of Technology) ;
  • Nehra, Maninder (Department of CSE, Malaviya National Institute of Technology)
  • Received : 2018.06.09
  • Accepted : 2018.07.18
  • Published : 2018.09.30

Abstract

Natural language processing (NLP) is an emerging research area in which we study how machines can be used to perceive and alter the text written in natural languages. We can perform different tasks on natural languages by analyzing them through various annotational tasks like parsing, chunking, part-of-speech tagging and lexical analysis etc. These annotational tasks depend on morphological structure of a particular natural language. The focus of this work is part-of-speech tagging (POS tagging) on Hindi language. Part-of-speech tagging also known as grammatical tagging is a process of assigning different grammatical categories to each word of a given text. These grammatical categories can be noun, verb, time, date, number etc. Hindi is the most widely used and official language of India. It is also among the top five most spoken languages of the world. For English and other languages, a diverse range of POS taggers are available, but these POS taggers can not be applied on the Hindi language as Hindi is one of the most morphologically rich language. Furthermore there is a significant difference between the morphological structures of these languages. Thus in this work, a POS tagger system is presented for the Hindi language. For Hindi POS tagging a hybrid approach is presented in this paper which combines "Probability-based and Rule-based" approaches. For known word tagging a Unigram model of probability class is used, whereas for tagging unknown words various lexical and contextual features are used. Various finite state machine automata are constructed for demonstrating different rules and then regular expressions are used to implement these rules. A tagset is also prepared for this task, which contains 29 standard part-of-speech tags. The tagset also includes two unique tags, i.e., date tag and time tag. These date and time tags support all possible formats. Regular expressions are used to implement all pattern based tags like time, date, number and special symbols. The aim of the presented approach is to increase the correctness of an automatic Hindi POS tagging while bounding the requirement of a large human-made corpus. This hybrid approach uses a probability-based model to increase automatic tagging and a rule-based model to bound the requirement of an already trained corpus. This approach is based on very small labeled training set (around 9,000 words) and yields 96.54% of best precision and 95.08% of average precision. The approach also yields best accuracy of 91.39% and an average accuracy of 88.15%.

Keywords

References

  1. C. D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing. USA, MIT Press, 1999.
  2. A. Voutilainen. Part-of-speech tagging. The Oxford handbook of computational linguistics, 219-232, 2003.
  3. E. Brill, "A Simple Rule-based Part of Speech Tagger," in Proceedings of the Third Conference on Applied Natural Language Processing (ANLC '92), Proc ACL, Italy, pp. 152-155, 1992.
  4. K.K. Zin and N.L. Thein,"Part of speech Tagging for Myanmar using Hidden Markov Model," In Proceedings of the International Conference on the Current Trends in Information Technology (CTIT), Dubai, UAE, pp. 1-6, 2009.
  5. A. Ekbal and S. Bandyopadhyay, "Part of Speech Tagging in Bengali Using Support Vector Machine," In Proceedings of the International Conference on Information Technology, ICIT '08, India, pp. 106-111, 2008.
  6. "AnnCorra: An Introduction," 2010; https://www. sketchengine.eu/wp-content/uploads/posguidelines_indian_languages.pdf.
  7. N. Mishra and A. Mishra, "Part of Speech Tagging for Hindi Corpus," In Proceedings of the International Conference on Communication Systems and Network Technologies (CSNT), India, pp. 554-558, 2011.
  8. V. Goyal, N. Garg and S. Preet, "Rule Based Hindi Part of Speech Tagger," in Proceedings of the Coling, pp. 163-174, 2012.
  9. S. Singh, K. Gupta, M. Shrivastava and P. Bhattacharyya, "Morphological Richness Offsets Resource Poverty- an Experience in Building a POS Tagger for Hindi," in Proceedings of the COLING/ACL, Sydney, Australia, 2006.
  10. A. Dalal, K. Nagaraj, U. Sawant, S. Shelke, and P. Bhattacharyya, "Building Feature Rich POS Tagger for Morphologically Rich Languages: Experiences in Hindi," in Proceedings of the ICON, 2007.
  11. R. Narayan, V. P. Singh, and S. Chakraverty, "Quantum neural network based parts of speech tagger for Hindi," International Journal of Advancements in Technology, vol 5, no. 2, pp. 137-152, 2014.
  12. S. Ghosh, S. Ghosh, and D. Das, "Part-of-speech Tagging of Code-Mixed Social Media Text," In Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 90-97, 2016.
  13. "A Part of Speech Tagger for Indian Languages (POS tagger)", 2007; http://shiva.iiit.ac.in/SPSAL2007/iiittagset guidelines.pdf.
  14. U. M. Fayyad, P. Smyth, and R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining, USA, American Association for Artificial Intelligence, 1996.