DOI QR코드

DOI QR Code

Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity

문장 분류를 위한 정보 이득 및 유사도에 따른 단어 제거와 선택적 단어 임베딩 방안

  • Lee, Min Seok (Department of Business Administration, The Catholic University of Korea) ;
  • Yang, Seok Woo (Department of Psychology, The Catholic University of Korea) ;
  • Lee, Hong Joo (Department of Business Administration, The Catholic University of Korea)
  • 이민석 (가톨릭대학교 경영학전공) ;
  • 양석우 (가톨릭대학교 심리학전공) ;
  • 이홍주 (가톨릭대학교 경영학전공)
  • Received : 2019.06.23
  • Accepted : 2019.11.16
  • Published : 2019.12.31

Abstract

Dimensionality reduction is one of the methods to handle big data in text mining. For dimensionality reduction, we should consider the density of data, which has a significant influence on the performance of sentence classification. It requires lots of computations for data of higher dimensions. Eventually, it can cause lots of computational cost and overfitting in the model. Thus, the dimension reduction process is necessary to improve the performance of the model. Diverse methods have been proposed from only lessening the noise of data like misspelling or informal text to including semantic and syntactic information. On top of it, the expression and selection of the text features have impacts on the performance of the classifier for sentence classification, which is one of the fields of Natural Language Processing. The common goal of dimension reduction is to find latent space that is representative of raw data from observation space. Existing methods utilize various algorithms for dimensionality reduction, such as feature extraction and feature selection. In addition to these algorithms, word embeddings, learning low-dimensional vector space representations of words, that can capture semantic and syntactic information from data are also utilized. For improving performance, recent studies have suggested methods that the word dictionary is modified according to the positive and negative score of pre-defined words. The basic idea of this study is that similar words have similar vector representations. Once the feature selection algorithm selects the words that are not important, we thought the words that are similar to the selected words also have no impacts on sentence classification. This study proposes two ways to achieve more accurate classification that conduct selective word elimination under specific regulations and construct word embedding based on Word2Vec embedding. To select words having low importance from the text, we use information gain algorithm to measure the importance and cosine similarity to search for similar words. First, we eliminate words that have comparatively low information gain values from the raw text and form word embedding. Second, we select words additionally that are similar to the words that have a low level of information gain values and make word embedding. In the end, these filtered text and word embedding apply to the deep learning models; Convolutional Neural Network and Attention-Based Bidirectional LSTM. This study uses customer reviews on Kindle in Amazon.com, IMDB, and Yelp as datasets, and classify each data using the deep learning models. The reviews got more than five helpful votes, and the ratio of helpful votes was over 70% classified as helpful reviews. Also, Yelp only shows the number of helpful votes. We extracted 100,000 reviews which got more than five helpful votes using a random sampling method among 750,000 reviews. The minimal preprocessing was executed to each dataset, such as removing numbers and special characters from text data. To evaluate the proposed methods, we compared the performances of Word2Vec and GloVe word embeddings, which used all the words. We showed that one of the proposed methods is better than the embeddings with all the words. By removing unimportant words, we can get better performance. However, if we removed too many words, it showed that the performance was lowered. For future research, it is required to consider diverse ways of preprocessing and the in-depth analysis for the co-occurrence of words to measure similarity values among words. Also, we only applied the proposed method with Word2Vec. Other embedding methods such as GloVe, fastText, ELMo can be applied with the proposed methods, and it is possible to identify the possible combinations between word embedding methods and elimination methods.

텍스트 데이터가 특정 범주에 속하는지 판별하는 문장 분류에서, 문장의 특징을 어떻게 표현하고 어떤 특징을 선택할 것인가는 분류기의 성능에 많은 영향을 미친다. 특징 선택의 목적은 차원을 축소하여도 데이터를 잘 설명할 수 있는 방안을 찾아내는 것이다. 다양한 방법이 제시되어 왔으며 Fisher Score나 정보 이득(Information Gain) 알고리즘 등을 통해 특징을 선택 하거나 문맥의 의미와 통사론적 정보를 가지는 Word2Vec 모델로 학습된 단어들을 벡터로 표현하여 차원을 축소하는 방안이 활발하게 연구되었다. 사전에 정의된 단어의 긍정 및 부정 점수에 따라 단어의 임베딩을 수정하는 방법 또한 시도하였다. 본 연구는 문장 분류 문제에 대해 선택적 단어 제거를 수행하고 임베딩을 적용하여 문장 분류 정확도를 향상시키는 방안을 제안한다. 텍스트 데이터에서 정보 이득 값이 낮은 단어들을 제거하고 단어 임베딩을 적용하는 방식과, 정보이득 값이 낮은 단어와 코사인 유사도가 높은 주변 단어를 추가로 선택하여 텍스트 데이터에서 제거하고 단어 임베딩을 재구성하는 방식이다. 본 연구에서 제안하는 방안을 수행함에 있어 데이터는 Amazon.com의 'Kindle' 제품에 대한 고객리뷰, IMDB의 영화리뷰, Yelp의 사용자 리뷰를 사용하였다. Amazon.com의 리뷰 데이터는 유용한 득표수가 5개 이상을 만족하고, 전체 득표 중 유용한 득표의 비율이 70% 이상인 리뷰에 대해 유용한 리뷰라고 판단하였다. Yelp의 경우는 유용한 득표수가 5개 이상인 리뷰 약 75만개 중 10만개를 무작위 추출하였다. 학습에 사용한 딥러닝 모델은 CNN, Attention-Based Bidirectional LSTM을 사용하였고, 단어 임베딩은 Word2Vec과 GloVe를 사용하였다. 단어 제거를 수행하지 않고 Word2Vec 및 GloVe 임베딩을 적용한 경우와 본 연구에서 제안하는 선택적으로 단어 제거를 수행하고 Word2Vec 임베딩을 적용한 경우를 비교하여 통계적 유의성을 검정하였다.

Keywords

References

  1. Azhagusundari, B. and A.S. Thanamani, "Feature Selection based on Information Gain," International Journal of Innovative Technology and Exploring Engineering (IJITEE), Vol.2, No.2(2013), 18-21.
  2. Barkan, O., "Bayesian Neural Word Embedding," Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), (2017)
  3. Barkan, O. and N. Koenigstein."Item2Vec: Neural Item Embedding for Collaborative Filtering," arXiv Preprint arXiv:1603.04259 (2016).
  4. Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information," CoRR abs/1607.04606, (2016)
  5. Deerwester, S., S.T. Dumais, T.K. Landauer, G.W. Furnas, and R. Harshman. "Indexing by latent semantic analysis," Journal of the American Society of Information Science, Vol.41, No.6(1990), 391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  6. Duda, R.O., P.E. Hart, and D.G. Stork. Pattern classification, Wiley, 2000.
  7. Frome, A., G. Corrado, and J. Shlens, "Devise: A Deep Visual-Semantic Embedding Model," Advances in Neural Information Processing Systems, 26(2013) 1-11.
  8. Joachims, T., "Text categorization with support vector machines," Technical report, University of Dortmund, (1997).
  9. Jolliffe, I.T., Principal Component Analysis, Springer-Verlag New York, Secaucus, NJ, (1989)
  10. Kim, Y., "Convolutional neural networks for sentence classification," Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, 1746-1751.
  11. Lee, M. and H. J. Lee, "Increasing Accuracy of Classifying Useful Reviews by Removing Neutral Terms," Journal of Intelligent Information Systems, Vol.22, No.3(2016), 129-142. https://doi.org/10.13088/jiis.2016.22.3.129
  12. Lee, M. and H. J. Lee, "Stock Price Prediction by Utilizing Category Neutral Terms: Text Mining Approach," Journal of Intelligent Information Systems, Vol.23, No.2(2017), 123-138.
  13. Lewis, D.D., "Naive (Bayes) at forty: The independence assumption in information retrieval," Proceedings of ECML-98, 10th European Conference on Machine Learning, (1998), 4-15.
  14. Lewis, D.D., "Feature selection and feature extraction for text categorization," Proceddings Speech and Natural Language Workshop, San Francisco, (1992), 212-217.
  15. Li, J., K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu, "Feature Selection: a data perspective," ACM Computing Surveys(CSUR), Vol.50, No.6(2017), 94:1-94:45.
  16. Landauer, T.K., P. W. Foltz, and D. Laham, "Introduction to Latent Semantic Analysis," Discourse Processes, Vol.25(1998), 259-84. https://doi.org/10.1080/01638539809545028
  17. Mika, S., G. Ratsch, J. Weston, B. Scholkopf and K. -R. Muller, "Fisher discriminant analysis with kernels," Proceedings, IEEE Workshop on Neural Network for Signal Processing, (1999).
  18. Mohan, P., I. Paramasivam, "A study on impact of dimensionality reduction on Naive Bayes classifier," Indian Journal of Science and Technology, Vol.10, No. 20(2017).
  19. Peng, H., F. Long, C. Dong, "Feature selection based on mutual information: Criteria of maxdependence, max-relevance, min-redundancy", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.27, No.8(2005). https://doi.org/10.1109/TPAMI.2005.152
  20. Pennington, J., R. Socher, and C. D. Manning. "Glove: Global vectors for word representation", EMNLP, (2014).
  21. Peters, M., M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. "Deep contextualized word representations", NAACL, (2018).
  22. Rapp, M., F.-J. Lübken, P. Hoffmann, R. Latteck, G. Baumgarten, and T. A. Blix, "PMSE dependence on aerosol charge, number density and aerosol size," Journal of Geophysical Research, Vol.108, No.D8(2003), 1-11.
  23. Roweis, S.T. and Saul, L.K., "Nonlinear dimensionality reduction by Locally Linear Embedding," Science, Vol.290, No.5500(2000), 2323-2326. https://doi.org/10.1126/science.290.5500.2323
  24. Mika, S., G. Ratsch, J. Weston, B. Scholkopf, and K. -R Muller, "Fisher discriminant analysis with kernels," Proceedings of IEEE Workshop on Neural Networks for Signal Processing, (1999).
  25. Sahami, M., "Learning limited dependence Bayesian classifiers". Proceedings 2nd International Conference on Knowledge Discovery and Data Mining, (1996), 334-338.
  26. Sahlgren, M., "The distributional hypothesis," Italian Journal of Linguistics, Vol.20, No.1 (2008), 33-53.
  27. Mikolov, T., K. Chen, G. Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space", ICLR Workshop, (2013).
  28. Yu, L.C., J. Wang, K. R. Lai, and X. Zhang, "Refining word embeddings for sentiment analysis", Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, (2017), 545-550.
  29. Zhang, R. and T. Tran, "An Information gainbased approach for recommending useful product reviews", Knowledge Information Systems, Vol.26, No.3(2011), 419-434. https://doi.org/10.1007/s10115-010-0287-y
  30. Zhou, P., W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, and B. Xu. "Attention-based bidirectional long short-term memory networks for relation classification", The 54th Annual Meeting of the Association for Computational Linguistics, (2016), 207-213.
  31. Zhu, L., G. Wang, and X. Zou, "Improved information gain feature selection method for Chinese text classification based on word embedding", proceedings of the 6th International Conference on Software and Computer Applications, (2017), 72-76.