DOI QR코드

DOI QR Code

Self-Supervised Document Representation Method

  • Yun, Yeoil (Graduate School of Business IT, Kookmin University) ;
  • Kim, Namgyu (School of Management Information Systems, Kookmin University)
  • Received : 2020.04.07
  • Accepted : 2020.05.05
  • Published : 2020.05.29

Abstract

Recently, various methods of text embedding using deep learning algorithms have been proposed. Especially, the way of using pre-trained language model which uses tremendous amount of text data in training is mainly applied for embedding new text data. However, traditional pre-trained language model has some limitations that it is hard to understand unique context of new text data when the text has too many tokens. In this paper, we propose self-supervised learning-based fine tuning method for pre-trained language model to infer vectors of long-text. Also, we applied our method to news articles and classified them into categories and compared classification accuracy with traditional models. As a result, it was confirmed that the vector generated by the proposed model more accurately expresses the inherent characteristics of the document than the vectors generated by the traditional models.

최근 신경망 기반의 학습 알고리즘인 딥 러닝 기술의 발전으로 인해 텍스트의 문맥을 고려한 문서 임베딩 모델이 다양하게 고안되었으며, 특히 대량의 텍스트 데이터를 사용하여 학습을 수행한 사전 학습 언어 모델을 사용하여 분석 문서의 벡터를 추론하는 방식의 임베딩이 활발하게 연구되고 있다. 하지만 기존의 사전 학습 언어 모델을 사용하여 새로운 텍스트에 대한 임베딩을 수행할 경우 해당 텍스트가 가진 고유한 정보를 충분히 활용하지 못한다는 한계를 가지며, 이는 특히 텍스트가 가진 토큰의 수에 큰 영향을 받는 것으로 알려져 있다. 이에 본 연구에서는 다수의 토큰을 포함한 장문 텍스트의 정보를 최대한 활용하여 해당 텍스트의 벡터를 도출할 수 있는 자기 지도 학습 기반의 사전 학습 언어 모델 미세 조정 방법을 제안한다. 또한, 제안 방법론을 실제 뉴스 기사에 적용하여 문서 벡터를 도출하고 이를 활용하여 뉴스의 카테고리 분류 실험을 수행하는 외부적인 임베딩 평가를 수행함으로써, 제안 방법론과 기존 문서 임베딩 모델과의 성능을 평가하였다. 그 결과 제안 방법론을 통해 도출된 벡터가 텍스트의 고유 정보를 충분히 활용함으로써, 문서의 특성을 더욱 정확하게 표현할 수 있음을 확인하였다.

Keywords

References

  1. T. Mikolov, C. Kai, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781, Jan, 2013.
  2. J. Pennington, R. Socher, and C. D. Manning, "Glove: Global Vectors for Word Representation," Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532-1543, 2014.
  3. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching Word Vectors with Subword Information," arXiv:1607.04606, Jul, 2016.
  4. T. Mikolov, I. Sutskever, C. Kai, G. Corrado, and J. Dean, "Distributed Representations of Words and Phrases and their Compositionality," Advances in Neural Information Processing Systems, Vol. 26, pp. 3111-3119, Dec, 2013.
  5. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, "Deep Contextualized Word Representations," arXiv:1802.05365, Feb, 2018.
  6. J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805, Oct, 2018.
  7. Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, "XLNet : Generalized Autoregressive Pretraining for Language Understanding," Advances in Neural Information Processing Systems, Vol. 32, pp. 1-11, Dec, 2019.
  8. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, "RoBERTa: A Robustly Optimized BERT Pretraining Approach," arXiv:1907.11692, Jul, 2019.
  9. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, "ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations," arXiv:1909.11942, Sep, 2019.
  10. V. Sanh, L. Debut, J. Chaumond, and T. Wolf, "DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter," arXiv:1910.01108, Oct, 2019.
  11. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is All You Need," Proceedings of the 31st Conference on Neural Information Processing Systems, pp. 1-11, 2017.
  12. K. Clark, U. Khandelwal, O. Levy, and C. D. Manning, "What Does BERT Looking At? An Analysis of BERT's Attention," arXiv:1906.04341, Jun, 2019.
  13. Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, "Transformers-XL: Attentive Language Models Beyond a Fixed-Length Context," arXiv:1901.02860, Jan, 2019.
  14. C. Sun, X .Qiu, Y. Xu, and X. Huang, "How to Fine-Tune BERT for Text Classification?," Proceedings of the 18th China National Conference on Chinese Computational Linguistics, pp. 194-206, 2019.
  15. A. Adhikari, A. Ram, R. Tang, and J. Lin, "DocBERT: BERT for Document Classification," arXiv:1904.08398, Apr, 2019.
  16. R. Pappagari, P. Zelasko, J. Villalba, Y. Carmiel, and N. Dehak, "Hierarchical Transformers for Long Document Classification," arXiv:1910.10781, Oct, 2019.
  17. N. Reimers and I. Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," arXiv:1908.10084, Aug, 2019.
  18. R. Zhang, Z. Wei, Y. Shi, and Y. Chen, "BERT-AL: BERT for Arbitrarily Long Document Understandding," Proceedings of the International Conference on Learning Representations 2020, pp. 1-10, 2020.
  19. D. Lee, "Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks," Proceedings of the International Conference on Machine Learning 2013 Workshop, pp. 1-6, 2013.
  20. M. S. Ahmed, L. Khan, and N. Oza, "Pseudo-Label Generation for Multi-Label Text Classfication," Proceedings of the 2011 Conference on Intelligent Data Understanding, pp. 60-74, 2011.
  21. J. Xu, B. Xu, P. Wang, S. Zheng, G. Tian, J. Zhao, and B. Xu, "Self-Taught Convolutional Neural Networks for Short Text Clustering," Neural Networks, Vol. 88, pp. 22-31, Apr, 2017. https://doi.org/10.1016/j.neunet.2016.12.008
  22. Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick, "Improved Variational Autoencoders for Text Modeling using Detailed Convolutions," Proceedings of the 34th International Conference on Machine Learning, pp. 3881-3890, 2017.
  23. D. Yeo, G. Lee, and J. Lee, "Pipe Leak Detection System using Wireless Acoustic Sensor Module and Deep Auto-Encoder," Journal of The Korea Society of Computer and Information, Vol. 25, No. 2, pp. 59-66, Feb, 2020.
  24. A. V. M. Barone, "Towards Cross-lingual Distributed Repre sentations without Parallel Text Trained with Adversarial Autoencoders," arXiv:1608.02996, Aug, 2016.
  25. L. Jiwei, L. Minh-Thang, and J. Dan, "A Hierarchical Neural Autoencoder for Paragraph and Documents," arXiv:1506.01057, Jun, 2015.
  26. Y. Chen and M. J. Zaki, "KATE: K-Competitive Autoencoder for Text," Proceedings of the 23rd International Conference on Knowledge Discovery and Data Mining, pp. 85-94, 2017.
  27. A. Bakarov, "A Survey of Word Embeddings Evaluation Methods," arXiv:1801.09536, Jan, 2018.
  28. Y. Tsvetkov, M. Faruqui, and C. Dyer, "Correlation-based Intrinsic Evaluation of Word Vector Representations," arXiv:1606.06710, Jun, 2016.
  29. J. Zhang and T. Baldwin, "Evaluating the Utility of Document Embedding Vector Difference for Relation Learning," arXiv:1907.08184, Jul, 2019.
  30. T. Baumel, R. Cohen, and M. Elhadad, "Sentence Embedding Evaluation using Pyramid Annotation," Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP, pp. 145-149, 2016.
  31. J. H. Lau and T. Baldwin, "An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation," arXiv:1607.05368, Jul, 2016.
  32. F. F. Liza and M. Grzes, "An Improved Crowdsourcing based Evaluation Technique for Word Embeddings Methods," Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP, pp. 55-61, 2016.
  33. M. Batchkarov, T. Kober, J. Reffin, J. Weeds, and D. Weir, "A Critique of Word Similarity as a Method for Evaluating Distributional Semantic Models," Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP, pp. 7-12, 2016.
  34. G. Wang, S. Shin, and W. Lee, "A Text Sentiment Classification Method Based on LSTM-CNN," Journal of The Korea Society of Computer and Information, Vol. 24, No. 12, pp. 1-7, Dec, 2019. https://doi.org/10.9708/jksci.2019.24.12.001
  35. M. Faruqui, Y. Tsvetkov, P. Rastogi, and C. Dyer, "Problems with Evaluation of Word Embeddings using Word Similarity Task," arXiv:1605.02276, May, 2016.