DOI QR코드

DOI QR Code

A Deeping Learning-based Article- and Paragraph-level Classification

  • Kim, Euhee (Computer Science & Engineering, Shinhan University)
  • Received : 2018.10.22
  • Accepted : 2018.11.10
  • Published : 2018.11.30

Abstract

Text classification has been studied for a long time in the Natural Language Processing field. In this paper, we propose an article- and paragraph-level genre classification system using Word2Vec-based LSTM, GRU, and CNN models for large-scale English corpora. Both article- and paragraph-level classification performed best in accuracy with LSTM, which was followed by GRU and CNN in accuracy performance. Thus, it is to be confirmed that in evaluating the classification performance of LSTM, GRU, and CNN, the word sequential information for articles is better than the word feature extraction for paragraphs when the pre-trained Word2Vec-based word embeddings are used in both deep learning-based article- and paragraph-level classification tasks.

Keywords

CPTSCQ_2018_v23n11_31_f0001.png 이미지

Fig. 1. Deep Learning-based Classification System

CPTSCQ_2018_v23n11_31_f0002.png 이미지

Fig. 2. The Preprocessing of Model Training

CPTSCQ_2018_v23n11_31_f0003.png 이미지

Fig. 3. Model Training

CPTSCQ_2018_v23n11_31_f0004.png 이미지

Fig. 4. LSTM Model

CPTSCQ_2018_v23n11_31_f0005.png 이미지

Fig. 5. GRU Model

CPTSCQ_2018_v23n11_31_f0006.png 이미지

Fig. 6. CNN Model

CPTSCQ_2018_v23n11_31_f0007.png 이미지

Fig. 7. Preprocessing of Genre Selection

CPTSCQ_2018_v23n11_31_f0008.png 이미지

Fig. 8. Genre Selection

CPTSCQ_2018_v23n11_31_f0009.png 이미지

Fig. 9. Word2Vec Visualization using T-SNE

CPTSCQ_2018_v23n11_31_f0010.png 이미지

Fig. 10. Deep Learning Model-based Training and Validation Accuracy in Classifying Articles

CPTSCQ_2018_v23n11_31_f0011.png 이미지

Fig. 11. Loss Change In Classifying Articles,

CPTSCQ_2018_v23n11_31_f0012.png 이미지

Fig. 12. Model Training Accuracy and Validation Accuracy Change In Classifying paragraphs,

CPTSCQ_2018_v23n11_31_f0013.png 이미지

Fig. 13. Loss Change In Classifying Paragraphs,

Table 1. Number of Articles and Paragraphs in COCA

CPTSCQ_2018_v23n11_31_t0001.png 이미지

Table 2. COCA Database Sample

CPTSCQ_2018_v23n11_31_t0002.png 이미지

Table 3. Hardware Configuration

CPTSCQ_2018_v23n11_31_t0003.png 이미지

Table 4. Software Configuration

CPTSCQ_2018_v23n11_31_t0004.png 이미지

Table 5. Article and Article Tagging List

CPTSCQ_2018_v23n11_31_t0005.png 이미지

Table 6. Paragraph and Paragraph Tagging List

CPTSCQ_2018_v23n11_31_t0006.png 이미지

Table 7. Sentence and Tagging List

CPTSCQ_2018_v23n11_31_t0007.png 이미지

Table 8. Maximum and Minimum Sequence Length in Article- and Paragraph-based Sentence List

CPTSCQ_2018_v23n11_31_t0008.png 이미지

Table 9. Preprocessing of Model Training

CPTSCQ_2018_v23n11_31_t0009.png 이미지

Table 10. Parameters of LSTM and GRU

CPTSCQ_2018_v23n11_31_t0010.png 이미지

Table 11. Parameters of CNN

CPTSCQ_2018_v23n11_31_t0011.png 이미지

Table 12. Training Accuracy, Validation Accuracy, and Genre Test Accuracy in the Articles Experiment

CPTSCQ_2018_v23n11_31_t0012.png 이미지

Table 13. Training Accuracy, Validation Accuracy and Genre Test Accuracy in the Paragraphs Experiment

CPTSCQ_2018_v23n11_31_t0013.png 이미지

References

  1. COCA, https://corpus.byu.edu/coca/
  2. Sejong Corpus, https://ithub.korean.go.kr/user/main.do
  3. J. Swales, "Genre Analysis: English in Academic and Research Settings," Cambridge University Press, 1990.
  4. D. Biber, "Variation across Speech and Writing," Cambridge University Press, 1988.
  5. D. M. Blei, "Probabilistic Topic Models," Communications of the ACM, Vol. 55, No. 4, 77-84, Apr. 2012. https://doi.org/10.1145/2133806.2133826
  6. Z. S. Harris, "Distributional Structure," pp.775-794, Springer, 1997.
  7. N. Friedman, D. Geiger, and M. Goldszmidt, "Bayesian Network Classifiers," Machine Learning 29.2-3, pp.131-163, Nov. 1997. https://doi.org/10.1023/A:1007465528199
  8. H. Jo, J-H. Kim, S. Yoon, K-M. Kim, and B-T. Zhang, "Large-Scale Text Classification with a Convolutional Neural Network," 42th The Korean Institute of Information Scientists and Engineers Annual Meetings, 2015.
  9. H. Jo, J-H. Kim, K-M. Kim, J-H Chang, J-H. Eom, and B-T. Zhang, "Large-Scale Text Classification with Recurrent Neural Networks," 43th The Korean Institute of Information Scientists and Engineers Annual Meetings, 2016.
  10. T. Young, D. Hazarika, S. Poria, E. Cambria, "Recent Trends in Deep Learning Based Natural Language Processing," arXiv:1708.02709, Oct. 2018.
  11. T. Mikolov, K. Chen, G. Corrado, J. Dean, "Efficient estimation of word representations in vector space," arXiv:1301.3781, Jan. 2013.
  12. Q. Le and T. Mikolov, "Distributed representations of sentences and documents," International Conference on Machine Learning, pp. 1188-1196, Jan. 2014.
  13. C. Goller and A. Kuchler, "Learning task-dependent distributed representations by backpropergation through structure," Neural Networks, IEEE International Conference, Vol. 1, 1996.
  14. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation 9.8, pp. 1735-1780, Nov. 1997. https://doi.org/10.1162/neco.1997.9.8.1735
  15. K. Cho, et al., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," arXiv preprint arXiv:1406.1078, 2014.
  16. R. Jozefowicz, W. Zaremba, and I. Sutskever, "An empirical exploration of recurrent network architecture," Proceedings of the 32nd Intenational Conference on Machine Learning, 2015.
  17. Y. LeCun and Y. Bengio, "Convoluntional networks for images, speech, and time series," In M. A. Arbib (Ed.), The handbook of brain theory and neural networks, Cambridge, MA: MIT Press, pp. 255-258, 1995.
  18. Yoon Kim, "Convoluntional Neural Networks for Sentence Classification", Empirical Methods on Natural Language Proceeding, 2014.
  19. Y. Liu and M. Zhang, "Neural Network Methods for Natural Language Processing", Computational Linguistics, Vol. 44, pp.193-195, Mar. 2018. https://doi.org/10.1162/COLI_r_00312
  20. E-S. You, G-H. Choi, and S-H. Kim, "Study on Extraction of Keywords Using TF-IDF and Text Structure of Novels", Journal of The Korea Society of Computer and Information, Vol. 20(2), pp. 121-129, Feb. 2015. https://doi.org/10.9708/jksci.2015.20.2.121
  21. J. Park, H. Kim, H-G. Kim, T-K. Ahn, and H. Yi "Structuring of Unstructured 눈 Messages on Rail Services using Deep Learning Techniques", Journal of The Korea Society of Computer and Information, Vol. 23(7), pp. 19-26, Jul. 2018. https://doi.org/10.9708/JKSCI.2018.23.07.019