DOI QR코드

DOI QR Code

Character-based Subtitle Generation by Learning of Multimodal Concept Hierarchy from Cartoon Videos

멀티모달 개념계층모델을 이용한 만화비디오 컨텐츠 학습을 통한 등장인물 기반 비디오 자막 생성

  • 김경민 (서울대학교 컴퓨터공학부) ;
  • 하정우 (서울대학교 컴퓨터공학부) ;
  • 이범진 (서울대학교 컴퓨터공학부) ;
  • 장병탁 (서울대학교 컴퓨터공학부)
  • Received : 2014.09.01
  • Accepted : 2015.01.28
  • Published : 2015.04.15

Abstract

Previous multimodal learning methods focus on problem-solving aspects, such as image and video search and tagging, rather than on knowledge acquisition via content modeling. In this paper, we propose the Multimodal Concept Hierarchy (MuCH), which is a content modeling method that uses a cartoon video dataset and a character-based subtitle generation method from the learned model. The MuCH model has a multimodal hypernetwork layer, in which the patterns of the words and image patches are represented, and a concept layer, in which each concept variable is represented by a probability distribution of the words and the image patches. The model can learn the characteristics of the characters as concepts from the video subtitles and scene images by using a Bayesian learning method and can also generate character-based subtitles from the learned model if text queries are provided. As an experiment, the MuCH model learned concepts from 'Pororo' cartoon videos with a total of 268 minutes in length and generated character-based subtitles. Finally, we compare the results with those of other multimodal learning models. The Experimental results indicate that given the same text query, our model generates more accurate and more character-specific subtitles than other models.

기존 멀티모달 학습 기법의 대부분은 데이터에 포함된 컨텐츠 모델링을 통한 지식획득보다는 이미지나 비디오 검색 및 태깅 등 구체적 문제 해결에 집중되어 있었다. 본 논문에서는 멀티모달 개념계층모델을 이용하여 만화 비디오로부터 컨텐츠를 학습하는 기법을 제안하고 학습된 모델로부터 등장인물의 특성을 고려한 자막을 생성하는 방법을 제시한다. 멀티모달 개념계층 모델은 개념변수층과 단어와 이미지 패치의 고차 패턴을 표현하는 멀티모달 하이퍼네트워크층으로 구성되며 이러한 모델구조를 통해 각각의 개념변수는 단어와 이미지패치 변수들의 확률분포로 표현된다. 제안하는 모델은 비디오의 자막과 화면 이미지로부터 등장 인물의 특성을 개념으로서 학습하며 이는 순차적 베이지안 학습으로 설명된다. 그리고 학습된 개념을 기반으로 텍스트 질의가 주어질 때 등장인물의 특성을 고려한 비디오 자막을 생성한다. 실험을 위해 총 268분 상영시간의 유아용 비디오 '뽀로로'로부터 등장인물들의 개념이 학습되고 학습된 모델로부터 각각의 등장인물의 특성을 고려한 자막 문장을 생성했으며 이를 기존의 멀티모달 학습모델과 비교했다. 실험결과는 멀티모달 개념계층모델은 다른 모델들에 비해 더 정확한 자막 문장이 생성됨을 보여준다. 또한 동일한 질의어에 대해서도 등장인물의 특성을 반영하는 다양한 문장이 생성됨을 확인하였다.

Keywords

Acknowledgement

Supported by : 한국연구재단

References

  1. N. Srivastava, and R. Salakutdinov, Multimodal Learning with Deep Boltzmann Machines, Advances in Neural Information Processing Systems 25 (NIPS 2012), pp. 2231-2239, 2012.
  2. C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda, Learning Systems of Concepts with an Infinite Relational Model, Proc. of the 21st Conference on Artificial Intelligence (AAAI 2006), pp. 381-388, 2006.
  3. R. Kiros, R. Salakutdinov and R. Zemel, Multimodal Neural Language Models, Journal of Machine Learning Research Conference on Machine Learning, Vol. 32, No. 1, pp. 595-603, 2014.
  4. H. Xiao, and T. Stibor, Toward Artificial Synesthesia: Linking Images and Sounds via Words, NIPS Workshop on Machine Learning for Next Generation Computer Vision Challenges, 2010.
  5. B. T. Zhang, Hypernetworks: A molecular evolutionary architecture for cognitive learning and memory, IEEE Computational Intelligence Magazine, Vol. 3, No. 3, pp. 49-63, 2008. https://doi.org/10.1109/MCI.2008.926615
  6. B. T Zhang and J. K. Kim, DNA hypernetworks for information storage and retrieval, Lecture Notes in Computer Science DNA12, 4287, pp. 298-307, 2006.
  7. J. K. Kim and B. T. Zhang, Evolving hypernetworks for pattern classification, IEEE Congress on Evolutionary Computation (CEC 2007), pp. 1856-1862, 2007.
  8. B. T. Zhang and H. Y. Jang, A Bayesian algorithm for in vitro molecular evolution of pattern classifiers, Lecture Notes in Computer Science, 3384, pp. 458-467, 2005.
  9. J.-W. Ha, J.-H. Eom, S.-C. Kim, and B.-T. Zhang, Evolutionary hypernetwork models for aptamerbased cardiovascular disease diagnosis, The Genetic and Evolutionary Computation Conference (GECCO 2007), pp. 2709-2716, 2007.
  10. S.-J Kim, J.-W. Ha, and B.-T. Zhang, Bayesian evolutionary hypergraph learning for predicting cancer clinical outcomes, Journal of Biomedical Informatics, 49, pp. 101-111, 2014. https://doi.org/10.1016/j.jbi.2014.02.002
  11. B. T. Zhang, J. W. Ha, and M. G. Kang, Sparse population code models of word learning in concept drift, Proc. of Annual Meeting of the Cognitive Science Society (CogSci 2012), pp. 1221-1226, 2012.
  12. A. Martin, The representation of object concepts in the brain, Annual Review of Psychology, 58, 25-45, 2007. https://doi.org/10.1146/annurev.psych.57.102904.190143
  13. M. Kiefer, E. J. Sim, B. Herrnberger, J. Grothe, and K. Hoenig, The sound of concepts: Four markers for a link between auditory and conceptual brain systems, Journal of Neuroscience, 28, 12224-12230, 2008. https://doi.org/10.1523/JNEUROSCI.3579-08.2008
  14. W. Prinz, M. Beisert and A. Herwig, Action Science : Foundation of an Emerging Discipline, MIT Press, pp. 384, 2013.
  15. B. T, Zhang, An incremental learning algorithm that optimizes network size and sample size in one trial, Proc. of IEEE International Conference on Neural Networks (ICNN'94), 1, pp. 215-220, 1994.
  16. J. H. Lee, S. H. Lee, W. H. Chung, E. S. Lee, T. H. Park, R. Deaton, and B.-T. Zhang, A DNA assembly model of sentence generation, BioSystems, 106, pp. 51-56, 2011. https://doi.org/10.1016/j.biosystems.2011.06.007
  17. M. J. Huiskes, M. S. Lew, The MIR Flickr Retrieval Evaluation, Proc. of the 2008 ACM International Conference on Multimedia Information Retrieval (MIR 08), 2008.
  18. H. Jegous, M. Douze, C. Schmid, and P. Perez, Aggregating Local Descriptors into a Compact Image Representation, Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), pp. 3304-3311, 2010.