DOI QR코드

DOI QR Code

Video Captioning with Visual and Semantic Features

  • Lee, Sujin (Dept. of Computer Science, Graduate School of Kyonggi University) ;
  • Kim, Incheol (Dept. of Computer Science, Kyonggi University)
  • Received : 2017.11.29
  • Accepted : 2018.06.30
  • Published : 2018.12.31

Abstract

Video captioning refers to the process of extracting features from a video and generating video captions using the extracted features. This paper introduces a deep neural network model and its learning method for effective video captioning. In this study, visual features as well as semantic features, which effectively express the video, are also used. The visual features of the video are extracted using convolutional neural networks, such as C3D and ResNet, while the semantic features are extracted using a semantic feature extraction network proposed in this paper. Further, an attention-based caption generation network is proposed for effective generation of video captions using the extracted features. The performance and effectiveness of the proposed model is verified through various experiments using two large-scale video benchmarks such as the Microsoft Video Description (MSVD) and the Microsoft Research Video-To-Text (MSR-VTT).

Keywords

E1JBB0_2018_v14n6_1318_f0001.png 이미지

Fig. 1. Example of video captioning.

E1JBB0_2018_v14n6_1318_f0002.png 이미지

Fig. 2. Examples of (a) dynamic and (b) static semantic features.

E1JBB0_2018_v14n6_1318_f0003.png 이미지

Fig. 3. Video captioning model.

E1JBB0_2018_v14n6_1318_f0004.png 이미지

Fig. 4. The dynamic semantic network (DSN).

E1JBB0_2018_v14n6_1318_f0005.png 이미지

Fig. 5. The static semantic network (SSN).

E1JBB0_2018_v14n6_1318_f0006.png 이미지

Fig. 6. The caption generation network (CGN).

E1JBB0_2018_v14n6_1318_f0007.png 이미지

Fig. 7. Qualitative results on the MSVD dataset: correct captions with relevant semantic features.

E1JBB0_2018_v14n6_1318_f0008.png 이미지

Fig. 8. Qualitative results on the MSVD dataset: incorrect captions with relevant semantic features.

Table 1. Performance comparison between two semantic networks on the MSVD dataset

E1JBB0_2018_v14n6_1318_t0001.png 이미지

Table 2. Performance comparison among different feature models on the MSVD dataset

E1JBB0_2018_v14n6_1318_t0002.png 이미지

Table 3. Performance comparison among different models on the MSR-VTT dataset

E1JBB0_2018_v14n6_1318_t0003.png 이미지

Table 4. Performance comparison with other state-of-art models on the MSVD dataset

E1JBB0_2018_v14n6_1318_t0004.png 이미지

References

  1. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778.
  2. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," in Proceedings of IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4489-4497.
  3. S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko, "YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition," in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 2013, pp. 2712-2719.
  4. J. Xu, T. Mei, T. Yao, and Y. Rui, "MSR-VTT: a large video description dataset for bridging video and language," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 5288-5296.
  5. S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, "Sequence to sequence: video to text," in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4534-4542.
  6. Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, "Jointly modeling embedding and translation to bridge video and language," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 4594-4602.
  7. L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, "Describing videos by exploiting temporal structure," in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4507-4515.
  8. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," in Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, 2015.
  9. Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng, "Semantic compositional networks for visual captioning," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 1141-1150.
  10. Y. Pan, T. Yao, H. Li, and T. Mei, "Video Captioning with Transferred Semantic Attributes," Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 984-992.
  11. Y. Yu, H. Ko, J. Choi, and G. Kim, "End-to-end concept word detection for video captioning, retrieval, and question answering," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 3261-3269.
  12. F. Nian, T. Li, Y. Wang, X. Wu, B. Ni, and C. Xu, "Learning explicit video attributes from mid-level representation for video captioning," Journal of Computer Vision and Image Understanding, vol. 163, pp. 126-138, 2017. https://doi.org/10.1016/j.cviu.2017.06.012
  13. J. Song, Z. Guo, L. Gao, W. Liu, D. Zhang, and H. T. Shen, "Hierarchical LSTM with adjusted temporal attention for video captioning," in Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australia, 2017, pp. 2737-2743.
  14. A. A. Liu, N. Xu, Y. Wong, J. Li, Y. T. Su, and M. Kankanhalli, "Hierarchical & multimodal video captioning: discovering and transferring multimodal knowledge for vision to language," Journal of Computer Vision and Image Understanding, vol. 163, pp. 113-125, 2017. https://doi.org/10.1016/j.cviu.2017.04.013
  15. K. Papineni, S. Roukos, T. Ward, and W. Zhu, "BLEU: a method for automatic evaluation of machine translation," in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA, 2002, pp. 311-318.
  16. R. Vedantam, C. L. Zitnick, and D. Parikh, "CIDEr: Consensus-based Image Description Evaluation," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 4566-4575.