Robot Vision to Audio Description Based on Deep Learning for Effective Human-Robot Interaction

Park, Dongkeon;Kang, Kyeong-Min;Bae, Jin-Woo;Han, Ji-Hyeong;

doi:10.7746/jkros.2019.14.1.022

The Journal of Korea Robotics Society (로봇학회논문지)

Volume 14 Issue 1
/
Pages.22-30
/
2019
/
1975-6291(pISSN)
/
2287-3961(eISSN)

Korea Robotics Society (한국로봇학회)

DOI QR Code

Robot Vision to Audio Description Based on Deep Learning for Effective Human-Robot Interaction

효과적인 인간-로봇 상호작용을 위한 딥러닝 기반 로봇 비전 자연어 설명문 생성 및 발화 기술

Park, Dongkeon (Dept. of Computer Science and Engineering, Seoul National University of Science and Technology) ;
Kang, Kyeong-Min (Dept. of Computer Science and Engineering, Seoul National University of Science and Technology) ;
Bae, Jin-Woo (Dept. of Computer Science and Engineering, Seoul National University of Science and Technology) ;
Han, Ji-Hyeong (Dept. of Computer Science and Engineering, Seoul National University of Science and Technology)

Received : 2018.12.08
Accepted : 2019.01.16
Published : 2019.02.28

https://doi.org/10.7746/jkros.2019.14.1.022 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

For effective human-robot interaction, robots need to understand the current situation context well, but also the robots need to transfer its understanding to the human participant in efficient way. The most convenient way to deliver robot's understanding to the human participant is that the robot expresses its understanding using voice and natural language. Recently, the artificial intelligence for video understanding and natural language process has been developed very rapidly especially based on deep learning. Thus, this paper proposes robot vision to audio description method using deep learning. The applied deep learning model is a pipeline of two deep learning models for generating natural language sentence from robot vision and generating voice from the generated natural language sentence. Also, we conduct the real robot experiment to show the effectiveness of our method in human-robot interaction.

Keywords

References

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, and K. Saenko, "Long-term recurrent convolutional networks for visual recognition and description," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, DOI: 10.1109/CVPR.2015.7298878.
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, DOI: 10.1109/CVPR.2015.7298935.
L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, "Describing videos by exploiting temporal structure," 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, DOI: 10.1109/ICCV.2015.512.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L, Fei-Fei, "ImageNet Large Scale Visual Recognition Challenge," arXiv:1409.0575 [cs.CV], 2014.
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv:1409.1556 [cs.CV], 2014.
H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu., "Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks," arXiv:1510.07712 [cs.CV], 2016
L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen., "Video captioning with attention-based LSTM and semantic consistency," IEEE Transactions on Multimedia, vol. 9, no. 9, pp. 2045-2055, Sept., 2017.
K. Tokuda, H. Zen, and A. Black, "An HMM-based speech synthesis system applied to English," 2002 IEEE Workshop on Speech Synthesis, Santa Monica, CA, USA, 2002.
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "WaveNet: A generative model for raw audio," arXiv:1609.03499[cs.SD], 2016.
J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, "Char2Wav: End-to-end speech synthesis," ICLR 2017, 2017.
S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, and M. Shoeybi, "Deep voice: Real-time neural text-to-speech," arXiv:1702.07825 [cs.CL], 2017.
O. Vinyals, L.Kaiser, T. Koo, S. Petrov, I. Sutskever, and G.Hinton, "Grammar as a foreign language," arXiv:1412.7449 [cs.CL], 2015.
M.-T. Luong, H. Pham, and C. D. Manning, "Effective approaches to attention-based neural machine translation," arXiv:1508.04025 [cs.CL], 2015.
S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, "Sequence to Sequence -- Video to Text," 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, "On the properties of neural machine translation: Encoderdecoder approaches," arXiv:1409.1259 [cs.CL], 2014.
I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," arXiv:1409.3215 [cs.CL], 2014.
S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, Nov., 1997.
D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv:1412.6980 [cs.LG], 2014.
Y. Wang, R.J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, "Tacotron: Towards End-to-End Speech Synthesis," Interspeech 2017, 2017, DOI: 10.21437/Interspeech.2017-1452.
D. Bahdanau, K.H. Cho, and Y. Bengio "Neural machine translation by jointly learning to align and translate," arXiv:1409.0473 [cs.CL], 2014.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A simple way to prevent neural networks from overfitting," Journal of Machine Learning Research., vol. 15, no. 1, pp. 1929-1958, 2014.
J. Lee, K. Cho, and T. Hofmann, "Fully character-level neural machine translation without explicit segmentation," Transactions of the Association for Computational Linguistics, vol. 5, pp. 365-378, 2017. https://doi.org/10.1162/tacl_a_00067
R. K. Srivastava, K. Greff, and J. Schmidhuber, "Highway networks," arXiv:1505.00387 [cs.LG], 2015.
J. Chung, C. Gulcehre, K.H. Cho, and Y. Bengio, "Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv:1412.3555 [cs.NE], 2014.
S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," arXiv:1502.03167 [cs.LG], 2015.
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp.770-778, 2016.
Y. Wu, M. Schuster, Z. Chen, Q. V Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Ł. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, "Google's neural machine translation system: Bridging the gap between human and machine translation," arXiv:1609.08144 [cs.CL], 2016.
H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak, "Fast, compact, and high-quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices," Interspeech, 2016, DOI: 10.21437/Interspeech.2016-522.
D. Griffin and J. Lim, "Signal estimation from modified short-time fourier transform," ICASSP '83. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236-243, 1984.
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, "TensorFlow: Large-scale machine learning on heterogeneous distributed systems," arXiv:1603.04467 [cs.DC], 2016.
D. L. Chen and W. B. Dolan, "Collecting highly parallel data for paraphrase evaluation," 49th Annual Meeting of the Association for Computational Linguistics, pp. 190-200, 2011.
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "Bleu: a method for automatic evaluation of machine translation," 40th Annual Meeting on Association for Computational Linguistics, pp. 311-318, Philadelphia, PA, USA, 2002.
C.-Y. Lin. "ROUGE: A Package for Automatic Evaluation of Summaries," ACL-04 Workshop, pp. 74-81, 2004.
M. Denkowski and A. Lavie, "Meteor universal: Language specific translation evaluation for any target language," Ninth Workshop on Statistical Machine Translation, pp. 376-380, Baltimore, MD, USA, 2014.
R. Vedantam, C. L. Zitnick, and D. Parikh, "CIDEr: Consensus- based image description evaluation," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp.4566-4575, 2015.

Cited by

심층학습 기반 표정인식을 통한 학습 평가 보조 방법 연구 vol.23, pp.2, 2020, https://doi.org/10.18108/jeer.2020.23.2.24

The Journal of Korea Robotics Society (로봇학회논문지)

Robot Vision to Audio Description Based on Deep Learning for Effective Human-Robot Interaction

효과적인 인간-로봇 상호작용을 위한 딥러닝 기반 로봇 비전 자연어 설명문 생성 및 발화 기술

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)