DOI QR코드

DOI QR Code

Image classification and captioning model considering a CAM-based disagreement loss

  • Yoon, Yeo Chan (SW Content Research Laboratory, Electronics and Technology Research Institute) ;
  • Park, So Young (Department of Game Design and Development, Sangmyung University) ;
  • Park, Soo Myoung (SW Content Research Laboratory, Electronics and Technology Research Institute) ;
  • Lim, Heuiseok (Department of Computer Science and Engineering, Korea University)
  • Received : 2018.12.05
  • Accepted : 2019.05.07
  • Published : 2020.02.07

Abstract

Image captioning has received significant interest in recent years, and notable results have been achieved. Most previous approaches have focused on generating visual descriptions from images, whereas a few approaches have exploited visual descriptions for image classification. This study demonstrates that a good performance can be achieved for both description generation and image classification through an end-to-end joint learning approach with a loss function, which encourages each task to reach a consensus. When given images and visual descriptions, the proposed model learns a multimodal intermediate embedding, which can represent both the textual and visual characteristics of an object. The performance can be improved for both tasks by sharing the multimodal embedding. Through a novel loss function based on class activation mapping, which localizes the discriminative image region of a model, we achieve a higher score when the captioning and classification model reaches a consensus on the key parts of the object. Using the proposed model, we established a substantially improved performance for each task on the UCSD Birds and Oxford Flowers datasets.

Keywords

References

  1. J. Donahue et al., Long-term recurrent convolutional networks for visual recognition and description, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Boston, MA, USA, June 2015, pp. 2625-2634.
  2. O. Vinyals et al., Show and tell: A neural image caption generator, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Boston, MA, USA, June 2015, pp. 3156-3164.
  3. Y. Dong et al., Improving interpretability of deep neural networks with semantic information, arXiv preprint arXiv: 1703.04096 (2017), 3-19.
  4. L.A. Hendricks et al., Generating visual explanations, in Eur. Conf. Comput. Vision, Amsterdam, The Netherlands, Oct. 2016, pp. 3-19.
  5. L.A. Hendricks et al., Deep compositional captioning: Describing novel object categories without paired training data, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, June 2016, pp. 1-10.
  6. Q. You et al., Image captioning with semantic attention, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, June 2016, pp. 4651-4659.
  7. S.J. Rennie et al., Self-critical sequence training for image captioning, in IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 1179-1195.
  8. W. Qi et al., Image captioning and visual question answering based on attributes and external knowledge, IEEE Trans. Pattern Anal. Mach. Intell. 40 (2018), no. 6, 1367-1381. https://doi.org/10.1109/TPAMI.2017.2708709
  9. Y. Youngjae et al., End-to-end concept word detection for video captioning, retrieval, and question answering in IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 3261-3269.
  10. P. Anderson et al., Bottom-up and top-down attention for image captioning and VQA, arXiv preprint arXiv: 1707.07998, 2017.
  11. L. Jiasen et al., Knowing when to look: Adaptive attention via a visual sentinel for image captioning, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 3242-3250.
  12. T. Yao et al., Boosting image captioning with attributes, in IEEE Int. Conf. Comput. Vision, Venice, Italy, Oct. 2017, pp. 22-29.
  13. C. Wang, H. Yang, and C. Meinel, Image captioning with deep bidirectional lstms and multi-task learning, ACM Trans. Multimedia Comput., Commun., Applicat., 14 (2018), no. 2s, 1-20.
  14. C. Szegedy et al., Going deeper with convolutions, in Proc. IEEE Conf. Computer Vision Pattern Recogn., Boston, MA, USA, June 2015, pp. 1-9.
  15. S. Reed et al., Learning deep representations of fine-grained visual descriptions, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, June 2016, pp. 49-58.
  16. L. Zhang et al., Learning a deep embedding model for zero-shot learning, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 3010-3019.
  17. X. He and Y. Peng, Fine-grained image classification via combining vision and language, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 7332-7340.
  18. R. Kiros, R. Salakhutdinov, and R.S. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, arXiv preprint arXiv: abs/1411.2539, 2014.
  19. J. Mao et al., Learning like a child: Fast novel visual concept learning from sentence descriptions of images, in Proc. IEEE Int. Conf. Comput. Vision, Santiago, Chile, 2015, pp. 2533-2541.
  20. R. Vedantam et al., Context-aware captions from context-agnostic supervision, in Proc. IEEE, Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 1070-1079.
  21. A.H. Abdulnabi et al., Multi-task CNN model for attribute prediction, IEEE Trans. Multimedia 17 (2015), no. 11, 1949-1959. https://doi.org/10.1109/TMM.2015.2477680
  22. T.-H. Chen et al., Show adapt and tell: Adversarial training of cross-domain image captioner, in IEEE, Int. Conf. Comput. Vision, Venice, Italy, Oct. 2017, pp. 521-530.
  23. R.R. Selvaraju et al., Grad-CAM: Visual explanations from deep networks via gradient-based localization, in IEEE Int. Conf. Comput. Vision, Venice, Italy, Oct. 2017, pp. 618-626.
  24. Y.-C. Yoon et al., Fine-grained mobile application clustering model using retrofitted document embedding, ETRI J. 39 (2017), no. 4, 443-454. https://doi.org/10.4218/etrij.17.0116.0936
  25. S. Kong and C. Fowlkes, Low-rank bilinear pooling for fine-grained classification, in IEEE Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 7025-7034.
  26. Y. Shaoyong et al., A model for fine-grained vehicle classification based on deep learning, Neurocomput. 257 (2017), 97-103. https://doi.org/10.1016/j.neucom.2016.09.116
  27. X.-S. Wei et al., Selective convolutional descriptor aggregation for fine-grained image retrieval, IEEE Trans. Image Process. 26 (2017), no. 6, 2868-2881. https://doi.org/10.1109/TIP.2017.2688133
  28. G.-S. Xie et al., LG-CNN: from local parts to global discrimination for fine-grained recognition, Pattern Recogn. 71 (2017), 118-131. https://doi.org/10.1016/j.patcog.2017.06.002
  29. S.H. Lee, HGO-CNN: Hybrid generic-organ convolutional neural network for multi-organ plant classification, in IEEE Int. Conf. Image Process., Beijing, China, Sept. 2017, pp. 4462-4466.
  30. A. Li et al., Zero-shot fine-grained classification by deep feature learning with semantics, arXiv preprint arXiv: abs/1707.00785, 2017.
  31. Z. Akata et al., Evaluation of output embeddings for fine-grained image classification, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Boston, MA, USA, June 2015, pp. 2927-2936.
  32. R. Ranjan, V. M. Patel, and R. Chellappa, Hyperface: A deep multitask learning framework for face detection, landmark localization, pose estimation, and gender recognition, IEEE Trans. Pattern Anal. Mach. Intell. 41 (2018), 121-135. https://doi.org/10.1109/TPAMI.2017.2781233
  33. K. Hashimoto et al., A joint many-task model: Growing a neural network for multiple NLP tasks, arXiv preprint arXiv: abs/1611.01587, 2016.
  34. R. Caruana, Multitask learning: a knowledge-based source of inductive bias, in Proc. Int. Conf. Mach. Learn., Amherst, MA, USA, June 1993, pp. 41 - 48.
  35. L. Duong et al., Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser, in Proc. Annu. Meeting Association Computat. Linguistics Int. Joint Conf. Natural Language Process., Beijing, China, July 2015, pp. 845-850.
  36. M. Nilsback and A. Zisserman, Automated flower classification over a large number of classes, in Proc. Indian Conf. Comput. Vision, Graphics Image Process., Bhubaneswar, India, Dec. 2008, pp. 722-729.
  37. C. Wah et al., The Caltech-UCSD Birds-200-2011 Dataset, Tech. Report CNS-TR-2011-001, California Institute of Technology, 2011.
  38. K. Papineni et al., Bleu: A method for automatic evaluation of machine translation, in Proc. Annu. Meeting Association Computat. Linguistics, Philadelphia, PA, USA, July 2002, pp. 311-318.
  39. C.-Y. Lin, Rouge: a package for automatic evaluation of summaries, in Workshop Text Summarization Branches Out, Post-Conf. Workshop ACL, Barcelona, Spain, July 2004, pp. 74-81.
  40. S. Banerjee and A. Lavie, Meteor: an automatic metric for MT evaluation with improved correlation with human judgments, in Proc. ACL Workshop Intrinsic Extrinsic Evaluation Measures Mach. Translation Summarization, Ann Arbor, MI, USA, 2005, pp. 65-72.
  41. R. Lawrence, C.L. Zitnick, and D. Parikh, Cider: Consensus-based image description evaluation, arXiv preprint arXiv: abs/1411.5726 (2014).
  42. C. Szegedy, S. Ioffe, and V. Vanhoucke, Inception-v4, Inception-Resnet and the impact of residual connections on learning, in Proc. AAAI Conf. Artif. Intell., San Francisco, CA, USA, Feb. 2017, pp. 2478-4284.
  43. A. Paszke et al., Automatic differentiation in PyTorch, in Proc. NIPS, Long Beach, CA, USA, 2017.

Cited by

  1. Automated optimization for memory-efficient high-performance deep neural network accelerators vol.42, pp.4, 2020, https://doi.org/10.4218/etrij.2020-0125
  2. CitiusSynapse: A Deep Learning Framework for Embedded Systems vol.11, pp.23, 2020, https://doi.org/10.3390/app112311570