DOI QR코드

DOI QR Code

Extensible Hierarchical Method of Detecting Interactive Actions for Video Understanding

  • Moon, Jinyoung (SW & Contents Research Laboratory, ETRI) ;
  • Jin, Junho (Hyper-connected Communication Research Laboratory, ETRI) ;
  • Kwon, Yongjin (SW & Contents Research Laboratory, ETRI) ;
  • Kang, Kyuchang (School of IT Information and Control Engineering, Kunsan National University) ;
  • Park, Jongyoul (SW & Contents Research Laboratory, ETRI) ;
  • Park, Kyoung (Memory System Research Lab., SK Hynix)
  • Received : 2016.08.11
  • Accepted : 2017.04.24
  • Published : 2017.08.01

Abstract

For video understanding, namely analyzing who did what in a video, actions along with objects are primary elements. Most studies on actions have handled recognition problems for a well-trimmed video and focused on enhancing their classification performance. However, action detection, including localization as well as recognition, is required because, in general, actions intersect in time and space. In addition, most studies have not considered extensibility for a newly added action that has been previously trained. Therefore, proposed in this paper is an extensible hierarchical method for detecting generic actions, which combine object movements and spatial relations between two objects, and inherited actions, which are determined by the related objects through an ontology and rule based methodology. The hierarchical design of the method enables it to detect any interactive actions based on the spatial relations between two objects. The method using object information achieves an F-measure of 90.27%. Moreover, this paper describes the extensibility of the method for a new action contained in a video from a video domain that is different from the dataset used.

Keywords

References

  1. G. Lavee, E. Rivlin, and M. Rudzsky, "Understanding Video Events: A Survey of Methods for Automatic Interpretation of Semantic Occurrences in Video," IEEE Trans. Syst., Man, Cybern., Syst. -Part C, vol. 39, no. 5, Sept. 2009, pp. 489-504.
  2. R. Poppe, "A Survey on Vision-Based Human Action Recognition," Image Vision Comput., vol. 28, no. 6, June 2010, pp. 976-990. https://doi.org/10.1016/j.imavis.2009.11.014
  3. J.K. Aggarwal and M.S. Ryoo, "Human Activity Analysis: A Review," ACM Comput. Surv., vol. 43, no. 3, Apr. 2011, pp. 1-43.
  4. D. Weinland, R. Ronfard, and E. Boyer, "A Survey of Vision-based Methods for Action Representation, Segmentation, and Recognition," Comput. Vis. Image Understanding., Feb. 2011, pp. 224-241.
  5. I. Laptev et al., "Learning Realistic Human Actions from Movies," IEEE Conf. Comput. Vis. Pattern Recogn., Anchorage, Alaska, June 23-28, 2008, pp. 1-8.
  6. H. Wang and C. Schmid, "Action Recognition with Improved Trajectories," IEEE Int. Conf. Comput. Vision, Sydney, Australia, Dec. 1-8, 2013, pp. 3551-3558.
  7. H. Wang and C. Schmid, "LEAR-INRIA Submission for the THUMOS Workshop," Int. Conf. Comput. Vision, Workshop Action Recogn. Large Number Classes, Sydney, Australia, Dec. 1-8, 2013.
  8. X. Peng et al., "Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice," Comput. Vis. Image Underst., vol. 150, Sept. 2016, pp. 109-125. https://doi.org/10.1016/j.cviu.2016.03.013
  9. M. Baccouche et al., "Sequential Deep Learning for Human Action Recognition," Int. Workshop Human Behav. Underst., Amsterdam, Netherlands, Nov. 16, 2011, pp. 29- 39.
  10. S. Ji et al., "3D Convolutional Neural Networks for Human Action Recognition," IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no.1, Jan. 2013, pp. 221-231. https://doi.org/10.1109/TPAMI.2012.59
  11. A. Karpathy et al., "Large-Scale Video Classification with Convolutional Neural Networks." IEEE Conf. Comput. Vis. Pattern Recogn., Columbus, USA, June 24-27, 2014, pp. 1725-1732.
  12. K. Simonyan, "Two-Stream Convolutional Networks for Action Recognition in Videos," Int. Conf. Neural Inform. Process. Syst., Montreal, Canada, Dec. 8-13, 2014, pp. 568- 576.
  13. J. Ng et al., "Beyond Short Snippets: Deep Networks for Video Classification," IEEE Conf. Comput. Vis. Pattern Recogn., Boston, USA, June 7-12, 2015, pp. 4694-4702.
  14. L. Wang, Y. Qiao, and X. Tang, "Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors," IEEE Conf. Comput. Vis. Pattern Recogn., Boston, USA, June 7- 12, 2015, pp. 4305-4314.
  15. D. Tran et al., "Learning Spatiotemporal Features with 3D Convolutional Networks," IEEE Int. Conf. Compt. Vis., Santiago, Chile, Dec. 13-16, 2015, pp. 4489-4497.
  16. C. Feichtenhofer, A. Pinz, and A. Zisserman, "Convolutional Two-Stream Network Fusion for Video Action Recognition," IEEE Conf. Comput. Vis. Pattern Recogn., Las Vegas, USA, June 26-July1, 2016, pp. 1933-1941.
  17. J.M. Chaquet, E.J. Carmona, and A. Fernandez-Caballero, "A Survey of Video Dataset for Human Action and Activity Recognition," Comput. Vis. Image Understanding (CVIU), vol. 117, no. 6, June 2013, pp. 633-659. https://doi.org/10.1016/j.cviu.2013.01.013
  18. Z. Shou, D. Wang, and S.-F. Chang, "Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs," IEEE Conf. Comput. Vis. Pattern Recogn., Las Vegas, USA, June 26-July 1, 2016, pp. 1049-1058.
  19. S. Yeung, "Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos," Preprint, submitted July 31, 2015. http://arxiv.org/abs/1507.05738v2.
  20. G. Gkioxari and J. Malik, "Finding Action Tubes," IEEE Conf. Comput. Vis. Pattern Recogn., Boston, USA, June 7-12, 2015, pp. 759-768.
  21. P. Weinzaepfel et al., "Learning to Track for Spatio- Temporal Action Localization," IEEE Int. Conf. Compt. Vis., Santiago, Chile, Dec. 13-16, 2015, pp. 3164-3172.
  22. J. Gall et al., "Hough Forests for Object Detection, Tracking, and Action Recognition," IEEE Trans. Pattern. Anal. Mach. Intell., vol. 33, no. 11, Nov. 2011, pp. 2188- 2202. https://doi.org/10.1109/TPAMI.2011.70
  23. S.-C. Cheng, K.-Y. Cheng, and Y.-P. Chen, "GHT-Based Associative Memory Learning and Its Application to Human Action Detection and Classification," Pattern Recogn., vol. 46, no. 11, Nov. 2013, pp. 3117-3128. https://doi.org/10.1016/j.patcog.2013.03.027
  24. S. Ma. et al., "Action Recognition and Localization by Hierarchical Space-time Segments," IEEE Int. Conf. Compt. Vis., Sydney, Australia, Dec. 3-6, 2013, pp. 2744- 2751.
  25. T. Lan, Y. Wang, and G. Mori, "Discriminative Figurecentric Models for Joint Action Localization and Recognition," IEEE Int. Conf. Compt. Vis., Barcelona, Spain, Nov. 6-13, 2011, pp. 2003-2010.
  26. Y.S. Sefidgar et al., "Discriminative Key-component Models for Interaction Detection and Recognition," Comput. Vis. Image Understanding., vol. 135, no. C, June 2015, pp. 16-30. https://doi.org/10.1016/j.cviu.2015.02.012
  27. J. Moon et al., "A Knowledge-Driven Approach to Interactive Event Recognition for Semantic Video Understanding," Int. Conf. IT Convergence Security, Prague, Czech Rep., Sept. 26-29, 2016, pp. 37-39.
  28. C. Schuldt, I. Laptev, and B. Caputo, "Recognizing Human Actions: A Local SVM Approach," Int. Conf. Pattern Recogn., Cambridge, UK, Aug. 23-26, 2004, pp. 32-36.
  29. M. Blank et al., "Actions as Space-Time Shapes," IEEE Int. Conf. Compt. Vis., Beijing, China, Oct. 17-21, 2005. pp. 1395-1402.
  30. H. Kuehne et al., "HMDB: A Large Video Database for Human Motion Recognition," IEEE Int. Conf. Compt. Vis., Barcelona, Spain, Nov. 6-13, 2011, pp. 2556-2563.
  31. K. Soomro, A.R. Zamir, and M. Shah, "UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild," Center for Research in Computer Vision, UCF, Orlando, Tech. Prep. CRCV-TR-12-01, Nov. 2012.
  32. K. Soomro and A.R. Zamir, "Action Recognition in Realistic Sports Videos, Computer Vision in Sports," Advances in Comput. Vis. Pattern Recogn., Springer International Publishing, Jan. 2015, pp. 181-208.
  33. H. Jhuang et al., "Towards Understanding Action Recognition," IEEE Int. Conf. Compt. Vis., Sydney, Australia, Dec. 3-6, 2013, pp. 3192-3199.
  34. Y.-G. Jiang et al., THUMOS Challenge 2014, Center for Research in Computer Vision, UCF, 2014, Accessed Aug. 8, 2016. http://crcv.ucf.edu/THUMOS14/
  35. A.B. James, "Activities of Daily Living and Instrumental Activities of Daily Living," in Willard and Spackman's Occupational Therapy, Philadelphia, USA: Wolters Kluwer Health/Lippincott Williams & Wilkins, 2014.
  36. L. Chen, C.D. Nugent, and H. Wang, "A Knowledge- Driven Approach to Activity Recognition in Smart Homes," IEEE Trans. Knowl. Data Eng., vol. 24, no. 6, June 2012, pp. 961-974. https://doi.org/10.1109/TKDE.2011.51
  37. D. Riboni and C. Bettini, "OWL 2 Modeling and Reasoning with Complex Human Actions," Pervasive Mobile Comput., vol. 7, no. 3, 2011, pp. 379-395. https://doi.org/10.1016/j.pmcj.2011.02.001
  38. I.H. Bae "An Ontology-Based Approach to ADL Recognition in Smart Homes," Future Gener. Comput. Syst. vol. 33, Apr. 2014, pp. 32-41. https://doi.org/10.1016/j.future.2013.04.004
  39. G. Okeyo, L. Chen, and H. Wnag, "Combining Ontological and Temporal Formalisms for Composite Activity Modelling and Recognition in Smart Homes," Future Gener. Comput. Syst., vol. 39, Oct. 2014, pp. 29-43. https://doi.org/10.1016/j.future.2014.02.014
  40. G. Meditskos, S. Dasiopoulou, and I. Kompatsiaris, "Meta- Q: A Knowledge-Driven Framework for Context-Aware Activity," Pervasive Mobile Comput., vol. 25, Jan. 2016, pp. 104-124. https://doi.org/10.1016/j.pmcj.2015.01.007
  41. S. Oh et al., Instruction for VIRAT Video Dataset Release 2.0, KITWARE, Sept. 30, 2011, Accessed Aug. 8, 2016. https://data.kitware.com/#collection/56f56db28d777f75320 9ba9f/folder/56f581c78d777f753209c9c2
  42. "Recognition Combining SPARQL and OWL2 Activity Patterns," Pervasive Mob. Comput., vol. 25, Jan. 2016, pp. 104-124. https://doi.org/10.1016/j.pmcj.2015.01.007
  43. G. Baryannis, P. Woznowski, and G. Antoniou, "Rule- Based Real-Time ADL Recognition in a Smart Home Environment," Rule Technol., Res., Tools, Applicat., Int. Web Rule Symp., June 28, 2016, pp. 325-340.
  44. Y. Yildirim, A. Yazici, and T. Yilmaz, "Automatic Semantic Content Extraction in Videos Using a Fuzzy Ontology and Rule-Based Model," IEEE Trans. Knowl. Data Eng., vol. 25, no. 1, Jan. 2013, pp. 47-61. https://doi.org/10.1109/TKDE.2011.189
  45. U. Akdemir, P. Turaga, and R. Chellappa, "An Ontology based Approach for Activity Recognition from Video," ACM Int. Conf. Multimedia, Vancouver, Canada, Oct. 27- Nov. 1, 2008, pp. 709-712.
  46. M. Bertini, A.D. Bimbo, and G. Serra, "Learning Ontology Rules for Semantic Video Annotation," ACM Int. Conf. Multimed., Workshop Multimed. Semantics, Vancouver, Canada, Oct. 26-31, 2008
  47. L. Ballan et al., "Video Annotation and Retrieval Using Ontologies and Rule Learning," IEEE Trans. Multimedia, vol. 17, no. 4, Oct. 2010, pp. 80-88. https://doi.org/10.1109/MMUL.2010.4
  48. L. Ballan et al., "Event Detection and Recognition for Semantic Annotation of Video," Multimed. Tools. Appl., vol. 51, no. 1, Jan. 2011, pp. 279-302. https://doi.org/10.1007/s11042-010-0643-7
  49. G. Antoniou and F.V. Harmelen, "Web Ontology Language: OWL," in Handbook on Ontologies, Heidelberg: Springer, 2004, pp. 67-92.
  50. W3C Std., SWRL: A Semantic Web Rule Language Combining OWL and RuleML, May 2004.
  51. J. Moon et al., "ActionNet-VE Dataset: A Dataset for Describing Visual Events by b ng VIRAT Ground 2.0," Conf. Sign. Pro. Image Proc. Pattern Recogn., Nov. 2015, pp. 1-4.
  52. S. Oh et al., "A Large-Scale Benchmark Dataset for Event Recognition in Surveillance Video," IEEE Conf. Comput. Vis. Pattern Recogn., Colorado, USA, June 20-25, 2011, pp. 3153-3160.
  53. X. Wang and Q. Ji, "Hierarchical Context Modeling for Video Event Recognition," IEEE Trans. Pattern Anal. Mach. Intell., Epub, Oct. 2016.
  54. O. Russakovsky and J. Deng, ImageNet Large Scale Visual Recognition Challenge 2016 (ILSVRC2016), Accessed Feb. 1, 2017. http://image-net.org/challenges/LSVRC/2016/

Cited by

  1. Sensor Data Acquisition and Multimodal Sensor Fusion for Human Activity Recognition Using Deep Learning vol.19, pp.7, 2017, https://doi.org/10.3390/s19071716
  2. Vision-based garbage dumping action detection for real-world surveillance platform vol.41, pp.4, 2019, https://doi.org/10.4218/etrij.2018-0520
  3. Zero-Shot Human Activity Recognition Using Non-Visual Sensors vol.20, pp.3, 2017, https://doi.org/10.3390/s20030825
  4. Three-stream network with context convolution module for human-object interaction detection vol.42, pp.2, 2017, https://doi.org/10.4218/etrij.2019-0230
  5. RGB-D Data-Based Action Recognition: A Review vol.21, pp.12, 2021, https://doi.org/10.3390/s21124246