DOI QR코드

DOI QR Code

Search for Optimal Data Augmentation Policy for Environmental Sound Classification with Deep Neural Networks

심층 신경망을 통한 자연 소리 분류를 위한 최적의 데이터 증대 방법 탐색

  • Park, Jinbae (Computer Science and Engineering, Kyung Hee University) ;
  • Kumar, Teerath (Computer Science and Engineering, Kyung Hee University) ;
  • Bae, Sung-Ho (Computer Science and Engineering, Kyung Hee University)
  • Received : 2020.09.10
  • Accepted : 2020.11.16
  • Published : 2020.11.30

Abstract

Deep neural networks have shown remarkable performance in various areas, including image classification and speech recognition. The variety of data generated by augmentation plays an important role in improving the performance of the neural network. The transformation of data in the augmentation process makes it possible for neural networks to be learned more generally through more diverse forms. In the traditional field of image process, not only new augmentation methods have been proposed for improving the performance, but also exploring methods for an optimal augmentation policy that can be changed according to the dataset and structure of networks. Inspired by the prior work, this paper aims to explore to search for an optimal augmentation policy in the field of sound data. We carried out many experiments randomly combining various augmentation methods such as adding noise, pitch shift, or time stretch to empirically search which combination is most effective. As a result, by applying the optimal data augmentation policy we achieve the improved classification accuracy on the environmental sound classification dataset (ESC-50).

심층 신경망은 영상 분류 그리고 음성 인식 등 다양한 분야에서 뛰어난 성능을 보여주었다. 그 중에서 데이터 증대를 통해 생성된 다양한 데이터는 신경망의 성능을 향상하게 시키는 데 중요한 역할을 했다. 일반적으로 데이터의 변형을 통한 증대는 신경망이 다채로운 예시를 접하고 더 일반적으로 학습되는 것을 가능하게 했다. 기존의 영상 분야에서는 신경망 성능 향상을 위해 새로운 증대 방법을 제시할 뿐만 아니라 데이터와 신경망의 구조에 따라 변화할 수 있는 최적의 데이터 증대 방법의 탐색 방법을 제안해왔다. 본 논문은 이에 영감을 받아 음향 분야에서 최적의 데이터 증대 방법을 탐색하는 것을 목표로 한다. 잡음 추가, 음의 높낮이 변경 혹은 재생 속도를 조절하는 등의 증대 방법들을 다양하게 조합하는 실험을 통해 경험적으로 어떤 증대 방법이 가장 효과적인지 탐색했다. 결과적으로 자연 음향 데이터 세트 (ESC-50)에 최적화된 데이터 증대 방법을 적용함으로써 분류 정확도를 향상하게 시킬 수 있었다.

Keywords

Acknowledgement

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea Government (MSIT) (No. 2019-01-01768, Deep Neural Network based Real-Time Accurate Voice Source Localization using Drones).

References

  1. Salamon, Justin, and Juan Pablo Bello. "Deep convolutional neural networks and data augmentation for environmental sound classification." IEEE Signal Processing Letters, 24(3), pp.279-283, Jan 2017. https://doi.org/10.1109/LSP.2017.2657381
  2. Cubuk, Ekin D., et al. "Autoaugment: Learning augmentation strategies from data." Proceedings of the IEEE conference on computer vision and pattern recognition. May 24 2018.
  3. Cubuk, Ekin D., et al. "Randaugment: Practical automated data augmentation with a reduced search space." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702-703. 2020.
  4. Hendrycks, Dan, et al. "Augmix: A simple data processing method to improve robustness and uncertainty." arXiv preprint arXiv:1912. 02781, Dec 5 2019.
  5. Sharma, Jivitesh, Ole-Christoffer Granmo, and Morten Goodwin. "Environment Sound Classification using Multiple Feature Channels and Deep Convolutional Neural Networks." arXiv preprint arXiv:1908.11219, Aug 28 2019.
  6. Park, Daniel S., et al. "Specaugment: A simple data augmentation method for automatic speech recognition." arXiv preprint arXiv:1904.08779, Apr 18 2019.
  7. Hwang, Yeongtae, et al. "Mel-spectrogram augmentation for sequence to sequence voice conversion." arXiv preprint arXiv:2001.01401, Jan 6 2020.
  8. Piczak, Karol J. "ESC: Dataset for environmental sound classification." Proceedings of the 23rd ACM international conference on Multimedia, pp. 1015-1018, Oct 13 2015.
  9. I lya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, Aug 13 2016.
  10. Venkatesh Boddapati, Andrej Petef, Jim Rasmusson, and Lars Lundberg. Classifying environmental sounds using image recognition networks. Procedia Computer Science, 112:2048 -2056, Jan 1 2017. https://doi.org/10.1016/j.procs.2017.08.250
  11. Yuji Tokozume, Yoshitaka Ushiku, and Tatsuya Harada. Learning from between-class examples for deep sound recognition. CoRR, abs/1711.10282, 2017.
  12. Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS'16, pp. 892-900, 2016.
  13. Zhichao Zhang, Shugong Xu, Shan Cao, and Shunqing Zhang. Deep convolutional neural network with mixup for environmental sound classification. In Jian-Huang Lai, Cheng-Lin Liu, Xilin Chen, Jie Zhou, Tieniu Tan, Nanning Zheng, and Hongbin Zha, editors, Pattern Recognition and Computer Vision, pp. 356-367, 2018.
  14. Z. Zhang, S. Xu, S. Zhang, T. Qiao, and S. Cao. Learning attentive representations for environmental sound classification. IEEE Access, 7:130327-130339, 2019.
  15. Xinyu Li, Venkata Chebiyyam, and Katrin Kirchhoff. Multi-stream network with temporal attention for environmental sound classification. CoRR, abs/1901.08608, 2019.