DOI QR코드

DOI QR Code

A Two-Phase On-Device Analysis for Gender Prediction of Mobile Users Using Discriminative and Popular Wordsets

모바일 사용자의 성별 예측을 위한 식별 및 인기 단어 집합 기반 2단계 기기 내 분석

  • Choi, Yerim (Department of Industrial Engineering, Seoul National University) ;
  • Park, Kyuyon (Department of Industrial Engineering, Seoul National University) ;
  • Kim, Solee (Department of Industrial Engineering, Seoul National University) ;
  • Park, Jonghun (Department of Industrial Engineering, Seoul National University)
  • Received : 2016.01.12
  • Accepted : 2016.02.15
  • Published : 2016.02.28

Abstract

As respecting one's privacy becomes an important issue in mobile device data analysis, on-device analysis is getting attention, in which the data analysis is conducted inside a mobile device without sending data from the device to outside. One possible application of the on-device analysis is gender prediction using text data in mobile devices, such as text messages, search keyword, website bookmarks, and contact, which are highly private, and the limited computing power of mobile devices can be addressed by utilizing the word comparison method, where words are selected beforehand and delivered to a mobile device of a user to determine the user's gender by matching mobile text data and the selected words. Moreover, it is known that performing prediction after filtering instances using definite evidences increases accuracy and reduces computational complexity. In this regard, we propose a two-phase approach to on-device gender prediction, where both discriminability and popularity of a word are sequentially considered. The proposed method performs predictions using a few highly discriminative words for all instances and popular words for unclassified instances from the previous prediction. From the experiments conducted on real-world dataset, the proposed method outperformed the compared methods.

모바일 기기 데이터를 활용한 분석에서 사용자의 프라이버시를 보호하는 것이 주요한 이슈로 대두됨에 따라 데이터를 외부로 전송하지 않고 모바일 기기 안에서 분석을 수행하는 기기내 분석이 주목 받고 있다. 기기 내 분석을 활용하면 문자 메시지, 검색 단어, 북마크, 연락처등 매우 개인적이지만 성별 구분에 효과적이라고 알려진 모바일 텍스트를 이용한 성별 예측이 가능하며, 사전에 선정된 단어들의 집합을 모바일 기기로 전송하여 이 단어들과 모바일 텍스트를 비교를 통해 성별을 예측하는 단어 비교 방식을 이용하면 모바일 기기의 제한된 자원 문제를 극복할 수 있다. 특히, 확실한 근거를 이용하여 필터링 한 후 예측을 수행하면 정확도를 극대화하고 복잡도를 낮출 수 있다. 따라서 본 논문에서는 단어의 식별력과 인기도를 순차적으로 고려하는 2단계의 기기 내 성별 예측 방법을 제안한다. 구체적으로, 제안하는 방법론은 소수의 높은 식별력을 가지는 단어를 이용하여 전체 사용자의 성별을 예측하고 이어서 인기도가 높은 단어를 활용하여 앞서 예측이 되지 않은 사용자의 성별을 예측한다. 실제 데이터를 이용한 실험에서 제안하는 방법론은 비교 방법론보다 우수한 성능을 나타내었다.

Keywords

References

  1. Baek, S. and Choi, D., "Exploring User Attitude to Information Privacy," The Journal of Society for e-Business Studies, Vol. 20, No. 1, pp. 45-59, 2015. https://doi.org/10.7838/jsebs.2015.20.1.045
  2. Chang, C. C. and Lin, C. J., "LIBSVM: A Library for Support Vector Machines," ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 3, pp. 1-27, 2011.
  3. Goswami, S., Sarkar, S., and Rustagi, M., "Stylometric Analysis of Bloggers' Age and Gender," Proceedings of the International AAAI Conference on Weblogs and Social Media, pp. 214-217, 2009.
  4. Han, J., Park, M., and Kim, J., "Improving the Performance of Automatic Text Categorization by Using Phrasal Patterns and Keyword Sets," Proceedings of the Korea Computer Congress, pp. 70-73, 1998.
  5. Kim, S., Choi, Y., Kim, Y., Park, K., and Park, J., "On-Device Gender Prediction Framework Based on the Development of Discriminative Word and Emoticon Sets," KIISE Transactions on Computing Practices, Vol. 21, No. 11, pp. 733-738, 2015. https://doi.org/10.5626/KTCP.2015.21.11.733
  6. Kim, Y., Choi, Y., Kim, S., Park, K., and Park, J., "An Ensemble Model for Gender Classification of Mobile Users," Proceedings of the International Conference on Computer Technology and Development, 2015.
  7. Lakoff, R., "Language and Woman's Place," Language in Society, Vol. 2, No. 1, pp. 45-80, 1973. https://doi.org/10.1017/S0047404500000051
  8. Lee, D. and Shim, J., "Survey on Vector Similarity Measures: Focusing on Algebraic Characteristics," The Journal of Society for e-Business Studies, Vol. 17, No. 4, pp. 209-219, 2012. https://doi.org/10.7838/jsebs.2012.17.4.209
  9. Lee, J., Choi, H., and Choi, S., "Study on How Service Usefulness and Privacy Concern Influence on Service Acceptance," The Journal of Society for e-Business Studies, Vol. 12, No. 4, pp. 37-51, 2007.
  10. Lee, K., Kim, K., Lee, M., Kim, W., and Hong, J., "Post Clustering Method using Tag Hierarchy for Blog Search," The Journal of Society for e-Business Studies, Vol. 16, No. 4, pp. 301-319, 2011. https://doi.org/10.7838/jsebs.2011.16.4.301
  11. Otterbacher, J., "Inferring Gender of Movie Reviewers: Exploiting Writing Style, Content and Metadata," Proceedings of the ACM International Conference on Information and Knowledge Management, pp. 369-378, 2010.
  12. Rao, D., Yarowsky, D., Shreevats, A., and Gupta, M., "Classifying Latent User Attributes in Twitter," Proceedings of the International Workshop on Search and Mining User-Generated Contents, pp. 37-44, 2010.
  13. Roh, J., Kim, H., and Jang, J., "Improving Hypertext Classification Systems through WordNet-based Feature Abstraction," The Journal of Society for e-Business Studies, Vol. 18, No. 2, pp. 95-110, 2013. https://doi.org/10.7838/jsebs.2013.18.2.095
  14. Shim, K., "MADE: Morphological Analyzer Development Environment," Journal of Internet Computing and Services, Vol. 8, No. 4, pp. 159-171, 2007.
  15. Vapnik, V., The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995.
  16. Yang, Y. and Pedersen, J. O., "A Comparative Study on Feature Selection in Text Categorization," Proceedings of the International Conference on Machine Learning, pp. 412-420, 1997.

Cited by

  1. 생체신호를 활용한 학습기반 영유아 스트레스 상태 식별 모델 연구 vol.22, pp.2, 2016, https://doi.org/10.7838/jsebs.2017.22.2.001