DOI QR코드

DOI QR Code

Clickstream Big Data Mining for Demographics based Digital Marketing

인구통계특성 기반 디지털 마케팅을 위한 클릭스트림 빅데이터 마이닝

  • Park, Jiae (Dept. of Data Science, Kookmin University) ;
  • Cho, Yoonho (School of Business Administration, Kookmin University)
  • 박지애 (국민대학교 데이터사이언스학과) ;
  • 조윤호 (국민대학교 경영대학 경영학부)
  • Received : 2016.08.17
  • Accepted : 2016.09.24
  • Published : 2016.09.30

Abstract

The demographics of Internet users are the most basic and important sources for target marketing or personalized advertisements on the digital marketing channels which include email, mobile, and social media. However, it gradually has become difficult to collect the demographics of Internet users because their activities are anonymous in many cases. Although the marketing department is able to get the demographics using online or offline surveys, these approaches are very expensive, long processes, and likely to include false statements. Clickstream data is the recording an Internet user leaves behind while visiting websites. As the user clicks anywhere in the webpage, the activity is logged in semi-structured website log files. Such data allows us to see what pages users visited, how long they stayed there, how often they visited, when they usually visited, which site they prefer, what keywords they used to find the site, whether they purchased any, and so forth. For such a reason, some researchers tried to guess the demographics of Internet users by using their clickstream data. They derived various independent variables likely to be correlated to the demographics. The variables include search keyword, frequency and intensity for time, day and month, variety of websites visited, text information for web pages visited, etc. The demographic attributes to predict are also diverse according to the paper, and cover gender, age, job, location, income, education, marital status, presence of children. A variety of data mining methods, such as LSA, SVM, decision tree, neural network, logistic regression, and k-nearest neighbors, were used for prediction model building. However, this research has not yet identified which data mining method is appropriate to predict each demographic variable. Moreover, it is required to review independent variables studied so far and combine them as needed, and evaluate them for building the best prediction model. The objective of this study is to choose clickstream attributes mostly likely to be correlated to the demographics from the results of previous research, and then to identify which data mining method is fitting to predict each demographic attribute. Among the demographic attributes, this paper focus on predicting gender, age, marital status, residence, and job. And from the results of previous research, 64 clickstream attributes are applied to predict the demographic attributes. The overall process of predictive model building is compose of 4 steps. In the first step, we create user profiles which include 64 clickstream attributes and 5 demographic attributes. The second step performs the dimension reduction of clickstream variables to solve the curse of dimensionality and overfitting problem. We utilize three approaches which are based on decision tree, PCA, and cluster analysis. We build alternative predictive models for each demographic variable in the third step. SVM, neural network, and logistic regression are used for modeling. The last step evaluates the alternative models in view of model accuracy and selects the best model. For the experiments, we used clickstream data which represents 5 demographics and 16,962,705 online activities for 5,000 Internet users. IBM SPSS Modeler 17.0 was used for our prediction process, and the 5-fold cross validation was conducted to enhance the reliability of our experiments. As the experimental results, we can verify that there are a specific data mining method well-suited for each demographic variable. For example, age prediction is best performed when using the decision tree based dimension reduction and neural network whereas the prediction of gender and marital status is the most accurate by applying SVM without dimension reduction. We conclude that the online behaviors of the Internet users, captured from the clickstream data analysis, could be well used to predict their demographics, thereby being utilized to the digital marketing.

인구통계학적 정보는 디지털 마케팅의 핵심이라 할 수 있는 인터넷 사용자에 대한 타겟 마케팅 및 개인화된 광고를 위해 고려되는 가장 기초적이고 중요한 정보이다. 하지만 인터넷 사용자의 온라인 활동은 익명으로 행해지는 경우가 많기 때문에 인구통계특성 정보를 수집하는 것은 쉬운 일이 아니다. 정기적인 설문 조사를 통해 사용자들의 인구통계특성 정보를 수집할 수도 있지만 많은 비용이 들며 허위 기재 등과 같은 위험성이 존재한다. 특히, 모바일 환경에서는 대부분의 사용자들이 익명으로 활동하기 때문에 인구통계특성 정보를 수집하는 것은 더욱 더 어려워지고 있다. 반면, 인터넷 사용자의 온라인 활동을 기록한 클릭스트림 데이터는 해당 사용자의 인구통계학적 정보에 활용될 수 있다. 특히, 인터넷 사용자의 온라인 행위 특성 중 하나인 페이지뷰는 인구통계학적 정보 예측에 있어서 중요한 요인이 된다. 본 연구에서는 기존 선행 연구를 토대로 클릭스트림 데이터 분석을 통해 인터넷 사용자의 온라인 행위 특성을 추출하고 이를 해당 사용자의 인구통계학적 정보 예측에 사용한다. 또한, 1)의사결정나무를 이용한 변수 축소, 2)주성분분석을 활용한 차원축소, 3)군집분석을 활용한 변수축소의 방법을 제안하고 실험에 적용함으로써 많은 설명변수를 이용하여 예측 모델 생성 시 발생하는 차원의 저주와 과적합 문제를 해결하고 예측 모델의 정확도를 높이고자 하였다. 실험 결과, 범주의 수가 많은 다분형 종속변수에 대한 예측 모델은 모든 설명변수를 사용하여 예측 모델을 생성했을 때보다 본 연구에서 제안한 방법론들을 적용했을 때 예측 모델에 대한 정확도가 향상됨을 알 수 있었다. 본 연구는 클릭스트림 분석을 통해 추출된 인터넷 사용자의 온라인 행위는 해당 사용자의 인구통계학적 정보 예측에 활용 가능하며, 예측된 익명의 인터넷 사용자들에 대한 인구통계학적 정보를 디지털 마케팅에 활용 할 수 있다는데 의의가 있다. 또한, 제안 방법론들을 통해 어느 종속변수에 대해 어떤 방법론들이 예측 모델의 정확도를 개선하는지 확인하였다. 이는 추후 클릭스트림 분석을 활용하여 인구통계학적 정보를 예측할 때, 본 연구에서 제안한 방법론을 사용하여 보다 높은 정확도를 가지는 예측 모델을 생성 할 수 있다는데 의의가 있다.

Keywords

References

  1. Ban, H. and Y. Kwon, "The Study of the Usage Correlation between Portal and Traditional News Media", Korean Journal of Journalism & Communication Studies, Vol.51, No.1 (2007), 399-426.
  2. Banlioni and Miriam, et al., "Preprocessing and mining web log data for web personalization", Congress of the Italian Association for Artificial Intelligence. Springer Berlin Heidelberg, 2003, 237-249.
  3. Boutsidis, Christos, M. W. Mahoney and P. Drineas, "Unsupervised feature selection for principal components analysis", Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2008, 61-69.
  4. Bucklin. Randolph E et al., "Choice and the Internet: From clickstream to research stream", Marketing Letters(2002), 245-258.
  5. Cho, K. and H. Park, "A study on 3-step complex data mining in society indicator survey", Journal of the Korean Data & Information Science Society, Vol.23, No.5(2012), 983-992. https://doi.org/10.7465/jkdi.2012.23.5.983
  6. Choi, S., Y. Hyun and N. Kim, "Improving Performance of Recommendation Systems Using Topic Modeling", Journal of Intelligence and Information Systems, Vol.22, No.1(2015), 77-93.
  7. De Bock, W. Koen and V. D. Poel, "Predicting website audience demographics for web advertising targeting using multi-website clickstream data", Fundamenta Informaticae, Vol.98, No.1(2010), 49-70.
  8. Eleonora Ivanova, "Predicting website audience demographics based on browsing history", Master's Thesis, Information and Service Management, Aalto University, 2013.
  9. Foody, M. Giles and A. Mathur, "A relative evaluation of multiclass image classification by support vector machines", IEEE, Transactions on geoscience and remote sensing, Vol.42, No.6(2004), 1335-1343. https://doi.org/10.1109/TGRS.2004.827257
  10. Gallagher, K. and J. Parsons, "A framework for targeting banner advertising on the Internet", Proc. 30th Hawaii International Conference on System Sciences(HICSS 30), 1997.
  11. Goel, Sharad, M. Jake, Hofman and M. I. Sirer, "Who Does What on the Web: A Large-Scale Study of Browsing Behavior." ICWSM, 2012.
  12. Han, S et al. "Real-Time Purchase Probability Prediction Using Clickstream Data of Internet Storefronts", Entrue Journal of Information Technology, Vol.11, No.1(2012), 101-110.
  13. Huang, Zan et al., "Credit rating analysis with support vector machines and neural networks: a market comparative study", Decision support systems, Vol.37, No.4(2004), 543-558. https://doi.org/10.1016/S0167-9236(03)00086-1
  14. Jones et al., "I know what you did last summer: query logs and user privacy", Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. ACM, 2007, 909-914.
  15. Kaufman, Leonard, J. Peter and Rousseeuw, "Finding groups in data: an introduction to cluster analysis", Vol. 344. John Wiley & Sons, 2009.
  16. Kim, I., "Predicting audience demographics of web sites using local cues", Doctoral dissertation, David Eccles School of Business, The University of Utah, 2011.
  17. Kim, T. and H. Ahn, "A Hybrid Under-sampling Approach for Better Bankruptcy Prediction", Journal of Intelligence and Information Systems, Vol.21, No.2(2015), 173-190. https://doi.org/10.13088/jiis.2015.21.2.173
  18. Kim, Y. et al., "A Study on Method for User Gender Prediction Using Multi-Modal Smart Device Log Data", The Journal of Society for e-Business Studies, Vol.21, No.1(2016), 147-163.
  19. Lee, K. and H. Lee, "A Study on the Combined Decision Tree(C4.5) and Neural Network Algorithm for Classification of Mobile Telecommunication Customer", Journal of Intelligence and Information Systems, Vol.9, No. 1(2003).
  20. Moe and W. Wendy, "Buying, searching, or browsing: Differentiating between online shoppers using in-store navigational clickstream", Journal of consumer psychology, Vol.13, No.1(2003), 29-39. https://doi.org/10.1207/S15327663JCP13-1&2_03
  21. Montgomery, A. L et al., "Modeling online browsing and path analysis using clickstream data", Marketing Science, Vol.23, No.4 (2004), 579-595. https://doi.org/10.1287/mksc.1040.0073
  22. Murray, D. and K. Durrell, "Inferring demographic attributes of anonymous internet users", Web usage Analysis and User Profiling Workshop, Springer, 2000, 7-20.
  23. Park, Y.-H. and S. F. Peter, "Modeling browsing behavior at multiple websites", Marketing Science, Vol.23, No.3(2004), 280-303. https://doi.org/10.1287/mksc.1040.0050
  24. Poindexter, M. Paula and E. M. Maxwell, "Revisiting the civic duty to keep informed in the new media environment", Journalism & Mass Communication Quarterly, Vol.78, No.1(2001), 113-126. https://doi.org/10.1177/107769900107800108
  25. Provost, Foster and T. Fawcett, "Data Science for Business: What you need to know about data mining and data-analytic thinking", O'Reilly Media, Inc., 2013.
  26. Rumelhart, David E., E. Geoffrey, Hinton and R. J. Williams, "Learning internal representations by error propagation", No. ICS-8506. CALIFORNIA UNIV SAN DIEGO LA JOLLA INST FOR COGNITIVE SCIENCE, 1985.