A Clustering Technique using Common Structures of XML Documents

XML 문서의 공통 구조를 이용한 클러스터링 기법

  • 황정희 (충북대학교 컴퓨터과학과) ;
  • 류근호 (충북대학교 전기전자컴퓨터공학부)
  • Published : 2005.12.01

Abstract

As the Internet is growing, the use of XML which is a standard of semi-structured document is increasing. Therefore, there are on going works about integration and retrieval of XML documents. However, the basis of efficient integration and retrieval of documents is to cluster XML documents with similar structure. The conventional XML clustering approaches use the hierarchical clustering algorithm that produces the demanded number of clusters through repeated merge, but it have some problems that it is difficult to compute the similarity between XML documents and it costs much time to compare similarity repeatedly. In order to address this problem, we use clustering algorithm for transactional data that is scale for large size of data. In this paper we use common structures from XML documents that don't have DTD or schema. In order to use common structures of XML document, we extract representative structures by decomposing the structure from a tree model expressing the XML document, and we perform clustering with the extracted structure. Besides, we show efficiency of proposed method by comparing and analyzing with the previous method.

인터넷의 성장으로 인해 반구조적인 문서의 표준인 XML 문서의 사용이 증가하고 있고 이에 따라 XML 문서의 통합과 검색을 위한 연구가 많이 진행되고 있다. 효율적인 문서의 통합과 검색을 위한 기초 작업은 유사 구조의 XML 문서를 클러스터링 하는 것이다. 기존 연구의 XML 문서 클러스터링에서는 문서간의 구조적 유사도를 이용하여 클러스터를 생성한다. 그러나 이러한 방법은 문서간의 구조적 유사성외 정확한 측정 기준을 만들기 어렵고, 반복적인 유사도의 비교로 인해 처리 속도가 느리다는 단점이 있다. 이러한 문제점을 개선하기 위하여 이 논문에서는 많은 데이타에도 유연하게 적용할 수 있는 트랜잭션 데이타를 위한 클러스터링 알고리즘을 적용하는 새로운 클러스터링 방법을 제안한다. 이 논문에서 제안하는 클러스터링 방법은 하나의 DTD나 XML 스키마를 공유하는 문서 집합이 아닌 스키마가 없는 다양한 구조의 XML 문서들을 대상으로 공통 구조를 이용한다. 공통 구조를 이용하기 위하여 XML 문서의 트리 모델에서 구조를 분리하여 빈발 구조를 추출하고 이를 기반으로 클러스터링을 수행한다. 아울러, 기존 연구와의 비교 및 실험을 통해 제안 기법의 효율성을 보인다.

Keywords

References

  1. W3C, Extensible Markup Language(XML) 1.1. http://www.w3.org/TR/xml11, W3C Working Draft. April 2002
  2. J. T. Wang, D. Shasha, G. J. S. Chang, 'Structural Matching and Discovery in Document Databases,' Proceedings of the ACM SIGMOD on Management of Data, 1997 https://doi.org/10.1145/253260.253406
  3. R. Nayak, R. Witt, A. Tonev, 'Data Mining and XML Documents,' International Conference on Internet Computing, 2002
  4. M. L. Lee, L. H. Yang, W. Hsu, X. Yang, 'XClust: Clustering XML Schemas for Effective Integration,' Proceedings of the ACM International Conference on Information and Knowledge Management, 2002 https://doi.org/10.1145/584792.584841
  5. Y. Shen, B. Wang, 'Clustering Schemaless XML Document,' Proceedings of the 11th International Conference on Cooperative Inormation System, 2003
  6. A. K. Jain, M. N. Murty and P. J. Flynn, 'Data clustering: a review', ACM computing Surveys. vol. 31, no. 3, September 1999 https://doi.org/10.1145/331499.331504
  7. J. Yoon, V. Raghavan, V. Chakilam, 'BitCube:Clustering and Statistical Analysis for XML Documents,' Proceedings of the International Conference on Scientific and Statistical Database Management, 2001
  8. A. Doucet, H. A. Myka, 'Naive Clustering of a Large XML Document Collection,' In Proceedings of INEX Workshop, 2002
  9. D.Hill, 'A Vector Clustering Technique,' Merchandise Information Storage, Retrieval and Dissemination, North-Holland, Amsterdam, 1968
  10. T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, 'Efficient Substructure Discovery from Large Semi-structured Data,' Proceedings of the SIAM International Conference on Data Mining, 2002
  11. M. Zaki, 'Efficiently Mining Frequent Tree in a Forest,' Proceedings of the ACM SIGKDD International Conference, 2002
  12. E. Kotasakis, 'Structural Information Retrieval in XML Documents,' ACM Symposium on Applied Computing(SAC), 2002 https://doi.org/10.1145/508791.508919
  13. J. W. Lee, K. Lee, W. Kim, 'Preparation for Semantics-Based XML Mining,' In Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 345-352, 2001 https://doi.org/10.1109/ICDM.2001.989538
  14. A. Termier, M. C. Houster, M. Sebag, 'Tree-Finder: A First Step towards XML Data Mining,' In Proceedings of IEEE International Conference on Data Mining (ICDM), pp.450-457, 2002 https://doi.org/10.1109/ICDM.2002.1183987
  15. J. Widom, 'Data Management for XML: Research Directions,' IEEE Computer Society Technical Committee on Data Engineering, 1999
  16. A. G. Buchner, M. Baumgarten, M. D. Mulvenna, R. Bohm, S. S. Anand, 'Data Mining and XML: Current and Future Issues,' Proceedings of WISE, 2000 https://doi.org/10.1109/WISE.2000.882869
  17. Z. Zhang, R. Li, S. Cao, Y. Zhu, 'Similarity Metric for XML Documents,' Workshop on Knowledge and Experience Management(FGWM), 2003
  18. F. D. Francesca, G. Gordano, G. Manco, R. Ortale, A. Tagarelli, 'A General Framework for XML Document Clustering,' Technical report, n(8), ICAR-CNR, 2003
  19. J. H. Hwang, K. H. Ryu, 'Structure-based Clustering for XML Document Retrieval,' to be published in the Journal of KIPS
  20. Y. Yang, X. Guan, J You, 'CLOPE : A fast and effective clustering algorithm for transaction data,' Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.
  21. K. Wang, C. Xu, 'Clustering Transactions Using Large Items,' Proceedings of ACM CIKM-99, 1999 https://doi.org/10.1145/319950.320054
  22. S.Abiteboul, P. Buneman, D. Suciu, 'Data On The Web: From Relational to Semistructured Data and XML,' Morgan Kaufmann Publishers, San Francisco, California, 2000
  23. http://www.cogsci.princeton.edu/~/
  24. J. Pei, J. Han, B. M. Asi, H. Pinto, 'PrefixSpan: Mining Sequential Pattern Efficiently by Prefix-Projected Pattern Growth,' Proceedings of International Conference on Data Engineering (ICDE) , 2001
  25. NIAGARA query engine. http://www.cs.wisc.edu/niagara/data.html
  26. http://sourceforge.net/projects/javawn/