Incremental Clustering of XML Documents based on Similar Structures

유사 구조 기반 XML 문서의 점진적 클러스터링

  • 황정희 (충북대학교 전자계산학과) ;
  • 류근호 (충북대학교 전기전자컴퓨터공학부)
  • Published : 2004.12.01

Abstract

XML is increasingly important in data exchange and information management. Starting point for retrieving the structure and integrating the documents efficiently is clustering the documents that have similar structure. The reason is that we can retrieve the documents more flexible and faster than the method treating the whole documents that have different structure. Therefore, in this paper, we propose the similar structure-based incremental clustering method useful for retrieving the structure of XML documents and integrating them. As a novel method, we use a clustering algorithm for transactional data that facilitates the large number of data, which is quite different from the existing methods that measure the similarity between documents, using vector. We first extract the representative structures of XML documents using sequential pattern algorithm, and then we perform the similar structure based document clustering, assuming that the document as a transaction, the representative structure of the document as the items of the transaction. In addition, we define the cluster cohesion and inter-cluster similarity, and analyze the efficiency of the Proposed method through comparing with the existing method by experiments.

XML은 정보 관리와 데이타 교환에서 점차로 더 중요해지고 있다. 효율적인 구조 검색과 문서 통합을 위한 기초 과정은 유사한 구조를 갖는 문서를 클러스터링 하는 것이다. 이것은 구조가 다른 전체 문서를 대상으로 검색하는 것보다 더 신속하고 유연성을 제공하기 때문이다. 따라서 이 논문에서는 XML 문서의 구조 검색과 통합에 유용한 유사 구조기반의 점진적 클러스터링 기법을 제안한다. 기존의 문서 클러스터링에서 벡터를 이용한 문서의 유사도에 의해 클러스터를 형성하는 것과는 다르게 우리는 대량의 데이타에 유연하게 적용할 수 있는 트랜잭션 데이타를 위한 클러스터링 알고리즘을 사용하였다. 제안 기법은 먼저 순차 패턴 알고리즘을 이용하여 XML 문서의 대표 구조를 추출한다. 그리고 문서를 하나의 트랜잭션으로, 문서의 대표구조를 트랜잭션의 항목으로 간주하여 유사 구조 항목 기반의 점진적인 클러스터링을 수행한다. 아울러, 클러스터의 응집도와 클러스터간의 유사도를 정의하였고, 이를 이용하여 기존 연구와의 실험에 대한 분석을 통해 제안 기법의 효율성을 분석하였다.

Keywords

References

  1. W3C, Extensible Markup Language(XML) 1.1. http://www.w3.org/TH/xml11, W3C Working Draft. April 2002
  2. P. Kotasek, J. Zendulka, 'An XML Framework Proposal for Knowledge Discovery in Database,' European Conference on Principles and Practice Knowledge Discovery in Databases, 2000
  3. K. Wang, H. Liu, 'Discovery Typical Structures of Documents: A Road Map Approach,' In Proceedings of ACM SIGIR Conference on Information Retrieval, pp. 146- 154, 1998 https://doi.org/10.1145/290941.290982
  4. J. Widom, 'Data Management for XML: Research Directions,' IEEE Computer Society Technical Committee on Data Engineering, pp.44-52, 1999
  5. R. Nayak, R. Witt, A. Tonev, 'Data Mining and XML Documents,' International Conference on Internet Computing, pp.660-666, 2002
  6. T. Asai, K. Abc, S. Kawasoe, H. Arimura, H. Sakamoto, 'Efficient Substructure Discovery from Large Semi-structured Data,' In Proceedings of SIAM International Conference on Data Mining, pp. 158-174, 2002
  7. A. P. Asirvatham, K. K. Ravi, 'Web Page Classification based on Document Structure,' In National Level Student Paper Contest, conducted by IEEE India Council, 2001
  8. J. T. Wang, D. Shasha, G. J. S. Chang, 'Structural Matching and Discovery in Document Databases,' Proceedings of the ACM SIGMOD on Management of Data, 1997 https://doi.org/10.1145/253260.253406
  9. W. Chiu, A. Wai-chee, 'Incremental Document Clustering for Web Page Classification,' In Proceedings of IEEE 2000 International Conference on Information Society in the 21st Century: Emerging Technologies and New Challenges, 2000
  10. M. Ester, H. P. Kriegel, J. Sander, M. Wimmer, X. Xu, 'Incremental Clustering for Mining in a Data Warehousing Environment,' In Proceedings of International Conference on VLDB, pp. 323-333, 1998
  11. M. L. Lee, L. H. Yang, W. Hsu, X. Yang, 'XClust: Clustering XML Schemas for Effective Integration,' Proceedings of the ACM International Conference on Information and Knowledge Management, 2002 https://doi.org/10.1145/584792.584841
  12. M. Zaki, 'Efficiently Mining Frequent Tree in a Forest,' In Proceedings of ACM SIGKDD International Conference, pp. 71-80, July 2002
  13. J. Pei, J. Han, B. M. Asi, H. Pinto, 'PrefixSpan: Mining Sequential Pattern Efficiently by PrefixProjected Pattern Growth,' In Proceedings of International Conference of Data Engineering(ICDE), pp. 215-224, 2001
  14. Y. Yang, X. Guan, J. You, 'CLOPE : A fast and effective clustering algorithm for transaction data,' In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 682-687, 2002 https://doi.org/10.1145/775047.775149
  15. K. Wang, C. Xu, 'Clustering Transactions Using Large Items,' Proceedings of ACM CIKM-99, 1999 https://doi.org/10.1145/319950.320054
  16. J. W. Lee, K. Lee, W. Kim, 'Preparation for Semantics-Based XML Mining,' In Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 345-352, 2001 https://doi.org/10.1109/ICDM.2001.989538
  17. A. Doucet, H. A. Myka, 'Naive Clustering of a Large XML Document Collection,' In Proceedings of INEX Workshop, 2002
  18. C. H. Moh, E. P. Lim, W. K. Ng, 'DTD-Miner: A Tool for Mining DTD from XML Document,' In Proceedings of International Workshop on Advance Issues of E-Commerce and Web-Based Information Systems (WECWIS), pp. 144-151, 2000 https://doi.org/10.1109/WECWIS.2000.853869
  19. A. G. Buchner, M. Baumgarten, M. D. Mulvenna, R. Bohm, S. S. Anand, 'Data Mining and XML: Current and Future Issues,' In Proceedings of WISE International Conference, pp. 131-135, 2000 https://doi.org/10.1109/WISE.2000.882869
  20. A. Termier, M. C. Houster, M. Sebag, 'Tree-Finder: A First Step towards XML Data Mining,' In Proceedings of IEEE International Conference on Data Mining (ICDM), pp.450-457, 2002 https://doi.org/10.1109/ICDM.2002.1183987
  21. J. Yoon, V. Raghavan, V. Chakilam, 'BitCube: Clustering and Statistical Analysis for XML Documents,' In Proceedings of International Conference on Scientific and Statistical Database Management, pp. 241-254, 2001
  22. J. H. Hwang, K. H. Ryu, 'XML Document Clustering Based on Sequence Pattern,' Journal of KIPS, (D), Vol. 10, No. 7, pp. 1093-1102, 2003
  23. NIAGARA query engine. http://www.cs.wisc.edu/niagara/data.html