DOI QR코드

DOI QR Code

Hierarchical Overlapping Clustering to Detect Complex Concepts

중복을 허용한 계층적 클러스터링에 의한 복합 개념 탐지 방법

  • Received : 2010.12.30
  • Accepted : 2011.02.18
  • Published : 2011.03.31

Abstract

Clustering is a process of grouping similar or relevant documents into a cluster and assigning a meaningful concept to the cluster. By this process, clustering facilitates fast and correct search for the relevant documents by narrowing down the range of searching only to the collection of documents belonging to related clusters. For effective clustering, techniques are required for identifying similar documents and grouping them into a cluster, and discovering a concept that is most relevant to the cluster. One of the problems often appearing in this context is the detection of a complex concept that overlaps with several simple concepts at the same hierarchical level. Previous clustering methods were unable to identify and represent a complex concept that belongs to several different clusters at the same level in the concept hierarchy, and also could not validate the semantic hierarchical relationship between a complex concept and each of simple concepts. In order to solve these problems, this paper proposes a new clustering method that identifies and represents complex concepts efficiently. We developed the Hierarchical Overlapping Clustering (HOC) algorithm that modified the traditional Agglomerative Hierarchical Clustering algorithm to allow overlapped clusters at the same level in the concept hierarchy. The HOC algorithm represents the clustering result not by a tree but by a lattice to detect complex concepts. We developed a system that employs the HOC algorithm to carry out the goal of complex concept detection. This system operates in three phases; 1) the preprocessing of documents, 2) the clustering using the HOC algorithm, and 3) the validation of semantic hierarchical relationships among the concepts in the lattice obtained as a result of clustering. The preprocessing phase represents the documents as x-y coordinate values in a 2-dimensional space by considering the weights of terms appearing in the documents. First, it goes through some refinement process by applying stopwords removal and stemming to extract index terms. Then, each index term is assigned a TF-IDF weight value and the x-y coordinate value for each document is determined by combining the TF-IDF values of the terms in it. The clustering phase uses the HOC algorithm in which the similarity between the documents is calculated by applying the Euclidean distance method. Initially, a cluster is generated for each document by grouping those documents that are closest to it. Then, the distance between any two clusters is measured, grouping the closest clusters as a new cluster. This process is repeated until the root cluster is generated. In the validation phase, the feature selection method is applied to validate the appropriateness of the cluster concepts built by the HOC algorithm to see if they have meaningful hierarchical relationships. Feature selection is a method of extracting key features from a document by identifying and assigning weight values to important and representative terms in the document. In order to correctly select key features, a method is needed to determine how each term contributes to the class of the document. Among several methods achieving this goal, this paper adopted the $x^2$�� statistics, which measures the dependency degree of a term t to a class c, and represents the relationship between t and c by a numerical value. To demonstrate the effectiveness of the HOC algorithm, a series of performance evaluation is carried out by using a well-known Reuter-21578 news collection. The result of performance evaluation showed that the HOC algorithm greatly contributes to detecting and producing complex concepts by generating the concept hierarchy in a lattice structure.

클러스터링(Clustering)은 유사한 문서나 데이터를 묶어 군집화해주는 프로세스이다. 클러스터링은 문서들을 대표하는 개념별로 그룹화함으로써 사용자가 자신이 원하는 주제의 문서를 찾기 위해 모든 문서를 검사할 필요가 없도록 도와준다. 이를 위해 유사한 문서를 찾아 그룹화하고, 이 그룹의 대표되는 개념을 도출하여 표현해주는 기법이 요구된다. 이 상황에서 문제점으로 대두되는 것이 복합 개념(Complex Concept)의 탐지이다. 복합 개념은 서로 다른 개념의 여러 클러스터에 속하는 중복 개념이다. 기존의 클러스터링 방법으로는 문서를 클러스터링할 때 동일한 레벨에 있는 서로 다른 개념의 클러스터에 속하는 중복된 복합 개념의 클러스터를 찾아서 표현할 수가 없었고, 또한 복합 개념과 각 단순 개념(Simple Concept) 사이의 의미적 계층 관계를 제대로 검증하기가 어려웠다. 본 논문에서는 기존 클러스터링 방법의 문제점을 해결하여 복합 개념을 쉽게 찾아 표현하는 방법을 제안한다. 기존의 계층적 클러스터링 알고리즘을 변형하여 동일 레벨에서 중복을 허용하는 계층적 클러스터링(Hierarchical Overlapping Clustering, HOC) 알고리즘을 개발하였다. HOC 알고리즘은 문서를 클러스터링하여 그 결과를 트리가 아닌 개념 중복이 가능한 Lattice 계층 구조로 표현함으로써 이를 통해 여러 개념이 중복된 복합 개념을 탐지할 수 있었다. HOC 알고리즘을 이용해 생성된 각 클러스터의 개념이 제대로 된 의미적인 계층 관계로 표현되었는지는 특징 선택(Feature Selection) 방법을 적용하여 검증하였다.

Keywords

References

  1. Baeza-Yates, R. and B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999.
  2. Chuang, S. and L. Chien, "A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments, Proc. 13th ACM Intl. Conf. on Information and Knowledge Management (CIKM'04) (2004), 127-136.
  3. Cleuziou, G., "An Extended Version of the Kmeans Method for Overlapping Clustering", Proc. 19th Intl. Conf. on Pattern Recognition (ICPR 2008), (2008), 1-4.
  4. Gath, I. and A. B. Geva, "Unsupervised Optimal Fuzzy Clustering", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.11, No.7(1989), 773-780. https://doi.org/10.1109/34.192473
  5. Jain, A. K., M. N. Murty and P. J. Flynn, "Data Clustering : A Review", ACM Computing Surveys, Vol.31, No.3(1999), 264-323. https://doi.org/10.1145/331499.331504
  6. Jonyer, I., D. J. Cook and L. B. Holder, "Graph- Based Hierarchical Conceptual Clustering", Journal of Machine Learning Research, Vol.2(2002), 19-43.
  7. Lavine, B. K., "Clustering and Classification of Analytical Data", in R. A. Meyers (ed.), Encyclopedia of Analytical Chemistry, (2000), 1-21.
  8. Lewis, D., "Reuters-21578 Text Categorization Test Collection", 2004.
  9. http://www.daviddlewis.com/resources/testcollecions/reuters21578/.
  10. Likas, A., N. Viassis and J. J. Verbeek, "The Global K-means Clustering Algorithm", Pattern Recognition, Vol.36, No.2(2003), 451-461. https://doi.org/10.1016/S0031-3203(02)00060-2
  11. Liu, B., Web Data Mining : Exploring Hyperlinks, Contents, and Usage Data, Springer, 2007.
  12. Yeh, J. and S. Sie, "Towards Automatic Concept Hierarchy Generation for Specific Knowledge Network", Lecture Notes in Artificial Intelligence(LNAI), Vol.4031 (2006), 982-989.

Cited by

  1. 통계적 접근법을 기초로 하는 지능형 교육 지원 시스템 vol.18, pp.1, 2011, https://doi.org/10.13088/jiis.2012.18.1.109