A Gene Clustering Method with Masking Cross-matching Fragments Using Modified Suffix Tree Clustering Method

Han, Sang-Il;Lee, Sung-Gun;Hou, Bo-Kyeng;Park, Sung-Hoon;Kim, Young-Han;Hwang, Kyu-Suk;

doi:10.1007/BF02719409

Korean Journal of Chemical Engineering

Volume 22 Issue 3
/
Pages.345-352
/
2005
/
0256-1115(pISSN)
/
1975-7220(eISSN)

The Korean Institute of Chemical Engineers (한국화학공학회)

DOI QR Code

A Gene Clustering Method with Masking Cross-matching Fragments Using Modified Suffix Tree Clustering Method

Han, Sang-Il (Department of Chemical Engineering, Pusan National University) ;
Lee, Sung-Gun (Department of Chemical Engineering, Pusan National University) ;
Hou, Bo-Kyeng (Department of Chemical Engineering, Pusan National University) ;
Park, Sung-Hoon (Department of Chemical Engineering, Pusan National University) ;
Kim, Young-Han (Department of Chemical Engineering, Dong-A University) ;
Hwang, Kyu-Suk (Department of Chemical Engineering, Pusan National University)

Published : 2005.05.31

https://doi.org/10.1007/BF02719409 Citation KSCI

⟨ Previous Next ⟩

Abstract

Multiple sequence alignment is a method for comparing two or more DNA or protein sequences. Most multiple sequence alignment methods rely on pairwise alignment and Smith-Waterman algorithm [Needleman and Wunsch, 1970; Smith and Waterman, 1981] to generate an alignment hierarchy. Therefore, as the mnnber of sequences increases, the runtime increases exponentially. To resolve this problem, this paper presents a multiple sequence alignment method using a parallel processing suffix tree algorithm to search for common subsequences at one time without pairwise alignment. The cross-matched subsequences among the searched common subsequences may be generated and those cause inexact-matching. So the procedure of masking cross-matching pairs was suggested in this study. The proposed method, improved STC (Suffix Tree Clustering), is summarized as follows: (I) construction of suffix tree; (2) search and overlap of common subsequences; (3) grouping of subsequence pairs; (4) masking of cross-matching pairs; and (5) clustering of gene sequences. The new method was successfully evaluated with 23 genes in Mus musculus and 22 genes in three species, clustering nine and eight clusters, respectively.

Keywords

References

Chen, J. Y. and Carlis, J. V., 'Genomic Data Modeling,' Information Systems, 28, 287 (2003) https://doi.org/10.1016/S0306-4379(02)00071-6
Choi, S. H. and Manousiouthakis, V., 'Global Optimization Methods for Chemical Process Design: Deterministic and Stochastic Approaches,' Korean J. Chem. Eng., 19(2), 227 (2002) https://doi.org/10.1007/BF02698406
Delcher, A. L., Kasif, S., Fleischmann, R. D., Peterson, J., White, O. and Salzberg, S. L., 'Alignment of Whole Genomes,' Nucleic Acids Res., 27(11), 2369 (1999) https://doi.org/10.1093/nar/27.11.2369
Delcher, A. L., Phillippy, A., Carlton J. and Salzberg, S. L., 'Fast Algorithms for Large-scale Genome Alignment and Comparison,' Nucleic Acids Res., 30(11), 2478 (2002) https://doi.org/10.1093/nar/30.11.2478
Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, Cambridge, London (1997)
Higgins, D. G., Thompson, J. D. and Gibson, T. J., 'Using CLUSTAL for Multiple Sequence Alignments,' Methods Enzymol., 266, 383 (1996) https://doi.org/10.1016/S0076-6879(96)66024-8
Higgins, D. G. and Sharp, P. M., 'CLUSTAL: A Package for Performing Multiple Sequence Alignment on a Microcomputer,' Gene, 73, 237 (1988) https://doi.org/10.1016/0378-1119(88)90330-7
Hon, W. K. and Sadakane, K., 'Space-Economical Algorithms for Finding Maximal Unique Matches,' Proceedings of the I3th Annual Symposium on Combinatorial Pattern Matching, 144 (2002)
Kalyanaraman, A., Aluru, S. and Kothari, S., Parallel EST Clustering, HICOMB 2002,185 (2002)
Kim, D. K., Lee, K. S. and Yang D. R., 'Control of pH Neutralization Proess Using Simulation Based Dynamic Programming,' Korean J. Chem. Eng., 21(5), 942 (2004) https://doi.org/10.1007/BF02705575
Lee, J. M. and Lee, J. H., 'Simulation-Based Learning of Cost- To-Go for Control of Nonlinear Processes,' Korean J. Chem. Eng., 21(2), 338 (2004) https://doi.org/10.1007/BF02705417
McCreight, E., 'A Space Economical Suffix Tree Construction Algorithm,' Journal of the ACM, 23, 262 (1976) https://doi.org/10.1145/321941.321946
Miller, R. T., Christoffels, A. G., Gopalakrishnan, C., Burke, J., Ptitsyn, A. A., Broveak, T. R. and Hide, W. A., 'A Comprehensive Approach to Clustering of Expressed Human Gene Sequence: The Sequence Tag Alignment and Consensus Knowledge Base,' Genome Research, 9, 1143 (1999) https://doi.org/10.1101/gr.9.11.1143
Morgenstern, B., Frech, K., Dress, A. and Wemer, T., 'DIALIGN: Finding Local Similarities by Multiple Sequence Alignment,' Bioinformatics, 14, 290 (1998) https://doi.org/10.1093/bioinformatics/14.3.290
Mount, D. W., Bioinformatics : Sequence and Genome Analysis, Cold Spring Harbor Laboratory Press (2001)
Needleman, S. B. and Wunsch, C. D., 'A General Method Applicable to the Search for Similarities in the Amino Acid Sequences of Two Proteins,' J. Mol. Biol., 48, 443 (1970) https://doi.org/10.1016/0022-2836(70)90057-4
Notredame, C. and Higgins, D. G., 'SAGA: Sequence Alignment by Genetic Algorithm,' Nucleic Acids Res., 24, 1515 (1996) https://doi.org/10.1093/nar/24.8.1515
Ostell, J. M., Wheelan, S. J. and Kans, J. A., 'The NCBI Data Model,' Methods Biochem. Anal., 43, 19 (2001) https://doi.org/10.1002/0471223921.ch2
Pearson, W. R. and Miller, W., 'Dynamic Programming Algorithm for Biological Sequence Comparison,' Methods Enzymol., 210, 575 (1992) https://doi.org/10.1016/0076-6879(92)10029-D
Phillips, A., Janies, D. and Wheeler, W., 'Multiple Sequence Alignment in Phylogenetic Analysis,' Molecular Phylogenetics and Evolution, 16, 317 (2000) https://doi.org/10.1006/mpev.2000.0785
Randal, L. S. and Christiansen, T., Learning Perl, Second Edition, O'Reilly (1997)
Salzberg, S. L., Searls, D. B. and Kasif, S., Trends Guide to Bioinformatics, Elsevier Science (1998)
Shin, P. K., Koo, J. H. and Lee, W J., 'Modeling of Cell Growth and phoA-Directed Expression of Cloned Genes in Recombinant Escherichia coli,' Korean J. Chem. Eng., 13(1), 82 (1996) https://doi.org/10.1007/BF02705893
Smith, T. F. and Waterman, M. S., 'Identification of Common Molecular Sequences,' J. Mol. Biol., 197, 723 (1981)
Thompson, J. D., Higgins, D. G. and Gibson, T. J., 'CLUSTAL W: improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-specific Gap Penalties and weight Matrix Choice,' Nucleic Acids Res., 22, 4673 (1994) https://doi.org/10.1093/nar/22.22.4673
Tisdall, J. D., Beginning Perl for Bioinformatics, O'REILLY (2001)
Ukkonen, E., 'On-line Construction of Suffix Trees,' Algorithmica, 14, 249 (1995) https://doi.org/10.1007/BF01206331
Volfovsky, N., Haas, B. J. and Salzberg, S. L., 'A Clustering Method for Repeat Analysis in DNA Sequences,' Genome Biology, 2, 1 (2001)
Weiner, P., 'Linear Pattern Matching Algorithms,' In Proc. of the 14th IEEE Annual Symposium on Switching and Automata Theory, 1 (1973)
Zamir, O., Etzioni, O., Madani, O. and Karp, R. M., 'Fast and Intuitive Clustering of Web Documents,' In Proc. of the 3rd International Conference on Knowledge Discovery and Data Mining, 287 (1997)
Zamir, O. and Etzioni, O., 'Web Document Clustering: A Feasibility Demonstration,' In Proc. of the $21^{st}$ Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 46 (1998)

Korean Journal of Chemical Engineering

A Gene Clustering Method with Masking Cross-matching Fragments Using Modified Suffix Tree Clustering Method

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)