DOI QR코드

DOI QR Code

A Gene Clustering Method with Masking Cross-matching Fragments Using Modified Suffix Tree Clustering Method

  • Han, Sang-Il (Department of Chemical Engineering, Pusan National University) ;
  • Lee, Sung-Gun (Department of Chemical Engineering, Pusan National University) ;
  • Hou, Bo-Kyeng (Department of Chemical Engineering, Pusan National University) ;
  • Park, Sung-Hoon (Department of Chemical Engineering, Pusan National University) ;
  • Kim, Young-Han (Department of Chemical Engineering, Dong-A University) ;
  • Hwang, Kyu-Suk (Department of Chemical Engineering, Pusan National University)
  • Published : 2005.05.31

Abstract

Multiple sequence alignment is a method for comparing two or more DNA or protein sequences. Most multiple sequence alignment methods rely on pairwise alignment and Smith-Waterman algorithm [Needleman and Wunsch, 1970; Smith and Waterman, 1981] to generate an alignment hierarchy. Therefore, as the mnnber of sequences increases, the runtime increases exponentially. To resolve this problem, this paper presents a multiple sequence alignment method using a parallel processing suffix tree algorithm to search for common subsequences at one time without pairwise alignment. The cross-matched subsequences among the searched common subsequences may be generated and those cause inexact-matching. So the procedure of masking cross-matching pairs was suggested in this study. The proposed method, improved STC (Suffix Tree Clustering), is summarized as follows: (I) construction of suffix tree; (2) search and overlap of common subsequences; (3) grouping of subsequence pairs; (4) masking of cross-matching pairs; and (5) clustering of gene sequences. The new method was successfully evaluated with 23 genes in Mus musculus and 22 genes in three species, clustering nine and eight clusters, respectively.

Keywords

References

  1. Chen, J. Y. and Carlis, J. V., 'Genomic Data Modeling,' Information Systems, 28, 287 (2003) https://doi.org/10.1016/S0306-4379(02)00071-6
  2. Choi, S. H. and Manousiouthakis, V., 'Global Optimization Methods for Chemical Process Design: Deterministic and Stochastic Approaches,' Korean J. Chem. Eng., 19(2), 227 (2002) https://doi.org/10.1007/BF02698406
  3. Delcher, A. L., Kasif, S., Fleischmann, R. D., Peterson, J., White, O. and Salzberg, S. L., 'Alignment of Whole Genomes,' Nucleic Acids Res., 27(11), 2369 (1999) https://doi.org/10.1093/nar/27.11.2369
  4. Delcher, A. L., Phillippy, A., Carlton J. and Salzberg, S. L., 'Fast Algorithms for Large-scale Genome Alignment and Comparison,' Nucleic Acids Res., 30(11), 2478 (2002) https://doi.org/10.1093/nar/30.11.2478
  5. Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, Cambridge, London (1997)
  6. Higgins, D. G., Thompson, J. D. and Gibson, T. J., 'Using CLUSTAL for Multiple Sequence Alignments,' Methods Enzymol., 266, 383 (1996) https://doi.org/10.1016/S0076-6879(96)66024-8
  7. Higgins, D. G. and Sharp, P. M., 'CLUSTAL: A Package for Performing Multiple Sequence Alignment on a Microcomputer,' Gene, 73, 237 (1988) https://doi.org/10.1016/0378-1119(88)90330-7
  8. Hon, W. K. and Sadakane, K., 'Space-Economical Algorithms for Finding Maximal Unique Matches,' Proceedings of the I3th Annual Symposium on Combinatorial Pattern Matching, 144 (2002)
  9. Kalyanaraman, A., Aluru, S. and Kothari, S., Parallel EST Clustering, HICOMB 2002,185 (2002)
  10. Kim, D. K., Lee, K. S. and Yang D. R., 'Control of pH Neutralization Proess Using Simulation Based Dynamic Programming,' Korean J. Chem. Eng., 21(5), 942 (2004) https://doi.org/10.1007/BF02705575
  11. Lee, J. M. and Lee, J. H., 'Simulation-Based Learning of Cost- To-Go for Control of Nonlinear Processes,' Korean J. Chem. Eng., 21(2), 338 (2004) https://doi.org/10.1007/BF02705417
  12. McCreight, E., 'A Space Economical Suffix Tree Construction Algorithm,' Journal of the ACM, 23, 262 (1976) https://doi.org/10.1145/321941.321946
  13. Miller, R. T., Christoffels, A. G., Gopalakrishnan, C., Burke, J., Ptitsyn, A. A., Broveak, T. R. and Hide, W. A., 'A Comprehensive Approach to Clustering of Expressed Human Gene Sequence: The Sequence Tag Alignment and Consensus Knowledge Base,' Genome Research, 9, 1143 (1999) https://doi.org/10.1101/gr.9.11.1143
  14. Morgenstern, B., Frech, K., Dress, A. and Wemer, T., 'DIALIGN: Finding Local Similarities by Multiple Sequence Alignment,' Bioinformatics, 14, 290 (1998) https://doi.org/10.1093/bioinformatics/14.3.290
  15. Mount, D. W., Bioinformatics : Sequence and Genome Analysis, Cold Spring Harbor Laboratory Press (2001)
  16. Needleman, S. B. and Wunsch, C. D., 'A General Method Applicable to the Search for Similarities in the Amino Acid Sequences of Two Proteins,' J. Mol. Biol., 48, 443 (1970) https://doi.org/10.1016/0022-2836(70)90057-4
  17. Notredame, C. and Higgins, D. G., 'SAGA: Sequence Alignment by Genetic Algorithm,' Nucleic Acids Res., 24, 1515 (1996) https://doi.org/10.1093/nar/24.8.1515
  18. Ostell, J. M., Wheelan, S. J. and Kans, J. A., 'The NCBI Data Model,' Methods Biochem. Anal., 43, 19 (2001) https://doi.org/10.1002/0471223921.ch2
  19. Pearson, W. R. and Miller, W., 'Dynamic Programming Algorithm for Biological Sequence Comparison,' Methods Enzymol., 210, 575 (1992) https://doi.org/10.1016/0076-6879(92)10029-D
  20. Phillips, A., Janies, D. and Wheeler, W., 'Multiple Sequence Alignment in Phylogenetic Analysis,' Molecular Phylogenetics and Evolution, 16, 317 (2000) https://doi.org/10.1006/mpev.2000.0785
  21. Randal, L. S. and Christiansen, T., Learning Perl, Second Edition, O'Reilly (1997)
  22. Salzberg, S. L., Searls, D. B. and Kasif, S., Trends Guide to Bioinformatics, Elsevier Science (1998)
  23. Shin, P. K., Koo, J. H. and Lee, W J., 'Modeling of Cell Growth and phoA-Directed Expression of Cloned Genes in Recombinant Escherichia coli,' Korean J. Chem. Eng., 13(1), 82 (1996) https://doi.org/10.1007/BF02705893
  24. Smith, T. F. and Waterman, M. S., 'Identification of Common Molecular Sequences,' J. Mol. Biol., 197, 723 (1981)
  25. Thompson, J. D., Higgins, D. G. and Gibson, T. J., 'CLUSTAL W: improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-specific Gap Penalties and weight Matrix Choice,' Nucleic Acids Res., 22, 4673 (1994) https://doi.org/10.1093/nar/22.22.4673
  26. Tisdall, J. D., Beginning Perl for Bioinformatics, O'REILLY (2001)
  27. Ukkonen, E., 'On-line Construction of Suffix Trees,' Algorithmica, 14, 249 (1995) https://doi.org/10.1007/BF01206331
  28. Volfovsky, N., Haas, B. J. and Salzberg, S. L., 'A Clustering Method for Repeat Analysis in DNA Sequences,' Genome Biology, 2, 1 (2001)
  29. Weiner, P., 'Linear Pattern Matching Algorithms,' In Proc. of the 14th IEEE Annual Symposium on Switching and Automata Theory, 1 (1973)
  30. Zamir, O., Etzioni, O., Madani, O. and Karp, R. M., 'Fast and Intuitive Clustering of Web Documents,' In Proc. of the 3rd International Conference on Knowledge Discovery and Data Mining, 287 (1997)
  31. Zamir, O. and Etzioni, O., 'Web Document Clustering: A Feasibility Demonstration,' In Proc. of the $21^{st}$ Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 46 (1998)