Efficiency of Clustering Algorithms - Large Biological Data Bases

Herold k.

Order the writing of a tailor-made Educational studies Case study

Free quote online

Case study Format .doc

Efficiency of Clustering Algorithms - Large Biological Data Bases

Download

Read an extract

Themes

Efficiency, Clustering Algorithm, Large Biological Data Bases

Reader
Abstract
Contents
Extract

Abstract

Today, protein sequences are more than one million (Sasson et al., 2002) and as such, there is need in bioinformatics for identifying meaningful patterns for the purposes of understanding their functions. For a long time, protein and gene sequences have been analyzed, compared and grouped using alignment methods. According to Cai et al.(2000), alignment methods are algorithms constructed to arrange, RNA, DNA, and protein sequences to detect similarities that may be as a result of evolutionary, functional or structural sequence relationships.Mount (2002) asserts that comparing and clustering sequences is done using pair-wise alignment method, which are of two types, global and local. Consequently, local alignment algorithm proposed by Waterman and Smith (Bolten et al., 2001) is utilized in identifying amino acid patterns that have been conserved in protein sequences.

The global alignment algorithm proposed by Wunsh and Needleman (Bolten et al., 2001) is used to try and align many characters of the entire sequence. This is because there are very many comparisons performed during computation, since every single protein in a data set is compared to all the proteins in the data set(Bolten et al., 2001). The pair-wise alignment method, both local and global, do not put into consideration the size of the data set, especiallytoo large data sets that may overwhelm the computer memory.

Introduction
Algorithms Performance and Efficiency Evaluation
Efficiency and Performance measure
Discussion
Conclusion

Get this table of contents for free after login.

Extract

[...] This family's sequences are grouped into eight clusters based on their functionalities. It is further subdivided into 3,737 sequences and labeled DS2. Proteins from the Globin's family sequences are collected as well and randomly grouped into 8 categories and 292 sequences and labeled DS3. Therefore, there are 28 different clusters of sequences as experts have classified them. The data set in use has 4,922 sequences in total and is labeled DS4. Out of these, about over 3,500 sequences as set for training and the remaining 30% (1,422) set aside for the testing phase. [...]

[...] This study analyzes four clustering mining algorithms using four large protein sequence data sets. The analysis highlights the weakness and shortcomings of the four and proposes a new algorithm based on the shortcomings of the four algorithms. Introduction Today, protein sequences are more than one million (Sasson et al., 2002) and as such, there is need in bioinformatics for identifying meaningful patterns for the purposes of understanding their functions. For a long time, protein and gene sequences have been analyzed, compared and grouped using alignment methods. [...]

[...] Pro-LEADER algorithm is not efficient when it comes to handling large data sets as shown on the DS4 data set. To complement these shortcomings, this study proposes the Pro-PAM algorithm to generate the maximal set of medoids. This algorithm randomly selects K sequences as clusters randomly from the data sets D. It also utilizes the local alignment algorithm-Smith Waterman-in the computation of the TSih, total score of every pair of the sequences selected and those that are not selected. It selects the optimal total. [...]

[...] Specificity and sensitivity of the previously discussed algorithms are calculated using the results from the test phase. In this scenario, sensitivity refers to the probability of predicting a classifier correctly, whereas specificity refers to the likelihood of the exactness of the forecast. They are defined by And In this case, FN represents False Negatives and is the number of unidentified true homologues pairs;TP represents True Positives, which are the true homologous pairs that were identified correctly, and FP stands for False Positives, which are the non-homologues pairs considered as homologues. [...]

[...] References Berkhin, P Survey of Clustering Data Mining Techniques. San Jose: Accrue Software, Inc. Galperin, M.Y. & Koonin, E.V Comparative Genome Analysis, In Bioinformatics- A Practical Guide to the Analysis of Genes and Proteins. 2nd ed. New York: Wiley-Interscience. Guralnik, V. & Karypis, G A scalable algorithm for clustering sequential data. In SIGKDD Workshop on Bioinformatics BIOKDD. Han, J. [...]

doc