Efficiency of Clustering Algorithms - Large Biological Data Bases
- Algorithms Performance and Efficiency Evaluation
- Efficiency and Performance measure
Today, protein sequences are more than one million (Sasson et al., 2002) and as such, there is need in bioinformatics for identifying meaningful patterns for the purposes of understanding their functions. For a long time, protein and gene sequences have been analyzed, compared and grouped using alignment methods. According to Cai et al.(2000), alignment methods are algorithms constructed to arrange, RNA, DNA, and protein sequences to detect similarities that may be as a result of evolutionary, functional or structural sequence relationships.Mount (2002) asserts that comparing and clustering sequences is done using pair-wise alignment method, which are of two types, global and local. Consequently, local alignment algorithm proposed by Waterman and Smith (Bolten et al., 2001) is utilized in identifying amino acid patterns that have been conserved in protein sequences.
The global alignment algorithm proposed by Wunsh and Needleman (Bolten et al., 2001) is used to try and align many characters of the entire sequence. This is because there are very many comparisons performed during computation, since every single protein in a data set is compared to all the proteins in the data set(Bolten et al., 2001). The pair-wise alignment method, both local and global, do not put into consideration the size of the data set, especiallytoo large data sets that may overwhelm the computer memory.
[...] This family's sequences are grouped into eight clusters based on their functionalities. It is further subdivided into 3,737 sequences and labeled DS2. Proteins from the Globin's family sequences are collected as well and randomly grouped into 8 categories and 292 sequences and labeled DS3. Therefore, there are 28 different clusters of sequences as experts have classified them. The data set in use has 4,922 sequences in total and is labeled DS4. Out of these, about over 3,500 sequences as set for training and the remaining 30% (1,422) set aside for the testing phase. [...]
[...] This study analyzes four clustering mining algorithms using four large protein sequence data sets. The analysis highlights the weakness and shortcomings of the four and proposes a new algorithm based on the shortcomings of the four algorithms. Introduction Today, protein sequences are more than one million (Sasson et al., 2002) and as such, there is need in bioinformatics for identifying meaningful patterns for the purposes of understanding their functions. For a long time, protein and gene sequences have been analyzed, compared and grouped using alignment methods. [...]
[...] & Kamber, M Data Mining: Concepts and Technique. San Francisco: Morgan kaufamnn. Mount, D Bioinformatics Sequence and Genome Analysis. New York: Cold Spring Harbor Laboratory Press. Rauhert, S.A., Sathgetre, S.R. & Rausxt, A.P Gene Expresssssion Analysis-A Review for large datferasets. Journal of Computer Technology Science and Engineering, 4(1). [...]
[...] The pseudo code for the projected algorithm is this: Input: A sample S of the training set S = h=1 m is the size of S 1. Select K objects randomly from Ri(i 2. For every pair of non-selectedobject Oh in S and selected object Rido Calculate the total score TSih; 3. Select the maximal TSih: MaxTSih, and mark the corresponding objects Ri and Oh; 4. If MaxTSih> 0 then Ri = Oh; Go back to Step 2 Else For each S do Compute the similarity score of Oh with each centroid Ri using Smith Waterman algorithm Assign Oh to the cluster with the nearest Ri; End Output: BestSets; BestSets refers to the best partition of S into K cluster; with each cluster defined by medoids Ri Conclusion Proteins that have similar sequences possess similar 3D constructsand the same biochemical use. [...]
[...] It randomly selects C', a neighbor of C by a difference of one sequence only(Essoussi & Fayech, 2007).These selections proceeds to the next neighbor as long as its total score is higher than that of the node at hand. It otherwise continues the random checks until a neighbor with a higher score is found, or the threshold number of maximal neighbors has been reached. In this case, the predetermined maximal number neighbors should be at least 250. This algorithm, as opposed to the other previously discussed, aims at optimizing the total score. In this case, the similarity score is also computed using the local alignment method. [...]