Search icone
Search and publish your papers
Our Guarantee
We guarantee quality.
Find out more!

Efficiency of Clustering Algorithms - Large Biological Data Bases

Or download with : a doc exchange

About the author

Lawyer/Lecturer
Level
General public
Study
criminal law
School/University
USIU

About the document

Herold k.
Published date
Language
documents in English
Format
Word
Type
case study
Pages
6 pages
Level
General public
Accessed
0 times
Validated by
Committee Oboolo.com
0 Comment
Rate this document
  1. Introduction
  2. Algorithms Performance and Efficiency Evaluation
  3. Efficiency and Performance measure
  4. Discussion
  5. Conclusion

Today, protein sequences are more than one million (Sasson et al., 2002) and as such, there is need in bioinformatics for identifying meaningful patterns for the purposes of understanding their functions. For a long time, protein and gene sequences have been analyzed, compared and grouped using alignment methods. According to Cai et al.(2000), alignment methods are algorithms constructed to arrange, RNA, DNA, and protein sequences to detect similarities that may be as a result of evolutionary, functional or structural sequence relationships.Mount (2002) asserts that comparing and clustering sequences is done using pair-wise alignment method, which are of two types, global and local. Consequently, local alignment algorithm proposed by Waterman and Smith (Bolten et al., 2001) is utilized in identifying amino acid patterns that have been conserved in protein sequences.

The global alignment algorithm proposed by Wunsh and Needleman (Bolten et al., 2001) is used to try and align many characters of the entire sequence. This is because there are very many comparisons performed during computation, since every single protein in a data set is compared to all the proteins in the data set(Bolten et al., 2001). The pair-wise alignment method, both local and global, do not put into consideration the size of the data set, especiallytoo large data sets that may overwhelm the computer memory.

[...] This family's sequences are grouped into eight clusters based on their functionalities. It is further subdivided into 3,737 sequences and labeled DS2. Proteins from the Globin's family sequences are collected as well and randomly grouped into 8 categories and 292 sequences and labeled DS3. Therefore, there are 28 different clusters of sequences as experts have classified them. The data set in use has 4,922 sequences in total and is labeled DS4. Out of these, about over 3,500 sequences as set for training and the remaining 30% (1,422) set aside for the testing phase. [...]


[...] This study analyzes four clustering mining algorithms using four large protein sequence data sets. The analysis highlights the weakness and shortcomings of the four and proposes a new algorithm based on the shortcomings of the four algorithms. Introduction Today, protein sequences are more than one million (Sasson et al., 2002) and as such, there is need in bioinformatics for identifying meaningful patterns for the purposes of understanding their functions. For a long time, protein and gene sequences have been analyzed, compared and grouped using alignment methods. [...]


[...] & Kamber, M Data Mining: Concepts and Technique. San Francisco: Morgan kaufamnn. Mount, D Bioinformatics Sequence and Genome Analysis. New York: Cold Spring Harbor Laboratory Press. Rauhert, S.A., Sathgetre, S.R. & Rausxt, A.P Gene Expresssssion Analysis-A Review for large datferasets. Journal of Computer Technology Science and Engineering, 4(1). [...]


[...] The pseudo code for the projected algorithm is this: Input: A sample S of the training set S = h=1 m is the size of S 1. Select K objects randomly from Ri(i 2. For every pair of non-selectedobject Oh in S and selected object Rido Calculate the total score TSih; 3. Select the maximal TSih: MaxTSih, and mark the corresponding objects Ri and Oh; 4. If MaxTSih> 0 then Ri = Oh; Go back to Step 2 Else For each S do Compute the similarity score of Oh with each centroid Ri using Smith Waterman algorithm Assign Oh to the cluster with the nearest Ri; End Output: BestSets; BestSets refers to the best partition of S into K cluster; with each cluster defined by medoids Ri Conclusion Proteins that have similar sequences possess similar 3D constructsand the same biochemical use. [...]


[...] It randomly selects C', a neighbor of C by a difference of one sequence only(Essoussi & Fayech, 2007).These selections proceeds to the next neighbor as long as its total score is higher than that of the node at hand. It otherwise continues the random checks until a neighbor with a higher score is found, or the threshold number of maximal neighbors has been reached. In this case, the predetermined maximal number neighbors should be at least 250. This algorithm, as opposed to the other previously discussed, aims at optimizing the total score. In this case, the similarity score is also computed using the local alignment method. [...]

Top sold for educational studies

Legalization of Marijuana

 Social studies   |  Educational studies   |  Case study   |  12/10/2013   |   .doc   |   2 pages

Williams Act and takeover defences in the United States

 Social studies   |  Educational studies   |  Case study   |  05/13/2014   |   .doc   |   2 pages

Most rated for educational studies

Alternative assessments to standardized testing

 Social studies   |  Educational studies   |  Term papers   |  05/20/2009   |   .doc   |   5 pages

The Metaphysics of Photography

 Social studies   |  Educational studies   |  Case study   |  12/05/2013   |   .doc   |   6 pages