LibCat » Книги » Приключения » unrecognised » Data Analytics in Bioinformatics

Data Analytics in Bioinformatics

Здесь есть возможность читать онлайн «Data Analytics in Bioinformatics» — ознакомительный отрывок электронной книги совершенно бесплатно, а после прочтения отрывка купить полную версию. В некоторых случаях можно слушать аудио, скачать через торрент в формате fb2 и присутствует краткое содержание. Жанр: unrecognised, на английском языке. Описание произведения, (предисловие) а так же отзывы посетителей доступны на портале библиотеки ЛибКат.

Читать книгу

Название:
Data Analytics in Bioinformatics
Автор:
Неизвестный Автор
Жанр:
unrecognised / на английском языке
Год:
неизвестен
ISBN:
нет данных
Рейтинг книги:
5 / 5. Голосов: 1
Избранное:

Добавить в избранное
Отзывы:
Написать комментарий
Ваша оценка:
- 100
- 1
- 2
- 3
- 4
- 5

Data Analytics in Bioinformatics: краткое содержание, описание и аннотация

Предлагаем к чтению аннотацию, описание, краткое содержание или предисловие (зависит от того, что написал сам автор книги «Data Analytics in Bioinformatics»). Если вы не нашли необходимую информацию о книге — напишите в комментариях, мы постараемся отыскать её.

Machine learning techniques are increasingly being used to address problems in computational biology and bioinformatics. Novel machine learning computational techniques to analyze high throughput data in the form of sequences, gene and protein expressions, pathways, and images are becoming vital for understanding diseases and future drug discovery. Machine learning techniques such as Markov models, support vector machines, neural networks, and graphical models have been successful in analyzing life science data because of their capabilities in handling randomness and uncertainty of data noise and in generalization. Machine Learning in Bioinformatics compiles recent approaches in machine learning methods and their applications in addressing contemporary problems in bioinformatics approximating classification and prediction of disease, feature selection, dimensionality reduction, gene selection and classification of microarray data and many more.

Data Analytics in Bioinformatics — читать онлайн ознакомительный отрывок

Ниже представлен текст книги, разбитый по страницам. Система сохранения места последней прочитанной страницы, позволяет с удобством читать онлайн бесплатно книгу «Data Analytics in Bioinformatics», без необходимости каждый раз заново искать на чём Вы остановились. Поставьте закладку, и сможете в любой момент перейти на страницу, на которой закончили чтение.

Тёмная тема

Шрифт:

↓

↑

Сбросить

Интервал:

↓

↑

Закладка:

Сделать

Drawback—This algorithm is incapable in dealing with closely connected clusters and cannot deal with high dimensional data [14, 15].

2.3.3.3 Intelligent Kernel k-Mean (IKKM)

Generally gene expression data is nonlinearly separable and the gene data is represented in a matrix format using linear kernel function as shown in Figure 2.2. IKKM is applied on the kernel matrix in which clusters need not be predefined.

The algorithmic approach of IKKM is carried by calculating the centre of mass of the entire dataset and then identifies the element (×1) which is distant from the centre of mass (C) another element (×2) is calculated distant from c1 is calculated. Now the elements near to c1 are moved into cluster (S1) and the elements near to c2 are moved into another cluster (S2). This process is repeated until no change is observed in the clusters. In contrast to traditional k-means approach this algorithm implementation doesn’t require predicting the number of clusters in advance.

Drawback—Cannot deal with high dimensional dataset [16].

2.3.3.4 Clustering Large Applications (CLARA)

Rather than considering and calculating medoids for the entire dataset CLARA considers a subset from the dataset with a constant size upon which PAM(partitioning around medoids) algorithm is applied. PAM algorithm aims at reducing the degree of similarity by calculating the distance between the cluster medoids and the elements in the subset space. As a result an optimal set of medoids are obtained which requires less store space when compared to other algorithms. In contrast to other partitioning algorithms this measures the degree of dissimilarity amongst the elements and the medoid of the cluster. This algorithm is applied on huge datasets producing accurate results with less computation time [17].

Drawback—Cannot deal with higher dimensional data and closely connected clusters which leaves this algorithm a second choice in gene expression data.

2.3.4 Hierarchical Clustering Algorithms

Hierarchical clustering technique is again divided into two categories

Agglomerative clustering—This follows a bottom up approach as depicted in Figure 2.4(a). Initially, every other object in the dataset is assumed as a single cluster. Distance measures are computed upon each object and based upon the similarities they are merged to form clusters.

Figure 2.4(a) Agglomerative clustering, (b) divisive clustering.

Divisive Clustering—This follows top down approach as depicted in Figure 2.4(b). Initially, it considers all the data points as single cluster or model. The model separation is performed recursively until required criteria is met based upon the model.

Various algorithms of hierarchical clustering are described below.

2.3.4.1 AGNES (Agglomerative Nesting)

In this algorithm subsequent merging operations are performed on the dataset to arrange them into clusters. Dissimilarity coefficients are acquired using either Euclidean or Manhattan distance metrics [17]. These measured dissimilarities forms a matrix using unweighted pair group method with arithmetic mean [18].

Drawbacks of this method are that the objects once merged cannot be separated and different computation metrics leads to varying results.

2.3.4.2 DIANA (Divisive Analysis)

This follows divisive clustering approach in which the whole dataset is divided into 2 sub parts (clusters) and these clusters are further separated into sub clusters until no splitting can be performed further [17].

Drawback—once separated clusters cannot be merged, this cannot handle huge dataset [19]. This may not be applicable for gene expression data as it cannot handle missing data [17].

2.3.4.3 CURE (Clustering Using Representatives)

This algorithm operates on defining stable scatter points which are able to detect the shape and extent of a cluster. These scatter points based upon the similarities moves towards centroid in order to form a cluster. The scatter points capture the shape and extent of clusters which enables to detect non-spherical clusters. This algorithm is applied to gene expression data which produces the clusters efficiently [20].

2.3.4.4 CHAMELEON

This algorithm is based on dynamic modeling which is executed in two stages. In the first stage graph partitioning algorithm is applied resulting into smaller clusters. As graph partitioning technique is applied the elements in the clusters are strongly connected. This establishes a quality separation among the clusters by discarding the noisy data. The later stage is executed with the application of agglomerative hierarchical clustering combining all the smaller clusters iteratively. This algorithm doesn’t fall short of incorrect merging because of its 2 stage execution. And also CHAMELEON overcomes all the shortcomings of AGNES and DIANA in which the data once split cannot be merged [21].

2.3.4.5 BRICH (Balanced Iterative Reducing and Clustering Using Hierarchies)

In this algorithmic approach entire data in the dataset is maintained in the form of a tree structure called as CF (clustering Feature) which is based on clustering feature. The CF tree also holds the information of clustering. As the outliers are eliminated the sub clusters data is made as leaf nodes. With iterative updating of CF BRICH provides the potential to deal with huge datasets. This works effectively for gene expression data [22].

Drawback—This algorithm gets skewed with the presence of nonspherical clusters [22].

2.3.5 Density-Based Approach

Density based clustering algorithmic approach is composed by splitting the low point density regions and high point density regions in a given data space. The low point density region is observed as outliers [26].

2.3.5.1 DBSCAN

The DBSCAN algorithm proposed by Ref. [27] is based on the following criterion:

Eps(€)—this depicts the space around the data point, if the difference of distance between 2 data points is greater than or equal to € then they are declared as neighbors, if the (€) value is very small then they are termed as outliers, if the (€) value is high then that data point is merged to form cluster.

Minpts—this represents the minimum number of data points within the radius (€) for the larger datasets like microarrays Minpts are obtained from the dimensions of the dataset.

From the following definition of data points clusters and outliers are identified

Core point—presence of higher Minpts within radius of (€)

Border point—presence of less Minpts within radius of (€) but close to core point

Outlier—point which is neither core nor border point.

2.3.6 Model-Based Approach

This clustering algorithm handles the shortcomings of K-Means and Fuzzy K-Means. In this approach a probabilistic model is developed to which the data from the dataset is fit into [23]. Self-Organizing Maps implement the model based approach technique.

2.3.6.1 SOM (Self-Organizing Maps)

This algorithm is popularly known in gene expression data clustering in which the expression values are closely related. SOM adapts a neural network template with a single layer in which data is split over neurons as shown in Figure 2.5. As every neuron is connected to every other neuron and is associated with a reference vector, the data objects are plotted to the nearest reference vectors to form quality clusters [24].