LibCat » Книги » Приключения » unrecognised » Machine Learning Techniques and Analytics for Cloud Security

Machine Learning Techniques and Analytics for Cloud Security

Здесь есть возможность читать онлайн «Machine Learning Techniques and Analytics for Cloud Security» — ознакомительный отрывок электронной книги совершенно бесплатно, а после прочтения отрывка купить полную версию. В некоторых случаях можно слушать аудио, скачать через торрент в формате fb2 и присутствует краткое содержание. Жанр: unrecognised, на английском языке. Описание произведения, (предисловие) а так же отзывы посетителей доступны на портале библиотеки ЛибКат.

Читать книгу

Название:
Machine Learning Techniques and Analytics for Cloud Security
Автор:
Неизвестный Автор
Жанр:
unrecognised / на английском языке
Год:
неизвестен
ISBN:
нет данных
Рейтинг книги:
3 / 5. Голосов: 1
Избранное:

Добавить в избранное
Отзывы:
Написать комментарий
Ваша оценка:
- 60
- 1
- 2
- 3
- 4
- 5

Machine Learning Techniques and Analytics for Cloud Security: краткое содержание, описание и аннотация

Предлагаем к чтению аннотацию, описание, краткое содержание или предисловие (зависит от того, что написал сам автор книги «Machine Learning Techniques and Analytics for Cloud Security»). Если вы не нашли необходимую информацию о книге — напишите в комментариях, мы постараемся отыскать её.

MACHINE LEARNING TECHNIQUES AND ANALYTICS FOR CLOUD SECURITY
This book covers new methods, surveys, case studies, and policy with almost all machine learning techniques and analytics for cloud security solutions
Audience The aim of Machine Learning Techniques and Analytics for Cloud Security

Machine Learning Techniques and Analytics for Cloud Security — читать онлайн ознакомительный отрывок

Ниже представлен текст книги, разбитый по страницам. Система сохранения места последней прочитанной страницы, позволяет с удобством читать онлайн бесплатно книгу «Machine Learning Techniques and Analytics for Cloud Security», без необходимости каждый раз заново искать на чём Вы остановились. Поставьте закладку, и сможете в любой момент перейти на страницу, на которой закончили чтение.

Тёмная тема

Шрифт:

↓

↑

Сбросить

Интервал:

↓

↑

Закладка:

Сделать

3.3.5 Illustration

The accessible dataset persists in two states, i.e., G Nand G C, where G Ndenotes dataset of non-cancerous and G Cdenotes dataset of cancerous state. The designed algorithm is examined on both lung and colon dataset. Both G Nand G Care combined and grouped together as one dataset G. Then, the dataset was transposed, i.e., rows became columns and columns became rows. A target variable Y was chosen and dataset was divided into dependent (Y) and independent (X) data.

In an M iterative process, a group of five genes is selected at random from the independent (X) data. Now, these five selected gene become X, i.e., dependent data, and Y, i.e., independent data, which is the same as earlier. This X × Y matrix is then divided into training and test data in 80:20 ratios.

After dividing into training and test data, the feature of the dataset is scaled down onto unit scale. Then, PCA is fitted onto training and test data of X to retain 95% of the variance. Then, LR was fitted on the training data of X and Y and predicted value is calculated using test data of X. At last, accuracy score was calculated by comparing the test data of Y and the predicted values. If the accuracy was found to be more than 85%, then those genes are considered as cancer mediating genes and stored in a new list as result set.

3.4 Result

The output of the proposed algorithm is a set of genes which are identified as their expression level changes significantly and can be referred as genes having correlation with cancer. The algorithm actually is experimented with some authentic dataset accessible from NCBI database. Two datasets, viz ., lung and colon, have been utilized for examination purpose. Data of both normal and carcinogenic states are given as the input of the algorithm to generate the target output. The algorithm follows a hybrid approach where PCA has been incorporated for minimization of dimensionality of the dataset. Then, prepared logistic model is applied as a binary classifier to detect the collection of genes which might have possible relation with cancer. Our developed PC-LR model is applied on both lung and colon data.

3.4.1 Description of the Dataset

In our algorithm, two datasets, viz ., lung and colon, are considered for testing and getting the output. With the help of microarray experiments, human gene expression is measured for lung and the data is obtained for tumor and normal sample. Total of 96 samples are collected of which 86 samples belong to tumor and 10 as normal state. In a more descriptive manner, it can be stated that among 86 samples of lung adenocarcinoma, 67 belong to stage I and 19 is of stage III. Ten lung samples are identified as neoplastic sample. The colon data consists of 7,464 genes with 18 samples that belong to carcinogenic state and 18 with normal state. More detailed information can be accessed from the site https://www.ncbi.nlm.nih.gov

3.4.2 Result Analysis

While executing the algorithm taking r = 5, i.e., a group of five genes is selected at random at a time. So, for lung dataset, it consisting of 5 cols (genes/features) and 96 rows (samples), which is divided into test and training dataset. For colon, it is 5 and 36, divided in same manner. Here, test data consist of 20% of the dataset and rest 80% belongs to training dataset. This dataset is scaled down by applying standard scalar and features of dataset is brought down onto unit scale. Then, PCA is applied on the selected 5 × 96 matrix. While applying PCA, the variance α is taken as 0.95 as number of components, parameter on both lung and colon datasets.

After reducing the dimensionality of the dataset, LR is applied using “sag” method for faster convergence. Predictive value is calculated based on the training dataset and then accuracy is calculated by comparing this predicted value and test data. When the accuracy was found to be more than 85%, those genes were selected as cancer mediating gene and stored in a new list.

For lung dataset, 886 genes were selected. When these genes were matched with the genes in the NCBI database, 102 were found to be true positive (TP). For colon dataset, 207 genes were selected out of which 85 were found to be TP when matched with NCBI database.

3.4.3 Result Set Validation

The generated result set genes for lung and colon dataset having correlation with cancers have been validated biologically using NCBI database. NCBI provides a gene database ( http://www.ncbi.nlm.nih.gov/Database) where the disease mediating gene list corresponding to a specific disease can be obtained. The list is arranged in terms of relevance of the genes. We have got different sets of genes for lung cancer and colon cancer. The algorithm has selected 886 genes for lung and 207 for colon cancer as mutated genes. For lung expression data, we have compared this set of genes with 1,067 genes from NCBI. Here, we have identified 102 common in both the sets. We call these genes TP genes ( Figure 3.4). Thus, 784 (886 − 102) genes are not in the list of genes obtained from NCBI. We denote these genes as false positive (FP) and 965 (1,067 − 102) genes are identified as false negative (FN). Likewise, for colon data, 1,223 genes are in the NCBI database. In this case, our algorithm has identified 207 genes. So, when compared with NCBI database, 85 genes got matched and marked as TP and 1,138 (1,223 − 85) genes are identified as FN and 122 (207 − 85) genes are FP ( Figure 3.3).

It is very important while developing an efficient algorithm using ML model with a skewed dataset. For example, if the dataset is about cancer detection, then the task becomes more significant. Accuracy alone cannot decide for a skewed dataset whether the algorithm is working efficiently or not. What happens is that if we see in the dataset that in 99% of the time, then there is no cancer. In a binary classification problem, we can easily predict 0 all the time (predicting 1 if cancer and 0 if no cancer) to get a 99% accuracy. If we implement that model, then we will have a 99% accurate model based on ML algorithm but we will never detect cancer. If someone has cancer, then s/he will never get detected and will not get treatment. In our problem, we want to detect cancer mediating genes whose expression level changes significantly from normal state to cancerous state. So, here also, only accuracy is not going to work. There are different evaluation matrices that can help with these types of datasets. Those evaluation metrics are called precision-recall evaluation metrics. The F-score is a way of combining the precision and recall of the model, and it is defined as the harmonic mean of the model’s precision and recall. The F-score is commonly used effectively for many kinds of ML models. Moreover, for a binary classification problem, it is very much significant to analyze the accuracy vs. F-score to evaluate the efficiency of the model. Accuracy is defined as simply the number of correctly categorized examples divided by the total number of examples. Accuracy can be useful but does not take into account the subtleties of class imbalances, or differing costs of FN and FP. On the other hand, F-score is an effective measure when there are either differing costs of FP or FN or where there is a large class imbalance. As our proposed method works with gene expression data where number of genes is very large in number but the number of genes whose mutation is correlated to cancer will be very less, so in this case, the accuracy would be misleading, since a classifier that classifies set of genes not related to cancer would automatically get 90% accuracy but would be useless for the proposed work and hence will have little contribution in real-world application specially in the field of medical science. As a result, F-score has been given importance to evaluate the efficacy of the proposed model by proper application precision and recall.