3 Data transformation. Data transformation plays an important role in making data constructive so that it can be used in the model. It is performed to increase the probability of algorithms making precise and meaningful predictions. There are several data transformation techniques such as:Categorical encoding. Machine learning models are mathematical models that understand numeric representation, but categorical data has label values. Some of the ML algorithms require input and output variables to be numeric. So, for this reason, categorical data is converted to numerical data using either of these two steps:Integer encoding. Also known as label coding, it assigns an integer value to every unique category. Machine learning algorithms are capable of understanding this relationship between integer and category. For example, your qualification can be high school, college, or postgraduate. If we assign an integer to each of these categories like high school = 1, college = 2, postgraduate = 3, this data will become machine‐readable.One‐hot encoding. One‐hot encoding is used when the categorical variables don't have an ordinal relationship. In one‐hot encoding, the binary value is assigned to each unique value and is separated by different columns. For example, Facebook has a relationship status feature—engaged, married, separated, divorced, or widowed. Then each status will have different columns with “engaged” status assigned value one in the “engaged” column and zero in the remaining four columns.Dealing with skewed data. If your data is not symmetric, meaning if one half of your data distribution is not the mirror image of the other half, then the data is considered as asymmetric or skewed data. So, to discover patterns in skewed data, you need to apply a log transformation or reciprocals (i.e., positive or negative) or Box‐Cox transformation over the whole set of values. This way, you can use it for the statistical model.Bias mitigation. Bias mitigation can be done by alternating current values and labels to get a more unbiased model. Some algorithms that can help in this process are reweighing, optimized preprocessing, learning fair representations, and disparate impact remover.Scaling. When you use regression algorithms and algorithms using Euclidean distances, you need to transform your data into a particular range. This can be done by altering the values of each numerical feature and setting it to a common scale. This can be done by using normalization (min‐max scaling) or z‐score standardization.
4 Feature engineering. Feature engineering is a process of constructing explanatory variables and features from raw data to turn our inputs into a machine‐readable format. These variables and features are used to train the model. To follow this step, one should have a clear understanding of the data. Feature engineering can be achieved in two activities:Feature extraction. It is done to reduce the number of processing resources without losing relevant information. This activity combines variables to features, thus reducing the amount of processing data while still accurately describing the original dataset.Capturing feature relationships. If you have a better understanding of your data, you can find out the relationship between two features so as to help your algorithm focus on what you know is important.
Select Algorithm and Model (Modeling)
After the completion of the tough part, that is, data selection and data pre‐processing, we are now moving to the interesting part: modeling.
Modeling is an iterative process of creating a smart model by continuously training and testing the model until you discover the one with high accuracy and performance.
To train an ML model, you need to provide an ML algorithm with a clean training dataset to learn from. Choosing a learning algorithm depends on the problem at hand. The training data that you are planning to feed to the ML algorithm must contain the target attribute. ML algorithms find patterns in training data and learn from it. This ML model is then tested with new data to make predictions for the unknown target attribute. Let's understand it with an example.
You want to train your model to separate spam mails from your regular emails. To do so, you need to provide your learning algorithm with the training data that contains a white list and black list. The white list contains email addresses of people you tend to receive email from. The blacklist contains all the addresses of users that you want to avoid receiving email from. So, the ML algorithm will learn from this training data and predict if the new mail is from black list or white list. If it's from the black list, it automatically labels it as spam.
To create an effective model, it is important to select an accurate algorithm that can find predictable, repeatable patterns. On the one hand, some problems that need ML are very specific and require a unique approach to solve the problem, On the other hand, some problems need a trial‐and‐error approach.
Machine learning algorithms are divided into four main types (see Figure 1.5):
1 Supervised learning
2 Unsupervised learning
3 Semi‐supervised learning
4 Reinforcement learning
Let's learn them one by one:
1 Supervised learning. It's a learning algorithm in which the machine is trained with data that is well labeled and predicts with the help of a labeled dataset. FIGURE 1.5 Machine learning algorithms.What is labeled data? The data for which you already know the target answer is called labeled data. For example, if I show you an image and tell you that it is a butterfly, then it's called labeled data. However, if I show you an image without telling you what it is, that is referred to as unlabeled data.Now let's understand with an example how labeled data makes a machine learn.We have images that are labeled as spoon and knife; we then feed them to the machine, which analyzes and learns the association of these images with their labels based on their features such as shape, size, and sharpness. Now, if any new image is fed to the machine without any label, the past data helps the machine to predict accurately and tell whether it's a spoon or knife. Thus, in supervised machine learning, the algorithm teaches the model to learn from the labeled example that we provide.It consists of two techniques: classification and regression.Classification. For example, if the output variable is categorical such as red or blue, disease or no disease, male or female, will I get an increment or not?Regression. Regression is a problem when the output variable is a real or a continuous value, for example, salary based on work experience or weight based on height. So, it creates predictive models showing trends in data. For example, how much increment will I get?The following is a list of commonly used algorithms in supervised learning:Nearest neighborNaive BayesDecision treesLinear regressionSupport vector machines (SVM)Neural networksLogistic regressionLinear discriminant analysisSimilarity learning
2 Unsupervised learning. In this learning, no training is given to the machine, allowing it to act on data that is not labeled. Hence, the machine tries to identify the patterns and provide the predictions. Let's take the example of a spoon and knife, but this time we do not tell the machine whether it's a spoon or a knife. The machine by itself identifies patterns from the set and makes a group based on their patterns, similarities, differences, and so on.Unsupervised learning consists of two techniques: clustering and association.Clustering. In clustering, the machine forms groups based on the behavior of the data. For example, which customer made similar product purchases?Association. It is an area of machine learning that identifies exceptional relationships between variables in large datasets. For example, which products were purchased together?The following is a list of commonly used algorithms in unsupervised learning:k‐means clusteringAssociation rules
Читать дальше