Learning means the acquisition of information or skills through study or experience. Based on this, we can characterize machine learning (ML) as follows −
It might be defined as the field of computer science, more specifically an application of artificial intelligence, which gives computer systems the ability to learn with data and improve from experience without being explicitly programmed.
Basically, the primary focus of machine learning is to permit the computers learn automatically without human intervention. Now the question arises that how such learning can be started and done? It can be started with the observations of data. The data can be some examples, instruction or some immediate experiences too. Then on the basis of this input, machine makes better choice by looking for some patterns in data.
Machine Learning Algorithms helps computer system learn without being explicitly programmed. These algorithms are categorized into supervised or unsupervised. Let us now see a couple of algorithms −
This is the most generally used machine learning algorithm. It is called supervised because the process of algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. In this sort of ML algorithm, the possible outcomes are already known and training data is also labeled with right answers. It can be perceived as follows −
Suppose we have input variables x and an output variable y and we applied an algorithm to learn the mapping function from the input to output, for example−
Y = f(x)
Now, the fundamental goal is to approximate the mapping function so well that when we have new input data (x), we can predict the output variable (Y) for that data.
Mainly supervised leaning issues can be divided into the following two kinds of issues −
Classification − An issue is called classification problem when we have the categorized output such as “black”, “teaching”, “non-teaching”, etc.
Regression − An issue is called regression problem when we have the real value output such as “distance”, “kilogram”, etc.
Decision tree, random forest, knn, logistic regression are the instances of supervised machine learning algorithms.
As the name recommends, these kinds of machine learning algorithms don't have any supervisor to provide any sort of guidance. That is why unsupervised machine learning algorithms are closely aligned with what some call true artificial intelligence. It can be understood as follows −
Suppose we have input variable x, at that point there will be no corresponding output variables as there is in supervised learning algorithms.
In simple words, we can say that in unsupervised learning there will be no right answer and no teacher for the guidance. Algorithms help to find interesting patterns in data.
Unsupervised learning issues can be divided into the following two kinds of problem −
Clustering − In clustering issues, we need to find the inherent groupings in the data. For example, grouping customers by their buying behavior.
Association − An issue is called association problem because such kinds of problem require discovering the rules that describe large portions of our data. For example, finding the clients who buy both x and y.
K-means for clustering, Apriori algorithm for association are the examples of unsupervised machine learning algorithms.
These kinds of machine learning algorithms are utilized exceptionally less. These algorithms train the systems to make specific decisions. Basically, the machine is exposed to an environment where it trains itself continually utilizing the trial and error method. These algorithms learn from past experience and tries to capture the best possible knowledge to make accurate decisions. Markov Decision Process is an example of reinforcement machine learning algorithms.
In this part, we will learn about the most common machine learning algorithms. The algorithms are described below −
It is one of the most well-known algorithms in statistics and machine learning.
Basic concept − Mainly linear regression is a linear model that assumes a linear relationship between the input variables say x and the single output variable say y. In other words, we can say that y can be determined from a linear combination of the input variables x. The relationship between variables can be set up by fitting a best line.
Linear regression is of the following two types −
Simple linear regression − A linear regression algorithm is called simple linear regression if it is having just a single independent variable.
Multiple linear regression − A linear regression algorithm is called multiple linear regression if it is having more than one independent variable.
Linear regression is fundamentally used to estimate the real values based on continuous variable(s). For example, the total sale of a shop in a day, based on real values, can be estimated by linear regression.
It is a classification algorithm and also known as logit regression.
Mainly logistic regression is a classification algorithm that is utilized to estimate the discrete values like 0 or 1, true or false, yes or no based on a given set of independent variable. Fundamentally, it predicts the probability hence its output lies in between 0 and 1.
Decision tree is a supervised learning algorithm that is generally used for classification problems.
Basically it is a classifier expressed as recursive partition dependent on the independent variables. Decision tree has nodes which form the rooted tree. Rooted tree is a directed tree with a node called “root”. Root doesn't have any incoming edges and all the other nodes have one incoming edge. These nodes are called leaves or decision nodes. For example, consider the following decision tree to see whether a person is fit or not.
It is utilized for both classification and regression issues. But mainly it is utilized for classification problems. The main concept of SVM is to plot each data item as a point in n-dimensional space with the value of each feature being the value of a particular coordinate. Here n would be the features we would have. Following is a basic graphical representation to understand the idea of SVM −
In the above diagram, we have two features subsequently we first need to plot these two variables in two dimensional space where each point has two co-ordinates, called support vectors. The line splits the data into two different classified groups. This line would be the classifier.
It is also a classification strategy. The logic behind this classification technique is to use Bayes theorem for building classifiers. The assumption is that the predictors are independent. In straightforward words, it assumes that the presence of a specific feature in a class is unrelated to the presence of any other feature. Below is the equation for Bayes theorem −
$$Pleft ( frac{A}{B} ight ) = frac{Pleft ( frac{B}{A} ight )Pleft ( A ight )}{Pleft ( B ight )}$$
The Naïve Bayes model is easy to build and particularly helpful for large data sets.
It is utilized for both classification and regression of the problems. It is broadly used to solve classification problems. The primary concept of this algorithm is that it used to store all the available cases and classifies new cases by majority votes of its k neighbors. The case being then assigned to the class which is the most well-known amongst its K-nearest neighbors, measured by a distance function. The distance function can be Euclidean, Minkowski and Hamming distance. Consider the following to utilize KNN −
Computationally KNN are costly than other algorithms used for classification problems.
The normalization of variables required in otherwise higher range variables can bias it.
In KNN, we need to work on pre-processing stage like noise removal.
As the name recommends, it is used to solve the clustering issues. It is basically a type of unsupervised learning. The principle logic of K-Means clustering algorithm is to classify the data set through a number of clusters. Follow these steps to form clusters by K-means −
K-means picks k number of points for each cluster known as centroids.
Now each data point forms a cluster with the nearest centroids, i.e., k clusters.
Now, it will discover the centroids of each cluster based on the existing cluster members.
We need to repeat these steps until convergence occurs.
It is a supervised classification algorithm. The advantage of random forest algorithm is that it can be utilized for both classification and regression kind of issues. Basically it is the collection of decision trees (i.e., forest) or you can say ensemble of the decision trees. The basic concept of random forest is that each tree gives a classification and the forest chooses the best classifications from them. Followings are the benefits of Random Forest algorithm −
Random forest classifier can be utilized for both classification and regression tasks.
They can handle the missing values.
It won’t over fit the model even if we have more number of trees in the forest.