Data Science is the future. According to Forbes, Machine learning patents grew at a 34% Rate between 2013 and 2017 and this is only set to increase in coming times. Data Science is a detailed study of the flow of information from the colossal amounts of data present in an organization’s repository. It involves obtaining meaningful insights from raw and unstructured data which is processed through analytical, programming, and business skills. So, here lets look into top Algorithms used in it.
Contents of Tutorial
1>> Linear Regression (LIR)
Linear regression is one of the most well-known Algorithms for Data Science in statistics and Machine Learning.
Predictive modeling is primarily concerned with minimizing the error of a model or making the most accurate predictions possible, at the expense of explainability. It will borrow, reuse and steal algorithms from many different fields, including statistics and use them towards these ends.
The representation of linear regression is an equation that describes a line that best fits the relationship between the input variables (x) and the output variables (y). Finding specific weightings for the input variables is called coefficients (B).
Different techniques can be used to learn the linear regression model from data, such as a linear algebra solution for ordinary least squares and gradient descent optimization.
Linear regression has been extensively studied. Some good rules of thumb when using this technique are to remove variables that are very similar and to remove noise from your data, if possible. It is a fast and simple technique and good first algorithm to try.
2>> Logistic Regression (LOR)
Logistic regression is another technique in the field of Data science. It is the go-to method for binary classification problems.
Logistic regression is like linear regression in that the goal is to find the values for the coefficients that weight each input variable. Unlike linear regression, the prediction for the output is transformed using a non-linear function called the logistic function.
The logistic function looks like a big S and will transform any value into the range 0 to 1. This is useful because we can apply a rule to the output of the logistic function to snap values to 0 and 1 and predict a class value.
Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other. It’s a fast model to learn and effective on binary classification problems.
The Apriori is Algorithms for Data Science and used in a transactional database to mine frequent item sets and then generate association rules. It is popularly used in market basket analysis, where one checks for combinations of products that frequently co-occur in the database. In general, we write the association rule for ‘if a person purchases item X, then he purchases item Y’ as : X -> Y.
The Support measure helps prune the number of candidate item sets to be considered during frequent item set generation. This support measure is guided by the Apriori principle. The Apriori principle states that if an itemset is frequent, then all of its subsets must also be frequent.
We start by choosing a value of k. Here, let us say k = 3. Then, we randomly assign each data point to any of the 3 clusters. Compute cluster centroid for each of the clusters. The red, blue and green stars denote the centroids for each of the 3 clusters.
Next, reassign each point to the closest cluster centroid. In the figure above, the upper 5 points got assigned to the cluster with the blue centroid. Follow the same procedure to assign points to the clusters containing the red and green centroids.
Then, calculate centroids for the new clusters. The old centroids are gray stars; the new centroids are the red, green, and blue stars.
4>> Classification And Regression Trees (CRT)
Decision Trees are an important type of algorithm for predictive modeling, machine learning and Data science analysis.
The non-terminal nodes of Classification and Regression Trees are the root node and the internal node. The terminal nodes are the leaf nodes. Each non-terminal node represents a single input variable (x) and a splitting point on that variable; the leaf nodes represent the output variable (y). The model is used as follows to make predictions: walk the splits of the tree to arrive at a leaf node and output the value present at the leaf node.
The representation of the decision tree model is a binary tree. This is your binary tree from algorithms and data structures, nothing too fancy. Each node represents a single input variable (x) and a split point on that variable.
The leaf nodes of the tree contain an output variable (y) which is used to make a prediction. Predictions are made by walking the splits of the tree until arriving at a leaf node and output the class value at that leaf node.
Trees are fast to learn and very fast for making predictions. They are also often accurate for a broad range of problems and do not require any special preparation for your data.
5 >> Naive Bayes (NB)
Naive Bayes is a simple but surprisingly powerful algorithm Algorithms for Data Science for predictive modeling.
The model is comprised of two types of probabilities that can be calculated directly from your training data: 1) The probability of each class; and 2) The conditional probability for each class given each x value. Once calculated, the probability model can be used to make predictions for new data using Bayes Theorem. When your data is real-valued it is common to assume a Gaussian distribution (bell curve) so that you can easily estimate these probabilities.
Naive Bayes is called naive because it assumes that each input variable is independent. This is a strong assumption and unrealistic for real data, nevertheless, the technique is very effective on a large range of complex problems.
6>> K-Nearest Neighbors (KNN)
The KNN algorithm is very simple and very effective Algorithms for Data Science. The model representation for KNN is the entire training dataset. It is an iterative algorithm that groups similar data into clusters. It calculates the centroids of k clusters and assigns a data point to that cluster having least distance between its centroid and the data point.
Predictions are made for a new data point by searching through the entire training set for the K most similar instances. For regression problems, this might be the mean output variable, for classification problems this might be the mode class value.
When an outcome is required for a new data instance, the KNN algorithm goes through the entire data set to find the k-nearest instances to the new instance. The k number of instances most similar to the new record, and then outputs the mean of the outcomes or the mode for a classification problem. The value of k is user-specified.
KNN can require a lot of memory or space to store all of the data, but only performs a calculation when a prediction is needed, just in time. You can also update and curate your training instances over time to keep predictions accurate.
7>> Learning Vector Quantization (LVQ)
A downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset. The Learning Vector Quantization algorithm is an artificial neural network algorithm.It allows you to choose how many training instances to hang onto and learns exactly what those instances should look like.
The representation for LVQ is a collection of codebook vectors. These are selected randomly in the beginning and adapted to best summarize the training dataset over a number of iterations of the learning algorithm. After learned, the codebook vectors can be used to make predictions just like K-Nearest Neighbors. The most similar neighbor is found by calculating the distance between each codebook vector and the new data instance. The class value or for the best matching unit is then returned as the prediction. Best results are achieved if you rescale your data to have the same range, such as between 0 and 1.
If you discover that KNN gives good results on your dataset try using LVQ to reduce the memory requirements of storing the entire training dataset.
8 >> Support Vector Machines (SVM)
Support Vector Machines are perhaps one of the most popular and talked about Algorithms for Data Science.
A hyperplane is a line that splits the input variable space. In SVM, a hyperplane is selected to best separate the points in the input variable space by their class, either class 0 or class1.In two-dimensions, you can visualize this as a line and let’s assume that all of our input points can be completely separated by this line. The SVM learning algorithm finds the coefficients that results in the best separation of the classes by the hyperplane.
The distance between the hyperplane and the closest data points is referred to as the margin. The best or optimal hyperplane that can separate the two classes is the line that has the largest margin. Only these points are relevant in defining the hyperplane and in the construction of the classifier. These points are called the support vectors. They support or define the hyperplane. In practice, an optimization algorithm is used to find the values for the coefficients that maximizes the margin.
SVM might be one of the most powerful out-of-the-box classifiers and worth trying on your dataset.
9>> Bagging And Random Forest (BRF)
Random Forest is one of the most popular and most powerful Algorithms for Data Science. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging.
The bootstrap is a powerful statistical method for estimating a quantity from a data sample. Such as a mean. You take lots of samples of your data, calculate the mean, then average all of your mean values to give you a better estimation of the true mean value.
In bagging, the same approach is used, but instead for estimating entire statistical models, most commonly decision trees. Multiple samples of your training data are taken then models are constructed for each data sample. When you need to make a prediction for new data, each model makes a prediction and the predictions are averaged.
Random forest is a tweak on this approach where decision trees are created so that rather than selecting optimal split points. Suboptimal splits are made by introducing randomness.
The models created for each sample of the data are therefore more different than they otherwise would be. Combining their predictions results in a better estimate of the true underlying output value.
If you get good results with an algorithm with high variance, you can often get better results by bagging that algorithm.
10>> Boosting and Adaboost (BA)
Adaboost is effective Algorithms for Data Science and stands for Adaptive Boosting. Bagging is a parallel ensemble because each model is built independently. On the other hand, boosting is a sequential ensemble where each model is built based on correcting the misclassifications of the previous model.
Bagging mostly involves ‘simple voting’, where each classifier votes to obtain a final outcome– one that is determined by the majority of the parallel models; boosting involves ‘weighted voting’, where each classifier votes to obtain a final outcome which is determined by the majority– but the sequential models were built by assigning greater weights to misclassified instances of the previous models.
In the above Figure, steps 1, 2, 3 involve a weak learner called a decision stump (a 1-level decision tree making a prediction based on the value of only 1 input feature; a decision tree with its root immediately connected to its leaves).
Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers. This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.
AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting. Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.
Conclusion: Algorithms for Data Science
To recap, we have covered some of the the most important machine learning algorithms for data science:
- 5 supervised learning techniques- Linear Regression, Logistic Regression, KNN, CRT, Navie Byes.
- 3 unsupervised learning techniques- Apriori, LVQ, SVM.
- 2 ensembling techniques- Bagging with Random Forests, Boosting with AdaBoost.