❓ “What is Supervised Learning? How is it different from other machine learning algorithms like Unsupervised learning and Reinforcement Learning?” — Every Aspiring Data Scientist
In this blog, I’ll be discussing in-depth about supervised learning.
I hope you have a basic understanding of machine learning, its types, and the “Idea behind Machine Learning”.
In case you would want to freshen up your basics, read this blog on Introduction to Machine Learning
What is Supervised Learning?
This is the same concept we use in Supervised Learning.
The system is given training data with both the input and the respective output columns. Then, the system tries to map the relationship between input and output variables using a hypothesis/model, which is then further used to predict the output accurately on any new input examples. An ideal model tries to generalize the mapping so it can predict the output for unseen instances as well.
For example, If I am given this dataset and asked to build a model to predict the house’s sale price. So here, I will keep columns marked in blue as input columns and train our model on those input features with columns in green as our output column.
Compared to all the other types of machine learning like Unsupervised Learning, Reinforcement Learning, Semi-Supervised Learning, etc…Supervised Learning is the one used most frequently. But one thing to consider here is that it really depends on the business use-case and the data, to decide which machine learning algorithm to use.
The Concept behind Supervised Machine Learning
Usually, when we are learning anything, our teachers (or parents in my case) always say – “Learn the concept and you will be able to solve any problem given to you on this topic“.
But no one actually guides us – What they actually mean by “learn the concept“? Well, LET ME help you 😉
If I ask you to find out an apple from the images on the right, I am sure you will first roll your eyes on me and instantly be able to do that.
But what if I ask you, “Which of these two apples might turn out to be rotten from the inside”(Btw, I have seen my parents do this 😆).
Most of the people would fail because we actually have never learned what attributes a rotten apple might show.
And how they differ from a healthy apple whereas in the earlier example you have all the features of apple and grapes which distinguish them from each other.
This is what it actually means to learn the concept!
Learning about the key differences in distinguishing one output from another output, which also drives supervised machine learning
💡 Bruner, Goodnow & Austin defined Concept Learning in 1967 as “exploration and listing of features/attributes which can be used to distinguish one thing, event or idea from another“. – Wikipedia
That is why it is important to always have a dataset that represents the difference clearly. And Including all the possible combinations of inputs and output help us in creating a balanced dataset.
Step-wise Flow of Supervised Machine Learning
The diagram below illustrates the step-wise execution for a supervised learning algorithm:
- Data Ingestion: This refers to the reading of the data from different file formats like CSV, XLSX, etc.
- Data Cleaning: Data needs to be cleaned before giving it to the algorithms. Because, data might contain null values, noise, or other errors.
- EDA( Exploratory Data Analysis) / Feature Engineering: In real-world data, it is very rare to get the desired features beforehand. Rather we need to create new features and do statistical analysis on the data to understand the data better and come up with such features. Domain knowledge helps a lot with feature engineering.
- Model Training and Evaluation: We need to choose a supervised learning algorithm that would fit the data better. Usually, we fit the data on different algorithms and choose the most meaningful one. Some supervised learning algorithms might need the user to change some parameters (known as hyperparameters) for better accuracy and generalization.
- Prediction: After finalizing the algorithm and parameters, we re-train on the whole dataset which gives us our final model. Then we start predicting new(unseen) input data.
Types of Supervised Learning
Based on the type of problem you are solving, supervised learning is further divided into two categories:
Whenever we are dealing with data in which the output column represents a continuous value.
We have to model a relationship between a dependent variable (output column or target variable) and one or more independent variables (input columns or features), we use regression modeling.
Some of the examples for regression problems include:
- Predicting real estate prices.
- Household electric power consumption
- Predicting salary from HR data
Discussing regression in depth is out of the scope of this blog. So hop on to Introduction to Regression Algorithms to get an in-depth understanding of regression algorithms such as Linear Regression, Step-Wise Regression, etc.
When we are given a data with output column (target variable) values as discrete (individually separate and distinct)
In other terms we have to predict which category the new input data will belong to, we solve it as a classification problem.
Some of the examples for classification problems include:
- Predicting cervical cancer risk
- Face Recognition
- Mushroom classification
If you want to get an in-depth understanding of different classification algorithms such as Support Vector Machines, Decision Trees, etc. refer to the Introduction to Classification Algorithms blog.
I hope you got the basic idea about supervised learning and its types. Now a question may arise!
📌 “Why would someone use Supervised Learning?” or specifically “When should someone use Supervised Learning?”
Advantages of Supervised Learning
- Predefined-Classes: The data has a predefined number of output classes. Hence, as a data scientist or machine learning engineer, you have a clear understanding of the classes predicted.
- Data Understanding: Data Understanding is better as input and output column(or labels) are given to us before-hand.
- Control: We have control over the algorithm to train it – to distinguish different classes and set an ideal decision boundary.
- Accuracy: Gives more accurate and reliable results.
Summing up, we should use supervised learning when we have a pre-defined set of classes we want to predict.
But, there are also some limitations to supervised machine learning which are very important to discuss.
Disadvantages of Supervised Learning
- Data preparation dependency: If a user labeling the input data makes a mistake. And the same data is fed to a supervised learning algorithm, it maps the wrong input and output. Therefore, it ends up creating a wrong hypothesis(or model). Hence, the accuracy depends on the quality of the labeled data.
- Uncertainty in data: It can be an issue if you are uncertain about the new data that will be generated in your system. As apart from pre-defined classes, it won’t consider any new classes. For eg. If I have trained a classifier model to predict if the image has an apple or grapes. And I try to predict an image with a banana in it. It will try to classify it as either apple or grape.
- No hidden patterns: It doesn’t reveal any hidden patterns or structures among data. So, it is the sole responsibility of the data scientist to understand and supervise the learning algorithm to fit an ideal hypothesis(or model).
- Dataset Dependency: To avoid a biased or underfit model, the dataset needs to be prepared carefully.
All the supervised learning algorithms like Linear Regression, Support Vector Machines, Decision Trees, Random Forests, etc. are discriminative training methods(also known as discriminative models).
Discriminative training methods try to build a hypothesis(or model) that discriminates well between different input and output values.
But, there is another approach of modeling which is known as generative modeling. This includes algorithms like Semi-Supervised Learning and Active Learning, etc. If you are interested to learn more about these approaches, I would highly recommend reading this blog to get an insight into “What are discriminative models? What are generative models? How they are different?” etc…in detail and with examples.
Test your conceptual knowledge on supervised learning by attempting the below quiz. Take it from me, its fun and rewarding!
Supervised Learning Quiz
Well Done! To begin with, you must have got the exact idea about Supervised Machine learning, its types, and when you should actually consider using it.
In any case, if you have any queries you can always use the comments section. And I’ll try to answer asap.
But believe me, this is just the tip of the iceberg and there is so much more to learn. Next, I would suggest you read this blog on Unsupervised Learning or explore Introduction to Linear Regression (where I code a linear regression model from scratch in python). You can also learn about the difference between different machine learning algorithms here.
🙏🏼 Until next time! Keep Learning and Keep Hustling!