A machine learning pipeline may consist of many tasks like Data Cleaning, Feature Extraction, etc. But the most important step involves identifying the category of machine learning problems given to us. And to identify out of all the types like semi-supervised learning, unsupervised learning, reinforcement learning etc…which one to use and why.
Broadly machine learning problems are divided into 3 categories:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
As I have already discussed in-depth about Supervised Learning and Reinforcement Learning in my other blogs, I’ll be mostly focusing on Unsupervised Learning. Although, before proceeding with this blog I would recommend going past these blogs to brush up your concepts. And if you are new to machine learning, I would highly recommend you start from Introduction to Machine Learning and then hop on to Supervised Learning and Unsupervised Learning.
❓ “What is Unsupervised Learning? What are the types of Unsupervised learning? How is it different from Supervised or Reinforcement Learning?”
We will be answering these and many more questions pertaining to unsupervised machine learning in this blog.
What is Unsupervised learning?
How about a simple game? Ready?
Okay! Try to guess the object in the image, on your right.
Yes! 10/10 if you answered “Clock” and get a clock if you didn’t 😜.
But do you wonder…How were you able to get it right? Even when the image was blurred?
As it is quite rare that you would actually have a clock that looks exactly like this.
It’s really simple. You were able to recognize it because of the clock dial, the hour and minute hand of the clock.
You actually understood the hidden pattern of a clock i.e. Whenever you see one to twelve numbers in a circular form with some hands pointing at numbers, it is a clock.
So no matter what kind of a clock I show you, you will instantly be able to find that out.
Unsupervised Learning works on a similar approach. Unlike supervised learning, it is not given a labeled dataset, rather it is given only the input data to find the hidden patterns with minimum human supervision, hence the term Un-Supervised Learning.
Types of Unsupervised Learning
Depending on the type of application, unsupervised learning is divided broadly into three categories:
Simply putting, Clustering is the art of grouping together similar objects/data (in terms of features).
It’s similar to the experience, where you have a black box with few colored balls in it and you are asked to draw one ball at a time and keep grouping them. So basically you don’t know what color or pattern of the ball might come when you pick one but you can group them using their features i.e color or pattern.
An interesting real-world application of clustering is Google Photos. It actually scans photos and creates different albums of your friends and family. It has no idea as to how many friends you have but it tries to map common features and create albums. A lot goes in the backend actually, but on a high-level, it is a clustering task.
Clustering is mostly used for applications like:
- Market Segmentation
- Anomaly Detection etc.
There have been numerous clustering algorithms devised. And the main reason is – “defining clusters” is very problem-oriented i.e. they vary on the type of problem you are solving. In simple terms, we can group together objects based on different parameters like distance, mean vectors, statistical distributions, and many many more. And mostly a cluster model which fits on one kind of data might not necessarily perform well on other kinds of data.
There are some basic cluster models which one needs to understand to have a good grasp of clustering algorithms. But as clustering is out of the scope of this blog, you can read more at Introduction to Clustering, for in-depth learning of clustering algorithms.
As the meaning suggests, “a connection or cooperative link between people or organizations“.
In terms of machine learning, it helps us to find hidden patterns or connections in the data. And as it is a rule-based machine learning method, it develops strong rules from data.
For example, whenever I go out to buy myself some bread, I get butter with it. So bread has a connection with butter here. And if a system is running any association (unsupervised learning) algorithm on that store’s data, it would easily find out this connection if it repeats for other customers as well. This way the owner can actually place butter besides bread to increase sales.
Association in unsupervised learning is used mainly for below applications:
- Market Basket Analysis (What do people buy together?)
- Automating Marketing Strategy (What to show people more?)
- Recommender System (Major Application)
- Price Bundling and Discounts
- Intrusion Detection
You might have seen the “Frequently Bought Together” on Amazon. So, I searched for a bulb socket on Amazon and it gave me these suggestions(click image to enlarge). Cool, Right?
The recommender system is the most used application of association based learning.
📌 If you want to learn more about the algorithms like Apriori, Euclat under association, you can head to blog Association in Unsupervised learning Algorithms
3. Dimensionality Reduction
In real-world applications, we sometimes come across datasets that may contain millions of features. Even excel breaks while loading such datasets.
💡 Being a machine learning enthusiast, you must know about the curse of dimensionality. In brief, it states that after a certain point, the performance of a machine learning algorithm decreases with the increase in the dimensionality of the data.
Dimensionality Reduction helps us to represent the same high-dimensional data in low-dimensional data. For example, it is a tough task to represent 3-D data but if the same data is represented as 2-D points, it is much easier to visualize.
Now, a question might arise – “But, won’t we loose important information while reducing the data?“
Absolutely Right! That is why we need to be careful and reduce the data in such a way that minimal loss of information happens. Usually, when we reduce the data we need to make sure 90-95% variance remains in the data.
Dimensionality Reduction is used for applications like:
- Noise Reduction
- Data Visualization
- Risk Management
- Topic Modelling
- As a pre-modeling step for dimension reduction.
It is a very vast topic as it can be further divided into linear and non linear approaches, which are further divided into feature selection and feature extraction processes.
Some of the commonly used algorithms under dimensionality reduction include PCA (Principal Component Analysis), LDA (Linear Discriminant Analysis), etc. For an in-depth understanding of Dimensionality reduction, refer to this blog.
Why do we need Unsupervised Learning? (Advantages)
❓ A question might arise as to why do we actually need an unsupervised approach when we have a “tried and tested” supervised approach?
Well, some of the important factors may include:
- Data Understanding: Although supervised learning methods gives us predictions with decent accuracy. It often fails to extract any hidden pattern in the data. In other terms, supervised learning is as good as the data and labels you prepare. Whereas, in unsupervised learning, it tries to find out the hidden patterns in the data.
- Labelling Data: With the ever-growing unlabeled data, it becomes a huge task to manually label the data. And believe me, No Data Scientist wants to do it. It’s that part of the cooking recipe where you have to peel the onion. Yeah, literally tears just roll out of you 🤣! Just kidding, but you get my point. Although there are services like Mechanical Turk who can do it for you at very low costs but again it increases the cost factor. Hence, it is much more important now to have algorithms that do not depend only on labeled data. One interesting algorithm which takes some labeled data and some unlabeled data is Semi-Supervised learning. If you are intrigued to learn more about it, refer to this blog.
- Undefined Classes: You might have understood by now that Supervised Learning needs labels or classes as the output column. But what if I don’t know the exact number of classes present in the data? People in data mining might relate to what I am saying, but others consider a dataset with customer details. What label/class would you choose for each person? Yup, you can’t! There might be 6,10, 100, or more classes.
Disadvantages of Unsupervised Learning
Although unsupervised learning brings into light the hidden patterns and associations. It is still not widely used in real-world applications due to some restrictions like:
- High Complexity – Unsupervised learning is considerably hard than other machine learning types.
- Results Verification – Although the system maps out hidden patterns, considerable effort is needed to actually verify the authenticity of results.
📌 If you want to learn the differences between supervised learning, unsupervised learning and reinforcement learning, read this blog.
The best way to test your understanding is by attempting a quiz. It’s fun and fast. Attempt the below quiz to challenge your knowledge.
Unsupervised Learning Quiz
I hope you now have a clear understanding of unsupervised learning. In case you have any queries, shoot them in the comment box and I would try to asap.
Well! From my experience I know, it can get overwhelming after looking at all the types and topics. So for everyone who completed this – “Amazing Work”!
But believe me, it’ll get better as there is so much more to learn.
For the hungry ones, head on to read more about unsupervised learning algorithms – Clustering, Association, and Dimension Reduction.
You can also read related blogs on Supervised Learning or Linear Regression from Scratch in python.
🙏 Until next time! Keep Learning and Keep Hustling!