Regression analysis is like tinder (a dating app) for data – you try to find meaningful relationships that will generalize to upcoming data. Just like only swiping right is not a great way to find relationships, implementing regression using pre-built packages is not helpful for an in-depth understanding of it.
Some people might think it’s better to hop on to the practical aspects of regression analysis and ditch the theory part. But to build something powerful you have to understand the underlying concepts behind it.
Before getting deep into regression analysis, if you are just getting started with machine learning – I would strongly recommend going through Introduction to Machine Learning and Supervised Learning. This will help you get the basics right and not scratch your head amid reading this blog 😜
What is Regression Analysis?
In simple terms, It is the process of finding the relationship between a dependent variable(s) and an independent variable(s).
✅ Dependent Variable – Target value or outcome variable
✅ Independent Variable – Predictors or Feature columns
Let’s understand this with an example most of you could relate with…!
Problem Statement – Consider you’re brewing a cup of coffee. And we want to apply some regression analysis to your cup of coffee.
So here, the dependent variable is the sweetness of the coffee, which is the end result. And independent variables could be coffee beans and sugar.
Some patterns which you will agree with, might be…
- The sweetness of the coffee decreases⬇ as I increase the number of coffee beans.
- The sweetness of the coffee increases⬆ as more sugar is added to the coffee.
In brief, the taste of the coffee would actually differ with the amount of coffee beans and sugar added to the milk.
There it is! The relationship of coffee beans and sugar to the sweetness of the coffee is the relationship we wanted to model.
In some ideal world, the equations might look something like, sweetness of the coffee = amount of sugar added / amount of coffee beans added.
🙆♂️ Free Advice alert! Ditch coffee and eat an apple 😂- Source
We will discuss some practical use cases later in the blog.
💡 The earliest form of regression was the method of least squares, published by Legendre in 1805.
Types of Regression
3 factors distinguish different types of regression algorithms:
- The number of independent variables (features)
- The shape of the regression line.
- The type of the dependent variable (target label)
Although there are numerous regression algorithms out there, you can even build your own algorithms considering the above 3 factors. The two most commonly used and easy to implement algorithms include:
This is one of the simplest and well-studied forms of regression. With its origin to statistics, linear regression has been used by data scientists to model relationships within data and get promising results.
Linear Regression builds a hypothesis or model which reflects a linear relationship between feature variables and target variables. In other terms, it assumes that the relationship between variables is linear and tries to map it using a straight line.
The best fit straight line in linear regression is represented as:
Here is the slope of the line, is the intercept and is the error term.
We can further divide linear regression into two categories, depending on the number of independent variables (or feature columns):
- Simple LR(when independent variables = 1)
- Multiple LR (when independent variables > 1)
To come up with the best fit straight line, we need to minimize the error between the predicted line and actual output. We can use different error metrics like Least Square (commonly used), Ordinary Least Squares, Weighted Least Square, etc to minimize the error.
If you want to read more about Linear Regression and python implementation(from scratch) of the same, read the blog – Linear Regression.
This form of regression is commonly used when we want to predict the probabilities of an event which is binary. In data terms, if we need to predict the target column with values like 0/1, True/False, Yes/No we use logistic regression.
Logistic regression measures the relationship between a dependent and independent variable(s) using a logistic function which represents a sigmoid function as shown in the image. It takes an input and outputs a value between zero and one.
It generates a hypothesis that is used to validate a relationship rather than state the relationship. A logistic regression model will build such a hypothesis which can then be used with a threshold to build a classifier. For example,
If you want to read more about Logistic Regression, its types and implementing it in python from scratch, read the blog – Logistic Regression.
Other Regression Algorithms
Although you may not use the algorithms listed below very often, but they are equally important. Because we can use them for certain types of data and target variables. And being a data scientist you should have a simple understanding of these as well:
Discussing each regression algorithm in detail is out of the scope of this blog. In any case you can read more about other algorithms here – Top 10 regression algorithms to know about.
Mathematics behind Regression Analysis
It is always good to know the mathematics behind the derivations as it clears out the basic idea of the execution of the algorithms.
You can skip this section if you are having a hard time understanding the maths. It’s not absolutely important but I have included this as it is always good to know.
A regression analysis involves the following parameters:
- – Unknown parameter which needs to be calculated
- – Rows of independent variables or the feature columns from the data
- – Rows of the dependent variable or the target column from the data
- – When we do computer regression analysis, we only compute an approximation of the real relationship between variables. This is the error term that denotes the difference between our approximation and the real relationship( theoretically). Please note that I am not talking about the difference between our model’s prediction and actual values. Refer to this video for a better understanding.
For every row in the data, most of the regression algorithms approximate as a function of and with a random statistical noise or error term . We represent it as,
Here, the term, is the hypothesis. The better the hypothesis, the better the predictions of the model. To select the appropriate hypothesis, a data scientist needs to evaluate the data and understand which function would work the best.
Once the function is estimated, different cost calculation parameters are used to estimate the parameters . It is sometimes denoted by to distinguish it from the original parameters used to produce the data. There are several approaches like least square which help in approximating the value of the parameters.
The value of the parameters estimated () is then used to predict and assess the performance of the model. Output is denoted by, .
The flow of the regression analysis looks something like,
Applications of Regression analysis
In machine learning, we use regression analysis normally for the following:
Prediction and Forecasting
Regression models are used to learn the mapping of the target variable with the features from data and trends, which is then used to predict on new data points.
- Predicting real estate prices
- Predicting death caused by bike accidents
You can also use it to forecast from trends and historical data like forecasting weather or sales of a predict.
Cause and Effect Relationship
Additionally, you can also use it to learn the cause and effect relationships. Like, the sweetness of the coffee will increase as the amount of sugar is increased. Although it is important to understand that cause and effect relationship might not always justify the prediction. Simply speaking, It is not necessary that a relationship within the dataset might also hold true for new data points.
Filtering important features
Regression analysis allows us to determine which factors matter and which can be ignored. It also tells us the magnitude of the impact of important factors.
Problems with Regression
Depending on the type of data, sometimes regression algorithms may fall prey to the below conditions:
- Multi-Collinearity: Sometimes the independent variables in the data may be highly correlated which causes problems in approximating the parameters and selecting the important features.
💡 Two variables and are collinear, if they have a linear relationship between them.
- Outliers: Outliers are the data points that are too high or low values when compared to other data points. They can cause issues as they do not represent the distribution of original data and should be removed.
- In-sufficient Data: If the data points available are lower than the independent variables then regression analysis results in infinite solutions and modeling is not possible.
Regression vs Classification
Both regression and classification are types of supervised learning algorithms. The only difference between regression and classification is the type of the target label. For a target label with,
- Continuous value – We use Regression
- Discrete Value – We use Classification
Continuous values can take on all real values between an interval. Eg. 11.3, 11.6, 20.3 etc. Discrete values belonging to a set are distinct and separate. Eg: 10, 20, 11 etc..
It is equally important to test your understanding of the concepts and quizzes are a fun way to do it. Attempt the below quiz on regression algorithms to get your score.
I hope you have a clear understanding of regression analysis, its types, and its applications. Going ahead, I would recommend you to read about the top 10 regression algorithms and start putting this knowledge into actual projects.
You can also refer to Introduction to Classification Algorithms if you would want to learn more about classification.
Was this blog useful to you? Tell me your views and doubts(if any) in the comments section below.