Linear Regression in Machine Learning (from Scratch !!)


In this post, I will talk about one of the most crucial techniques in Regression Analysis/Machine Learning, called Linear Regression. As per Wikipedia, Regression Analysis is defined as a set of statistical processes used to estimate the strength of the relationship between a dependent variable and an independent variable. The process which tries to estimate this strength of relationship assuming a linear behaviour between the dependent and independent variable is called Linear Regression. In simple words, in Linear Regression we try to estimate the relationship between the independent variables (also called features) and the dependent variable (also called Target Variable) assuming a linear relationship.

Linear Regression: Example

Some of you might ask a question, what is an example of linear regression? In this section I tackle this question by taking an example of predicting the sales of a product using the Advertising Data. The dataset is available on Kaggle and can be accessed from here and the entire code can be accessed from here. The variables in the dataset are as follows:

  1. TV: TV Advertising
  2. Radio: Radio Advertising
  3. Newspaper: Newspaper Advertising
  4. Sales: Sales of the product.

So, in this problem we will analyse the relation between the independent variables and the dependent variable using a Linear Regression model created from scratch.

Let’s take a look at the data.

Advertising Data

From the above data it is clear that all the variables are continuous. Before going into the depths of Linear Regression. Let’s discuss the assumptions of Linear Regression.

Assumptions of Linear Regression

There are five assumptions associated with a linear regression model:

  1. Linearity: The relationship between the independent variable and the mean of the dependent variable is linear.
  2. Homoscedasticity: The variance of residual is the same for any value of the independent variable.
  3. Independence: Observations are independent of each other.
  4. Normality: For any fixed value of both the dependent and the independent variable are normally distributed.
  5. No Autocorrelation: There should be no correlation between the current and the past values of the independent variable.

Next we talk about the basics of Linear Regression and eventually move into its depths.

Basics of Linear Regression

In this section, we discuss the mathematics behind the Linear Regression. When we have only one independent variable then it is called Simple Linear Regression or Univariate Linear Regression and it is given by,

{\displaystyle {\widehat {y}}_{i}={\widehat {\beta }}_{0}+{\widehat {\beta }}_{1}x_{i}.}
Simple Linear Regression

In the above equation, y1 represents the target (dependent variable), xi represents the independent variable, 𝜷1 represents the weight of the independent variable and 𝜷0 represents the bias term or in simple terms, the intercept of the linear equation.

If we have more than two independent variables then the linear regression is called Multivariate Linear Regression. The above equation can be generalised for multivariate linear regression as well. It is given by,

y_{i}=\beta _{0}+\beta _{1}x_{i}+\varepsilon _{i},\quad i=1,\dots ,n.\!
Multivariate Linear Regression

Linear Algebra makes it easier for us to calculate this equation. We use the vectorised form for calculating the Linear Regression equation. It is given by,

{\displaystyle \mathbf {y} =X{\boldsymbol {\beta }}+{\boldsymbol {\varepsilon }},\,}
Vectorised form of Linear Regression

Where y, X, 𝛃 and ε are given by,

\mathbf {y} ={\begin{pmatrix}y_{1}\\y_{2}\\\vdots \\y_{n}\end{pmatrix}},\quad
{\displaystyle X={\begin{pmatrix}\mathbf {x} _{1}^{\mathsf {T}}\\\mathbf {x} _{2}^{\mathsf {T}}\\\vdots \\\mathbf {x} _{n}^{\mathsf {T}}\end{pmatrix}}={\begin{pmatrix}1&x_{11}&\cdots &x_{1p}\\1&x_{21}&\cdots &x_{2p}\\\vdots &\vdots &\ddots &\vdots \\1&x_{n1}&\cdots &x_{np}\end{pmatrix}},}
{\displaystyle {\boldsymbol {\beta }}={\begin{pmatrix}\beta _{0}\\\beta _{1}\\\beta _{2}\\\vdots \\\beta _{p}\end{pmatrix}},\quad {\boldsymbol {\varepsilon }}={\begin{pmatrix}\varepsilon _{1}\\\varepsilon _{2}\\\vdots \\\varepsilon _{n}\end{pmatrix}}.}

We will be using this vectorisation technique in our code as well. Next we discuss what are loss functions and which loss function can be used to train Linear Regression.

Loss Functions

Loss function is a function used to evaluate a candidate solution, in case of Linear Regression it is the set of parameters or weights that we want to evaluate. In Linear Regression, we prefer a function which is continuous, differentiable and smooth. One such function which satisfies all these criterion is the Mean Squared Error (MSE in short). It is given as,

\mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n}(Y_{i}-\hat{Y}_{i})^2
\mathrm{MSE}mean squared error
{n}number of data points
Y_{i}observed values
\hat{Y}_{i}predicted values
Table representing the symbol’s meanings

Loss function can also be referred to as the Cost Function. Next we see, the algorithm which can be used to train the Linear Regression model.

Gradient Descent Algorithm

Gradient Descent is the algorithm that we can use to train the Linear Regression model. On a high level, we initialise a set of weights randomly, equal to the number of independent variables, then we measure the loss of the model with these weights using our loss function and then update these weights using the gradient descent update rule. This process is repeated until convergence. Mathematically, the weights are updated as described by the following equation.

Gradient Descent Update rule

Here 𝜷i represents the ith weight, J represents the loss function and α represents the learning rate. Intuitively, the gradient descent algorithm can be explained as a spherical object rolling down a hill to its bottom. This can be visualised as follows,

Quick Guide to Gradient Descent and Its Variants | by Sahdev Kansal |  Towards Data Science
Gradient Descent Algorithm

Enough of the theory part now. Let’s dive into the code. In the next section, we code all the concepts explained above right from scratch.

Linear Regression Code From Scratch

In this section, I have attached the code for a custom Linear Regression model.

class LinearRegression_Custom:
    # Constructor
    def __init__(self, X, y, lr=0.01, n_iter=1000):
        self.X = X
        self.y = y = lr
        self.n_iter = n_iter
        self.theta = np.zeros(shape = (self.X.shape[1],))
        self.error_list = []
    # 1. Predictions
    def predictions(self, data):
        return, self.theta)
    # 2. Loss Function
    def loss_function(self):
        preds = self.predictions(data=self.X)
        act = self.y
        mse = np.mean((act-preds)**2)
        return mse
    # 3. Gradient
    def gradient(self):
        preds = self.predictions(data=self.X)
        act = self.y
        m = self.X.shape[0]
        error = (act - preds)
        return -2*(, error)/m)
    # 4. Gradient Descent
    def train(self):
        for _ in range(self.n_iter):
            # Compute gradient
            grad = self.gradient()
            # Calculate Error
            error = self.loss_function()
            # perform the gradient descent algorithm
            self.theta = self.theta -*grad
    # 5. Compute the R squared Score
    def score_R2(self, data, test):
        R2 Score = 1 - summation((actual    
        predictions)**2)/summation((actual - mean)**2)
        # 1. Make Predictions
        preds = self.predictions(data=data)
        act = test
        # 2. Compute RSS: Residula Sum Squared
        rss = np.sum((preds - act)**2)
        # 3. Cmpute TSS: Total Squared Error
        tss = np.sum((preds - np.mean(act))**2)
        # 4. Compute R Squared
        r2_score = 1 - (rss/tss)
        return r2_score 

Next, we create an object of this class and train the model on our training data. We also visualise the predictions on a scatter plot along with the loss against number of epochs.

# Create the model object and train the model using the train function

lr = LinearRegression_Custom(X_train, y_train, lr = 0.1, n_iter=500)

# Make Predictions
y_preds = lr.predictions(data=X_test)

# Plot the predictions with actual values and plot the error as well.

fig, ax = plt.subplots(1, 2, figsize=(12, 4))
sns.scatterplot(x=y_test, y=y_preds, ax=ax[1])
ax[1].plot(np.arange(5.0, 25.0), np.arange(5.0, 25.0), color='orange', label="45 Degree Line")
ax[0].set_xlabel("Number of Iterations")
ax[1].set_xlabel("Actual Values")
ax[1].set_ylabel("Predicted Values")
Error Analysis and results

We are able to achieve a R Squared score of 0.91 using this custom model. You can access the entire notebook here.


So, this was the implementation of Linear Regression from scratch. I hope you find my blogpost informative. I keep on posting Data Science content regularly on my blog as well as on other platforms like Medium, Kaggle and LinkedIn. Please do subscribe to my blog and if you would like to connect with me feel free to do so over LinkedIn. I am quite active there and I will be happy to have a conversation with you. The link to my LinkedIn profile is attached here. I will catch you in another post till then, happy learning 🙂

2 responses to “Linear Regression in Machine Learning (from Scratch !!)”

  1. […] Many statistical techniques like Regression, Analysis of Variance (ANOVA) etc. assume a homoscedastic behaviour of residuals. Violation of such assumptions can lead to inaccuracies in the statistical inference. One such regression technique is called Linear Regression which also assume a linear relationship between dependent and independent variable. To know more about Linear Regression, please visit here. […]


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: