Linear Regression in Machine Learning: Fundamentals, Loss Functions, and Regularization Techniques

Raj Shaikh 9 min read 1841 words

Welcome to the land of Linear Regression, where math meets fortune-telling! This is the place where AI tries to predict the future, one straight line at a time. Whether it’s predicting house prices, stock trends, or how many tacos you’ll eat tomorrow 🌮, linear regression is the OG of machine learning. Let’s start with the first piece of the puzzle: Least Squares Regression.

1. Least Squares Regression: Fitting the “Best Line”

What is Linear Regression?

Linear regression is like trying to draw the perfect straight line through a scatterplot of data points. Your goal? Find the line that minimizes the “oops moments” (a.k.a. errors) between the actual data and your predictions.

Mathematically, we model a target $ y $ as a linear function of input $ x $:

\[ y = mx + b \]

Where:

$ m $ is the slope (how steep your line is).
$ b $ is the y-intercept (where the line crosses the y-axis).

But how do we find $ m $ and $ b $ to make the line fit like a glove? 🧤 That’s where Least Squares Regression steps in.

The Least Squares Method

The idea is simple: minimize the sum of squared errors (the “oops moments”) between the predicted $ \hat{y} $ and actual $ y $. The error for a single point is:

\[ \text{Error} = y - \hat{y} \]

The total error (called the Residual Sum of Squares, or RSS) is:

\[ \text{RSS} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

The goal is to find $ m $ and $ b $ that make RSS as small as possible.

Math Behind the Magic

Using calculus, we can find the optimal slope $ m $ and intercept $ b $ with these formulas:

\[ m = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]\[ b = \bar{y} - m \bar{x} \]

Where $ \bar{x} $ and $ \bar{y} $ are the means of $ x $ and $ y $.

Numerical Example

Let’s predict ice cream sales ($ y $) based on temperature ($ x $):

$ x $ (Temperature in °C)	$ y $ (Sales in $)
20	200
25	250
30	300

Compute $ \bar{x} = 25 $, $ \bar{y} = 250 $.
Compute $ m $: \[ m = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} = \frac{(20-25)(200-250) + (25-25)(250-250) + (30-25)(300-250)}{(20-25)^2 + (25-25)^2 + (30-25)^2} = 10 \]
Compute $ b $: \[ b = \bar{y} - m \bar{x} = 250 - 10(25) = 0 \]

The best-fit line is:

\[ y = 10x \]

Real-Life Analogy

Imagine you’re throwing spaghetti at the wall to see if it sticks 🍝. Linear regression is like finding the perfect angle and throw strength so the spaghetti lands as close to the same spot as possible every time. 🎯

AI Applications of Linear Regression

Predictive Modeling:
- Predict house prices, stock trends, or exam scores.
Feature Engineering:
- Use linear regression to identify relationships between features.
Early ML Models:
- Before neural networks, linear regression was the MVP.

Code Example: Linear Regression in Python

Here’s how to implement least squares regression using NumPy:

import numpy as np

# Data
x = np.array([20, 25, 30])  # Temperature
y = np.array([200, 250, 300])  # Sales

# Compute means
x_mean = np.mean(x)
y_mean = np.mean(y)

# Compute slope (m) and intercept (b)
m = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean)**2)
b = y_mean - m * x_mean

print("Slope (m):", m)
print("Intercept (b):", b)

# Predict sales for 28°C
temperature = 28
sales_prediction = m * temperature + b
print(f"Predicted sales at {temperature}°C: ${sales_prediction}")

Mermaid.js Diagram: Least Squares Flow

graph TD
    DataPoints["Data Points (x, y)"] --> ComputeRSS["Compute Residual Sum of Squares (RSS)"]
    ComputeRSS --> MinimizeRSS[Find m, b to Minimize RSS]
    MinimizeRSS --> BestFitLine[Best Fit Line y = mx + b]

2. Loss Functions: AI’s Toughest Critics 🍝💔

What is a Loss Function?

A loss function measures how far off our predictions ($ \hat{y} $) are from the actual values ($ y $). In essence, it gives us a single number that represents the “badness” of our model. Our goal? Minimize the loss and make Gordon Ramsay say, “Finally, something cooked properly!” 👨‍🍳✨

Common Loss Functions in Linear Regression

1. Mean Squared Error (MSE): The Classic Critic

MSE is the gold standard for regression tasks. It calculates the average of the squared differences between predictions ($ \hat{y}_i $) and actual values ($ y_i $).

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^n (\hat{y}_i - y_i)^2 \]

Why squared? To penalize big mistakes more than small ones (because nobody likes a soggy spaghetti landing five feet off the plate).

2. Mean Absolute Error (MAE): The Straight-Talker

MAE calculates the average of the absolute differences between predictions and actual values. No fancy squaring—just straight-up “How far off are you?”

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^n |\hat{y}_i - y_i| \]

MAE is like the friend who says, “I don’t care about details, just tell me how wrong you are.” 🗣️

3. Huber Loss: The Best of Both Worlds

Huber Loss combines the simplicity of MAE with the sensitivity of MSE. It’s less harsh on small errors (like MAE) but punishes big errors more (like MSE). For small residuals:

\[ L = \frac{1}{2}(y - \hat{y})^2 \]

For large residuals:

\[ L = \delta |y - \hat{y}| - \frac{1}{2}\delta^2 \]

Think of Huber Loss as the judge who’s strict but fair—like your math teacher who gave bonus points for good handwriting. ✏️✨

Example: Predicting Ice Cream Sales

Let’s revisit our example where temperature ($ x $) predicts ice cream sales ($ y $):

$ x $ (Temp)	$ y $ (Actual Sales)	$ \hat{y} $ (Predicted Sales)
20	200	190
25	250	260
30	300	290

MSE:
\[ \text{MSE} = \frac{1}{3} [(190-200)^2 + (260-250)^2 + (290-300)^2] = \frac{1}{3} [100 + 100 + 100] = 100 \]
MAE:
\[ \text{MAE} = \frac{1}{3} [|190-200| + |260-250| + |290-300|] = \frac{1}{3} [10 + 10 + 10] = 10 \]

MSE penalizes big mistakes more heavily, while MAE gives a simpler, less punishing score.

Why Loss Functions Matter

Loss functions aren’t just for regression—they’re the heart of every machine learning algorithm. Whether it’s classifying cats and dogs 🐾 or translating languages 🗺️, every model relies on loss functions to know if it’s improving or totally flopping.

Code Example: Calculating Loss Functions

Let’s calculate MSE and MAE using Python:

import numpy as np

# Actual and predicted values
y = np.array([200, 250, 300])  # Actual sales
y_pred = np.array([190, 260, 290])  # Predicted sales

# Mean Squared Error
mse = np.mean((y - y_pred)**2)
print("Mean Squared Error (MSE):", mse)

# Mean Absolute Error
mae = np.mean(np.abs(y - y_pred))
print("Mean Absolute Error (MAE):", mae)

Fun Analogy

Imagine a spaghetti-throwing competition 🍝:

MSE: Freaks out if you miss by a mile and shames you on national TV.
MAE: Just casually says, “Bro, you missed by 5 inches.”
Huber Loss: The chill judge who says, “Eh, 5 inches is fine, but let’s not go crazy with 5 feet.”

Mermaid.js Diagram: Loss Function Flow

graph TD
    Predictions[Model Predictions] --> CalculateError[Calculate Error]
    CalculateError --> MSE[Mean Squared Error]
    CalculateError --> MAE[Mean Absolute Error]
    CalculateError --> HuberLoss[Huber Loss]
    MSE --> UpdateModel[Update Model Parameters]
    MAE --> UpdateModel
    HuberLoss --> UpdateModel

3. Overfitting and Regularization – Teaching Your Model to Chill Out 😎

What is Overfitting?

Imagine you’re at a karaoke night 🎤. Overfitting is like memorizing every lyric and vocal inflection from one singer. Sure, you sound great singing that one song, but the moment someone hands you a new tune, you’re completely lost. 😅

In AI, overfitting happens when a model learns the noise in the training data instead of the general pattern. It performs brilliantly on training data but flops when faced with new data.

Example:

You train a model to predict house prices, and it memorizes your specific dataset:

Training data: “If the house has 3 bathrooms, the price is always $300,000.”
Test data: “Here’s a house with 3 bathrooms in another city.” Model response: “Uhhh… $300,000?” 🚨

What Causes Overfitting?

Too Complex Models: When your model has too many parameters (like trying to fit a wavy roller coaster to a straight road).
Too Little Data: Less data = easier to memorize instead of generalize.
Training for Too Long: The model keeps fine-tuning itself to the quirks of the training set.

Regularization to the Rescue

What is Regularization?

Regularization is like a pair of blinders for your model—it prevents it from focusing too much on the quirks of the training data and forces it to simplify its learning.

Types of Regularization

1. L1 Regularization (Lasso): Simplicity Rules

L1 regularization adds a penalty proportional to the absolute values of the model’s weights:

\[ \text{Penalty} = \lambda \sum_{i} |w_i| \]

Effect: Encourages some weights to become exactly zero, simplifying the model (like Marie Kondo cleaning out your cluttered closet 🧹).

2. L2 Regularization (Ridge): Keep It Balanced

L2 regularization adds a penalty proportional to the square of the model’s weights:

\[ \text{Penalty} = \lambda \sum_{i} w_i^2 \]

Effect: Shrinks weights toward zero but doesn’t eliminate them. It’s like a model on a healthy diet—it trims the excess but doesn’t starve itself. 🥗

The New Loss Function

Regularization modifies the loss function by adding a penalty for large weights:

\[ \text{Regularized Loss} = \text{Original Loss} + \lambda \cdot \text{Penalty} \]

For example:

With L2 regularization: \[ \text{Loss} = \frac{1}{n} \sum_{i=1}^n (\hat{y}_i - y_i)^2 + \lambda \sum_{j} w_j^2 \]

Here, $ \lambda $ is the regularization strength:

Small $ \lambda $: Model is flexible but prone to overfitting.
Large $ \lambda $: Model is rigid and avoids overfitting.

Example: Predicting House Prices

Feature (Bathrooms)	Target (Price)	Prediction (Without Regularization)	Prediction (With Regularization)
1	100,000	99,000	101,000
2	200,000	205,000	198,000
3	300,000	300,000	299,500

Without regularization, the model hugs the training data. With regularization, it smooths out the predictions.

Code Example: Regularization with Scikit-Learn

Let’s compare L1 and L2 regularization:

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
import numpy as np

# Sample data
X = np.array([[1], [2], [3], [4], [5]])  # Number of bathrooms
y = np.array([100, 200, 300, 400, 500])  # Prices in $1000s

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression (No Regularization)
lr = LinearRegression()
lr.fit(X_train, y_train)

# Lasso Regression (L1)
lasso = Lasso(alpha=0.1)  # Lambda = 0.1
lasso.fit(X_train, y_train)

# Ridge Regression (L2)
ridge = Ridge(alpha=0.1)  # Lambda = 0.1
ridge.fit(X_train, y_train)

# Predictions
print("Linear Regression Prediction:", lr.predict(X_test))
print("Lasso Prediction:", lasso.predict(X_test))
print("Ridge Prediction:", ridge.predict(X_test))

Fun Analogy

Overfitting is like a try-hard student 🧑‍🎓 who memorizes every word in a textbook. Regularization steps in like a wise teacher, saying, “Stop cramming every detail and focus on the big picture!” 💡

Mermaid.js Diagram: Regularization Flow

graph TD
    TrainingData[Training Data] --> ModelTraining[Train Model]
    ModelTraining --> LossFunction[Compute Loss]
    LossFunction --> Regularization[Add Regularization Term]
    Regularization --> UpdateWeights[Update Weights]
    UpdateWeights --> BetterGeneralization[Better Generalization]

Why Regularization Matters in AI

Prevents Overfitting:
- Keeps models from obsessing over training data.
Feature Selection:
- L1 regularization helps eliminate irrelevant features.
Smooth Predictions:
- Ensures models behave predictably with new data.

Last updated on February 28, 2025

Mathematical Foundations of Large Language Models: Training Objectives, Token Probability, and Loss Functions Linear Algebra in AI: Vectors, Matrices, Eigenvalues, and Singular Value Decomposition