UNIT IV- IDS

 

UNIT IV

Model Development: Simple and Multiple Regression – Model Evaluation using Visualization – Residual Plot –Distribution Plot – Polynomial Regression and Pipelines – Measures for In-sampleEvaluation – Prediction and Decision Making.

MODEL DEVELOPMENT

Model development is the process of creating a mathematical model that describes the relationship between input variables (features) and an output variable (target).
In data science, regression models are commonly used when the target variable is continuous.

REGRESSION

Regression is a statistical and data science method used to understand and predict the relationship between variables.

Simple Definition

Regression means finding a relationship between a dependent variable (output) and one or more independent variables (inputs) so we can predict values or analyze how changes in inputs affect the output.

Example

If you want to predict exam scores based on hours studied:

  • Hours studied → independent variable
  • Exam score → dependent variable
    Regression helps estimate how exam scores change as study hours increase.

Regression is mainly used for:

  • Prediction (e.g., house prices, sales, temperature)
  • Trend analysis
  • Understanding relationships between variables

Types of Regression

  • Simple Linear Regression – one input variable
  • Multiple Linear Regression – multiple input variables
  • Polynomial Regression – curved relationships
  • Logistic Regression – used for classification (yes/no outcomes)

Regression fits a best-fit line or curve through data points to minimize error and make accurate predictions.

Supervised Learning: -

Supervised Machine Learning: It is an ML technique where models are trained on labeled data i.e output variable is provided in these types of problems. Here, the models find the mapping function to map input variables with the output variable or the labels. Regression and Classification problems are a part of Supervised Machine Learning.

Regression: -

Regression Analysis in Machine learning

Regression analysis is a statistical method to model the relationship between a dependent (target) and independent (predictor) variables with one or more independent variables. More specifically, Regression analysis helps us to understand how the value of the dependent variable is changing corresponding to an independent variable when other independent variables are held fixed. It predicts continuous/real values such as temperature, age, salary, price, etc.

 We can understand the concept of regression analysis using the below example:




Example: Suppose there is a marketing company A, who does various advertisement every year and get sales on that. The below list shows the advertisement made by the company in the last 5 years and the corresponding sales:

Now, the company wants to do the advertisement of $200 in the year 2023 and wants to know the prediction about the sales for this year. So to solve such type of prediction problems in machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation between variables and enables us to predict the continuous output variable based on the one or more predictor variables. It is mainly used for prediction, forecasting, time series modeling, and determining the causal-effect relationship between variables.

 In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot, the machine learning model can make predictions about the data.

The distance between datapoints and line tells whether a model has captured a strong relationship or not.

Some examples of regression can be as:

o   Prediction of rain using temperature and other factors

o   Determining Market trends

o   Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:

  • Dependent Variable: The main factor in Regression analysis which we want to predict or understand is called the dependent variable. It is also called target variable.
  • Independent Variable: The factors which affect the dependent variables or which are used to predict the values of the dependent variables are called independent variable, also called as a predictor.
  • Outliers: Outlier is an observation which contains either very low value or very high value in comparison to other observed values. An outlier may hamper the result, so it should be avoided.
  • Multicollinearity: If the independent variables are highly correlated with each other than other variables, then such condition is called Multicollinearity. It should not be present in the dataset, because it creates problem while ranking the most affecting variable.
  • Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with test dataset, then such problem is called Overfitting. And if our algorithm does not perform well even with training dataset, then such problem is called underfitting.

 Why do we use Regression Analysis?

Regression analysis helps in the prediction of a continuous variable. There are various scenarios in the real world where we need some future predictions such as weather condition, sales prediction, marketing trends, etc., for such case we need some technology which can make predictions more accurately.

·    Regression estimates the relationship between the target and the independent variable.

·    It is used to find the trends in data.

·    It helps to predict real/continuous values.

·    By performing the regression, we can confidently determine the most important factor, the least important factor, and how each factor is affecting the other factors.

 Types of Regression

o   Linear Regression

o   Logistic Regression

o   Polynomial Regression

o   Support Vector Regression

o   Decision Tree Regression

o   Random Forest Regression

o   Ridge Regression

o   Lasso Regression

Linear Regression:

Linear regression is a statistical regression method which is used for predictive analysis.

It is one of the very simple and easy algorithms which works on regression and shows the relationship between the continuous variables.

It is used for solving the regression problem in machine learning.

In the simplest words, Linear Regression is the supervised Machine Learning model in which the model finds the best fit linear line between the independent and dependent variable i.e it finds the linear relationship between the dependent and independent variable.

If there is only one input variable (x), then such linear regression is called simple linear regression.

And if there is more than one input variable, then such linear regression is called multiple linear regression.




Linear regression makes predictions for continuous/real or numeric variables such as sales, salary, age, product price, etc. The relationship between variables in the linear regression model can be explained using the below image. Here we are predicting the salary of an employee on the basis of the year of experience.

 Below is the mathematical equation for Linear regression:

                    Y= aX+b

Here, Y = dependent variables (target variables),

            X= Independent variables (predictor variables), a and b are the linear coefficients




Error is the difference between the actual value and Predicted value and the goal is to reduce this difference.

Statistical tools for high-throughput data analysis

In the above diagram,

  • x is our dependent variable which is plotted on the x-axis and y is the dependent variable which is plotted on the y-axis.
  • Black dots are the data points i.e the actual values.
  • bo is the intercept which is 10 and b1 is the slope of the x variable.
  • The blue line is the best fit line predicted by the model i.e the predicted values lie on the blue line.

The vertical distance between the data point and the regression line is known as error or residual. Each data point has one residual and the sum of all the differences is known as the Sum of Residuals/Errors.

 

Mathematical Approach:

Residual/Error = Actual values – Predicted Values

Sum of Residuals/Errors = Sum(Actual- Predicted Values)

Square of Sum of Residuals/Errors = (Sum(Actual- Predicted Values))2 i.e

linear regression

Some popular applications of linear regression are:

o   Analyzing trends and sales estimates

o   Salary forecasting

o   Real estate prediction

o   Arriving at ETAs in traffic.

 

o   Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

  • Simple Linear Regression: If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.
  • Multiple Linear regression: If more than one independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

Equation of Simple Linear Regression, where bo is the intercept, b1 is coefficient or slope, x is the independent variable and y is the dependent variable.

linear regression 1

Equation of Multiple Linear Regression, where bo is the intercept, b1,b2,b3,b4…,bn are coefficients or slopes of the independent variables x1,x2,x3,x4…,xn and y is the dependent variable.

linear regression 2

Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is called a regression line. A regression line can show two types of relationship:

  • Positive Linear Relationship:

If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then such a relationship is termed as a Positive linear relationship.

  • Negative Linear Relationship:

If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis, then such a relationship is called a negative linear relationship.

Linear Regression in Machine Learning        Linear Regression in Machine Learning

Finding the best fit line: -

When working with linear regression, our main goal is to find the best fit line that means the error between predicted values and actual values should be minimized. The best fit line will have the least error.

Simple Linear Regression: -

Simple Linear Regression is a type of Regression algorithms that models the relationship between a dependent variable and a single independent variable. The relationship shown by a Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple Linear Regression.

The key point in Simple Linear Regression is that the dependent variable must be a continuous/real value. However, the independent variable can be measured on continuous or categorical values.

Simple Linear regression algorithm has mainly two objectives:

  • Model the relationship between the two variables. Such as the relationship between Income and expenditure, experience and Salary, etc.
  • Forecasting new observations. Such as Weather forecasting according to temperature, Revenue of a company according to the investments in a year, etc.

 

Simple Linear Regression Model:

The Simple Linear Regression model can be represented using the below equation: y= a+bx+ ε

Where,

a=  It  is  the  intercept  of  the  Regression  line  (can  be  obtained  putting  x=0) b= It is the slope of the regression line, which tells whether the line is increasing or decreasing. ε = The error term. (For a good model it will be negligible)

We will find the value of a and b by using the below formula




 

Multiple Linear Regression: -

In, Simple Linear Regression, where a single Independent/Predictor(X) variable is used to model the response variable (Y). But there may be various cases in which the response variable is affected by more than one predictor variable; for such cases, the Multiple Linear Regression algorithm is used.

Moreover, Multiple Linear Regression is an extension of Simple Linear regression as it takes more than one predictor variable to predict the response variable. We can define it as:

Text Box: Multiple Linear Regression is one of the important regression algorithms which models the linear relationship between a single dependent continuous variable and more than one independent variable.

Example:

Prediction of CO2 emission based on engine size and number of cylinders in a car.

Assumptions of multiple linear regression

Multiple linear regression makes all of the same assumptions as simple linear regression:

  1. Linearity: - The relationship between dependent and independent variables should be linear.
  2. Lack of Multicollinearity: - It is assumed that there is little or no multicollinearity in the data. Multicollinearity occurs when the features (or independent variables) are not independent of each other.

       3.  Multivariate Normality: - Multiple regression assumes that the residuals are normally distributed.

 

Multiple linear regression formula

The formula for a multiple linear regression is:

 

Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable. You can use multiple linear regression when you want to know:

1.         How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).

2.         The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

Multiple linear regression example

You are a public health researcher interested in social factors that influence heart disease. You survey 500 towns and gather data on the percentage of people in each town who smoke, the percentage of people in each town who bike to work, and the percentage of people in each town who have heart disease.

Because  you  have  two  independent  variables  and  one  dependent  variable,  and  all your variables are quantitative, you can use multiple linear regression to analyze the relationship between them.

Difference between Simple Linear Regression and Multi Linear Regression

Simple Linear Regression

Simple Linear Regression establishes the relationship between two variables using a straight line. It attempts to draw a line that comes closest to the data by finding the slope and intercept which define the line and minimize regression errors. Simple linear regression has only one x and one y variable.

multiple Linear Regression 

Multiple Linear regressions are based on the assumption that there is a linear relationship between both the dependent and independent variables or Predictor variable and Target variable. It also assumes that there is no major correlation between the independent variables. Multi Linear regressions can be linear and nonlinear. It has one y and two or more x variables or one dependent variable and two or more independent variables.

4.2)Model Evaluation using Visualization in Data Science

1. Introduction

Model Evaluation is the process of assessing how well a machine learning or statistical model performs on given data.
Visualization-based evaluation uses graphical techniques to analyze errors, patterns, bias, variance, and goodness of fit instead of relying only on numerical metrics.

Visualization helps to:

  • Understand model behavior

  • Detect overfitting or underfitting

  • Identify outliers and residual patterns

  • Compare multiple models effectively


2. Importance of Visualization in Model Evaluation

  • Makes complex results easy to interpret

  • Reveals patterns hidden behind metrics

  • Helps in model selection and tuning

  • Improves communication with non-technical stakeholders


3. Common Visualization Techniques for Model Evaluation


3.1 Residual Plot

Definition

A Residual Plot shows the difference between actual values and predicted values.

Residual=yactualypredicted\text{Residual} = y_{actual} - y_{predicted}

Plot Description

  • X-axis: Independent variable or predicted values

  • Y-axis: Residuals

Interpretation

  • Random scatter around zero → Good model

  • Curve pattern → Non-linear relationship

  • Funnel shape → Heteroscedasticity

  • Extreme points → Outliers

Uses

  • Detects non-linearity

  • Identifies overfitting and underfitting

  • Checks assumptions of regression models


3.2 Distribution Plot (Error Distribution)

Definition

A Distribution Plot visualizes the distribution of prediction errors or residuals.

Plot Description

  • X-axis: Residual values

  • Y-axis: Frequency or density

Interpretation

  • Symmetric, bell-shaped distribution → Well-performing model

  • Skewed distribution → Model bias

  • Wide spread → High variance

Uses

  • Checks normality of errors

  • Evaluates prediction stability

  • Helps compare models


3.3 Actual vs Predicted Plot

Definition

Compares actual output values with predicted values.

Plot Description

  • X-axis: Actual values

  • Y-axis: Predicted values

Interpretation

  • Points near diagonal line → Accurate predictions

  • Systematic deviation → Model bias

  • Large scatter → Poor prediction quality

Uses

  • Measures goodness of fit visually

  • Highlights prediction errors


3.4 Learning Curve

Definition

Shows model performance with respect to training set size.

Plot Description

  • X-axis: Number of training samples

  • Y-axis: Error or accuracy

  • Two curves: Training error and validation error

Interpretation

  • High bias → Both errors high

  • High variance → Large gap between curves

  • Optimal model → Errors converge at low value

Uses

  • Detects overfitting and underfitting

  • Helps decide if more data is needed


3.5 ROC Curve (for Classification Models)

Definition

The Receiver Operating Characteristic (ROC) curve plots:

  • True Positive Rate (TPR)

  • False Positive Rate (FPR)

Interpretation

  • Curve close to top-left → Better classifier

  • Diagonal line → Random classifier

  • AUC (Area Under Curve) close to 1 → Excellent model

Uses

  • Evaluates classification models

  • Compares multiple classifiers


3.6 Precision–Recall Curve

Definition

Plots Precision vs Recall, especially useful for imbalanced datasets.

Interpretation

  • High precision and recall → Good model

  • Sharp drop → Poor generalization

Uses

  • Fraud detection

  • Medical diagnosis

  • Spam filtering


4. Advantages of Visualization-Based Evaluation

✔ Easy interpretation
✔ Detects hidden errors
✔ Better model comparison
✔ Enhances explainability


5. Limitations

❌ Subjective interpretation
❌ Not sufficient alone (needs metrics)
❌ May be misleading without domain knowledge


6. Applications

  • Regression model evaluation

  • Classification performance analysis

  • Model selection and hyperparameter tuning

  • Business analytics and forecasting


7. Conclusion

Model Evaluation using Visualization is a powerful approach that complements statistical metrics. It provides deeper insights into model performance, error behavior, and reliability, making it an essential step in the data science workflow.


4.4)Residual Plot – Model Evaluation using Visualization

1. Introduction 

In data science and machine learning, building a predictive model is not sufficient unless its performance is properly evaluated. Model evaluation using visualization provides deep insights into model behavior beyond numerical metrics.
One of the most important visualization techniques for evaluating regression models is the Residual Plot.

A Residual Plot helps in analyzing prediction errors and checking whether the assumptions of the regression model are satisfied.


2. Definition of Residual Plot 

A Residual Plot is a scatter plot that displays the residuals (errors) of a model against the independent variable or predicted values.

Residual Formula:

Residual=yactualypredicted\text{Residual} = y_{\text{actual}} - y_{\text{predicted}}

Where:

  • yactualy_{\text{actual}} = Actual observed value

  • ypredictedy_{\text{predicted}} = Value predicted by the model


3. Purpose of Residual Plot 

The residual plot is used to:

  • Evaluate the goodness of fit of a regression model

  • Detect non-linearity in data

  • Identify overfitting or underfitting

  • Detect outliers

  • Verify assumptions of linear regression


4. Axes Description

A residual plot consists of:

  • X-axis:

    • Independent variable (X) or

    • Predicted values (ypredictedy_{\text{predicted}})

  • Y-axis:

    • Residuals (errors)


5. Residual Plot Diagram 

Neat Labeled Diagram (Exam-Oriented)

Residuals+5│ ● ● +3│ ● ● +1│─────────────── Zero Line (Residual = 0) -1│ ● ● -3│ ● ● -5│ ● └────────────────────────→ Predicted Values

Zero Line:
The horizontal line at residual = 0 represents perfect prediction.


6. Interpretation of Residual Plot 

(a) Random Scatter Around Zero – Good Model

  • Residuals are randomly distributed

  • No visible pattern

  • Indicates linear relationship

  • Model fits data well

Conclusion: Model is appropriate


(b) Curved or Systematic Pattern – Non-Linearity

  • Residuals form a curve or pattern

  • Indicates missing non-linear relationship

Conclusion: Linear model is unsuitable


(c) Funnel Shape – Heteroscedasticity

  • Residuals increase or decrease with X

  • Error variance is not constant

Conclusion: Model assumptions violated


(d) Large Isolated Points – Outliers

  • Points far from zero line

  • Indicates unusual or noisy data

Conclusion: Outliers affect model accuracy


(e) Clustering of Residuals – Overfitting / Underfitting

  • Too tight → Overfitting

  • Too spread → Underfitting




7. Advantages of Residual Plot

  • Simple and intuitive visualization

  • Detects hidden patterns in errors

  • Helps validate regression assumptions


8. Limitations of Residual Plot 

  • Subjective interpretation

  • Less effective for very large datasets


9. Applications 

  • Linear and polynomial regression evaluation

  • Time-series forecasting

  • Business analytics and prediction systems


10. Conclusion 

A Residual Plot is a powerful visualization tool used to evaluate regression models by analyzing prediction errors. It helps identify non-linearity, bias, outliers, and assumption violations, making it an essential step in the model evaluation process.

4.5)Distribution Plot in Data Science

1. Introduction 

In Data Science, building a predictive model is only one part of the analytical process. Evaluating how well the model performs is equally important. Model evaluation using visualization helps data scientists understand error behavior, bias, variance, and overall prediction quality.

A Distribution Plot is one of the most important visualization techniques used to analyze the distribution of data or model errors (residuals). It provides insights into the shape, spread, and symmetry of the data, which numerical metrics alone cannot reveal.


2. Definition of Distribution Plot 

A Distribution Plot is a graphical representation that shows how data values or residuals are distributed across a range of values.

In model evaluation, it is most commonly used to visualize the distribution of residuals:

Residual=yactualypredicted\text{Residual} = y_{\text{actual}} - y_{\text{predicted}}

It helps determine whether the errors follow a normal (Gaussian) distribution, which is an important assumption in many statistical models.


3. Purpose of Distribution Plot 

A Distribution Plot is used to:

  • Analyze prediction errors

  • Detect model bias

  • Identify skewness in data

  • Check normality of residuals

  • Compare performance of multiple models

  • Detect outliers and noise


4. Types of Distribution Plots 

4.1 Histogram

  • Displays frequency of values in bins

  • Simple and widely used

  • Sensitive to bin size

4.2 Density Plot (KDE – Kernel Density Estimation)

  • Smooth continuous curve

  • Shows probability density

  • Better than histogram for understanding shape

4.3 Combined Histogram + Density Plot

  • Histogram shows frequency

  • Density curve shows smooth distribution

  • Most commonly used in data science


5. Axes Description 

  • X-axis:

    • Data values or residual values

  • Y-axis:

    • Frequency (Histogram)

    • Probability Density (Density Plot)


6. Distribution Plot Diagram 

Neat Exam-Oriented Diagram (Draw This)

Frequency / Density ↑ 8│ ╭──────╮ 7│ ╭─╯ ╰─╮ 6│ ╭─╯ ╰─╮ 5│ ╭─╯ ╰─╮ 4│ ╭─╯ ╰─╮ 3│ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2│ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1│ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ └────────────────────────────────────→ Residual Values -3 -2 -1 0 1 2 3 Legend: ▇ → Histogram bars (frequency) ╭─╯─╮ → Smooth density curve Center at 0 → Errors balanced, model unbiased

Explanation:

  • Bars → Histogram (frequency of residuals)

  • Curve → Density plot

  • Centered around zero → Unbiased model


7. Interpretation of Distribution Plot 

(a) Symmetric Bell-Shaped Distribution (Good Model)

  • Residuals centered around zero

  • Follows normal distribution

  • Low bias and stable predictions

Conclusion: Model is well-trained


(b) Left-Skewed Distribution

  • More residuals on negative side

  • Model over-predicts

Conclusion: Systematic bias exists


(c) Right-Skewed Distribution

  • More residuals on positive side

  • Model under-predicts

Conclusion: Needs recalibration


(d) Wide Distribution

  • Large spread of residuals

  • High variance

  • Unstable predictions

Conclusion: Model may be overfitting


(e) Multiple Peaks (Multimodal)

  • Multiple data groups

  • Indicates hidden patterns or mixed populations

Conclusion: Data segmentation required


(f) Heavy Tails

  • Extreme residual values

  • Presence of outliers

Conclusion: Data cleaning required


8. Role in Model Evaluation 

Distribution plots help to:

  • Validate regression assumptions

  • Compare multiple models visually

  • Improve feature engineering

  • Decide model selection and tuning


9. Advantages of Distribution Plot 

  • Simple and intuitive

  • Reveals bias and variance

  • Helps detect outliers

  • Complements numerical metrics


10. Limitations of Distribution Plot 

  • Interpretation can be subjective

  • Bin size affects histogram

  • Not sufficient alone for evaluation


11. Applications 

  • Regression model evaluation

  • Error analysis in machine learning

  • Financial forecasting

  • Quality control systems

  • Business analytics

  • Risk assessment


12. Comparison with Residual Plot 

Residual PlotDistribution Plot
Shows error patternShows error distribution
Detects non-linearityDetects bias & skewness
Scatter plotHistogram / Density plot

13. Conclusion 

A Distribution Plot is a powerful visualization tool in data science that helps analyze the distribution of data values or prediction errors. It provides valuable insights into bias, variance, normality, and model reliability. When used along with residual plots and numerical metrics, it ensures robust and trustworthy model evaluation.

Comments

Popular posts from this blog

RDBMS LAB EXERCISES WITH ANSWER

DATA STRUCTURES-UNIT IV

DATA STRUCTURES-UNIT V