UNIT IV

Model Development: Simple and Multiple Regression – Model Evaluation using Visualization – Residual Plot –Distribution Plot – Polynomial Regression and Pipelines – Measures for In-sampleEvaluation – Prediction and Decision Making.

MODEL DEVELOPMENT

Model development is the process of creating a mathematical model that describes the relationship between input variables (features) and an output variable (target).
In data science, regression models are commonly used when the target variable is continuous.

REGRESSION

Regression is a statistical and data science method used to understand and predict the relationship between variables.

Simple Definition

Regression means finding a relationship between a dependent variable (output) and one or more independent variables (inputs) so we can predict values or analyze how changes in inputs affect the output.

Example

If you want to predict exam scores based on hours studied:

Hours studied → independent variable
Exam score → dependent variable
Regression helps estimate how exam scores change as study hours increase.

Regression is mainly used for:

Prediction (e.g., house prices, sales, temperature)
Trend analysis
Understanding relationships between variables

Types of Regression

Simple Linear Regression – one input variable
Multiple Linear Regression – multiple input variables
Polynomial Regression – curved relationships
Logistic Regression – used for classification (yes/no outcomes)

Regression fits a best-fit line or curve through data points to minimize error and make accurate predictions.

Supervised Learning: -

Supervised Machine Learning: It is an ML technique where models are trained on labeled data i.e output variable is provided in these types of problems. Here, the models find the mapping function to map input variables with the output variable or the labels. Regression and Classification problems are a part of Supervised Machine Learning.

Regression: -

Regression Analysis in Machine learning

Regression analysis is a statistical method to model the relationship between a dependent (target) and independent (predictor) variables with one or more independent variables. More specifically, Regression analysis helps us to understand how the value of the dependent variable is changing corresponding to an independent variable when other independent variables are held fixed. It predicts continuous/real values such as temperature, age, salary, price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement every year and get sales on that. The below list shows the advertisement made by the company in the last 5 years and the corresponding sales:

Now, the company wants to do the advertisement of $200 in the year 2023 and wants to know the prediction about the sales for this year. So to solve such type of prediction problems in machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation between variables and enables us to predict the continuous output variable based on the one or more predictor variables. It is mainly used for prediction, forecasting, time series modeling, and determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot, the machine learning model can make predictions about the data.

The distance between datapoints and line tells whether a model has captured a strong relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors

o Determining Market trends

o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:

Dependent Variable: The main factor in Regression analysis which we want to predict or understand is called the dependent variable. It is also called target variable.
Independent Variable: The factors which affect the dependent variables or which are used to predict the values of the dependent variables are called independent variable, also called as a predictor.
Outliers: Outlier is an observation which contains either very low value or very high value in comparison to other observed values. An outlier may hamper the result, so it should be avoided.
Multicollinearity: If the independent variables are highly correlated with each other than other variables, then such condition is called Multicollinearity. It should not be present in the dataset, because it creates problem while ranking the most affecting variable.
Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with test dataset, then such problem is called Overfitting. And if our algorithm does not perform well even with training dataset, then such problem is called underfitting.

Why do we use Regression Analysis?

Regression analysis helps in the prediction of a continuous variable. There are various scenarios in the real world where we need some future predictions such as weather condition, sales prediction, marketing trends, etc., for such case we need some technology which can make predictions more accurately.

· Regression estimates the relationship between the target and the independent variable.

· It is used to find the trends in data.

· It helps to predict real/continuous values.

· By performing the regression, we can confidently determine the most important factor, the least important factor, and how each factor is affecting the other factors.

Types of Regression

o Linear Regression

o Logistic Regression

o Polynomial Regression

o Support Vector Regression

o Decision Tree Regression

o Random Forest Regression

o Ridge Regression

o Lasso Regression

Linear Regression:

Linear regression is a statistical regression method which is used for predictive analysis.

It is one of the very simple and easy algorithms which works on regression and shows the relationship between the continuous variables.

It is used for solving the regression problem in machine learning.

In the simplest words, Linear Regression is the supervised Machine Learning model in which the model finds the best fit linear line between the independent and dependent variable i.e it finds the linear relationship between the dependent and independent variable.

If there is only one input variable (x), then such linear regression is called simple linear regression.

And if there is more than one input variable, then such linear regression is called multiple linear regression.

Linear regression makes predictions for continuous/real or numeric variables such as sales, salary, age, product price, etc. The relationship between variables in the linear regression model can be explained using the below image. Here we are predicting the salary of an employee on the basis of the year of experience.

Below is the mathematical equation for Linear regression:

Y= aX+b

Here, Y = dependent variables (target variables),

X= Independent variables (predictor variables), a and b are the linear coefficients

Error is the difference between the actual value and Predicted value and the goal is to reduce this difference.

Statistical tools for high-throughput data analysis

In the above diagram,

x is our dependent variable which is plotted on the x-axis and y is the dependent variable which is plotted on the y-axis.
Black dots are the data points i.e the actual values.
bo is the intercept which is 10 and b1 is the slope of the x variable.
The blue line is the best fit line predicted by the model i.e the predicted values lie on the blue line.

The vertical distance between the data point and the regression line is known as error or residual. Each data point has one residual and the sum of all the differences is known as the Sum of Residuals/Errors.

Mathematical Approach:

Residual/Error = Actual values – Predicted Values

Sum of Residuals/Errors = Sum(Actual- Predicted Values)

Square of Sum of Residuals/Errors = (Sum(Actual- Predicted Values))2 i.e

Some popular applications of linear regression are:

o Analyzing trends and sales estimates

o Salary forecasting

o Real estate prediction

o Arriving at ETAs in traffic.

o Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

Simple Linear Regression: If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.
Multiple Linear regression: If more than one independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

Equation of Simple Linear Regression, where bo is the intercept, b1 is coefficient or slope, x is the independent variable and y is the dependent variable.

Equation of Multiple Linear Regression, where bo is the intercept, b1,b2,b3,b4…,bn are coefficients or slopes of the independent variables x1,x2,x3,x4…,xn and y is the dependent variable.

linear regression 2

Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is called a regression line. A regression line can show two types of relationship:

Positive Linear Relationship:

If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then such a relationship is termed as a Positive linear relationship.

Negative Linear Relationship:

If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis, then such a relationship is called a negative linear relationship.

Linear Regression in Machine Learning

Finding the best fit line: -

When working with linear regression, our main goal is to find the best fit line that means the error between predicted values and actual values should be minimized. The best fit line will have the least error.

Simple Linear Regression: -

Simple Linear Regression is a type of Regression algorithms that models the relationship between a dependent variable and a single independent variable. The relationship shown by a Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple Linear Regression.

The key point in Simple Linear Regression is that the dependent variable must be a continuous/real value. However, the independent variable can be measured on continuous or categorical values.

Simple Linear regression algorithm has mainly two objectives:

Model the relationship between the two variables. Such as the relationship between Income and expenditure, experience and Salary, etc.
Forecasting new observations. Such as Weather forecasting according to temperature, Revenue of a company according to the investments in a year, etc.

Simple Linear Regression Model:

The Simple Linear Regression model can be represented using the below equation: y= a+bx+ ε

Where,

a= It is the intercept of the Regression line (can be obtained putting x=0) b= It is the slope of the regression line, which tells whether the line is increasing or decreasing. ε = The error term. (For a good model it will be negligible)

We will find the value of a and b by using the below formula

Multiple Linear Regression: -

In, Simple Linear Regression, where a single Independent/Predictor(X) variable is used to model the response variable (Y). But there may be various cases in which the response variable is affected by more than one predictor variable; for such cases, the Multiple Linear Regression algorithm is used.

Moreover, Multiple Linear Regression is an extension of Simple Linear regression as it takes more than one predictor variable to predict the response variable. We can define it as:

Text Box: Multiple Linear Regression is one of the important regression algorithms which models the linear relationship between a single dependent continuous variable and more than one independent variable.

Example:

Prediction of CO2 emission based on engine size and number of cylinders in a car.

Assumptions of multiple linear regression

Multiple linear regression makes all of the same assumptions as simple linear regression:

Linearity: - The relationship between dependent and independent variables should be linear.
Lack of Multicollinearity: - It is assumed that there is little or no multicollinearity in the data. Multicollinearity occurs when the features (or independent variables) are not independent of each other.

3. Multivariate Normality: - Multiple regression assumes that the residuals are normally distributed.

Multiple linear regression formula

The formula for a multiple linear regression is:

Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable. You can use multiple linear regression when you want to know:

1. How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).

2. The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

Multiple linear regression example

You are a public health researcher interested in social factors that influence heart disease. You survey 500 towns and gather data on the percentage of people in each town who smoke, the percentage of people in each town who bike to work, and the percentage of people in each town who have heart disease.

Because you have two independent variables and one dependent variable, and all your variables are quantitative, you can use multiple linear regression to analyze the relationship between them.

Difference between Simple Linear Regression and Multi Linear Regression

Simple Linear Regression

Simple Linear Regression establishes the relationship between two variables using a straight line. It attempts to draw a line that comes closest to the data by finding the slope and intercept which define the line and minimize regression errors. Simple linear regression has only one x and one y variable.

multiple Linear Regression

Multiple Linear regressions are based on the assumption that there is a linear relationship between both the dependent and independent variables or Predictor variable and Target variable. It also assumes that there is no major correlation between the independent variables. Multi Linear regressions can be linear and nonlinear. It has one y and two or more x variables or one dependent variable and two or more independent variables.

4.2)Model Evaluation using Visualization in Data Science

1. Introduction

Model Evaluation is the process of assessing how well a machine learning or statistical model performs on given data.
Visualization-based evaluation uses graphical techniques to analyze errors, patterns, bias, variance, and goodness of fit instead of relying only on numerical metrics.

Visualization helps to:

Understand model behavior
Detect overfitting or underfitting
Identify outliers and residual patterns
Compare multiple models effectively

2. Importance of Visualization in Model Evaluation

Makes complex results easy to interpret
Reveals patterns hidden behind metrics
Helps in model selection and tuning
Improves communication with non-technical stakeholders

3. Common Visualization Techniques for Model Evaluation

3.1 Residual Plot

Definition

A Residual Plot shows the difference between actual values and predicted values.

$\text{Residual} = y_{actual} - y_{predicted}$

Plot Description

X-axis: Independent variable or predicted values
Y-axis: Residuals

Interpretation

Random scatter around zero → Good model
Curve pattern → Non-linear relationship
Funnel shape → Heteroscedasticity
Extreme points → Outliers

Uses

Detects non-linearity
Identifies overfitting and underfitting
Checks assumptions of regression models

3.2 Distribution Plot (Error Distribution)

Definition

A Distribution Plot visualizes the distribution of prediction errors or residuals.

Plot Description

X-axis: Residual values
Y-axis: Frequency or density

Interpretation

Symmetric, bell-shaped distribution → Well-performing model
Skewed distribution → Model bias
Wide spread → High variance

Uses

Checks normality of errors
Evaluates prediction stability
Helps compare models

3.3 Actual vs Predicted Plot

Definition

Compares actual output values with predicted values.

Plot Description

X-axis: Actual values
Y-axis: Predicted values

Interpretation

Points near diagonal line → Accurate predictions
Systematic deviation → Model bias
Large scatter → Poor prediction quality

Uses

Measures goodness of fit visually
Highlights prediction errors

3.4 Learning Curve

Definition

Shows model performance with respect to training set size.

Plot Description

X-axis: Number of training samples
Y-axis: Error or accuracy
Two curves: Training error and validation error

Interpretation

High bias → Both errors high
High variance → Large gap between curves
Optimal model → Errors converge at low value

Uses

Detects overfitting and underfitting
Helps decide if more data is needed

3.5 ROC Curve (for Classification Models)

Definition

The Receiver Operating Characteristic (ROC) curve plots:

True Positive Rate (TPR)
False Positive Rate (FPR)

Interpretation

Curve close to top-left → Better classifier
Diagonal line → Random classifier
AUC (Area Under Curve) close to 1 → Excellent model

Uses

Evaluates classification models
Compares multiple classifiers

3.6 Precision–Recall Curve

Definition

Plots Precision vs Recall, especially useful for imbalanced datasets.

Interpretation

High precision and recall → Good model
Sharp drop → Poor generalization

Uses

Fraud detection
Medical diagnosis
Spam filtering

4. Advantages of Visualization-Based Evaluation

✔ Easy interpretation
✔ Detects hidden errors
✔ Better model comparison
✔ Enhances explainability

5. Limitations

❌ Subjective interpretation
❌ Not sufficient alone (needs metrics)
❌ May be misleading without domain knowledge

6. Applications

Regression model evaluation
Classification performance analysis
Model selection and hyperparameter tuning
Business analytics and forecasting

7. Conclusion

Model Evaluation using Visualization is a powerful approach that complements statistical metrics. It provides deeper insights into model performance, error behavior, and reliability, making it an essential step in the data science workflow.

4.4)Residual Plot – Model Evaluation using Visualization

1. Introduction

In data science and machine learning, building a predictive model is not sufficient unless its performance is properly evaluated. Model evaluation using visualization provides deep insights into model behavior beyond numerical metrics.
One of the most important visualization techniques for evaluating regression models is the Residual Plot.

A Residual Plot helps in analyzing prediction errors and checking whether the assumptions of the regression model are satisfied.

2. Definition of Residual Plot

A Residual Plot is a scatter plot that displays the residuals (errors) of a model against the independent variable or predicted values.

Residual Formula:

\text{Residual} = y_{\text{actual}} - y_{\text{predicted}}

Where:

$y_{\text{actual}}$ = Actual observed value
$y_{\text{predicted}}$ = Value predicted by the model

3. Purpose of Residual Plot

The residual plot is used to:

Evaluate the goodness of fit of a regression model
Detect non-linearity in data
Identify overfitting or underfitting
Detect outliers
Verify assumptions of linear regression

4. Axes Description

A residual plot consists of:

X-axis:
- Independent variable (X) or
- Predicted values ( $y_{\text{predicted}}$ )
Y-axis:
- Residuals (errors)

5. Residual Plot Diagram

Neat Labeled Diagram (Exam-Oriented)


Residuals
   ↑
 +5│        ●        ●
 +3│   ●        ●
 +1│───────────────  Zero Line (Residual = 0)
 -1│      ●       ●
 -3│   ●        ●
 -5│        ●
   └────────────────────────→ Predicted Values

Zero Line:
The horizontal line at residual = 0 represents perfect prediction.

6. Interpretation of Residual Plot

(a) Random Scatter Around Zero – Good Model

Residuals are randomly distributed
No visible pattern
Indicates linear relationship
Model fits data well

✔ Conclusion: Model is appropriate

(b) Curved or Systematic Pattern – Non-Linearity

Residuals form a curve or pattern
Indicates missing non-linear relationship

❌ Conclusion: Linear model is unsuitable

(c) Funnel Shape – Heteroscedasticity

Residuals increase or decrease with X
Error variance is not constant

❌ Conclusion: Model assumptions violated

(d) Large Isolated Points – Outliers

Points far from zero line
Indicates unusual or noisy data

❌ Conclusion: Outliers affect model accuracy

(e) Clustering of Residuals – Overfitting / Underfitting

Too tight → Overfitting
Too spread → Underfitting

7. Advantages of Residual Plot

Simple and intuitive visualization
Detects hidden patterns in errors
Helps validate regression assumptions

8. Limitations of Residual Plot

Subjective interpretation
Less effective for very large datasets

9. Applications

Linear and polynomial regression evaluation
Time-series forecasting
Business analytics and prediction systems

10. Conclusion

A Residual Plot is a powerful visualization tool used to evaluate regression models by analyzing prediction errors. It helps identify non-linearity, bias, outliers, and assumption violations, making it an essential step in the model evaluation process.

4.5)Distribution Plot in Data Science

1. Introduction

In Data Science, building a predictive model is only one part of the analytical process. Evaluating how well the model performs is equally important. Model evaluation using visualization helps data scientists understand error behavior, bias, variance, and overall prediction quality.

A Distribution Plot is one of the most important visualization techniques used to analyze the distribution of data or model errors (residuals). It provides insights into the shape, spread, and symmetry of the data, which numerical metrics alone cannot reveal.

2. Definition of Distribution Plot

A Distribution Plot is a graphical representation that shows how data values or residuals are distributed across a range of values.

In model evaluation, it is most commonly used to visualize the distribution of residuals:

$\text{Residual} = y_{\text{actual}} - y_{\text{predicted}}$

It helps determine whether the errors follow a normal (Gaussian) distribution, which is an important assumption in many statistical models.

3. Purpose of Distribution Plot

A Distribution Plot is used to:

Analyze prediction errors
Detect model bias
Identify skewness in data
Check normality of residuals
Compare performance of multiple models
Detect outliers and noise

4. Types of Distribution Plots

4.1 Histogram

Displays frequency of values in bins
Simple and widely used
Sensitive to bin size

4.2 Density Plot (KDE – Kernel Density Estimation)

Smooth continuous curve
Shows probability density
Better than histogram for understanding shape

4.3 Combined Histogram + Density Plot

Histogram shows frequency
Density curve shows smooth distribution
Most commonly used in data science

5. Axes Description

X-axis:
- Data values or residual values
Y-axis:
- Frequency (Histogram)
- Probability Density (Density Plot)

6. Distribution Plot Diagram

Neat Exam-Oriented Diagram (Draw This)


Frequency / Density
   ↑
  8│                ╭──────╮
  7│              ╭─╯        ╰─╮
  6│            ╭─╯              ╰─╮
  5│          ╭─╯                    ╰─╮
  4│        ╭─╯                          ╰─╮
  3│     ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
  2│     ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
  1│     ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
   └────────────────────────────────────→ Residual Values
              -3   -2   -1   0   1   2   3
Legend:
▇  → Histogram bars (frequency)
╭─╯─╮ → Smooth density curve
Center at 0 → Errors balanced, model unbiased

Explanation:

Bars → Histogram (frequency of residuals)
Curve → Density plot
Centered around zero → Unbiased model

7. Interpretation of Distribution Plot

(a) Symmetric Bell-Shaped Distribution (Good Model)

Residuals centered around zero
Follows normal distribution
Low bias and stable predictions

✔ Conclusion: Model is well-trained

(b) Left-Skewed Distribution

More residuals on negative side
Model over-predicts

❌ Conclusion: Systematic bias exists

(c) Right-Skewed Distribution

More residuals on positive side
Model under-predicts

❌ Conclusion: Needs recalibration

(d) Wide Distribution

Large spread of residuals
High variance
Unstable predictions

❌ Conclusion: Model may be overfitting

(e) Multiple Peaks (Multimodal)

Multiple data groups
Indicates hidden patterns or mixed populations

❌ Conclusion: Data segmentation required

(f) Heavy Tails

Extreme residual values
Presence of outliers

❌ Conclusion: Data cleaning required

8. Role in Model Evaluation

Distribution plots help to:

Validate regression assumptions
Compare multiple models visually
Improve feature engineering
Decide model selection and tuning

9. Advantages of Distribution Plot

Simple and intuitive
Reveals bias and variance
Helps detect outliers
Complements numerical metrics

10. Limitations of Distribution Plot

Interpretation can be subjective
Bin size affects histogram
Not sufficient alone for evaluation

11. Applications

Regression model evaluation
Error analysis in machine learning
Financial forecasting
Quality control systems
Business analytics
Risk assessment

12. Comparison with Residual Plot

Residual Plot	Distribution Plot
Shows error pattern	Shows error distribution
Detects non-linearity	Detects bias & skewness
Scatter plot	Histogram / Density plot

13. Conclusion

A Distribution Plot is a powerful visualization tool in data science that helps analyze the distribution of data values or prediction errors. It provides valuable insights into bias, variance, normality, and model reliability. When used along with residual plots and numerical metrics, it ensures robust and trustworthy model evaluation.

UNIT IV- IDS

Linear Regression:

4.2)Model Evaluation using Visualization in Data Science

1. Introduction

2. Importance of Visualization in Model Evaluation

3. Common Visualization Techniques for Model Evaluation

3.1 Residual Plot

Definition

Plot Description

Interpretation

Uses

3.2 Distribution Plot (Error Distribution)

Definition

Plot Description

Interpretation

Uses

3.3 Actual vs Predicted Plot

Definition

Plot Description

Interpretation

Uses

3.4 Learning Curve

Definition

Plot Description

Interpretation

Uses

3.5 ROC Curve (for Classification Models)

Definition

Interpretation

Uses

3.6 Precision–Recall Curve

Definition

Interpretation

Uses

4. Advantages of Visualization-Based Evaluation

5. Limitations

6. Applications

7. Conclusion

4.4)Residual Plot – Model Evaluation using Visualization

1. Introduction

2. Definition of Residual Plot

Residual Formula:

3. Purpose of Residual Plot

4. Axes Description

5. Residual Plot Diagram

Neat Labeled Diagram (Exam-Oriented)

6. Interpretation of Residual Plot

(a) Random Scatter Around Zero – Good Model

(b) Curved or Systematic Pattern – Non-Linearity

(c) Funnel Shape – Heteroscedasticity

(d) Large Isolated Points – Outliers

(e) Clustering of Residuals – Overfitting / Underfitting

7. Advantages of Residual Plot

8. Limitations of Residual Plot

9. Applications

10. Conclusion

4.5)Distribution Plot in Data Science

1. Introduction

2. Definition of Distribution Plot

3. Purpose of Distribution Plot

4. Types of Distribution Plots

4.1 Histogram

4.2 Density Plot (KDE – Kernel Density Estimation)

4.3 Combined Histogram + Density Plot

5. Axes Description

6. Distribution Plot Diagram

Neat Exam-Oriented Diagram (Draw This)

7. Interpretation of Distribution Plot

(a) Symmetric Bell-Shaped Distribution (Good Model)

(b) Left-Skewed Distribution

(c) Right-Skewed Distribution

(d) Wide Distribution

(e) Multiple Peaks (Multimodal)

(f) Heavy Tails

8. Role in Model Evaluation

9. Advantages of Distribution Plot

10. Limitations of Distribution Plot

11. Applications

12. Comparison with Residual Plot

13. Conclusion