DM-BCA-UNIT I
BCA
DATAMINING AND WAREHOUSING-320E6B
UNIT I
Introduction:
Data mining – Functionalities – Classification – Introduction to Data
Warehousing – Data Pre-processing: Pre-processing the Data – Data cleaning –
Data Integration and Transformation – Data Reduction
1.1)INTRODUCTION
Introduction
to Data Mining
Data
mining (also called data or knowledge discovery) is the process of analyzing
data from different perspectives and summarizing it into useful
information.
This information can be used to increase revenue, reduce costs, or improve
decision-making.
Data
mining software is one of the analytical tools that allows users to:
- Analyze data
from multiple dimensions
- Categorize
data
- Identify
relationships between data items
Technically, data mining is the process of finding correlations,
patterns, and trends in large relational databases containing many fields.
Simple definition:
Data mining is the process of extracting knowledge from large volumes of
data.
Applications
of Data Mining
The
knowledge extracted can be used in:
- Market
Analysis
- Fraud
Detection
- Customer
Retention
- Production
Control
- Scientific
Exploration
Data
Mining Applications
Data
mining is highly useful in various domains such as:
- Market
Analysis and Management
- Corporate
Analysis and Risk Management
- Fraud
Detection
Additionally,
it can be applied in:
- Production
Control
- Customer
Retention
- Science and
Research
- Sports
Analytics
- Astrology
- Web usage
analysis (Web Surf-Aid)
Data
Mining Applications with Examples
1.
Market Analysis and Management
a)
Customer Profiling
Identifies
what type of customers buy specific products.
Example:
An online store analyses purchase history and discovers that college
students mostly buy budget laptops.
This helps the company target ads specifically to students.
b)
Identifying Customer Requirements
Finds
out what products customers need or prefer.
Example:
Netflix analyzes viewers’ watching habits and recommends movies/series
based on their interests.
This increases customer engagement.
c)
Cross-Market Analysis (Association Analysis)
Finds
relationships between products that are bought together.
Example:
A supermarket discovers that customers who buy bread often also buy butter.
So they place butter near the bread section to increase sales.
d)
Target Marketing
Identifies
groups of customers with similar habits or characteristics.
Example:
A bank uses data mining to group customers into:
- High spenders
- Middle-income
families
- Students
Then
the bank sends personalized loan offers to each group.
e)
Customer Purchasing Pattern
Analyzes
how customers buy products over time.
Example:
Amazon detects that customers buy mobile phones mostly during festival
seasons, so they offer special discounts at that time.
f)
Summary Information
Provides
summarized reports for decision-making.
Example:
A retail chain gets a report showing total monthly sales and best-selling
items, helping managers plan inventory.
2.
Corporate Analysis and Risk Management
a)
Finance Planning and Asset Evaluation
Predicts
future financial situations and evaluates assets.
Example:
Banks use data mining to predict loan default risk by analyzing:
- Salary
- Credit
history
- Spending
behavior
This
helps the bank decide whether to approve or reject a loan.
b)
Resource Planning
Helps
companies allocate resources efficiently.
Example:
A manufacturing company uses data mining to analyze machine usage and finds
that two machines are underused.
So they shift production to avoid wastage.
c)
Competition Analysis
Monitors
competitors and market trends.
Example:
An airline compares its ticket prices with competitors and identifies the best
pricing strategy to stay competitive.
3.
Fraud Detection
a)
Credit Card Fraud Detection
Detects
unusual or suspicious transactions.
Example:
If a customer usually spends ₹2,000 per month and suddenly a transaction of
₹50,000 appears, the bank gets an alert.
The transaction may be blocked until verified.
b)
Telecommunication Fraud Detection
Finds
abnormal call behavior.
Example:
If a customer suddenly makes hundreds of international calls, the system
detects unusual patterns and warns the service provider.
c)
Insurance Fraud Detection
Identifies
false claims.
Example:
An insurance company finds that a customer has claimed compensation for the
same accident multiple times, indicating fraud.
4.
Production Control
Helps
organizations improve product quality and reduce defects.
Example:
A
factory analyzes machine sensor data and discovers when machines are likely
to fail.
This allows for preventive maintenance before breakdowns.
5.
Customer Retention
Identifies
customers who may stop using a service.
Example:
A
telecom company detects users whose data usage has reduced suddenly.
They send special discounts to retain those customers.
6.
Science and Research / Exploration
Used
to discover patterns in scientific data.
Example:
Astronomers
use data mining to identify new stars, planets, and galaxies by
analyzing massive space datasets.
7.
Web Usage Mining (Internet Web Surf-Aid)
Analyzes
user behavior on websites.
Example:
Google
tracks user search patterns and uses them to show personalized search
results and ads.
KNOWLEDGE DISCOVERY IN DATABASES (KDD)
Knowledge discovery in
databases (KDD) is the process of discovering useful knowledge from a
collection of data. This widely used data mining technique is a process that
includes data preparation and selection, data cleansing, incorporating prior
knowledge on data sets and interpreting accurate solutions from the observed
results.
Here is the list of steps
involved in the knowledge discovery process −
·
Data Cleaning − In this step,
the noise and inconsistent data is removed.
·
Data Integration − In this step,
multiple data sources are combined.
·
Data Selection − In this step,
data relevant to the analysis task are retrieved from the database.
·
Data
Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
·
Data Mining − In this step,
intelligent methods are applied in order to extract data patterns.
·
Pattern Evaluation
−
In this step, data patterns are evaluated.
·
Knowledge
Presentation − In this step, knowledge is represented.
The following diagram shows
the process of knowledge discovery
![]() |
Fig 1: Data Mining as a
process of knowledge discovery
Architecture of
data mining system
![]() |
Fig 2: Architecture of data
mining system
A
typical data mining system consists of several important components. Each
component plays a specific role in the process of discovering useful patterns
from large datasets.
1.
Knowledge Base
The
Knowledge Base stores all the domain knowledge required to guide the
data mining process.
- Concept hierarchies:
Helps organize attributes into multiple levels (e.g., city → state → country). - User beliefs or expectations:
Used to evaluate whether a discovered pattern is interesting or surprising. - Interestingness measures and
thresholds:
Defines what kind of patterns the user wants (e.g., minimum support in association rules). - Metadata:
Describes data from different sources, especially when data comes from various databases.
Purpose:
- Helps the
system focus on meaningful patterns.
- Guides the
mining algorithm toward useful and relevant results.
2.
Data Mining Engine
The
Data Mining Engine is the core component of the entire system.
It
contains functional modules that perform different data mining tasks, such as:
- Characterization (general summaries of data)
- Association
and correlation analysis
(finding relationships between items)
- Classification (assigning data to predefined
categories)
- Prediction (forecasting future values)
- Clustering (grouping similar data items)
- Outlier
analysis (detecting unusual
data)
- Evolution
analysis (finding trends and
changes over time)
Purpose:
This
engine performs all major operations required to extract patterns from data.
3.
Pattern Evaluation Module
The
Pattern Evaluation Module checks the quality and usefulness of the
patterns found.
- Uses interestingness
measures (like support, confidence, accuracy) to evaluate discovered
patterns.
- Filters out unimportant
or irrelevant patterns.
- Works closely
with the data mining engine to guide it toward more meaningful patterns.
- May use thresholds
to show only the patterns that meet user criteria.
Purpose:
- Ensures only valuable
and interesting patterns are presented to the user.
- Makes the
mining process more efficient by removing unnecessary results
early.
4.
User Interface
The
User Interface (UI) connects the user with the data mining system.
it
allows the user to do:
- Enter queries
or specify data mining tasks
(e.g., “find association rules from sales data”) - Provide
constraints or preferences to narrow the search
- View and
explore intermediate results
- Browse
database or data warehouse structures
- Evaluate and
refine mined patterns
- Visualize
patterns (graphs, charts, plots)
Purpose:
- Makes the
data mining system interactive and user-friendly
- Helps users
easily understand and analyze the results
5.Data
Sources (Bottom Layer)
The
system collects data from different sources such as:
a)
Database
Traditional
databases storing structured data (e.g., banking records).
b)
Data Warehouse
A
large collection of historical data used for analysis (e.g., sales data from
past 10 years).
c)
World Wide Web
Web
data such as browsing history, webpage content, and clickstream data.
d)
Other Information Repositories
Text
files, documents, sensor data, scientific databases, etc.
These
sources provide the raw data for mining.
6.Data
Cleaning, Integration, and Selection
Before
mining can begin, the raw data must be prepared.
This
step includes:
- Data
Cleaning: Removing noise,
missing values, and errors
- Data
Integration: Combining
data from multiple sources
- Data
Selection: Selecting
only relevant data for analysis
Example:
From a large sales database, only the “customer purchase details” may be
selected for mining.
This
cleaned and prepared data is sent to the database or data warehouse server.
7.
Database or Data Warehouse Server
This
server acts as a bridge between the data and the mining system.
It:
- Stores
cleaned data
- Retrieves
required data efficiently
- Provides it
to the Data Mining Engine
Think
of it as the storage backbone of the mining process.
1.2)DATA MINING FUNCTIONALITIES
Data mining discovers
patterns from large datasets.
These patterns depend on the type of data mining tasks, which fall into
two main categories:
1. Descriptive Data
Mining Tasks
Describe existing data
and reveal general properties or patterns.
2. Predictive Data Mining
Tasks
Use existing data to predict
future outcomes.
Major Functionalities
1. Characterization
(Descriptive)
Data characterization
summarizes the general features of a target group (class).
It produces characteristic rules.
How it works
- Data is retrieved using queries
- Passed through a summarization module
- Generates high-level information
using concept hierarchies
- OLAP (data cube operations) can
perform roll-up summaries
Example
Characterizing customers
of OurVideoStore who rent more than 30 movies per year.
Summary may include: age group, preferred genres, visiting days, etc.
2. Discrimination
(Descriptive)
Compares the general
characteristics of two or more classes:
- A target class
- A contrast class
Purpose
Identifies differences
between groups using discriminant rules.
Example
Compare customers who:
- Rent >30 movies yearly
(target class)
vs. - Rent <5 movies yearly
(contrast class)
Differences in age, genre
preference, or spending patterns may be identified.
3. Association Analysis
(Descriptive)
Discovers relationships
among items in transactional databases, producing association rules.
Key measures
- Support (s): frequency of items appearing
together
- Confidence (c): conditional probability that item Q
appears when P is present
Purpose
Useful in market
basket analysis and recommendation systems.
Example Rule
RentType(X,
"game") AND Age(X, "13–19") → Buy(X, "pop") [s =
2%, c = 55%]
Meaning:
- 2% transactions are by teenagers
renting games AND buying pop
- 55% of teen game renters also buy pop
4. Classification
(Predictive)
Assigns objects to predefined
classes based on training data.
Also called supervised learning.
How it works
- Used with labeled training data
- Algorithm learns a model
- Model classifies new data
Example
In OurVideoStore,
customers may be classified as:
- “Safe”
- “Risky”
- “Very risky”
based on rental and
payment history.
Later, the model is used
to approve or reject credit requests.
5. Prediction
(Predictive)
Forecasts missing data,
future values, or trends.
Types
- Predict numerical values (regression)
- Predict trends (time-series analysis)
- Predict class labels (linked to
classification)
Example
Predict next month’s
movie rental demand based on past 12 months’ data.
6. Clustering
(Descriptive)
Groups data into clusters
based on similarity without predefined labels.
Also called unsupervised learning.
Objective
- Maximize intra-class similarity
- Minimize inter-class similarity
Example
Grouping customers into:
- Frequent renters
- Moderate renters
- Rare renters
without any prior labels.
7. Outlier Analysis
(Descriptive)
Finds data objects that do
not fit into normal patterns.
Also called anomalies, exceptions, or surprises.
Outliers may indicate:
- Fraud
- Rare events
- Errors
- Unexpected behavior
Example
A customer suddenly rents
200 movies in one month → suspicious activity.
8. Evolution and
Deviation Analysis (Descriptive + Predictive)
Studies time-related
data that changes over time.
Includes
Evolution Analysis
- Identifies long-term trends,
growth, seasonality.
Deviation Analysis
- Finds differences between actual
and expected values
- Helps detect anomalies or shifts
Example
- Evolution: Movie rentals increase
during summer each year
- Deviation: Sudden rental drop last
month → requires investigation
|
Functionality |
Type |
Purpose |
Example |
|
Characterization |
Descriptive |
Summarizes target group |
Customers renting > 30 movies |
|
Discrimination |
Descriptive |
Compares two groups |
High vs low renters |
|
Association |
Descriptive |
Finds item relationships |
Game rental → pop purchase |
|
Classification |
Predictive |
Assigns labels |
Safe / risky customers |
|
Prediction |
Predictive |
Forecasts future values |
Next month rentals |
|
Clustering |
Descriptive |
Groups similar objects |
Renting behavior clusters |
|
Outlier Analysis |
Descriptive |
Detects unusual patterns |
200 rentals in a month |
|
Evolution & Deviation |
Both |
Time-based trends |
Yearly rental trends |
1.3) DATA MINING SYSTEM CLASSIFICATION
A data mining
system can be classified according to the following criteria −
· Database Technology
· Statistics
· Machine Learning
· Information Science
· Visualization
· Other Disciplines
Apart from these, a data mining system can also be
classified based on the kind of (a) databases mined, (b) knowledge mined, (c)
techniques utilized, and (d) applications adapted.
Classification Based on the Databases Mined
We can
classify a data mining system according to the kind of databases mined. Database system can be classified according to
different criteria such as data models, types of data, etc. And the data mining
system can be classified accordingly.For example, if we classify a database
according to the data model, then we may have a relational, transactional,
object-relational, or data warehouse mining system.
Classification Based on the kind of Knowledge Mined
We can classify a data mining system according to the
kind of knowledge mined. It means the data mining system is classified on the
basis of functionalities such as −
· Characterization
· Discrimination
· Association and Correlation Analysis
· Classification
· Prediction
· Outlier Analysis
· Evolution Analysis
Classification Based on the Techniques Utilized
Data mining systems employ and provide different
techniques. This classification categorizes data mining systems according to
the data analysis approach used such as machine learning, neural networks, genetic algorithms, statistics, visualization, database-oriented or data
warehouse-oriented, etc. The classification can also
take into account the degree of user interaction involved in the data mining
process such as query-driven systems, interactive exploratory systems, or
autonomous systems. A comprehensive system would provide a wide variety of data
mining techniques to fit different situations and options, and offer different
degrees of user interaction.
Classification Based on the Applications Adapted
We can classify a data mining system according to the
applications adapted. These applications are as follows −
· Finance
· Telecommunications
· DNA
· Stock Markets
· E-mail
Integrating a Data Mining
System with a DB/DW System
If a data mining system is not integrated with a
database or a data warehouse system, then there will be no system to
communicate with. This scheme is known as the non-coupling scheme. In this
scheme, the main focus is on data mining design and on developing efficient and
effective algorithms for mining the available data sets.
The list of Integration Schemes
is as follows −
·
No
Coupling − In this scheme, the data mining system does not
utilize any of the database or data warehouse functions. It fetches the data
from a particular source and processes that data using some data mining
algorithms. The data mining result
is stored in another file.
·
Loose
Coupling − In this scheme, the data mining system may use some of the functions of database and
data warehouse system. It fetches the data from the data respiratory managed by
these systems and performs data mining on that data. It then stores the mining result either in a file or in a
designated place in a database or in a data warehouse.
·
Semi−tight
Coupling - In this scheme, the data mining system is linked
with a database or a data warehouse system and in addition to that, efficient
implementations of a few data mining primitives can be provided in the
database.
·
Tight
coupling − In this coupling scheme, the data mining system is
smoothly integrated into the database
or data warehouse system. The data mining subsystem is treated as one
functional component of an information system.
1.4)INTRODUCTION TO DATA WAREHOUSING
Data warehousing refers
to the process of collecting, storing, and managing data from different sources
in a centralized repository. It allows businesses to analyze historical data
and make informed decisions. The data is structured in a way that makes it easy
to query and generate reports.
- A data warehouse consolidates data
from multiple sources.
- It helps businesses track historical
trends and performance.
- Facilitates complex queries and
analysis for decision-making.
- Enables efficient reporting and
business intelligence.
DATA
WAREHOUSE ARCHITECTURE
Designing
a data warehouse requires choosing the right approach for how the system will
be structured, developed, and scaled. There are two common approaches to
constructing a data warehouse:
- Top-Down
Approach: This method
starts with designing the overall data warehouse architecture first and
then creating individual data marts.
- Bottom-Up
Approach: In this
method, data marts are built first to meet specific business needs and
later integrated into a central data warehouse.
Components
of Data Warehouse Architecture
A
data warehouse architecture consists of several key components that work
together to store, manage and analyze data.
- External
Sources: Where data
originates. Includes: Structured (databases, spreadsheets),
Semi-structured (XML, JSON) & Unstructured (emails, images)
- Staging Area: A temporary space where raw
data is cleaned and validated before moving to the warehouse. ETL tools
manage this process: Extract (E) - Pulls raw data from sources, Transform
(T) - Standardizes and formats the data & Load (L) - Moves the data into
the data warehouse
- Data
Warehouse: A central storage
for organized, cleansed data, including both raw data and metadata.
Supports analysis, reporting and decision-making.
- Data Marts: Smaller, focused sections of the
data warehouse for specific teams (e.g., sales, marketing), enabling quick
access to relevant data.
- Data Mining: Analyzing large datasets in the
warehouse to find patterns, trends and insights that support decisions and
improve operations.
Working
of Top-Down Approach
- Central Data
Warehouse: The process
begins with creating a comprehensive data warehouse where data from
various sources is collected, integrated and stored. This involves the ETL
(Extract, Transform, Load) process to clean and transform the data.
- Specialized
Data Marts: Once the
central warehouse is established, smaller, department-specific data marts
(e.g., for finance or marketing) are built. These data marts pull
information from the main data warehouse, ensuring consistency across
departments.
Top-Down Approach
Advantages
of Top-Down Approach
- Consistent
View: Data marts
built from a central warehouse ensure uniform data across departments,
reducing reporting discrepancies.
- High Data
Consistency: Standardizing
data through one source minimizes errors and improves the reliability of
insights.
- Simplified
Maintenance: Updates
in the central warehouse automatically reflect in all data marts, saving
time and effort.
- Scalable
Architecture: New
data marts can be added easily as business needs grow or change.
- Stronger
Governance: Centralized
control improves data security, access management and compliance.
- Less Data
Duplication: Data is
stored once in the warehouse, saving space and avoiding redundant or
conflicting records.
- Better
Reporting: A
unified data source enables faster, more accurate reporting and
decision-making.
- Improved
Integration: Central
sourcing makes it easier to combine data from multiple systems for deeper
analysis.
Disadvantages
of Top-Down Approach
- High Cost
& Time: Building
a central data warehouse and data marts requires major investment and long
implementation time, making it hard for smaller organizations.
- Complex
Setup: Designing and
managing a centralized system is technically complex and requires skilled
resources and careful planning.
- Low
Flexibility: Predefined
structures make it hard to adapt quickly to changing business needs or
reporting requirements.
- Limited User
Input: IT-led
development can exclude business users, resulting in solutions that may
not meet their actual needs.
- Data Delays: Pulling data from various
systems can cause processing delays, affecting real-time reporting and
insights.
- Unclear Data
Ownership: Centralization
can blur responsibility, making it unclear who manages or maintains
specific data.
Bottom-Up
Approach
Instead
of starting with a central data warehouse, it begins by building small,
department-specific data marts that cater to the immediate needs of individual
teams, such as sales or finance.
These
data marts are later integrated to form a larger, unified data warehouse.
Working
of Bottom-Up Approach
- Department-Specific
Data Marts: The
process starts with creating data marts for individual departments or
specific business functions. These data marts are designed to meet
immediate data analysis and reporting needs, allowing departments to gain
quick insights.
- Integration
into a Data Warehouse: Over
time, these data marts are connected and consolidated to create a unified
data warehouse. The integration ensures consistency and provides a
comprehensive view of the organization’s data.
Bottom-Up Approach
Advantages
of Bottom-Up Approach
- Faster
Reporting: Data marts
allow quick insights and report generation.
- Step-by-Step
Development: Enables
gradual rollout with quick wins.
- User-Centric: Involves business users to meet
actual needs.
- Highly
Flexible: Easily customized
for departments or evolving needs.
- Quick
Results: Early setup
gives immediate value.
Disadvantages
of Bottom-Up Approach
- Inconsistent
Views: Different
structures can lead to inconsistent reporting.
- Data Silos: Independent marts may cause
duplication and isolation.
- Integration
Difficulty: Combining
varied marts into one warehouse is hard.
- Redundant
Efforts: Similar marts
may be built by different teams.
- Harder to
Manage: Multiple marts
increase maintenance overhead. to Manage – Multiple marts increase
maintenance overhead.
Major Issues in Data Mining:
Mining different
kinds of knowledge in databases. - The need of
different users is not the same. And
Different user may be in interested in different
kind of knowledge. Therefore, it is necessary for data
mining to cover broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of
abstraction. - The data mining process needs to be interactive because
it allows users to focus the search for patterns, providing and refining data
mining requests based on returned results.
Data mining query
languages and ad hoc data mining. - Data Mining
Query language that allows the user
to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.Presentation and visualization of data
mining results. - Once the patterns are
discovered it needs to be expressed in high level languages, visual
representations. These representations should be easily understandable by the
users.
Handling noisy or
incomplete data. - The data cleaning methods are
required that can handle the noise, incomplete objects while mining the data
regularities. If data cleaning methods are not there then the accuracy of the
discovered patterns will be poor.
Pattern
evaluation. - It refers to interestingness of the
problem. The patterns discovered should be interesting because either they represent common knowledge or lack
novelty.
Efficiency and scalability of data mining algorithms.
- In order to effectively extract the information from huge
amount of data in databases, data mining algorithm must be efficient and
scalable.
Parallel,
distributed, and incremental mining algorithms. - The
factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of parallel and
distributed data mining algorithms. These algorithms divide the data into partitions which
is further processed parallel. Then the results from the partitions are merged. The incremental algorithms,
updates the databases without having to mine the data again from the scratch.
DATA
PREPROCESSING
· Data preprocessing is the first and most
important step in the data mining process.
· Real-world data is often incomplete,
inconsistent, noisy, or duplicated, and cannot be used directly for
analysis.
· Preprocessing improves the quality of data,
which leads to better mining results and accurate predictions.
Why preprocessing ?
1. Real world data are generally
o
Incomplete: lacking
attribute values, lacking
certain attributes of interest, or containing only aggregate data
o
Noisy: containing errors or outliers
o
Inconsistent: containing discrepancies in codes or names
2. Tasks in data preprocessing
o
Data cleaning:
fill in missing
values, smooth noisy data, identify
or remove outliers, and
resolve inconsistencies.
o
Data integration: using multiple databases, data cubes, or files.
o
Data transformation: normalization and aggregation.
o
Data reduction: reducing the volume but producing the same or similar analytical results.
o
Data discretization: part of data reduction, replacing
numerical attributes with nominal ones.
Major
Steps of Data Preprocessing
1.
Data Cleaning
Fixes
errors and removes noise.
Includes:
- Handling
missing values
(fill, delete, or estimate)
- Removing
noise
(using smoothing techniques)
- Correcting
inconsistencies
- Removing
duplicates
Example:
If customer age is missing, we fill it with the average age.
2.
Data Integration
Combining
data from multiple sources.
Sources may include:
- Databases
- Files
- Web
data
- Sensors
Problems
handled:
- Schema
conflicts
- Naming
conflicts
- Duplicate
records
Example:
Integrating sales data from Excel with customer data from SQL
database.
3.
Data Transformation
Converting
the data into a suitable format for mining.
Includes:
- Normalization: Scaling
values (0 to 1)
- Aggregation: Summarizing
data
- Generalization: Replacing
low-level data with higher concepts
- Encoding: Converting
categories into numbers
Example:
Converting “High/Medium/Low” into 3/2/1.
4.
Data Reduction
Reducing
data size while keeping its meaning.
Used when data is huge.
Techniques:
- Dimensionality
reduction
(e.g., PCA)
- Numerosity
reduction
(sampling, clustering)
- Data
compression
Example:
Instead of storing daily sales for 5 years, store monthly totals.
5.
Data Discretization
Converting
continuous values into categories (bins).
Helps in classification and association rule mining.
Example:
Age →
- 0–12
(Child)
- 13–19
(Teen)
- 20–60
(Adult)
📌 Summary
|
Step |
Purpose |
|
Data
Cleaning |
Fix
missing, noisy, inconsistent data |
|
Data
Integration |
Combine
data from multiple sources |
|
Data
Transformation |
Normalize,
aggregate, convert formats |
|
Data
Reduction |
Reduce
size while keeping essential info |
|
Data
Discretization |
Convert
continuous data → categories |
Importance of data
preprocessing
Preprocessing data is an important step for data
analysis.
1.
It improves accuracy and reliability.
2.
It makes data consistent.
3.
It
increases the data's algorithm readability
DATA CLEANING
·
Data Cleaning
(also called data cleansing or data scrubbing) is the process of
detecting, correcting, and removing errors in a dataset to improve its quality.
·
It deals with
problems such as missing values, noise, duplicate records, inconsistent
formats, and incorrect data to ensure that the data is accurate, complete, and
reliable for analysis or data mining.
Data cleaning
mainly involves:
- Handling missing values
- Removing noise
- Detecting and correcting
inconsistencies
1. Handling Missing
Values
Missing data must be
handled carefully to avoid wrong mining results.
Common methods:
- Ignore the tuple
Used when the class label is missing; not effective for large missing portions. - Fill manually
Accurate but slow and not feasible for large datasets. - Use a global constant
Replace with “Unknown” or “N/A”; may create false patterns. - Use mean/median of the attribute
Mean for normal data; median for skewed data. - Use mean/median of similar class
More accurate—uses class-specific statistics. - Use most probable value
Uses regression, Bayesian methods, or decision trees to predict missing values.
→ Most powerful because it uses relationships between attributes.
Note: Missing value does
not always imply error (e.g., no driver’s license).
2. Handling Noisy Data
Noise = random errors or
variations.
Techniques to smooth noise:
a) Binning
- Sort data and divide into bins.
- Replace values using:
o Bin mean
o Bin median
o Bin boundaries
→ Local smoothing.
b) Regression
- Fit the data to a line/curve (linear
or multiple regression).
c) Outlier Detection
- Use clustering or statistical methods
to detect unusual values.
Many smoothing methods
also help in discretization and reduction.
3. Data Cleaning as a
Process
Cleaning involves detecting
and correcting discrepancies.
a) Discrepancy Detection
Errors may come from:
- Poor forms, human entry errors,
outdated values
- Inconsistent codes/formats (e.g.,
date formats)
- Data integration differences
Use metadata and
statistical descriptions (mean, range, standard deviation) to identify:
- Outliers
- Inconsistent values
- Invalid formats
- Violations of rules (unique rule,
consecutive rule, null rule)
b) Data Transformation
Once discrepancies are
found, transformations are applied:
- Reformatting
- Correcting values
- Standardizing codes (e.g., “gender” →
“sex”)
Tools:
- Data scrubbing tools (detect & fix using domain
knowledge)
- Data auditing tools (find rule violations using mining
techniques)
- ETL tools (apply transformations)
c) Interactive Cleaning
New tools like Potter’s
Wheel allow:
- Step-by-step transformations
- Immediate feedback
- Undo/redo options
- Automatic discrepancy checking
DATA INTEGRATION AND TRANSFORMATION
DATA INTEGRATION
- Data integration is
the process of consolidating information from multiple, diverse sources
that use different technologies, creating a cohesive and unified dataset.
- This consolidated
system is commonly known as a data warehouse.
- It involves combining
data from various repositories, including databases, data cubes, and flat
files.
- Effective data
integration relies on managing metadata, performing correlation analysis,
detecting data conflicts, and resolving semantic inconsistencies to ensure
seamless merging of information.
Benefits:
- Operates
independently.
- Processes queries
quickly.
- Handles complex
questions.
- Can summarize and
store data effectively.
- Manages large amounts
of data.
Drawbacks:
- Slower response due
to data loading.
- More expensive
because of data storage and security costs.
Key issues in data integration:
- Schema integration
- Redundancy
- Detecting and fixing
data conflicts
Schema integration means matching
real-world entities from different sources, called the entity identification
problem. For example, ensuring customer_id in one database matches cust_number
in another. Metadata stores data about data in databases and warehouses.
Redundancy:
- An attribute is
redundant if it can be derived from another table, like annual revenue.
- Correlation analysis
can detect some redundancies by measuring how strongly one attribute
implies another.
- For example,
correlation between attributes A and B shows their relationship based on
data.
Detecting and fixing data value conflicts
is crucial in data integration. Different sources may show the same entity's
attributes differently due to varied formats, scales, or detail levels. For
example, "total sales" might mean sales for one store branch in one
database but sales for all stores in a region in another.
DATA TRANSFORMATION:
In data transformation, the data are transformed or consolidated into forms appropriate for miningFor example, in
normalization, attribute data are scaled so as to fall within a small range
such as 0.0 to 1.0. Other examples are data discretization and concept
hierarchy generation.
Data transformation can involve the following:
1.
Smoothing, which works
to remove noise from the data. Such techniques include binning, regression, and
clustering.
2.
Aggregation,
where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in
constructing a data cube for analysis of the
data at multiple granularities.
3.
Generalization
of the data, where low-level or―primitive‖ (raw) data are replaced by
higher-level concepts through the use of concept hierarchies. For example,
categorical attributes, like street, can be generalized to higher-
level concepts, like city or country.
4.
Normalization,
where the attribute data are scaled so as to fall within a small
specified range, such as 1:0 to 1:0, or 0:0 to 1:0.
5.
Attribute construction (or feature
construction),where new attributes are constructed
and added from the given set of attributes to help the mining process.
Data Reduction
Data reduction techniques can be applied to obtain a
reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the
original data. That is, mining on the reduced data set should be more efficient
yet produce the same (or almost the same) analytical results.
Strategies for data reduction include
the following:
Data cube aggregation: where aggregation operations are applied to the data in the construction of a data cube.
Attribute subset selection: where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.
Dimensionality reduction: where encoding mechanisms are used to reduce the dataset size. Numerosity reduction: where the data
are replaced or estimated by alternative,
smaller data representations such as parametric models (which need store only the model parameters
instead of the actual data) or
nonparametric methods such as clustering,
sampling, and the use of histograms.
Discretization
and concept hierarchy generation, where raw data
values for attributes are replaced by ranges or higher conceptual levels. Data
discretization is a form of numerosity reduction that is very useful for the
automatic generation of concept hierarchies. Discretization and concept
hierarchy generation are powerful tools for data mining, in that they allow the
mining of data at multiple levels of abstraction.


Comments
Post a Comment