DM-BCA-UNIT I

 BCA

DATAMINING AND WAREHOUSING-320E6B

UNIT I

Introduction: Data mining – Functionalities – Classification – Introduction to Data Warehousing – Data Pre-processing: Pre-processing the Data – Data cleaning – Data Integration and Transformation – Data Reduction

1.1)INTRODUCTION

Introduction to Data Mining

Data mining (also called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information.
This information can be used to increase revenue, reduce costs, or improve decision-making.

Data mining software is one of the analytical tools that allows users to:

  • Analyze data from multiple dimensions
  • Categorize data
  • Identify relationships between data items

Technically, data mining is the process of finding correlations, patterns, and trends in large relational databases containing many fields.

Simple definition:
Data mining is the process of extracting knowledge from large volumes of data.

Applications of Data Mining

The knowledge extracted can be used in:

  • Market Analysis
  • Fraud Detection
  • Customer Retention
  • Production Control
  • Scientific Exploration

Data Mining Applications

Data mining is highly useful in various domains such as:

  • Market Analysis and Management
  • Corporate Analysis and Risk Management
  • Fraud Detection

Additionally, it can be applied in:

  • Production Control
  • Customer Retention
  • Science and Research
  • Sports Analytics
  • Astrology
  • Web usage analysis (Web Surf-Aid)

Data Mining Applications with Examples

1. Market Analysis and Management

a) Customer Profiling

Identifies what type of customers buy specific products.

Example:
An online store analyses purchase history and discovers that college students mostly buy budget laptops.
This helps the company target ads specifically to students.


b) Identifying Customer Requirements

Finds out what products customers need or prefer.

Example:
Netflix analyzes viewers’ watching habits and recommends movies/series based on their interests.
This increases customer engagement.


c) Cross-Market Analysis (Association Analysis)

Finds relationships between products that are bought together.

Example:
A supermarket discovers that customers who buy bread often also buy butter.
So they place butter near the bread section to increase sales.


d) Target Marketing

Identifies groups of customers with similar habits or characteristics.

Example:
A bank uses data mining to group customers into:

  • High spenders
  • Middle-income families
  • Students

Then the bank sends personalized loan offers to each group.


e) Customer Purchasing Pattern

Analyzes how customers buy products over time.

Example:
Amazon detects that customers buy mobile phones mostly during festival seasons, so they offer special discounts at that time.


f) Summary Information

Provides summarized reports for decision-making.

Example:
A retail chain gets a report showing total monthly sales and best-selling items, helping managers plan inventory.


2. Corporate Analysis and Risk Management

a) Finance Planning and Asset Evaluation

Predicts future financial situations and evaluates assets.

Example:
Banks use data mining to predict loan default risk by analyzing:

  • Salary
  • Credit history
  • Spending behavior

This helps the bank decide whether to approve or reject a loan.


b) Resource Planning

Helps companies allocate resources efficiently.

Example:
A manufacturing company uses data mining to analyze machine usage and finds that two machines are underused.
So they shift production to avoid wastage.


c) Competition Analysis

Monitors competitors and market trends.

Example:
An airline compares its ticket prices with competitors and identifies the best pricing strategy to stay competitive.


3. Fraud Detection

a) Credit Card Fraud Detection

Detects unusual or suspicious transactions.

Example:
If a customer usually spends ₹2,000 per month and suddenly a transaction of ₹50,000 appears, the bank gets an alert.
The transaction may be blocked until verified.


b) Telecommunication Fraud Detection

Finds abnormal call behavior.

Example:
If a customer suddenly makes hundreds of international calls, the system detects unusual patterns and warns the service provider.


c) Insurance Fraud Detection

Identifies false claims.

Example:
An insurance company finds that a customer has claimed compensation for the same accident multiple times, indicating fraud.


4. Production Control

Helps organizations improve product quality and reduce defects.

Example:

A factory analyzes machine sensor data and discovers when machines are likely to fail.
This allows for preventive maintenance before breakdowns.


5. Customer Retention

Identifies customers who may stop using a service.

Example:

A telecom company detects users whose data usage has reduced suddenly.
They send special discounts to retain those customers.


6. Science and Research / Exploration

Used to discover patterns in scientific data.

Example:

Astronomers use data mining to identify new stars, planets, and galaxies by analyzing massive space datasets.


7. Web Usage Mining (Internet Web Surf-Aid)

Analyzes user behavior on websites.

Example:

Google tracks user search patterns and uses them to show personalized search results and ads.

 

KNOWLEDGE DISCOVERY IN DATABASES (KDD)

Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets and interpreting accurate solutions from the observed results.

Here is the list of steps involved in the knowledge discovery process −

·       Data Cleaning − In this step, the noise and inconsistent data is removed.

·       Data Integration − In this step, multiple data sources are combined.

·       Data Selection − In this step, data relevant to the analysis task are retrieved from the database.

·       Data Transformation − In this step, data is transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations.

·       Data Mining − In this step, intelligent methods are applied in order to extract data patterns.

·       Pattern Evaluation − In this step, data patterns are evaluated.

·       Knowledge Presentation − In this step, knowledge is represented.

 

The following diagram shows the process of knowledge discovery

 


 

Fig 1: Data Mining as a process of knowledge discovery

 

 

 

 

 

 

Architecture of data mining system



 

Fig 2: Architecture of data mining system


A typical data mining system consists of several important components. Each component plays a specific role in the process of discovering useful patterns from large datasets.

1. Knowledge Base

The Knowledge Base stores all the domain knowledge required to guide the data mining process.

  • Concept hierarchies:
    Helps organize attributes into multiple levels (e.g., city → state → country).
  • User beliefs or expectations:
    Used to evaluate whether a discovered pattern is interesting or surprising.
  • Interestingness measures and thresholds:
    Defines what kind of patterns the user wants (e.g., minimum support in association rules).
  • Metadata:
    Describes data from different sources, especially when data comes from various databases.

Purpose:

  • Helps the system focus on meaningful patterns.
  • Guides the mining algorithm toward useful and relevant results.

2. Data Mining Engine

The Data Mining Engine is the core component of the entire system.

It contains functional modules that perform different data mining tasks, such as:

  • Characterization (general summaries of data)
  • Association and correlation analysis (finding relationships between items)
  • Classification (assigning data to predefined categories)
  • Prediction (forecasting future values)
  • Clustering (grouping similar data items)
  • Outlier analysis (detecting unusual data)
  • Evolution analysis (finding trends and changes over time)

Purpose:

This engine performs all major operations required to extract patterns from data.


 

3. Pattern Evaluation Module

The Pattern Evaluation Module checks the quality and usefulness of the patterns found.

  • Uses interestingness measures (like support, confidence, accuracy) to evaluate discovered patterns.
  • Filters out unimportant or irrelevant patterns.
  • Works closely with the data mining engine to guide it toward more meaningful patterns.
  • May use thresholds to show only the patterns that meet user criteria.

Purpose:

  • Ensures only valuable and interesting patterns are presented to the user.
  • Makes the mining process more efficient by removing unnecessary results early.

4. User Interface

The User Interface (UI) connects the user with the data mining system.

it allows the user to do:

  • Enter queries or specify data mining tasks
    (e.g., “find association rules from sales data”)
  • Provide constraints or preferences to narrow the search
  • View and explore intermediate results
  • Browse database or data warehouse structures
  • Evaluate and refine mined patterns
  • Visualize patterns (graphs, charts, plots)

Purpose:

  • Makes the data mining system interactive and user-friendly
  • Helps users easily understand and analyze the results

5.Data Sources (Bottom Layer)

The system collects data from different sources such as:

a) Database

Traditional databases storing structured data (e.g., banking records).

b) Data Warehouse

A large collection of historical data used for analysis (e.g., sales data from past 10 years).

c) World Wide Web

Web data such as browsing history, webpage content, and clickstream data.

d) Other Information Repositories

Text files, documents, sensor data, scientific databases, etc.

These sources provide the raw data for mining.


6.Data Cleaning, Integration, and Selection

Before mining can begin, the raw data must be prepared.

This step includes:

  • Data Cleaning: Removing noise, missing values, and errors
  • Data Integration: Combining data from multiple sources
  • Data Selection: Selecting only relevant data for analysis

Example:
From a large sales database, only the “customer purchase details” may be selected for mining.

This cleaned and prepared data is sent to the database or data warehouse server.


7. Database or Data Warehouse Server

This server acts as a bridge between the data and the mining system.

It:

  • Stores cleaned data
  • Retrieves required data efficiently
  • Provides it to the Data Mining Engine

Think of it as the storage backbone of the mining process.

 

1.2)DATA MINING FUNCTIONALITIES

Data mining discovers patterns from large datasets.
These patterns depend on the type of data mining tasks, which fall into two main categories:

1. Descriptive Data Mining Tasks

Describe existing data and reveal general properties or patterns.

2. Predictive Data Mining Tasks

Use existing data to predict future outcomes.

 Major Functionalities

1. Characterization (Descriptive)

Data characterization summarizes the general features of a target group (class).
It produces characteristic rules.

 How it works

  • Data is retrieved using queries
  • Passed through a summarization module
  • Generates high-level information using concept hierarchies
  • OLAP (data cube operations) can perform roll-up summaries

Example

Characterizing customers of OurVideoStore who rent more than 30 movies per year.
Summary may include: age group, preferred genres, visiting days, etc.

2. Discrimination (Descriptive)

Compares the general characteristics of two or more classes:

  • A target class
  • A contrast class

Purpose

Identifies differences between groups using discriminant rules.

Example

Compare customers who:

  • Rent >30 movies yearly (target class)
    vs.
  • Rent <5 movies yearly (contrast class)

Differences in age, genre preference, or spending patterns may be identified.

3. Association Analysis (Descriptive)

Discovers relationships among items in transactional databases, producing association rules.

Key measures

  • Support (s): frequency of items appearing together
  • Confidence (c): conditional probability that item Q appears when P is present

Purpose

Useful in market basket analysis and recommendation systems.

Example Rule

RentType(X, "game") AND Age(X, "13–19") → Buy(X, "pop") [s = 2%, c = 55%]

Meaning:

  • 2% transactions are by teenagers renting games AND buying pop
  • 55% of teen game renters also buy pop

4. Classification (Predictive)

Assigns objects to predefined classes based on training data.
Also called supervised learning.

How it works

  • Used with labeled training data
  • Algorithm learns a model
  • Model classifies new data

Example

In OurVideoStore, customers may be classified as:

  • “Safe”
  • “Risky”
  • “Very risky”

based on rental and payment history.

Later, the model is used to approve or reject credit requests.

5. Prediction (Predictive)

Forecasts missing data, future values, or trends.

Types

  • Predict numerical values (regression)
  • Predict trends (time-series analysis)
  • Predict class labels (linked to classification)

Example

Predict next month’s movie rental demand based on past 12 months’ data.

6. Clustering (Descriptive)

Groups data into clusters based on similarity without predefined labels.
Also called unsupervised learning.

Objective

  • Maximize intra-class similarity
  • Minimize inter-class similarity

Example

Grouping customers into:

  • Frequent renters
  • Moderate renters
  • Rare renters
    without any prior labels.

7. Outlier Analysis (Descriptive)

Finds data objects that do not fit into normal patterns.
Also called anomalies, exceptions, or surprises.

Outliers may indicate:

  • Fraud
  • Rare events
  • Errors
  • Unexpected behavior

Example

A customer suddenly rents 200 movies in one month → suspicious activity.

8. Evolution and Deviation Analysis (Descriptive + Predictive)

Studies time-related data that changes over time.

Includes

Evolution Analysis

  • Identifies long-term trends, growth, seasonality.

Deviation Analysis

  • Finds differences between actual and expected values
  • Helps detect anomalies or shifts

Example

  • Evolution: Movie rentals increase during summer each year
  • Deviation: Sudden rental drop last month → requires investigation

 

Functionality

Type

Purpose

Example

Characterization

Descriptive

Summarizes target group

Customers renting > 30 movies

Discrimination

Descriptive

Compares two groups

High vs low renters

Association

Descriptive

Finds item relationships

Game rental → pop purchase

Classification

Predictive

Assigns labels

Safe / risky customers

Prediction

Predictive

Forecasts future values

Next month rentals

Clustering

Descriptive

Groups similar objects

Renting behavior clusters

Outlier Analysis

Descriptive

Detects unusual patterns

200 rentals in a month

Evolution & Deviation

Both

Time-based trends

Yearly rental trends

 

1.3) DATA MINING SYSTEM CLASSIFICATION

A data mining system can be classified according to the following criteria

 

·       Database Technology

·       Statistics

·       Machine Learning

·       Information Science

·       Visualization

·       Other Disciplines

Apart from these, a data mining system can also be classified based on the kind of (a) databases mined, (b) knowledge mined, (c) techniques utilized, and (d) applications adapted.

Classification Based on the Databases Mined

We can classify a data mining system according to the kind of databases mined. Database system can be classified according to different criteria such as data models, types of data, etc. And the data mining system can be classified accordingly.For example, if we classify a database according to the data model, then we may have a relational, transactional, object-relational, or data warehouse mining system.

Classification Based on the kind of Knowledge Mined

We can classify a data mining system according to the kind of knowledge mined. It means the data mining system is classified on the basis of functionalities such as −

 

·       Characterization

·       Discrimination

·       Association and Correlation Analysis

·       Classification

·       Prediction

·       Outlier Analysis

·       Evolution Analysis

Classification Based on the Techniques Utilized

Data mining systems employ and provide different techniques. This classification categorizes data mining systems according to the data analysis approach used such as machine learning, neural  networks,  genetic  algorithms,  statistics,  visualization,  database-oriented  or  data

warehouse-oriented, etc. The classification can also take into account the degree of user interaction involved in the data mining process such as query-driven systems, interactive exploratory systems, or autonomous systems. A comprehensive system would provide a wide variety of data mining techniques to fit different situations and options, and offer different degrees of user interaction.

Classification Based on the Applications Adapted

We can classify a data mining system according to the applications adapted. These applications are as follows −

 

·       Finance

·       Telecommunications

·       DNA

·       Stock Markets

·       E-mail

Integrating a Data Mining System with a DB/DW System

If a data mining system is not integrated with a database or a data warehouse system, then there will be no system to communicate with. This scheme is known as the non-coupling scheme. In this scheme, the main focus is on data mining design and on developing efficient and effective algorithms for mining the available data sets.

 

The list of Integration Schemes is as follows

·       No Coupling − In this scheme, the data mining system does not utilize any of the database or data warehouse functions. It fetches the data from a particular source and processes that data using some data mining algorithms. The data mining result is stored in another file.

·       Loose Coupling − In this scheme, the data mining system may use some of the functions of database and data warehouse system. It fetches the data from the data respiratory managed by these systems and performs data mining on that data. It then stores the mining result either in a file or in a designated place in a database or in a data warehouse.

·       Semi−tight Coupling - In this scheme, the data mining system is linked with a database or a data warehouse system and in addition to that, efficient implementations of a few data mining primitives can be provided in the database.

·       Tight coupling − In this coupling scheme, the data mining system is smoothly integrated into the database or data warehouse system. The data mining subsystem is treated as one functional component of an information system.

 

1.4)INTRODUCTION TO DATA WAREHOUSING

Data warehousing refers to the process of collecting, storing, and managing data from different sources in a centralized repository. It allows businesses to analyze historical data and make informed decisions. The data is structured in a way that makes it easy to query and generate reports.

  • A data warehouse consolidates data from multiple sources.
  • It helps businesses track historical trends and performance.
  • Facilitates complex queries and analysis for decision-making.
  • Enables efficient reporting and business intelligence.

 

DATA WAREHOUSE ARCHITECTURE

Designing a data warehouse requires choosing the right approach for how the system will be structured, developed, and scaled. There are two common approaches to constructing a data warehouse:

  • Top-Down Approach: This method starts with designing the overall data warehouse architecture first and then creating individual data marts.
  • Bottom-Up Approach: In this method, data marts are built first to meet specific business needs and later integrated into a central data warehouse.

Components of Data Warehouse Architecture

A data warehouse architecture consists of several key components that work together to store, manage and analyze data.

  • External Sources: Where data originates. Includes: Structured (databases, spreadsheets), Semi-structured (XML, JSON) & Unstructured (emails, images)
  • Staging Area: A temporary space where raw data is cleaned and validated before moving to the warehouse. ETL tools manage this process: Extract (E) - Pulls raw data from sources, Transform (T) - Standardizes and formats the data & Load (L) - Moves the data into the data warehouse
  • Data Warehouse: A central storage for organized, cleansed data, including both raw data and metadata. Supports analysis, reporting and decision-making.
  • Data Marts: Smaller, focused sections of the data warehouse for specific teams (e.g., sales, marketing), enabling quick access to relevant data.
  • Data Mining: Analyzing large datasets in the warehouse to find patterns, trends and insights that support decisions and improve operations.

Working of Top-Down Approach

  • Central Data Warehouse: The process begins with creating a comprehensive data warehouse where data from various sources is collected, integrated and stored. This involves the ETL (Extract, Transform, Load) process to clean and transform the data.
  • Specialized Data Marts: Once the central warehouse is established, smaller, department-specific data marts (e.g., for finance or marketing) are built. These data marts pull information from the main data warehouse, ensuring consistency across departments.

 

Top-Down Approach

Advantages of Top-Down Approach   

  • Consistent View: Data marts built from a central warehouse ensure uniform data across departments, reducing reporting discrepancies.
  • High Data Consistency: Standardizing data through one source minimizes errors and improves the reliability of insights.
  • Simplified Maintenance: Updates in the central warehouse automatically reflect in all data marts, saving time and effort.
  • Scalable Architecture: New data marts can be added easily as business needs grow or change.
  • Stronger Governance: Centralized control improves data security, access management and compliance.
  • Less Data Duplication: Data is stored once in the warehouse, saving space and avoiding redundant or conflicting records.
  • Better Reporting: A unified data source enables faster, more accurate reporting and decision-making.
  • Improved Integration: Central sourcing makes it easier to combine data from multiple systems for deeper analysis.

Disadvantages of Top-Down Approach

  • High Cost & Time: Building a central data warehouse and data marts requires major investment and long implementation time, making it hard for smaller organizations.
  • Complex Setup: Designing and managing a centralized system is technically complex and requires skilled resources and careful planning.
  • Low Flexibility: Predefined structures make it hard to adapt quickly to changing business needs or reporting requirements.
  • Limited User Input: IT-led development can exclude business users, resulting in solutions that may not meet their actual needs.
  • Data Delays: Pulling data from various systems can cause processing delays, affecting real-time reporting and insights.
  • Unclear Data Ownership: Centralization can blur responsibility, making it unclear who manages or maintains specific data.

Bottom-Up Approach 

Instead of starting with a central data warehouse, it begins by building small, department-specific data marts that cater to the immediate needs of individual teams, such as sales or finance.

These data marts are later integrated to form a larger, unified data warehouse.

Working of Bottom-Up Approach

  • Department-Specific Data Marts: The process starts with creating data marts for individual departments or specific business functions. These data marts are designed to meet immediate data analysis and reporting needs, allowing departments to gain quick insights.
  • Integration into a Data Warehouse: Over time, these data marts are connected and consolidated to create a unified data warehouse. The integration ensures consistency and provides a comprehensive view of the organization’s data.

Bottom-Up Approach

Advantages of Bottom-Up Approach   

  • Faster Reporting: Data marts allow quick insights and report generation.
  • Step-by-Step Development: Enables gradual rollout with quick wins.
  • User-Centric: Involves business users to meet actual needs.
  • Highly Flexible: Easily customized for departments or evolving needs.
  • Quick Results: Early setup gives immediate value.

Disadvantages of Bottom-Up Approach   

  • Inconsistent Views: Different structures can lead to inconsistent reporting.
  • Data Silos: Independent marts may cause duplication and isolation.
  • Integration Difficulty: Combining varied marts into one warehouse is hard.
  • Redundant Efforts: Similar marts may be built by different teams.
  • Harder to Manage: Multiple marts increase maintenance overhead. to Manage – Multiple marts increase maintenance overhead.

 

Major Issues in Data Mining:

Mining different kinds of knowledge in databases. - The need of different users is not the same. And Different user may be in interested in different kind of knowledge. Therefore, it is necessary for data mining to cover broad range of knowledge discovery task.

Interactive mining of knowledge at multiple levels of abstraction. - The data mining process needs to be interactive because it allows users to focus the search for patterns, providing and refining data mining requests based on returned results.

Data mining query languages and ad hoc data mining. - Data Mining Query language that allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining.Presentation and visualization of data mining results. - Once the patterns are discovered it needs to be expressed in high level languages, visual representations. These representations should be easily understandable by the users.

Handling noisy or incomplete data. - The data cleaning methods are required that can handle the noise, incomplete objects while mining the data regularities. If data cleaning methods are not there then the accuracy of the discovered patterns will be poor.

Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered should be interesting because either they represent common knowledge or lack novelty.

Efficiency and scalability of data mining algorithms. - In order to effectively extract the information from huge amount of data in databases, data mining algorithm must be efficient and scalable.

Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of databases, wide distribution of data, and complexity of data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide the data into partitions which is further processed parallel. Then the results from the partitions are merged. The incremental algorithms, updates the databases without having to mine the data again from the scratch.

 

 

 

 

 

DATA PREPROCESSING

·       Data preprocessing is the first and most important step in the data mining process.

·       Real-world data is often incomplete, inconsistent, noisy, or duplicated, and cannot be used directly for analysis.

·       Preprocessing improves the quality of data, which leads to better mining results and accurate predictions.

Why preprocessing ?

1.      Real world data are generally

o      Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

o      Noisy: containing errors or outliers

o      Inconsistent: containing discrepancies in codes or names

2.      Tasks in data preprocessing

o      Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.

o      Data integration: using multiple databases, data cubes, or files.

o      Data transformation: normalization and aggregation.

o      Data reduction: reducing the volume but producing the same or similar analytical results.

o      Data discretization: part of data reduction, replacing numerical attributes with nominal ones.

Major Steps of Data Preprocessing

1. Data Cleaning

Fixes errors and removes noise.
Includes:

  • Handling missing values (fill, delete, or estimate)
  • Removing noise (using smoothing techniques)
  • Correcting inconsistencies
  • Removing duplicates

Example:
If customer age is missing, we fill it with the average age.

2. Data Integration

Combining data from multiple sources.
Sources may include:

  • Databases
  • Files
  • Web data
  • Sensors

Problems handled:

  • Schema conflicts
  • Naming conflicts
  • Duplicate records

Example:
Integrating sales data from Excel with customer data from SQL database.

3. Data Transformation

Converting the data into a suitable format for mining.
Includes:

  • Normalization: Scaling values (0 to 1)
  • Aggregation: Summarizing data
  • Generalization: Replacing low-level data with higher concepts
  • Encoding: Converting categories into numbers

Example:
Converting “High/Medium/Low” into 3/2/1.

 

4. Data Reduction

Reducing data size while keeping its meaning.
Used when data is huge.
Techniques:

  • Dimensionality reduction (e.g., PCA)
  • Numerosity reduction (sampling, clustering)
  • Data compression

Example:
Instead of storing daily sales for 5 years, store monthly totals.

 

5. Data Discretization

Converting continuous values into categories (bins).
Helps in classification and association rule mining.

Example:
Age →

  • 0–12 (Child)
  • 13–19 (Teen)
  • 20–60 (Adult)

 

📌 Summary

Step

Purpose

Data Cleaning

Fix missing, noisy, inconsistent data

Data Integration

Combine data from multiple sources

Data Transformation

Normalize, aggregate, convert formats

Data Reduction

Reduce size while keeping essential info

Data Discretization

Convert continuous data → categories

 

Importance of data preprocessing

Preprocessing data is an important step for data analysis.

1.     It improves accuracy and reliability.

2.     It makes data consistent.

3.     It increases the data's algorithm readability

 

DATA CLEANING

·       Data Cleaning (also called data cleansing or data scrubbing) is the process of detecting, correcting, and removing errors in a dataset to improve its quality.

·       It deals with problems such as missing values, noise, duplicate records, inconsistent formats, and incorrect data to ensure that the data is accurate, complete, and reliable for analysis or data mining.

Data cleaning mainly involves:

  • Handling missing values
  • Removing noise
  • Detecting and correcting inconsistencies

1. Handling Missing Values

Missing data must be handled carefully to avoid wrong mining results.
Common methods:

  1. Ignore the tuple
    Used when the class label is missing; not effective for large missing portions.
  2. Fill manually
    Accurate but slow and not feasible for large datasets.
  3. Use a global constant
    Replace with “Unknown” or “N/A”; may create false patterns.
  4. Use mean/median of the attribute
    Mean for normal data; median for skewed data.
  5. Use mean/median of similar class
    More accurate—uses class-specific statistics.
  6. Use most probable value
    Uses regression, Bayesian methods, or decision trees to predict missing values.
    → Most powerful because it uses relationships between attributes.

Note: Missing value does not always imply error (e.g., no driver’s license).

2. Handling Noisy Data

Noise = random errors or variations.
Techniques to smooth noise:

a) Binning

  • Sort data and divide into bins.
  • Replace values using:

o   Bin mean

o   Bin median

o   Bin boundaries
→ Local smoothing.

b) Regression

  • Fit the data to a line/curve (linear or multiple regression).

c) Outlier Detection

  • Use clustering or statistical methods to detect unusual values.

Many smoothing methods also help in discretization and reduction.


3. Data Cleaning as a Process

Cleaning involves detecting and correcting discrepancies.

a) Discrepancy Detection

Errors may come from:

  • Poor forms, human entry errors, outdated values
  • Inconsistent codes/formats (e.g., date formats)
  • Data integration differences

Use metadata and statistical descriptions (mean, range, standard deviation) to identify:

  • Outliers
  • Inconsistent values
  • Invalid formats
  • Violations of rules (unique rule, consecutive rule, null rule)

b) Data Transformation

Once discrepancies are found, transformations are applied:

  • Reformatting
  • Correcting values
  • Standardizing codes (e.g., “gender” → “sex”)

Tools:

  • Data scrubbing tools (detect & fix using domain knowledge)
  • Data auditing tools (find rule violations using mining techniques)
  • ETL tools (apply transformations)

c) Interactive Cleaning

New tools like Potter’s Wheel allow:

  • Step-by-step transformations
  • Immediate feedback
  • Undo/redo options
  • Automatic discrepancy checking

 

DATA INTEGRATION AND TRANSFORMATION

DATA INTEGRATION

  • Data integration is the process of consolidating information from multiple, diverse sources that use different technologies, creating a cohesive and unified dataset.
  • This consolidated system is commonly known as a data warehouse.
  • It involves combining data from various repositories, including databases, data cubes, and flat files.
  • Effective data integration relies on managing metadata, performing correlation analysis, detecting data conflicts, and resolving semantic inconsistencies to ensure seamless merging of information.

Benefits:

  • Operates independently.
  • Processes queries quickly.
  • Handles complex questions.
  • Can summarize and store data effectively.
  • Manages large amounts of data.

Drawbacks:

  • Slower response due to data loading.
  • More expensive because of data storage and security costs.

Key issues in data integration:

  1. Schema integration
  2. Redundancy
  3. Detecting and fixing data conflicts

Schema integration means matching real-world entities from different sources, called the entity identification problem. For example, ensuring customer_id in one database matches cust_number in another. Metadata stores data about data in databases and warehouses.

Redundancy:

  • An attribute is redundant if it can be derived from another table, like annual revenue.
  • Correlation analysis can detect some redundancies by measuring how strongly one attribute implies another.
  • For example, correlation between attributes A and B shows their relationship based on data.

Detecting and fixing data value conflicts is crucial in data integration. Different sources may show the same entity's attributes differently due to varied formats, scales, or detail levels. For example, "total sales" might mean sales for one store branch in one database but sales for all stores in a region in another.

DATA TRANSFORMATION:

In data transformation, the data are transformed or consolidated into forms appropriate for miningFor example, in normalization, attribute data are scaled so as to fall within a small range such as 0.0 to 1.0. Other examples are data discretization and concept hierarchy generation.

Data transformation can involve the following:

1.     Smoothing, which works to remove noise from the data. Such techniques include binning, regression, and clustering.

2.     Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a data cube for analysis of the data at multiple granularities.

3.     Generalization of the data, where low-level or―primitive‖ (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higher- level concepts, like city or country.

4.     Normalization, where the attribute data are scaled so as to fall within a small specified range, such as 1:0 to 1:0, or 0:0 to 1:0.

5.     Attribute construction (or feature construction),where new attributes are constructed and added from the given set of attributes to help the mining process.

Data Reduction

 

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results.

Strategies for data reduction include the following:

Data cube aggregation: where aggregation operations are applied to the data in the construction of a data cube.

Attribute subset selection: where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.

Dimensionality reduction: where encoding mechanisms are used to reduce the dataset size. Numerosity reduction: where the data are replaced or estimated by alternative, smaller data representations such as parametric models (which need store only the model parameters

instead of the actual data) or nonparametric methods such as clustering, sampling, and the use of histograms.

 

Discretization and concept hierarchy generation, where raw data values for attributes are replaced by ranges or higher conceptual levels. Data discretization is a form of numerosity reduction that is very useful for the automatic generation of concept hierarchies. Discretization and concept hierarchy generation are powerful tools for data mining, in that they allow the mining of data at multiple levels of abstraction.

 

 

Comments

Popular posts from this blog

RDBMS LAB EXERCISES WITH ANSWER

DATA STRUCTURES-UNIT IV

DATA STRUCTURES-UNIT V