๐Ÿ“˜ Data Mining & Warehousing Unit 3

Data Types, Data Quality, Preprocessing, Similarity Measures, KDD, Data Mining Tasks and Fuzzy Logic

Unit 3

๐ŸŽฏ Unit 3 Overview

Unit 3 introduces data and data mining concepts. It covers data types, quality of data, data preprocessing, similarity measures, summary statistics, data distributions, basic data mining tasks, KDD, issues in data mining and fuzzy logic.

Exam Tip: Data preprocessing, similarity measures, KDD, data mining tasks and fuzzy logic are important topics for RGPV exams.

๐Ÿ“Š Introduction to Data

Data is a collection of facts, values, observations or records. In data mining, data is analyzed to discover useful patterns and knowledge.

Examples of Data

๐Ÿ“‚ Data Types

Data Type Description Example
Nominal Data Categories without order. Gender, city, branch
Ordinal Data Categories with order. Low, medium, high
Interval Data Numeric data without true zero. Temperature in Celsius
Ratio Data Numeric data with true zero. Age, income, weight
Discrete Data Countable values. Number of students
Continuous Data Measured values. Height, time, distance

โœ… Quality of Data

Data quality means how accurate, complete, consistent and useful data is for analysis. Poor data quality gives wrong results in data mining.

Data Quality Issues

๐Ÿงน Data Preprocessing

Data preprocessing is the process of converting raw data into clean and useful data before mining.

Steps of Data Preprocessing

  1. Data cleaning
  2. Data integration
  3. Data transformation
  4. Data reduction
  5. Data discretization
Raw data directly mining ke liye suitable nahi hota. Isliye pehle preprocessing ki jaati hai.

๐Ÿ” Similarity Measures

Similarity measures are used to find how similar or different two data objects are. They are mostly used in clustering and classification.

Common Similarity / Distance Measures

Measure Use
Euclidean Distance Distance between two points in space.
Manhattan Distance Distance measured along right-angle paths.
Cosine Similarity Measures angle similarity between vectors.
Jaccard Similarity Used for set similarity.

๐Ÿ“ˆ Summary Statistics

Summary statistics describe the main features of data using numerical values.

Important Measures

๐Ÿ“Š Data Distributions

Data distribution shows how data values are spread over a range.

Types

Understanding distribution helps in selecting suitable data mining algorithms.

โ›๏ธ Basic Data Mining Tasks

Task Description
Classification Assigns data into predefined classes.
Clustering Groups similar data objects.
Association Rule Mining Finds relationships between items.
Regression Predicts continuous numeric values.
Prediction Predicts future outcomes.
Anomaly Detection Finds abnormal or unusual data.

๐Ÿง  Data Mining vs Knowledge Discovery in Databases

KDD means Knowledge Discovery in Databases. It is the complete process of discovering useful knowledge from large datasets. Data mining is one important step of KDD.

KDD Process Steps

  1. Data selection
  2. Data cleaning
  3. Data transformation
  4. Data mining
  5. Pattern evaluation
  6. Knowledge presentation
KDD complete process hai, Data Mining us process ka main step hai.

โš–๏ธ Data Mining vs KDD

Data Mining KDD
It is a step in KDD. It is the complete knowledge discovery process.
Focuses on pattern extraction. Includes selection, cleaning, mining and interpretation.
Uses algorithms. Uses complete methodology.
Output is patterns. Output is useful knowledge.

โš ๏ธ Issues in Data Mining

๐ŸŒซ๏ธ Introduction to Fuzzy Sets

A fuzzy set allows partial membership. In classical sets, an element either belongs or does not belong to a set. But in fuzzy sets, membership value can be between 0 and 1.

Example

A person can be partially tall. Membership value may be 0.7 instead of only true or false.

Classical set: 0 or 1 only. Fuzzy set: value between 0 and 1.

๐Ÿงฉ Fuzzy Logic

Fuzzy logic is a form of logic that handles uncertainty and approximate reasoning. It is useful where answers are not simply true or false.

Applications

โญ Important Questions

  1. Explain different types of data with examples.
  2. What is data quality? Explain data quality issues.
  3. Explain data preprocessing and its steps.
  4. Explain similarity measures used in data mining.
  5. Explain summary statistics.
  6. Explain basic data mining tasks.
  7. Differentiate between data mining and KDD.
  8. Explain KDD process with steps.
  9. Explain issues in data mining.
  10. Explain fuzzy sets and fuzzy logic.

๐Ÿ”ฅ Last Minute Revision

๐Ÿ”— Related Links