๐Ÿ“˜ Big Data Unit 3

Hive, HiveQL, Pig, Pig Latin, Pig Architecture, ETL Processing, Operators, Functions and Data Types

Unit 3

๐ŸŽฏ Unit 3 Overview

Big Data Unit 3 mainly covers two important Hadoop ecosystem tools: Hive and Pig. Hive is used for SQL-like querying on Hadoop, while Pig is used for data flow and ETL processing.

Exam Tip: Hive Architecture, HiveQL, Pig Architecture, Pig Latin and Hive vs Pig are very important for RGPV exams.

๐Ÿข Introduction to Hive

Hive is a data warehouse tool built on top of Hadoop. It allows users to write SQL-like queries called HiveQL to analyze large datasets stored in HDFS.

Why Hive is Used?

Simple Meaning: Hive Hadoop ke data par SQL jaisi query chalane ke liye use hota hai.

๐Ÿ—๏ธ Hive Architecture

Hive architecture contains different components that convert HiveQL queries into MapReduce jobs and execute them on Hadoop.

Component Work
User Interface Allows user to submit HiveQL queries.
Driver Receives query and manages execution lifecycle.
Compiler Checks query syntax and converts query into execution plan.
Metastore Stores metadata such as table name, columns, data types and location.
Execution Engine Executes the query plan using MapReduce, Tez or Spark.
HDFS Stores actual data files.
Important: Hive ka Metastore actual data store nahi karta, sirf metadata store karta hai.

๐Ÿ“‚ Hive Data Types

1. Primitive Data Types

2. Complex Data Types

Data Type Example
INT 101
STRING 'RGPV'
ARRAY ['Java','Python','SQL']
MAP {'name':'Amit','city':'Bhopal'}
STRUCT student.name, student.rollno

๐Ÿ’ป Hive Query Language

Hive Query Language or HiveQL is similar to SQL. It is used to create tables, load data, query data and perform analysis on Hadoop data.

Common HiveQL Commands

CREATE DATABASE college;
CREATE TABLE student(id INT, name STRING, marks INT);
LOAD DATA INPATH '/student.txt' INTO TABLE student;
SELECT * FROM student;
SELECT name, marks FROM student WHERE marks > 60;

Uses of HiveQL

๐Ÿท Introduction to Pig

Pig is a high-level platform used for analyzing large datasets in Hadoop. Pig uses a scripting language called Pig Latin.

Features of Pig

Simple Meaning: Pig large data ko clean, transform aur process karne ke liye use hota hai.

๐Ÿ—๏ธ Pig Architecture

Pig architecture explains how Pig Latin scripts are converted into MapReduce jobs.

Component Work
Pig Latin Script User writes data processing commands.
Parser Checks syntax and creates logical plan.
Optimizer Improves the logical plan for better performance.
Compiler Converts logical plan into MapReduce jobs.
Execution Engine Runs MapReduce jobs on Hadoop.
HDFS Stores input and output data.

๐Ÿ”„ Pig on Hadoop

Pig runs on Hadoop and uses HDFS for storage and MapReduce for processing. The user writes Pig Latin scripts, and Pig automatically converts them into MapReduce jobs.

  1. User writes Pig Latin script.
  2. Pig parses the script.
  3. Logical plan is generated.
  4. Plan is optimized.
  5. MapReduce jobs are created.
  6. Jobs are executed on Hadoop cluster.
  7. Output is stored in HDFS.

๐Ÿงพ Pig Latin

Pig Latin is a data flow language used in Apache Pig. It is simple and suitable for data transformation.

Example Pig Latin Script

student = LOAD 'student.txt' USING PigStorage(',') AS (id:int, name:chararray, marks:int);
passed = FILTER student BY marks > 60;
grouped = GROUP passed BY name;
DUMP passed;

Common Pig Latin Commands

โš™๏ธ ETL Processing in Pig

ETL means Extract, Transform and Load. Pig is widely used for ETL operations in Big Data.

Step Meaning
Extract Data is collected from different sources.
Transform Data is cleaned, filtered and converted into useful format.
Load Processed data is stored into target system.

๐Ÿงฎ Operators in Pig

Operator Use
LOAD Loads data from file system.
FILTER Selects data based on condition.
FOREACH Generates required fields.
GROUP Groups records.
JOIN Combines two datasets.
ORDER Sorts data.
DISTINCT Removes duplicate values.
STORE Saves output into file system.
DUMP Displays output on screen.

๐Ÿงฉ Functions in Pig

Pig supports built-in functions and user-defined functions for data processing.

Common Built-in Functions

User Defined Functions

If built-in functions are not sufficient, users can create their own functions using Java, Python or other supported languages.

๐Ÿ“‚ Data Types in Pig

Data Type Meaning
int Integer value
long Large integer value
float Decimal value
double Large decimal value
chararray String value
bytearray Raw data
tuple Collection of fields
bag Collection of tuples
map Key-value pair

โš–๏ธ Hive vs Pig

Hive Pig
Uses HiveQL Uses Pig Latin
SQL-like query language Data flow scripting language
Best for structured data Best for semi-structured data
Used by analysts Used by developers
Good for reporting Good for ETL processing

โญ Important Questions

  1. Explain Hive Architecture with diagram.
  2. What is Hive? Explain features of Hive.
  3. Explain Hive data types.
  4. What is HiveQL? Write common HiveQL commands.
  5. Explain Pig Architecture.
  6. What is Pig Latin? Explain with example.
  7. Explain Pig on Hadoop.
  8. Explain ETL processing in Pig.
  9. Explain operators and functions in Pig.
  10. Differentiate between Hive and Pig.

๐Ÿ”ฅ Last Minute Revision

๐Ÿ”— Related Links