Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views15 pages

Hive Commands

The document provides a comprehensive guide on using Apache Hive, including commands for starting Hive, creating databases and tables, inserting and selecting data, and managing partitions and buckets. It explains the differences between partitioning and bucketing, as well as data types in Hive such as arrays, maps, and structs. Additionally, it covers various SQL-like queries for data manipulation and retrieval in Hive.

Uploaded by

vaishnavi kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views15 pages

Hive Commands

The document provides a comprehensive guide on using Apache Hive, including commands for starting Hive, creating databases and tables, inserting and selecting data, and managing partitions and buckets. It explains the differences between partitioning and bucketing, as well as data types in Hive such as arrays, maps, and structs. Additionally, it covers various SQL-like queries for data manipulation and retrieval in Hive.

Uploaded by

vaishnavi kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Hive Commands

Steps to start hive:


1. cd C:\hadoopsetup\hadoop-3.2.4\sbin (Hadoop sbin)
2. start-all.cmd
3. cd C:\hive\apache-hive-3.1.3-bin\apache-hive-3.1.3-bin\bin (hive bin)
in a new cmd
4. StartNetworkServer -h 0.0.0.0
back to our original cmd
5. hive

now start with hive commands:


1. Create Database
CREATE DATABASE lpu;
What it does:
Creates a new database named lpu.
Purpose:
Databases are used to organize tables into separate logical groups.

2. Use Database
USE lpu;
What it does:
Switches the active database to lpu, so any new table you create will belong to it.

3. Create Table
CREATE TABLE students (id INT, name STRING);
What it does:
Creates a table students with two columns:
id → integer type
name → string type

4. Show Tables
SHOW TABLES;
What it does:
Lists all the tables available in the currently selected database.

5. Describe Table
DESCRIBE students;
What it does:
Displays the schema (columns and data types) of the students table.

6. Insert Data into Table


INSERT INTO students VALUES (1, "abc");
What it does:
Inserts one record into students table:
(id=1, name="abc").

7. Select Data
SELECT * FROM students;
What it does:
Fetches all rows and all columns from the students table.

8. Create Table with Custom Settings


CREATE TABLE customer(id INT, fname STRING, lname STRING, city
STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
What it does:
Creates a customer table where:
Each row's fields are separated by commas.
Table is saved in plain text format.

9. Load Data into Table


LOAD DATA LOCAL INPATH
'C:/Users/ASUS/Desktop/HADOOPFILES/hive.txt' INTO TABLE
customer;
What it does:
Loads data from a local file hive.txt into the customer table.

10. Rename Table


ALTER TABLE customer RENAME TO employees;
What it does:
Changes the table name from customer to employees.

11. Add Column to Table


ALTER TABLE employees ADD COLUMNS (salary INT);
What it does:
Adds a new column salary (integer type) to the employees table.
12. Truncate Table
TRUNCATE TABLE employees;
What it does:
Removes all rows from the employees table but keeps the table structure.

13. Drop Table


DROP TABLE employees;
What it does:
Deletes the employees table and removes all its data permanently.

Queries on student_data Table


Create student_data Table
CREATE TABLE student_data (
student_id INT,
student_name STRING,
department STRING,
marks INT,
advisor_id INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
What it does:
Creates a table for storing student records.
Insert Multiple Rows
INSERT INTO TABLE student_data VALUES
(1, 'Anya', 'CS', 88, 501),
(2, 'Brian', 'Math', 76, 502),
(3, 'Cara', 'CS', 92, 501),
(4, 'Daniel', 'Physics', 65, 503),
(5, 'Eva', 'Math', 81, NULL);
What it does:
Inserts multiple student records into the table at once.

Select CS Students with Marks > 90


SELECT * FROM student_data
WHERE department = 'CS' AND marks > 90;
Purpose:
Fetches CS students who scored more than 90 marks.

Students Not in Math Department


SELECT * FROM student_data
WHERE department != 'Math';
Purpose:
Fetches students whose department is NOT Math.

Students Whose Names Start with 'A'


SELECT * FROM student_data
WHERE student_name LIKE 'A%';
Purpose:
Fetches students whose names begin with the letter A.

Students in CS or Physics Department


SELECT * FROM student_data
WHERE department IN ('CS', 'Physics');
Purpose:
Fetches students enrolled either in CS or Physics departments.

Students with Marks Between 70 and 90


SELECT * FROM student_data
WHERE marks BETWEEN 70 AND 90;
Purpose:
Fetches students whose marks fall between 70 and 90, inclusive.

Extra Useful Hive Commands (Added by me!)


Show All Databases
SHOW DATABASES;
Lists all databases available in Hive.

Drop Database
DROP DATABASE lpu;
Deletes the lpu database (only if it’s empty unless you use
CASCADE).

Drop Database with All Tables


DROP DATABASE lpu CASCADE;
Deletes the lpu database along with all its tables.

Create Table as Select (CTAS)


CREATE TABLE high_scorers AS
SELECT * FROM student_data WHERE marks > 85;
Creates a new table (high_scorers) with data from a SELECT query.

Count Rows in Table


SELECT COUNT(*) FROM student_data;
Returns the total number of rows in the student_data table.
What is Hive?

➔ Apache Hive is a data warehouse system built on top of Hadoop.


It is used for querying and managing big data stored in Hadoop Distributed File System (HDFS) using SQL-like
language called HiveQL.

In simple words:

Hive = SQL for Hadoop.

What is HBase?

➔ Apache HBase is a NoSQL distributed database that runs on top of Hadoop HDFS.
It is designed to provide random real-time read/write access to big data.

In simple words:

HBase = NoSQL Database for Hadoop.

Partitioning in hive:
hive> show tables;
create table students(id INT, name STRING, branch STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

load data local inpath 'C:\Users\ASUS\Desktop\HADOOPFILES\


hivepartitioning.txt' into table students;

create table part_stu_branch(id INT, name STRING)


partitioned by (branch STRING);

set hive.exec.dynamic.partition.mode = nonstrict;


(By default, Hive is in strict mode for safety reasons (to avoid mistakenly creating too many partitions, which
can slow down the system).

But in your case, you want full dynamic partitioning, where Hive reads branch values and makes partitions
automatically.

insert overwrite table part_stu_branch partition(branch)


select id, name, branch from students;
3. Verifying Partitioning in HDFS
(Opening another Command Prompt.)

Navigate to Hadoop's sbin directory:

cd C:\hadoopsetup\hadoop-3.2.4\sbin
Start Hadoop services:
start-all.cmd
Starts HDFS (namenode, datanode) and YARN (resourcemanager,
nodemanager).

Step 8:
hdfs dfs -ls /user/hive/warehouse/part_stu_branch
Lists the folders inside part_stu_branch.

You will see folders like:


/branch=CSE
/branch=ECE
/branch=MECH

These are partition folders.

Step 9:
hdfs dfs -ls "/user/hive/warehouse/part_stu_branch/branch=CSE"
Lists the files inside the partition folder for CSE branch.

Step 10:
hdfs dfs -cat
"/user/hive/warehouse/part_stu_branch/branch=CSE/000000_0"
Displays the data for students in the CSE branch.
Static Partitioning:
In static partitioning, you must manually specify the partition column value
during INSERT.
insert into table part_stu_branch partition(branch='CSE')
select id, name from students where branch='CSE';

You tell Hive exactly:


➔ "Put this data into the branch = CSE partition."

Dynamic Partitioning:
In dynamic partitioning, you don't specify partition values manually.
Hive automatically reads partition column values from your SELECT statement.

insert overwrite table part_stu_branch partition(branch)


select id, name, branch from students;

Here, Hive looks at the branch column and creates partitions automatically like:
 branch = CSE
 branch = ECE
 branch = MECH etc.

Mode Meaning

strict Dynamic partitioning is restricted. You must at least partially specify static
(default) partitions.

Dynamic partitioning is fully allowed. No need to specify any static partition


nonstrict
values. Hive will create partitions dynamically for all data.

Hive Bucketing:
SET hive.enforce.bucketing=true;
 Makes Hive respect bucketing rules during insert operations.
 Without setting this, Hive might ignore buckets even if you define them.

create table st_bucket(id INT, name STRING, branch STRING)


clustered by (id) into 3 buckets
row format delimited
fields terminated by ',';

 Creates a table st_bucket.


 Data is bucketed into 3 files based on id (clustered by id).
 Bucketed tables help in faster queries by organizing data better.

insert overwrite table st_bucket select * from students;

 Inserts all the data from students into st_bucket and divides it into 3
buckets (files).

Verifying Bucketing in HDFS


(Open another new CMD terminal.)
hdfs dfs -ls "/user/hive/warehouse/st_bucket"
 Lists all the bucket files created inside the st_bucket directory.
 You will find 3 files (buckets), named something like:
o 000000_0
o 000001_0
o 000002_0
hdfs dfs -cat "/user/hive/warehouse/st_bucket/000000_0"
 Displays the content of a bucket file.

What is Partitioning?
Partitioning means dividing the data into separate folders based on the value of a specific
column.
Purpose of Partitioning:
 Faster Queries:
When you query only CSE students, Hive will directly go to /branch=CSE/ instead of
scanning the full table.
 Less I/O:
Hive reads only necessary partitions, not entire data.
 Better Management:
Easier to maintain and delete specific partitions

What is Bucketing?
Bucketing means dividing data into a fixed number of files based on the hash of a column.
 Instead of organizing into folders, data is organized into N number of buckets (files).
 You specify how many buckets you want.
 Rows are assigned to a bucket based on the hash value of a column (e.g., id).
Purpose of Bucketing:
 Even Distribution:
Distributes data more evenly, especially useful when data is skewed.
 Efficient Joins:
If two tables are bucketed on the same column, join operations become much faster.
 Parallel Processing:
MapReduce can process multiple buckets in parallel.

Partitioning vs Bucketing
Feature Partitioning Bucketing

Divides by Column Value Hash of Column Value


Feature Partitioning Bucketing

Storage Folders Files inside a folder

Number Depends on unique column values (dynamic) Fixed number (you decide)

Best for Filtering data (WHERE branch='CSE') Efficient joins, sampling

Example /branch=CSE/ folder 000000_0, 000001_0 bucket files

Hive data types:


Array:
CREATE TABLE temperature(

sno INT,

place STRING,

temp ARRAY<DOUBLE>

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

COLLECTION ITEMS TERMINATED BY ',';

LOAD DATA LOCAL INPATH 'D:/temperature.txt' INTO TABLE temperature;

SELECT temp[0] FROM temperature;

Map:
CREATE TABLE country(

city STRING,

temp MAP<INT, INT>

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

COLLECTION ITEMS TERMINATED BY ','


MAP KEYS TERMINATED BY ':';

LOAD DATA LOCAL INPATH 'D:/mapset.txt' INTO TABLE country;

SELECT * FROM country;

SELECT temp[2018] FROM country;

SELECT temp[2018] FROM country WHERE city='jalandhar';

Struct:
CREATE TABLE result(

name STRING,

city STRING,

marks STRUCT<subject:STRING, grade:FLOAT>

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

COLLECTION ITEMS TERMINATED BY ',';

LOAD DATA LOCAL INPATH 'D:/result.txt' INTO TABLE result;

Query struct elements:

SELECT * FROM result;

 Shows entire table.

SELECT marks.grade FROM result;

 Fetches only grade from the struct.

SELECT marks.subject FROM result;

 Fetches only subject from the struct.

To get the total sum of all transactions:


SELECT SUM(amount) AS total_amount_spent FROM transactions;

To get total amount spent per account:


SELECT account_number, SUM(amount) AS total_amount
FROM transactions
GROUP BY account_number;

CREATE TABLE transactions (


transaction_id INT,
account_number STRING,
amount DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Sample Input (CSV format):


1,ACC001,100.50
2,ACC002,250.00
3,ACC001,150.75
4,ACC003,300.25

You might also like