Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
16 views11 pages

Lab6E - Creating Hive Partition Table

The document provides a comprehensive guide on creating partitioned tables in Hive, covering both static and dynamic partitioning methods. It includes multiple scenarios with step-by-step instructions for creating partitioned tables based on one or two columns, as well as exercises for practical application. Additionally, it outlines the benefits of partitioning and the command structure for creating and managing partitioned tables.

Uploaded by

2024740897
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views11 pages

Lab6E - Creating Hive Partition Table

The document provides a comprehensive guide on creating partitioned tables in Hive, covering both static and dynamic partitioning methods. It includes multiple scenarios with step-by-step instructions for creating partitioned tables based on one or two columns, as well as exercises for practical application. Additionally, it outlines the benefits of partitioning and the command structure for creating and managing partitioned tables.

Uploaded by

2024740897
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

L6E-Creating Hive Partition Table

Outlines • Partitioning concept


• Types of partitioning
• Scenario 1 - creating a partitioned table (static partitioning) based on one column and loading data from HDFS
• Scenario 2 - creating a partitioned table (static partitioning) based on one column where the data comes from
existing table
• Scenario 3 - creating a partitioned table (dynamic partitioning) based on one column where the data comes from
existing table
• Scenario 4 - creating a partitioned table (static partitioning) based on two columns where the data comes from
existing table
• Scenario 5 - creating a partitioned table (dynamic partitioning) based on two columns where the data comes
from existing table
• Exercise 1
• Exercise 2

partitioning
concept

https://data-flair.training/blogs/apache-hive-partitions/

purpose:
• to divide tables into different parts (smaller tables) based on partition keys
• the partition keys can refer to any particular columns such as gender, date, city, and department

benefit:
• partitioning is the optimization technique in Hive which can improve the performance significantly
• It can be improved since it can eliminate entire table scans when dealing with a large set of data

the command structure:


CREATE TABLE table_name (column1 data_type, column2 data_type) PARTITIONED BY (partition1 data_type, partition2
data_type,….);

Types of https://data-flair.training/blogs/apache-hive-partitions/
partitioning
Static Partitioning
• Insert input data files individually into a partition table
• Usually when loading files (big files) into Hive tables static partitions are preferred
• It saves your time in loading data compared to dynamic partition
• We can alter the partition in the static partition.

Dynamic Partitioning
• Single insert to partition table
• Usually, dynamic partition loads the data from the non-partitioned table.
• Dynamic Partition takes more time in loading data compared to static partition.
• When you have large data stored in a table then the Dynamic partition is suitable.
• If you want to partition a number of columns but you don’t know how many columns then also dynamic partition
is suitable.
• We can’t perform alter on the Dynamic partition.

Scenario 1 • To create a partitioned internal table (static partitioning) where the partition key is based on one column
• Then, to use load function to load data from HDFS into the partitioned table

concepts

Steps • transfer the dataset into HDFS


• construct and execute an HQL command to create an empty partitioned table
• load the dataset into the partitioned table using the respective partition key (i.e. rating)
• check the outcome

Dataset: ratings.csv

This dataset contains four columns:


• user id
• movie id
• rating (will be used as the partition key)
• unixtime

transfer the recall this tutorial - Transferring file into HDFS


dataset in
HDFS

Note:
• Make sure you have successfully transferred this file into HDFS before proceeding the next task.
• Make sure the file exists in the directory.
create a Note: remember to select your database e.g. (use student_saXX) before creating any table.
table
Run this command in Hive:

create table movie_rating_part (


userid int,
movieid int,
unixtime string)
partitioned by (rating int)
row format delimited
fields terminated by ',' ;

load the data Note:


• We need to specify the key for static partitioning
• Assume, we are interested in rating = 5 and rating =4

1) Load data where rating = 5

load data inpath '/user/student30/movie_rating/ratings.csv' overwrite into table movie_rating_part


partition(rating=5);

2) load data where rating = 4


• you need to update the command accordingly

Note: Make sure ratings.csv exists in the directory before executing 2) command.

check the To check the created partition, run this command:


outcome • show partitions movie_rating_part;

To check the actual directory where the data is stored in Hive, run this command:
• show create table movie_rating_part;

You should be able to see this location info:

To check the data for a specific partition, run this command:


• select userid, rating from movie_rating_part where rating=5 limit 5;

Scenario 2 • To create a partitioned table (static partitioning) from existing table where the partition key is based on one column
concepts

Steps • make sure your existing table exists and contains data
• construct and execute an HQL command to create an empty partitioned table
• insert data from the existing table into the partitioned table by using the partition key (i.e. rating)
• check the outcome

Existing table

Note:
• in this exercise, the existing table is not a partitioned table
• you will need to recall L6A (scenario 1), if you have not created this table yet

create a Note: remember to select your database e.g. (use student_saXX) before creating any table.
table
Run this command in Hive:

create table movie_rating_part2 (


userid int,
movieid int,
unixtime string)
partitioned by (rating int)
row format delimited
fields terminated by ',' ;

insert data run this command:


from an
existing table insert into table movie_rating_part2 partition(rating=5)
select userid, movieid, unixtime from movie_rating
where rating=5;

check the To check the created partition, run this command:


outcome • show partitions movie_rating_part2;

To check the actual directory where the data is stored in Hive, run this command:
• show create table movie_rating_part2;

You should be able to see this location info:

To check the data for a specific partition, run this command:


• select userid, rating from movie_rating_part2 where rating=5 limit 5;

insert run this command:


another
partitioned insert into table movie_rating_part2 partition(rating=4)
data select userid, movieid, unixtime from movie_rating
where rating=4;

then, check the output.

Scenario 3 • To create a partitioned table (dynamic partitioning) from existing table where the partition key is based on one
column

Note:
• For dynamic partitioning, we cannot directly load the data from HDFS into a partitioned table with dynamic approach.
• The only way is to load the data into a table (staging table), and use this table to insert data into a new table using
dynamic partitioning

Steps • make sure your existing table is already created and contains data
• construct and execute an HQL command to create an empty partitioned table
• set for dynamic partitioning
• insert data from the existing table into the partitioned table by using the partition key (i.e. rating)
• check the outcome

Existing table

Note:
• in this exercise, the existing table is not a partitioned table
• you will need to recall L6A (scenario 1), if you have not created this table yet

create a Note: remember to select your database e.g. (use student_saXX) before creating any table.
table
Run this command in Hive:

create table movie_rating_dynpart (


userid int,
movieid int,
unixtime string)
partitioned by (rating int)
row format delimited
fields terminated by ',' ;
set for Run these commands in Hive:
dynamic • set hive.exec.dynamic.partition=true;
partitioning
• set hive.exec.dynamic.partition.mode=nonstrict;

Note:
• The first setting is to enable dynamic partitioning
• The second setting is to allow all partitions to be dynamic, otherwise, at least one partition has to be statically defined
• without this setting, you may get the following error:

insert data run this command:


from an
existing table insert into table movie_rating_dynpart partition(rating)
select userid, movieid, unixtime, rating from movie_rating;

check the To check the created partition, run this command:


outcome • show partitions movie_rating_dynpart;

To check the actual directory where the data is stored in Hive, run this command:
• show create table movie_rating_dynpart;

You should be able to see this location info:

To check the total records, run this command:


• select count (*) as total from movie_rating_dynpart where rating=1;

Scenario 4 • To create a partitioned table (static partitioning) from existing table where the partition key is based on two columns

Steps • prepare the data source


• construct and execute an HQL command to create an empty partitioned table
• set for dynamic partitioning
• insert data from the existing table into the partitioned table by using the partition keys
• check the output

prepare the The dataset refers to orders table, given as follows:


data source
(recall sqoop
tutorial)
• This table is available in retail_db database, in MariaDB
• You will need to sqoop this table from MariaDB into Hive Metastore (if you have not done it yet)
• Recall this tutorial to guide you – L5C
• The partition keys to be used for this exercise are:
o order date
o order status

create the Note: remember to select your database e.g. (use student30) before creating any table.
partitioned
table Run this command:

create table orders_part (


order_id int,
order_customer_id int)
partitioned by (order_date string, order_status string)
row format delimited
fields terminated by ',' ;

insert data Note:


from an • Assume, we are interested to store data into the partition where order_date='2014-07-24' and
existing table order_status='COMPLETE'

Run this command:

insert overwrite table orders_part partition (order_date='2014-07-24', order_status='COMPLETE')


select order_id, order_customer_id from orders where
order_date = '2014-07-24 00:00:00.0' and order_status='COMPLETE';

check the To check the created partition, run this command:


outcome • show partitions orders_part;

To check the actual directory where the data is stored in Hive, run this command:
• show create table orders_part;

You should be able to see this location info:

To check the data, run this command:


• select count(*) as total from orders_part;

exploration / There are other categories for order status:


exercise
• Insert another partition where order_date='2014-07-24' and order_status='PENDING'
• tips: you need to use insert into, and dont forget to change the order_status in your command
• you should get the following partitions created:

• the total count of records should be:

• the total of newly added records is:

Scenario 5 • To create a partitioned table (dynamic partitioning) from existing table where the partition key is based on two
columns

Steps • prepare the data source


• construct and execute an HQL command to create an empty partitioned table
• set for dynamic partitioning
• insert data from the existing table into the partitioned table by using the partition keys
• check the output

prepare the The dataset refers to customers table, given as follows:


data source
(recall sqoop
tutorial)

• This table is available in retail_db database, in MariaDB


• You will need to sqoop this table from MariaDB into Hive Metastore
• Recall this tutorial L5C
• The partition keys are:
o customer state
o customer city

create the Note: remember to select your database e.g. (use student_saXX) before creating any table.
partitioned
table Run this command in Hive:

create table customers_dynpart (


cust_id int,
cust_fname string,
cust_lname string,
cust_email string,
cust_zipcode string)
partitioned by (cust_state string, cust_city string)
row format delimited
fields terminated by ',' ;

set for Run these commands in Hive:


dynamic • set hive.exec.dynamic.partition=true;
partitioning
• set hive.exec.dynamic.partition.mode=nonstrict;
• set hive.exec.max.dynamic.partitions.pernode = 600;

Note:
• The first setting is to enable dynamic partitioning
• The second setting is to allow all partitions to be dynamic, otherwise, at least one partition has to be statically defined
• The third setting is to increase the max number of partitions (The expected partition to be created is closed to 600)

insert data Run this command:


from an
existing table insert overwrite table customers_dynpart partition (cust_state, cust_city)
select customer_id, customer_fname, customer_lname, customer_email, customer_zipcode, customer_state as cust_state,
customer_city as cust_city from customers;

This process will take some times. You can monitor the progress via:
• YARN application monitor - http://10.5.19.231:8088/cluster/apps
• YARN job monitor - http://10.5.19.231:19888/jobhistory/app

• You can also find out the number of mapper executed:

check the To check the created partition, run this command:


outcome • show partitions customers_dynpart;

• also, notice the number of partition created:


To check the actual directory where the data is stored in Hive, run this command:
• show create table customers_dynpart;

You should be able to see this location info:

To check the total number, run this command:


• select count(*) from customers_part;

Exercise 1 • Load this dataset into HDFS


• Create an external table to hold this dataset
• Create a partitioned table (dynamic partitioning) where the data comes from the previously created external table

The dataset student_record.csv

Exercise 2 • Load this dataset into HDFS


• Create an external table to hold this dataset which contains Id, Url, Date, PubId, AdvertiserId
• Notice that AdvertiseId can be split into sub fields
• Thus, create a new external table to store the processed dataset which contains Id, Date, PubId, AdvertiserId, Keyword,
Country
• Create a partitioned table (dynamic partitioning) based on country where the data comes from the previously created
external table
• The partitioned table should contain Id, Date, PubId, AdvertiserId, Keyword

Dataset advertisement.txt

Sample Sample of created partitions:


output

Accessing • to access HUE, go to https://bigdatalab-rm-en1.uitm.edu.my:8889/hue/accounts/login?next=/


HUE
• then login using the given account

Accessing • to access Hive, execute the following command:


Hive o beeline -u jdbc:hive2://bigdatalab-cdh-mn1.uitm.edu.my:10000 -n yourrusername -p yourpassword
• then type in:
o use yourdatabasename
• then, you can browse the available tables, by typing in:
o show tables

Accessing Type in the following:


MariaDB • mysql -ustudent -pp@ssw0rd retail_db

You might also like