L6E-Creating Hive Partition Table
Outlines • Partitioning concept
• Types of partitioning
• Scenario 1 - creating a partitioned table (static partitioning) based on one column and loading data from HDFS
• Scenario 2 - creating a partitioned table (static partitioning) based on one column where the data comes from
existing table
• Scenario 3 - creating a partitioned table (dynamic partitioning) based on one column where the data comes from
existing table
• Scenario 4 - creating a partitioned table (static partitioning) based on two columns where the data comes from
existing table
• Scenario 5 - creating a partitioned table (dynamic partitioning) based on two columns where the data comes
from existing table
• Exercise 1
• Exercise 2
partitioning
concept
https://data-flair.training/blogs/apache-hive-partitions/
purpose:
• to divide tables into different parts (smaller tables) based on partition keys
• the partition keys can refer to any particular columns such as gender, date, city, and department
benefit:
• partitioning is the optimization technique in Hive which can improve the performance significantly
• It can be improved since it can eliminate entire table scans when dealing with a large set of data
the command structure:
CREATE TABLE table_name (column1 data_type, column2 data_type) PARTITIONED BY (partition1 data_type, partition2
data_type,….);
Types of https://data-flair.training/blogs/apache-hive-partitions/
partitioning
Static Partitioning
• Insert input data files individually into a partition table
• Usually when loading files (big files) into Hive tables static partitions are preferred
• It saves your time in loading data compared to dynamic partition
• We can alter the partition in the static partition.
Dynamic Partitioning
• Single insert to partition table
• Usually, dynamic partition loads the data from the non-partitioned table.
• Dynamic Partition takes more time in loading data compared to static partition.
• When you have large data stored in a table then the Dynamic partition is suitable.
• If you want to partition a number of columns but you don’t know how many columns then also dynamic partition
is suitable.
• We can’t perform alter on the Dynamic partition.
Scenario 1 • To create a partitioned internal table (static partitioning) where the partition key is based on one column
• Then, to use load function to load data from HDFS into the partitioned table
concepts
Steps • transfer the dataset into HDFS
• construct and execute an HQL command to create an empty partitioned table
• load the dataset into the partitioned table using the respective partition key (i.e. rating)
• check the outcome
Dataset: ratings.csv
This dataset contains four columns:
• user id
• movie id
• rating (will be used as the partition key)
• unixtime
•
transfer the recall this tutorial - Transferring file into HDFS
dataset in
HDFS
Note:
• Make sure you have successfully transferred this file into HDFS before proceeding the next task.
• Make sure the file exists in the directory.
create a Note: remember to select your database e.g. (use student_saXX) before creating any table.
table
Run this command in Hive:
create table movie_rating_part (
userid int,
movieid int,
unixtime string)
partitioned by (rating int)
row format delimited
fields terminated by ',' ;
load the data Note:
• We need to specify the key for static partitioning
• Assume, we are interested in rating = 5 and rating =4
1) Load data where rating = 5
load data inpath '/user/student30/movie_rating/ratings.csv' overwrite into table movie_rating_part
partition(rating=5);
2) load data where rating = 4
• you need to update the command accordingly
Note: Make sure ratings.csv exists in the directory before executing 2) command.
check the To check the created partition, run this command:
outcome • show partitions movie_rating_part;
•
To check the actual directory where the data is stored in Hive, run this command:
• show create table movie_rating_part;
You should be able to see this location info:
To check the data for a specific partition, run this command:
• select userid, rating from movie_rating_part where rating=5 limit 5;
Scenario 2 • To create a partitioned table (static partitioning) from existing table where the partition key is based on one column
concepts
Steps • make sure your existing table exists and contains data
• construct and execute an HQL command to create an empty partitioned table
• insert data from the existing table into the partitioned table by using the partition key (i.e. rating)
• check the outcome
Existing table
Note:
• in this exercise, the existing table is not a partitioned table
• you will need to recall L6A (scenario 1), if you have not created this table yet
create a Note: remember to select your database e.g. (use student_saXX) before creating any table.
table
Run this command in Hive:
create table movie_rating_part2 (
userid int,
movieid int,
unixtime string)
partitioned by (rating int)
row format delimited
fields terminated by ',' ;
insert data run this command:
from an
existing table insert into table movie_rating_part2 partition(rating=5)
select userid, movieid, unixtime from movie_rating
where rating=5;
check the To check the created partition, run this command:
outcome • show partitions movie_rating_part2;
To check the actual directory where the data is stored in Hive, run this command:
• show create table movie_rating_part2;
You should be able to see this location info:
To check the data for a specific partition, run this command:
• select userid, rating from movie_rating_part2 where rating=5 limit 5;
insert run this command:
another
partitioned insert into table movie_rating_part2 partition(rating=4)
data select userid, movieid, unixtime from movie_rating
where rating=4;
then, check the output.
Scenario 3 • To create a partitioned table (dynamic partitioning) from existing table where the partition key is based on one
column
Note:
• For dynamic partitioning, we cannot directly load the data from HDFS into a partitioned table with dynamic approach.
• The only way is to load the data into a table (staging table), and use this table to insert data into a new table using
dynamic partitioning
Steps • make sure your existing table is already created and contains data
• construct and execute an HQL command to create an empty partitioned table
• set for dynamic partitioning
• insert data from the existing table into the partitioned table by using the partition key (i.e. rating)
• check the outcome
Existing table
Note:
• in this exercise, the existing table is not a partitioned table
• you will need to recall L6A (scenario 1), if you have not created this table yet
create a Note: remember to select your database e.g. (use student_saXX) before creating any table.
table
Run this command in Hive:
create table movie_rating_dynpart (
userid int,
movieid int,
unixtime string)
partitioned by (rating int)
row format delimited
fields terminated by ',' ;
set for Run these commands in Hive:
dynamic • set hive.exec.dynamic.partition=true;
partitioning
• set hive.exec.dynamic.partition.mode=nonstrict;
Note:
• The first setting is to enable dynamic partitioning
• The second setting is to allow all partitions to be dynamic, otherwise, at least one partition has to be statically defined
• without this setting, you may get the following error:
insert data run this command:
from an
existing table insert into table movie_rating_dynpart partition(rating)
select userid, movieid, unixtime, rating from movie_rating;
check the To check the created partition, run this command:
outcome • show partitions movie_rating_dynpart;
To check the actual directory where the data is stored in Hive, run this command:
• show create table movie_rating_dynpart;
You should be able to see this location info:
To check the total records, run this command:
• select count (*) as total from movie_rating_dynpart where rating=1;
Scenario 4 • To create a partitioned table (static partitioning) from existing table where the partition key is based on two columns
Steps • prepare the data source
• construct and execute an HQL command to create an empty partitioned table
• set for dynamic partitioning
• insert data from the existing table into the partitioned table by using the partition keys
• check the output
prepare the The dataset refers to orders table, given as follows:
data source
(recall sqoop
tutorial)
• This table is available in retail_db database, in MariaDB
• You will need to sqoop this table from MariaDB into Hive Metastore (if you have not done it yet)
• Recall this tutorial to guide you – L5C
• The partition keys to be used for this exercise are:
o order date
o order status
create the Note: remember to select your database e.g. (use student30) before creating any table.
partitioned
table Run this command:
create table orders_part (
order_id int,
order_customer_id int)
partitioned by (order_date string, order_status string)
row format delimited
fields terminated by ',' ;
insert data Note:
from an • Assume, we are interested to store data into the partition where order_date='2014-07-24' and
existing table order_status='COMPLETE'
Run this command:
insert overwrite table orders_part partition (order_date='2014-07-24', order_status='COMPLETE')
select order_id, order_customer_id from orders where
order_date = '2014-07-24 00:00:00.0' and order_status='COMPLETE';
check the To check the created partition, run this command:
outcome • show partitions orders_part;
To check the actual directory where the data is stored in Hive, run this command:
• show create table orders_part;
You should be able to see this location info:
To check the data, run this command:
• select count(*) as total from orders_part;
exploration / There are other categories for order status:
exercise
• Insert another partition where order_date='2014-07-24' and order_status='PENDING'
• tips: you need to use insert into, and dont forget to change the order_status in your command
• you should get the following partitions created:
• the total count of records should be:
• the total of newly added records is:
Scenario 5 • To create a partitioned table (dynamic partitioning) from existing table where the partition key is based on two
columns
Steps • prepare the data source
• construct and execute an HQL command to create an empty partitioned table
• set for dynamic partitioning
• insert data from the existing table into the partitioned table by using the partition keys
• check the output
prepare the The dataset refers to customers table, given as follows:
data source
(recall sqoop
tutorial)
• This table is available in retail_db database, in MariaDB
• You will need to sqoop this table from MariaDB into Hive Metastore
• Recall this tutorial L5C
• The partition keys are:
o customer state
o customer city
create the Note: remember to select your database e.g. (use student_saXX) before creating any table.
partitioned
table Run this command in Hive:
create table customers_dynpart (
cust_id int,
cust_fname string,
cust_lname string,
cust_email string,
cust_zipcode string)
partitioned by (cust_state string, cust_city string)
row format delimited
fields terminated by ',' ;
set for Run these commands in Hive:
dynamic • set hive.exec.dynamic.partition=true;
partitioning
• set hive.exec.dynamic.partition.mode=nonstrict;
• set hive.exec.max.dynamic.partitions.pernode = 600;
Note:
• The first setting is to enable dynamic partitioning
• The second setting is to allow all partitions to be dynamic, otherwise, at least one partition has to be statically defined
• The third setting is to increase the max number of partitions (The expected partition to be created is closed to 600)
insert data Run this command:
from an
existing table insert overwrite table customers_dynpart partition (cust_state, cust_city)
select customer_id, customer_fname, customer_lname, customer_email, customer_zipcode, customer_state as cust_state,
customer_city as cust_city from customers;
This process will take some times. You can monitor the progress via:
• YARN application monitor - http://10.5.19.231:8088/cluster/apps
• YARN job monitor - http://10.5.19.231:19888/jobhistory/app
•
• You can also find out the number of mapper executed:
check the To check the created partition, run this command:
outcome • show partitions customers_dynpart;
• also, notice the number of partition created:
To check the actual directory where the data is stored in Hive, run this command:
• show create table customers_dynpart;
You should be able to see this location info:
To check the total number, run this command:
• select count(*) from customers_part;
Exercise 1 • Load this dataset into HDFS
• Create an external table to hold this dataset
• Create a partitioned table (dynamic partitioning) where the data comes from the previously created external table
The dataset student_record.csv
Exercise 2 • Load this dataset into HDFS
• Create an external table to hold this dataset which contains Id, Url, Date, PubId, AdvertiserId
• Notice that AdvertiseId can be split into sub fields
• Thus, create a new external table to store the processed dataset which contains Id, Date, PubId, AdvertiserId, Keyword,
Country
• Create a partitioned table (dynamic partitioning) based on country where the data comes from the previously created
external table
• The partitioned table should contain Id, Date, PubId, AdvertiserId, Keyword
Dataset advertisement.txt
Sample Sample of created partitions:
output
Accessing • to access HUE, go to https://bigdatalab-rm-en1.uitm.edu.my:8889/hue/accounts/login?next=/
HUE
• then login using the given account
Accessing • to access Hive, execute the following command:
Hive o beeline -u jdbc:hive2://bigdatalab-cdh-mn1.uitm.edu.my:10000 -n yourrusername -p yourpassword
• then type in:
o use yourdatabasename
• then, you can browse the available tables, by typing in:
o show tables
Accessing Type in the following:
MariaDB • mysql -ustudent -pp@ssw0rd retail_db