# SQOOP IMPORT EXECISE
=======================
SESSION - 1
============
Sqoop Import - Databases to HDFS (frequently)
Sqoop Export - HDFS to Databases
Sqoop Eval - to run queries on the database
sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
--password cloudera
sqoop-list-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera
sqoop-eval \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
--password cloudera \
--query "select * from retail_db.customers limit 10"
SESSION - 2
============
INSERT INTO people values (101,'Raj','Pali','Itwara chowk','Yavatmal)
Sqoop import
=============
(transfer data from your relation db to HDFS)
Mapreduce job
only mappers work and no reducer.
by default there are 4 mappers which do the work.
yes we can change the number of mappers.
these mappers divide the work based on primary key.
if there is no primary key then what will happen?
1. you change the number of mappers to 1.
2. split by column
sqoop-eval \
--connect "jdbc:mysql://10.0.2.15:3306" \
--username retail_dba \
--password cloudera \
--query "describe retail_db.orders"
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username root \sqoop
--password cloudera \
--table orders \
--target-dir /queryresult
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/trendytech" \
--username root \
--password cloudera \
--table people \
--target-dir peopleresult
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/trendytech" \
--username root \
--password cloudera \
--table people \ {people table don't content the P.K therefore
setting the mapper 1}
-m 1 \ {if you dont set mapper 1 then it will give an
error}
--target-dir peopleresult
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/trendytech" \
--username root \
--password cloudera \
--table people \
-m 1 \
--warehouse-dir peopleresult1
Now my path will this = peopleresult1/people
Target dir vs. Warehouse dir
=============================
employee table that you are importing from mysql
In case of target directory the directory path mentioned is
the final path where data is copied.
/data
In case of warehouse directory, the system will create a
subdirectory with the table name.
/data/employee
sqoop-import-all-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--as-sequencefile \
-m 4 \
--warehouse-dir /user/cloudera/sqoopdir
SESSION - 3
============
sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
--password cloudera
sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
-P {console me aapka Password show nahi hoga!}
How to Redirect the logs for later use ?
----------------------------------------
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /queryresult4 1>query.out 2>query.err
Mostly it will contain output content : 1>query.out (in case of eval command)
And all other log , errors will be here : 2>query.err (you can set any name for
file & this file will be stored in cwd from where command is run)
Boundary query
===============
sqoop import the work is divided among the mappers based on the
primary key.
Employee table
===============
empId, empname, age, salary (empId is the primary key)
0
1
2
3
4
5
6
.
.
100000
the mappers by default will be 4.
find -- how the mapper will distribute the work on the basis of P.K.?
the max of primary key
min of primary key
split size = (max_of_pk - min_of_pk)/Num_Mappers
(100000 - 0)/4
100000/4 = 25000
split size = 25000
mapper1 0 - 25000
mapper2 25001 - 50000
mapper3 50001 - 75000
mapper4 75001 - 100000
SESSION - 4
============
sqoop-import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table orders \
--compress \
--warehouse-dir /user/cloudera/compressresult
sqoop-import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--table orders \
--compression-codec BZip2Codec \
--warehouse-dir /user/cloudera/bzipcompresult
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--cloumns order_id,order_customer_id,order_status \
--where "order_status in ('complete','closed')" \ {Where clause converted as
BoundaryValsQuery}
--warehouse-dir /user/cloudera/customimportresult
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--boundary-query "SELECT 1, 68883" {Here we are hardcoding the min & max for
BVQ due to outlier}
--warehouse-dir /user/cloudera/ordersboundval
SESSION - 5
============
sqoop-import
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--columns order_id,order_customer_id,order_status \
--where "order_status in ('processing')" \ {Where clause internally add
to boundary query, no matter what}
--warehouse-dir /user/cloudera/whereclauseresult
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table order_no_pk \ {It will fail because in this table no PK and therefore
mapper doesn't know how to divide the work among themselves}
--warehouse-dir /ordersnopk
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table order_no_pk \
--split-by order_id \
--target-dir /ordersnopk
sqoop import-all-tables \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--warehouse-dir /user/cloudera/autoreset1mresult \
--autoreset-to-one-mapper \ {uses one mappper if a table with no P.K. is
encountered}
--num-mappers 2
{Agar apne pass 100 tables hai or usme se 98 tables me P.K. hai and remaining 2
tables me P.K. nahi hai toh
jab table me P.K. hai toh 2 mapper work karege, and jissme P.K nahi hai ussme by
default mapper 1 ho jayega!}
SESSION - 6
============
sqoop create-hive-table \ {Creating the empty table in hive based on metadata in
mysql}
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \ {By default hive me table ka name same hota hai from
source table name but we can change it}
--hive-table emps \ {The name of the table should be emps in hive which
content the metadata of order table present in mysql}
--fields-terminated-by ','
# SQOOP EXPORT EXECISE
=======================
SESSION - 1
============
SQOOP EXPORT
IS USED TO TRANSFER DATA FROM HDFS TO RDBMS.
CREATE TABLE card_transactions (
transaction_id INT,
card_id BIGINT,
member_id BIGINT,
amount INT,
postcode INT,
pos_id BIGINT,
transaction_dt varchar(255),
status varchar(255),
PRIMARY KEY(transaction_id)
);
WE HAVE CARD_TRANS.CSV ON THE DESKTOP LOCALLY IN CLOUDERA.
WE SHOULD BE MOVING THIS FILE FROM LOCAL TO HDFS
hadoop fs -mkdir /data
hadoop fs -put Desktop/card_trans.csv /data
sqoop export \
--connect jdbc:mysql://quickstart.cloudera:3306/banking \
--username root \
--password cloudera \
--table card_transactions \
--export-dir /data/card_trans.csv \
--fields-terminated-by ","
2 IMPORTANT THINGS:
1. why the job failed ? {check your Job tracking url}
2. if a job fails how to make sure that target table is not
impacted.{thats means nothing should be transfered if job fail
i.e. it should not be a partial}
Caused by:
com.mysql.jdbc.exceptions.jdbc4.MySQLIntergrityConstraintViolationException:
Duplicate entry '345925144288000-10-10-2017 18:02:40' for key 'PRIMARY'
>>Concept : Staging table comes into play for avoid partial transfer of data :
>>1st creating the same schema table with stage name attach in mysql database,
CREATE TABLE card_transactions_stage (
card_id BIGINT,
member_id BIGINT,
amount INT(10),
postcode INT(10),
pos_id BIGINT,
transaction_dt varchar(255),
status varchar(255),
PRIMARY KEY (card_id, transaction_dt)
);
>>Now, Running the export command with --staging-table <table name>
sqoop export \
--connect jdbc:mysql://quickstart.cloudera:3306/banking \
--username root \
--password cloudera \
--table card_transactions \
--staging-table card_transactions_stage \
--export-dir /data/card_transactions.csv \
--field-terminated-by ','
>>If partial record transfered then the partial record will kept in stage table
will not
transfer to the main table;
>>If data is successfully transfered to staging table then stage table in MySql
will Migrate the data
to the main table and stage table will become empty. Because data has been
migrated.
SESSION - 8
============
sqoop export \
--connect jdbc:mysql://quickstart.cloudera:3306/banking \
--username root \
--password cloudera \
--table card_transactions \
--staging-table card_transactions_stage \
--export-dir /user/cloudera/data/card_transactions_new.csv \
--fields-terminated-by ','
SESSION - 9
============
Incremental Import
orders table in mysql
50000 records are there.
order_id is the primary key.
100 new orders are coming tomorrow in orders table.
again, sqoop import.
you already have done the import of 50000 records using
sqoop import.
in such a case you should go with incremental import
2 choices
==========
1.append mode - append mode is used when there are no updates
in data, and there are just new inserts.
2.lastmodified mode - when we need to capture the updates also.
so in this case we will be using a date on basis of which we will
try to fetch the data.
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /data \
--incremental append \
--check-column order_id \
--last-value 0 {saying that if order_id is >0 then please import the records}
insert into orders values(68884,'2014-07-23 00:00:00',5522,'COMPLETE');
insert into orders values(68885,'2014-07-23 00:00:00',5522,'COMPLETE');
insert into orders values(68886,'2014-07-23 00:00:00',5522,'COMPLETE');
insert into orders values(68887,'2014-07-23 00:00:00',5522,'COMPLETE');
insert into orders values(68888,'2014-07-23 00:00:00',5522,'COMPLETE');
insert into orders values(68889,'2014-07-23 00:00:00',5522,'COMPLETE');
>>commit
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /data \
--incremental append \
--check-column order_id \
--last-value 68883 \
--append
SESSION - 10
=============
incremental import using append mode - only inserts, no updates.
incremental import using lastmodified mode - when there are updates
as well.
sqoop import
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db
--username root
--password cloudera
--table orders
--warehouse-dir /user/cloudera/data
--incremental lastmodified
--check-column order_date {Issme ham TimeStamp (Date) wala column specify karte
hai}
--last-value 0 {Basically I should give here date but in the first load I want to
consider everything}
--append
>> '2023-02-07 22:35:59' { Now next time I have run this then I have to replace 0
with this number thatswhy I am taking this }
insert into orders values(68890,current_timestamp,5523,'COMPLETE');
insert into orders values(68891,current_timestamp,5523,'COMPLETE');
insert into orders values(68892,current_timestamp,5523,'COMPLETE');
insert into orders values(68893,current_timestamp,5523,'COMPLETE');
insert into orders values(68894,current_timestamp,5523,'COMPLETE');
update orders set order_status='COMPLETE',order_date = current_timestamp WHERE
ORDER_ID = 68862;
commit;
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /user/cloudera/data \
--incremental lastmodified \
--check-column order_date \
--last-value '2023-02-07 22:35:59' \ {Bass aye date ko save karna padta hai for
next import}
--append {hamne jab ek baar import kiya and again hum incremental import karte hai
over the same output dir then
we want to choose either append or merge-key on the base of the
requirement}
if a record is updated in your table and then we use incremental
import with last modified. then we will get the updated record
also
5000 oldtimestamp in hdfs
5000 newtimestamp in hdfs {It means in your hdfs you will have 2 records with
oldtimestamp & newtimestamp
because we are using --append parameter}
you want that hdfs file should be always in sync with the table.
{e.g. If you have 1000 records in your table of MySql DB then there should be 1000
records in your hdfs}
{i.e. hame bass new updated records chahiye old wale records nahi chahiye}
{ thats means 5000 is Primary key and having 2 records then it should
consider the record with the latest timeStamp in hdfs thats means
there will not be any duplicate entry in HDFS ,So for that we use --merge-key }
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /user/cloudera/data \
--incremental lastmodified \
--check-column order_date \
--last-value '2023-02-07 22:35:59' \
--merge-key order_id {if i am using merge-key as against append then it will make
sure that against each P.K.
or each key order_id will have only one record and with
the latest TimeStamp will be considered
in hdfs}
{After using the above import command it will bring the new records which are added
& old records which are updated in the table and
after receiving that records then it will start process of merging the duplicate
records on the basis of --merging-key parameter,
and After it get merged then in will produce ony 1 file in the output dir with
Part-r file because merging is reducing activity }
2 modes
========
1. append - we talk only about new inserts
--incremental append
--check-column order_id
--last-value 0 {Any order_id greater than 0 should be import}
2. lastmodified - when we have updates as well
--incremental lastmodified
--check-column order_date {It should be some date column}
--last-value previousdate {So, this is a date after which all records entered
should be imported}
>>Aaapko append or merge dono me se ek parameter dena hi padega after 1st
incremental import otherwise will show err that output dir exist :
--append (will create the duplicacy if old records and old updated records in hdfs}
--merge-key order_id (will merge the duplicacy with the help of reducing activity
on the basis of P.K)
{And we usually replace the new records over the old records
on the basis of TimeStamp}
SESSION - 11
=============
incremental import
In this session we will talk about
1. sqoop job
2. password management.
sqoop job \
--create job_orders \ {job name should be unique}
-- import \ {yaha 2 times hypen ke baadme ek space honi chahiye}
--connect jdbc:mysql://quickstart.cloudera:3306\retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /user/cloudera/data \
--incremental append \
--check-column order_id \
--last-value 0
sqoop job --list : This command will show us all the created sqoop jobs.
sqoop job --exec job_orders
sqoop job --show job_orders : To see all the parameter saved or stored.
sqoop job --delete job_orders : Deleting a sqoop job
echo -n "cloudera" >> .password.file , it's is created in local cloudera
sqoop job \
--create job_orders \
-- import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password-file file:///home/cloudera/.password.file \
--table orders \
--warehouse-dir /user/cloudera/data \
--incremental append \
--check-column order_id \
--last-value 0
We are expecting the above command will be fully automatic.
We have successfully created job.
sqoop job --exec job_orders