0% found this document useful (0 votes)

51 views29 pages

Sqoop 1

Uploaded by

sarkhelatanu74

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views29 pages

Sqoop 1

Uploaded by

sarkhelatanu74

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

SQOOP IMPORT

TO LIST DATABASES IN HADOOP USING SQOOP

sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
--password cloudera

TO LIST TABLES IN HADOOP USING SQOOP

sqoop-list-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera

DISPLAYING TABLES USING SQOOP EVAL

sqoop-eval \
--connect "jdbc:mysql://10.0.2.15:3306" \
--username retail_dba \
--password cloudera \
--query "select * from retail_db.customers limit 10"

OR (-e)

sqoop-eval \
--connect "jdbc:mysql://10.0.2.15:3306" \
--username retail_dba \
--password cloudera \
-e "select * from retail_db.customers limit 10"

TO GET THE IP ADDRESS

ifconfig
10.0.2.15

sqoop-eval \
--connect "jdbc:mysql://10.0.2.15:3306" \
--username retail_dba \
--password cloudera \
--query "describe retail_db.orders"

Created by - Shubham Wadekar

SQOOP IMPORT (WHEN THERE IS A PRIMARY KEY)

1.USING TARGET DIR.

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--target-dir /queryresult

hadoop fs -ls /queryresult

hadoop fs -cat /queryresult/*

2.USING WAREHOUSE DIR.

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /ordersresult2

hadoop fs -ls /ordersresult2/orders

SQOOP IMPORT (WHEN THERE IS NO PRIMARY KEY)

1. MAKING THE NUMBER OF MAPPER 1

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/trendytech \
--username root \
--password cloudera \
--table people \
-m 1 \
--target-dir /peopleresult

2. SPLIT BY

1. ON NUMERIC COLUMN

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders_no_primarykey \
--split-by order_id \
--warehouse-dir /ordersresult2

Created by - Shubham Wadekar

2. TEXTUAL COLUMN (not recommended)

sqoop import \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders_no_primarykey \
--split-by "category_name" \
--warehouse-dir /ordersresult2

-D = Indicates start of the property

performance is not good

SQOOP AUTORESET TO ONE MAPPER (EX-1)

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders_no_primarykey \
--warehouse-dir /user/cloudera/npkresult
--autoreset-to-one-mapper \
-m 8 \

please autoreset to one mapper when there is no primary key

If there is primary key use number of mappers specified.

SQOOP AUTORESET TO ONE MAPPER (EX-2)

sqoop import-all-tables \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--as-sequencefile \
--autoreset-to-one-mapper \
-m 4 \
--warehouse-dir /user/cloudera/sqoopdir

please autoreset to one mapper for all tables when there is no primary key
If there is primary key use number of mappers specified.

Created by - Shubham Wadekar

TO IMPORT ALL TABLES FROM DATABASE TO HDFS

sqoop import-all-tables \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--as-sequencefile \
-m 4 \
--warehouse-dir /user/cloudera/sqoopdir

SQOOP HELP

sqoop help

SQOOP VERSION

sqoop version

PARAMETERS WHICH CAN BE USED WITH COMMANDS

sqoop help eval

sqoop help import

ALTERNATIVE FOR PASSWORD

-P

sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
-P

It will ask for the password while executing.

REDIRECTING LOGS

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /ordersresult2 1>query.out 2>query.err

query.out = All the output messages/records

query.err = All the errors or logs

cat query.out
cat query.err
Created by - Shubham Wadekar
COMPRESSION TECHNIQUES

1. USING GZIP

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--compress \
--warehouse-dir /user/cloudera/compresult

--compress or -z

hadoop fs -ls /user/cloudera/compresult/orders

.gz extension

2.USING BZip2Codec

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--compression-codec BZip2Codec \
--warehouse-dir /user/cloudera/bzipcomp

hadoop fs -ls /user/cloudera/bzicomp/orders

.bz2 extension

IMPORT DATA WITH SELECTED COLUMN

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table customers \
--columns customer_id,customer_fname,customer_city \
--warehouse-dir /user/cloudera/customerresult

hadoop fs -ls /user/cloudera/customerresult/customers

hadoop fs -cat /user/cloudera/customerresult/customers/*
hadoop fs -cat /user/cloudera/customerresult/customers/part-m-00000 | head

Created by - Shubham Wadekar

WHERE CLAUSE

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--columns order_id,order_customer_id,order_status \
--where "order_status in ('complete ','closed')" \
--warehouse-dir /user/cloudera/result199

to count the number of lines = hadoop fs -cat /user/cloudera/result199/orders/* | wc -l

BOUNDARY QUERY CUSTOMIZATION

PASSING THE HARDCODED MIN AND MAX VALUES

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--boundary-query "SELECT 1,68883" \
--warehouse-dir /user/cloudera/boundaryq

USING NON PRIMARY KEY COLUMN

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table order_items \
--boundary-query "SELECT min(order_item_order_id),max(order_item_order_id) FROM
order_items WHERE order_item_order_id > 10000" \
--warehouse-dir /user/cloudera/boundaryq1

USING WHERE CLAUSE

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--columns order_id,order_customer_id,order_status \
--where "order_status in ('processing')" \
--warehouse-dir /user/cloudera/result8

Created by - Shubham Wadekar

DELIMITERS

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--fields-terminated-by '|' \
--lines-terminated-by ';' \
--target-dir /user/cloudera/result1234

CREATE HIVE TABLE SAME AS MYSQL TABLE WITH SAME SCHEMA

sqoop create-hive-table \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
-hive-table emps \

SQOOP VERBOSE

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--verbose \
--target-dir /queryresult8

verbose will generate more logs and debugging information.

SQOOP APPEND

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--target-dir /queryresult8 \
--append

append argument will append data to an exsiting dataset in HDFS.

Created by - Shubham Wadekar

DELETE TARGET DIRECTORY IF EXISTS

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--target-dir /queryresult8 \
--delete-target-dir

DEALING WITH NULLS WHILE IMPORTING DATA

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--target-dir /user/cloudera/result1234 \
--delete-target-dir \
--null-non-string "-1"

Created by - Shubham Wadekar

SQOOP EXPORT

HDFS to RDBMS

Target table must already exist in the database.

If file is present in local which you want to export in RDBMS then we first need to move
that file into HDFS.

Suppose,we have CARD_TRANS.CSV file on desktop locally in cloudera then move it from
LOCAL to HDFS

hadoop fs -mkdir /data =create directory in HDFS

hadoop fs -put Desktop/card_trans.csv /data = Local to Hdfs

hadoop fs -ls /data = check

CREATE TABLE card_transactions (

card_id BIGINT,
member_id BIGINT,
amount INT (10),
postcode INT (10),
pos_id BIGINT,
transaction_dt varchar (255),
status varchar (255),
PRIMARY KEY (card_id, transaction_dt));

sqoop export \
--connect jdbc:mysql://quickstart.cloudera:3306/banking \
--username root \
--password cloudera \
--table card_transactions \
--export-dir /data/card_trans-200913-220429.xlsx \
--fields-terminated-by ","

Created by - Shubham Wadekar

1. If the Job Fails? How to see what is gone wrong?

Go to logs --> Click on URL --> Click on number (1) of failed maps
--> Logs --> Here you will see the complete details.

2. How to make sure target table is not impacted?

STAGING TABLE

There should not be any partial transfer of data either full transfer or no transfer
So,Here comes the STAGING TABLE
Create staging table in our MYSQL
Schema of this staging table should be exactly like target table.

CREATE TABLE card_transactions_stage (

card_id BIGINT,
member_id BIGINT,
amount INT (10),
postcode INT (10),
pos_id BIGINT,
transaction_dt varchar (255),
status varchar (255),
PRIMARY KEY (card_id, transaction_dt));

everything is same except we are adding staging table.

sqoop export \
--connect jdbc:mysql://quickstart.cloudera:3306/banking \
--username root \
--password cloudera \
--table card_transactions \
--staging-table card_transactions_stage \
--export-dir /data/card_trans-200913-220429.csv \
--fields-terminated-by ","

Created by - Shubham Wadekar

SQOOP INCREMENTAL

1.APPEND MODE

sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /data \
--incremental append \
--check-column order_id \
--last-value 0

Incremental mode is append

Check column is order_id
Last value is 0

This will import all the data from 0 to 68883 rows.

Next time when you will import you will need to give these details as input,these details
will be avilable in the logs when you import the data

--incremental append \
--check-column order_id \
--last-value 68883

Now lets say you have added 5 new records in orders table in MYSQL

Now this will add only new 5 records.

Created by - Shubham Wadekar

2. LAST MODIFIED MODE

When we use Last modified mode we need to give check column as date column.
This will import all the rows in first go.

Details to be filled in next import

--incremental lastmodified
--check-column order_date
--last-value 2022-09-22 03:20:09

select current_timestamp from dual;

Now add some records in the orders table and update and update any of the record.

insert into orders values(68890,current_timestamp,5523,'COMPLETE');

insert into orders values(68891,current_timestamp,5523,'COMPLETE');
insert into orders values(68892,current_timestamp,5523,'COMPLETE');
insert into orders values(68893,current_timestamp,5523,'COMPLETE');
insert into orders values(68894,current_timestamp,5523,'COMPLETE');
update orders set order_status='COMPLETE' and order_date = 2020-06-16 00:00:00 WHERE
ORDER_ID = 68862;
commit;

For next import-->

if a record is updated in your table and then we use incremental import with last modified,
then we will get updated record also

Created by - Shubham Wadekar

INCREMENTAL MERGE KEY

for e.g we updated record where order_id is 50000

Then in HDFS you will see

50000 oldtimestamp in hdfs
50000 newtimestamp in hdfs

you want hdfs file should be always in sync with the table(only updated record)

--append will result in duplicate records in HDFS

merge key will contain the column with primary key.

--merge key will only give the unique records with latest timestamp

Created by - Shubham Wadekar

SQOOP JOB
sqoop job will help you to remember the parameters.It will automate the things.

sqoop job \
--create job_orders \
-- import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /data \
--incremental append \
--check-column order_id \
--last-value 0

There is a space between -- & import.

To see the list of sqoop jobs.

sqoop job --list

Executing the sqoop job

sqoop job --exec job_orders

It will say--> Saving incremental import state to the metastore

For checking the saved state of job

sqoop job --show job_orders

You will get --> incremental.last.value = 68894

Deleting a sqoop job

sqoop job --delete job_orders

Created by - Shubham Wadekar

PASSWORD MANAGEMENT

How to create a password file?

echo -n "cloudera" >> .password-file

To see hidden file --> ls -a

To see content --> cat .password-file

FULLY AUTOMATIC JOB

sqoop job \
--create job_orders \
-- import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password-file file:///home/cloudera/.password-file \
--table orders \
--warehouse-dir /data \
--incremental append \
--check-column order_id \
--last-value 0

file:// indicates that the password file is in local, and not in HDFS.
if you do not mention file:// then it will expect the file in HDFS.
Now during job execution it will not ask for password,it will be a fully automated process.

Where is state of the sqoop job stored?

locally in folder named (.sqoop)

ls -altr /home/cloudera
cd .sqoop ---> ls ---> metastore.db.script

cat metastore.db.script | grep incremental

Here you will see all the 3 parameters stored.

Created by - Shubham Wadekar

JCEKS ( JAVA CRYPTOGRAPHY ENCRYPTION KEY STORE )

How to create a password alias

hadoop credential create mysql.banking.password -provider

jceks://hdfs/user/cloudera/mysql.password.jceks

It will ask to enter the password

shubham

mysql.banking.password = Name of password alias

jceks://hdfs/user/cloudera/mysql.password.jceks = Location where password is stored in
encrypted form.

To check the content of file created

hadoop fs -cat /user/cloudera/mysql.password.jceks

It will be in encrypted form.

Try running this command if it works

sqoop eval \
-
Dhadoop.security.credential.provider.path=jceks://hdfs/user/cloudera/mysql.password.jcek
s\
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password-alias mysql.banking.password \
--query "select count(*) from orders"

Created by - Shubham Wadekar

ADDITIONAL READING

CONCEPT 1 – DEFAULT IMPORT

By default, Sqoop will import a table named orders to a directory named orders inside your
home directory in HDFS. For example, if your username is someuser, then the import tool
will write to /user/someuser/orders/(files)

sqoop-import\
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db\
--username root\
--password cloudera NO
--table orders

Note: we are not specifying the target-dir or warehouse-dir.

Also just like we import a table we can import a view also.

Created by - Shubham Wadekar

CONCEPT 2 - FREE FORM QUERY IMPORT

Instead of using the --table, --columns and --where arguments, you can specify a SQL
statement with the --query argument. Comment men

When importing a free-form query, you must specify a destination directory with
--target-dir.

If you want to import the results of a query in parallel, then each map task will need to
execute a copy of the query, with results partitioned by bounding conditions inferred by
Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will
replace with a unique condition expression. You must also select a splitting column with
--split-by.

Note: If you are issuing the query wrapped with double quotes ("), you will have to use
\$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell
variable.

Example 1:

sqoop-import\
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db\
--username root\
--password cloudera
--query 'select * from orders where $CONDITIONS AND order_id >50000' \
--target-dir /data/orders3\
--split-by order_id

Example 2:

sqoop-import\
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db\
--username root\
--password cloudera \
--query “select * from orders where \$CONDITIONS AND order_id >50000” \
--target-dir /data/orders3\
--split-by order_id

Note: The facility of using free-form query in the current version of Sqoop is limited to
simple queries where there are no ambiguous projections and no OR conditions in the
WHERE clause. Use of complex queries such as queries that have sub-queries or joins
leading to ambiguous projections can lead to unexpected results.

Created by - Shubham Wadekar

CONCEPT 3 - DIRECT IMPORT

Controlling the Import Process

By default, the import process will use JDBC which provides a reasonable cross-vendor
import channel. Some databases can perform imports in a more high-performance fashion
by using database-specific data movement tools.

For example, MySQL provides the mysqldump tool which can export data from MySQL to
other systems very quickly.

By supplying the --direct argument, you are specifying that Sqoop should attempt the direct
import channel. T

his channel may be higher performance than using JDBC. But can be used for very basic
things only.

Example:

sqoop-import \
--username root
--password cloudera \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--table orders \
--target-dir /data/orders \
--direct

Created by - Shubham Wadekar

CONCEPT 4 – VALIDATE

Validate the data copied, either import or export by comparing the row counts from the
source and the target post copy.

Validation currently only validates data copied from a single table into HDFS and there are a
lot of limitations.

Example:

sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--target-dir /data/orders \
--validate

Data successfully validated

CONCEPT 5 – EXPORTS MAY FAIL FOR NUMBER OF REASONS.

1. Loss of connectivity from the Hadoop cluster to the database (either due to hardware
fault, or server software crashes)

2. Attempting to INSERT a row which violates a consistency constraint (for example,

inserting a duplicate primary key value)

3. Attempting to parse an incomplete or malformed record from the HDFS source data

4. Attempting to parse records using incorrect delimiters

5. Capacity issues (such as insufficient RAM or disk space)

Note: If an export map task fails due to these or other reasons, it will cause the export job to
fail. The results of a failed export are undefined. Each export map task operates in a
separate transaction. Furthermore, individual map tasks commit their current transaction
periodically. If a task fails, the current transaction will be rolled back. Any previously-
committed transactions will remain durable in the database, leading to a partially-complete
export.

Created by - Shubham Wadekar

CONCEPT 6 – IMPORTING DATA INTO HBASE

Sqoop supports additional import targets beyond HDFS and Hive. Sqoop can also import
records into a table in HBase.

By specifying --hbase-table, you instruct Sqoop to import to a table in in HBase rather than
the directory in HDFS. Sqoop will import data to the table specified as the argument to –
hbase-table.

Each row of the input table will be transformed into an HBase Put operation to a row of the
output table. The key for each row is taken from a column of the input. By default, Sqoop
will use the split-by column as the row key column. If that is not specified, it will try to
identify the primary key column, if any, of the source table.

You can manually specify the row key column with --hbase-row-key. Each output column
will be placed in the same column family, which must be specified with --column-family.

If the target table and column family do not exist, the Sqoop job will exit with an error. You
should create the target table and column family before running an import. If you specify --
hbase-create-table, Sqoop will create the target table and column family if they do not exist,
using the default parameters from your HBase configuration.

Created by - Shubham Wadekar

CONCEPT 7 – IMPORTING DATA INTO HIVE

Sqoop's import tool's main function is to upload your data into files in HDFS. If you have a
Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive
by generating and executing a CREATE TABLE statement to define the data's layout in Hive.
Importing data into Hive is as simple as adding the --hive-import option to your Sqoop
command line.

If the Hive table already exists, you can specify the --hive-overwrite option to indicate that
existing table in hive must be replaced. After your data is imported into HDFS or this step is
omitted, Sqoop will generate a Hive script containing a CREATE TABLE operation defining
your columns using Hive's types, and a LOAD DATA INPATH statement to move the data files
into Hive's warehouse directory.

Sqoop will by default import NULL values as string null. Hive is however using string \N to
denote NULL values and therefore predicates dealing with NULL (like IS NULL) will not work
correctly. You should append parameters --null-string and --null-non-string in case of import
job or --input-null-string and --input-null-non-string in case of an export job if you wish to
properly preserve NULL values. Because sqoop is using those parameters in generated code,
you need to properly escape value \N to \\N:

sqoop import ... --null-string '\\N' --null-non-string '\\N'

The table name used in Hive is, by default, the same as that of the source table. You can
control the output table name with the --hive-table option.

Example-

sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--hive-import \
--hive-table orders_new \
--verbose

Created by - Shubham Wadekar

FAQs

How is split size calculated when the primary key is not in sequence...I am referring to the
example: sqoop import with where clause, where we only consider the records with order
status as ‘processing'. Hence order-id is not in sequence...is it not leading to unequal
distribution of data among the mappers

It is always preferred to go with numeric columns that too where you have unique values.
Split wise is always used in a column that has no outliers and is an indexed one.
Even when the primary key is not in sequence (when we apply where condition), mappers
automatically divides the work equally among all 4 mappers.

What happens when the sqoop job fails while transferring a large transfer job?

If it is an export job with --staging-table, your staging table will hold partial data.

If it is an import job, you may find few mapper output files created in your target directory
provided one or more mappers are complete. We will have to delete the directories which
formed during import and restart the job.

What can be the Sqoop command if we wish to import only thefirst 10 records from a
table?

You can use a limit clause in select statements while importing. Sqoop eval is preferred if
you want to get the sample of data.

When we modify boundary query like "select 1, 68883" and wehave outliers in data (with
index 200000) then we would end up pulling data without an outlier record right?

Correct. Outlier should be ingested using one more sqoop command

If our table has a COMPOSITE primary key then how sqoop IMPORT works?,I mean how
boundary query will find min (primary key) and max (primary key)?

just like it does in case of string using hash value

Can we use --merge-key argument to merge all mapper output files?

It's not to merge files. rather to merge keys so that we do not have repeated keys. if you
want to merge files then use getmerge

What is the meaning of Overriding argument of save job in sqoop and how it done by --
exec

That means if you have to change any parameter value

Created by - Shubham Wadekar
How does split by works for non-numeric columns?

It will divide the data based on characters. But the issue comes when multiple entries start
with the same characters encountered then it considers 2nd character and it goes on.

It becomes more complex when it finds uppercase, numbers, special characters etc. The
Algorithm is very complex and sometimes unable to resolve conflicts resulting in repeating
entries or missing entries. That's why it is not recommended.

how to use -e or -query argument with sqoop import? For example, if I want to import the
data limit to 10 (e.g: select * from abc limit 10;) how these arguments can be used?

Your --query must include the token $CONDITIONS--query "select * from retail_db.orders
where \$CONDITIONS LIMIT 10"

If the primary key is not number type, how will be the min and max for boundary val
query be calculated?

Non numeric keys are not recommended as it will internally convert it to ASCII value for
string n The even distribution of load among mappers is not guaranteed n job might fail as
well.

How are sqoop queries scheduled in real PRODUCTION environments, Is it oozie or some
other tool?

Oozie or Airflow

What if we want the outlier as well when importing? Suppose records are from 1 – 100
And 555 is outlier?

We are defining boundary query, not to process the outlier.

If you want all rows, don't define boundary query.

Distrubution among mappers will be like-

In that case split size = 139(138.5)
Mapper 1 will process order_id 1-139
Mapper 2 140 - 279
Mapper3 280 – 419
Mapper4 420 – 555

With above all records processes in mapper 1 (order_id)

mapper2 and mapper3 run with 0 records.
Mapper4 will run with one record order_id 555.

Created by - Shubham Wadekar

If the sqoop export fails and we are using the staging table, then the staging table will get
partial data but the target table will not be impacted. Now what will happen if we run the
sqoop export again, do we have to manually truncate the table or it will be automatically
done?

--clear-staging-table will ensure that data is deleted in the staging table before the export

Is it feasible to use staging table while exporting large amount of data? performance
prospective?

Not necessarily. We tried to export GBs of data, didn't find any issue. If you still have any
performance issue, you increase the frequency of job. For example: rather than running
every 4 hours, run it for every 2 hours.

Is there a way to import all columns of a table except one? I know we can use --columns
and --query options. But is there something like --exclude-column?

I don't think there is a direct parameter in sqoop command to exclude columns. (If any
please let me know). Only --exclude-tables is available to restrict certain tables while doing
import-all-tables

In boundary val query if we are split by non-primary key column and it contains duplicates
will all the records get imported?
For example - if there are two records invorder_items table for order_item_order_id =
68880

If it is not a Primary key and you have duplicates, I think that both rows with id 68880 should
be imported to HDFS.
Please check the rows retrieved and count(*) from the table to make sure.
Cat all files with grep that 68880.
hadoop fs -cat/user/cloudera/bvqresult1/order_items/* | grep 68880

What is the efficient way to handle outlier data if that data is valid? I understand we can
ignore it while customising boundary query, but in that case the record will be filtered out
right. In the example given in video, the record with ID 200000 will not be transferred to
HDFS.

First is let it run with default boundary val. but this will be inefficient from mappers point of
view. Second approach is I can run 2 sqoop imports. Let's suppose you have date from 1 to
500 and then 10000 to 10003. First import I will process only 500 records and in second I
can process the last 4. This way my imports will be faster and records will be distributed
properly on mappers

Created by - Shubham Wadekar

When we import data from a mysql table to hdfs, how we will validate the table data
properly loaded into the file or not?

We can use --validate in sqoop to validate the data. however, this will only check for the
count of the rows from source and destination and not the data

Sqoop eval doesn't support --warehouse-dir argument

Yes, it doesn’t support warehouse-dir argument.

Sqoop supports data imported into following services:

 HDFS
 Hive
 Hbase
 Hcatalog
 Accumulo

Role of JDBC driver in sqoop setup? Is the JDBC driver enough to connect the sqoop to the
database?

Sqoop needs a connector to connect the different relational databases. Almost all Database
vendors make a JDBC connector available specific to that Database, Sqoop needs a JDBC
driver of the database for interaction.
No, Sqoop needs JDBC and a connector to connect a database.

Define Sqoop metastore? What is the purpose of Sqoop-merge?

Sqoop meta store is a tool for using hosts in a shared metadata repository. Multiple users
and remote users can define and execute saved jobs defined in metastore. End users
configured to connect the metastore in sqoop-site.xml or with the
–meta-connect argument.

The purpose of sqoop-merge is:

This tool combines 2 datasets where entries in one dataset overwrite entries of an older
dataset preserving only the new version of the records between both the data sets.

Explain the saved job process in Sqoop.

Sqoop allows us to define saved jobs which make this process simple. A saved job records
the configuration information required to execute a Sqoop command at a later time. sqoop-
job tool describes how to create and work with saved jobs. Job descriptions are saved to a
private repository stored in $HOME/.sqoop/.

We can configure Sqoop to instead use a shared metastore, which makes saved jobs offered
to multiple users across a shared cluster. Starting the metastore is covered by the section on
the sqoop-metastore tool.

Created by - Shubham Wadekar

What are the basic commands in Hadoop Sqoop and its uses?

The basic commands of HadoopSqoop are

 Codegen, Create-hive-table, Eval, Export, Help, Import, Import-all-tables, List-
databases, List-tables,Versions.
 Useof HadoopSqoop basic commands
 Codegen- It helps to generate code to interact with database records.
 Create-hive-table- It helps to Import a table definition into a hive
 Eval- It helps to evaluateSQL statement and display the results
 Export-It helps to export an HDFS directory into a database table
 Help- It helps to list the available commands
 Import- It helps to import a table from a database to HDFS
 Import-all-tables- It helps to import tables from a database to HDFS
 List-databases- It helps to list available databases on a server
 List-Tables-It helps to list tables in a database
 Version-It helps to display the version information

Is sqoop same as to distcp in hadoop?

No. Because the only distcp import command is same as Sqoop import command and both
the commands submit parallel map-only jobs but both command functions are different.
Distcp is used to copy any type of files from Local filesystem to HDFS and Sqoop is used for
transferring the data records between RDBMS and Hadoop eco- system service.

Explain the significance of using –split-by clause in Apache Sqoop?

Split-by is a clause, it is used to specify the columns of the table which are helping to
generate splits for data imports during importing the data into the Hadoop cluster. This
clause specifies the columns and helps to improve the performance via greater parallelism.
And also it helps to specify the column that has an even distribution of data to create splits,
that data is imported.

How can you execute a free-form SQL query in Sqoop to import the rows in a sequential
manner?

Ans. By using the –m 1 option in the Sqoop import command we can accomplish it.
Basically, it will create only one MapReduce task which will then import rows serially

How will you list all the columns of a table using Apache Sqoop?

Created by - Shubham Wadekar

Features of Sqoop

Parallel import/export
While it comes to import and export the data, Sqoop uses YARN framework. Basically, that
offers fault tolerance on top of parallelism.

Connectors for all major RDBMS Databases

However, for multiple RDBMS databases, Sqoop offers connectors, covering almost the
entire circumference.

Import results of SQL query

Also, in HDFS, we can import the result returned from an SQL query.

Incremental Load
Moreover, we can load parts of table whenever it is updated. Since Sqoop offers the facility
of the incremental load.

Full Load
It is one of the important features of sqoop, in which we can load the whole table by a single
command in Sqoop. Also, by using a single command we can load all the tables from a
database.

Kerberos Security Integration

Basically, Sqoop supports Kerberos authentication. Where Kerberos defined as a computer
network authentication protocol. That works on the basis of ‘tickets’ to allow nodes
communicating over a non-secure network to prove their identity to one another in a secure
manner.

Load data directly into HIVE/HBase

Basically, for analysis, we can load data directly into Apache Hive. Also, can dump your data
in HBase, which is a NoSQL database.

Compression
By using deflate(gzip) algorithm with –compress argument, We can compress your data.
Moreover, it is also possible by specifying –compression-codec argument. In addition, we
can also load compressed table in Apache Hive.

Support for Accumulo

It is possible that rather than a directory in HDFS we can instruct Sqoop to import the table
in Accumulo.

What is the advantage of using –password-file rather than -P option while preventing the
display of password in the sqoop import statement?

Inside a sqoop script, we can use The –password-file option. Whereas the -P option reads
from standard input, preventing automation.

Is JDBC driver enough to connect sqoop to the databases?

No. to connect to a database Sqoop needs both JDBC and connector.

Created by - Shubham Wadekar
Use of Codegen command in Hadoop sqoop?

Basically, Codegen command generates code to interact with database records

What is a disadvantage of using –direct parameter for faster data load by sqoop?

The native utilities used by databases to support faster load do not work for binary data
formats like Sequence File.

How will you update the rows that are already exported?

Basically, to update existing rows we can use the parameter –update-key. Moreover, in it, a
comma-separated list of columns is used which uniquely identifies a row. All of these
columns are used in the WHERE clause of the generated UPDATE query. All other table
columns will be used in the SET part of the query.

What is Sqoop Validation?

It means to validate the data copied. Either import or export by comparing the row counts
from the source as well as the target post copy. Likewise, we use this option to compare the
row counts between source as well as the target just after data imported into HDFS.
Moreover, While during the imports, all the rows are deleted or added, Sqoop tracks this
change. Also updates the log file.

What is Purpose to Validate in Sqoop?

In Sqoop to validate the data copied is Validation main purpose. Basically, either Sqoop
import or Export by comparing the row counts from the source as well as the target post
copy.

Created by - Shubham Wadekar

Big Data Practice
No ratings yet
Big Data Practice
93 pages
My Guitar Shop
0% (5)
My Guitar Shop
25 pages
Siemens S7-1200/1500 MySQL Integration Guide
No ratings yet
Siemens S7-1200/1500 MySQL Integration Guide
37 pages
Data Ingest
No ratings yet
Data Ingest
5 pages
Sqoop
No ratings yet
Sqoop
15 pages
Informatica MDM Training 2
No ratings yet
Informatica MDM Training 2
80 pages
Sqoop Export and Import Commands
No ratings yet
Sqoop Export and Import Commands
5 pages
Sqoop Commands
No ratings yet
Sqoop Commands
4 pages
Sqoop Practice
No ratings yet
Sqoop Practice
2 pages
Sqoop MySQL to HDFS Data Transfer Guide
No ratings yet
Sqoop MySQL to HDFS Data Transfer Guide
7 pages
Sqoop Commands for Data Engineers
No ratings yet
Sqoop Commands for Data Engineers
2 pages
Sqoop Demo
No ratings yet
Sqoop Demo
7 pages
Sqoop Commands for SQL Server Integration
No ratings yet
Sqoop Commands for SQL Server Integration
3 pages
Sqoop Implementation Revised
No ratings yet
Sqoop Implementation Revised
7 pages
Creating A Table in RDBMS 3 2. Importing RDBMS Data Into H DFS 3 Exporting HDFS Data To RDBMS .. 6
No ratings yet
Creating A Table in RDBMS 3 2. Importing RDBMS Data Into H DFS 3 Exporting HDFS Data To RDBMS .. 6
5 pages
Sqoop Data Transfer Guide
No ratings yet
Sqoop Data Transfer Guide
18 pages
Using Sqooptool To Transfer Data Between Hadoop and Mysql: Implementation
No ratings yet
Using Sqooptool To Transfer Data Between Hadoop and Mysql: Implementation
4 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
Knowledge About Apache Sqoop and Its All Basic Commands To Import and Export The Data
No ratings yet
Knowledge About Apache Sqoop and Its All Basic Commands To Import and Export The Data
7 pages
MySQL to Hive Data Import Guide
No ratings yet
MySQL to Hive Data Import Guide
14 pages
BD Sqltohadoop3 PDF
No ratings yet
BD Sqltohadoop3 PDF
13 pages
Sqoop - A Haddop Technology: Srikalahasti
No ratings yet
Sqoop - A Haddop Technology: Srikalahasti
13 pages
Sqoop Performance Tuning Guidelines
No ratings yet
Sqoop Performance Tuning Guidelines
8 pages
This Documents Are About Apache Sqoop
No ratings yet
This Documents Are About Apache Sqoop
23 pages
Sqoop Import & Export Guide
No ratings yet
Sqoop Import & Export Guide
9 pages
Week 3
No ratings yet
Week 3
11 pages
Session9 DataIngestion SQOOP
No ratings yet
Session9 DataIngestion SQOOP
4 pages
How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
No ratings yet
How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
7 pages
Hadoop Data Transfer with Sqoop
No ratings yet
Hadoop Data Transfer with Sqoop
21 pages
Sqoop
No ratings yet
Sqoop
4 pages
Loadeer Lab
No ratings yet
Loadeer Lab
3 pages
Sqoop: Interface for RDBMS & Hadoop
No ratings yet
Sqoop: Interface for RDBMS & Hadoop
39 pages
5 - Big - Data Vivek
No ratings yet
5 - Big - Data Vivek
4 pages
Sqoop Commands for MySQL Import
No ratings yet
Sqoop Commands for MySQL Import
12 pages
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
No ratings yet
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
27 pages
Sqoop Practice
No ratings yet
Sqoop Practice
5 pages
Sqoop v1.1
No ratings yet
Sqoop v1.1
18 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
90 pages
Big Data: Sqoop
No ratings yet
Big Data: Sqoop
43 pages
Lab Manual 10 PDF
No ratings yet
Lab Manual 10 PDF
4 pages
6.moving Data Into Hadoop
No ratings yet
6.moving Data Into Hadoop
18 pages
M - M - Num-Mappers
No ratings yet
M - M - Num-Mappers
4 pages
Sqoop Import Techniques Guide
No ratings yet
Sqoop Import Techniques Guide
18 pages
DBMS Question Bank PDF
No ratings yet
DBMS Question Bank PDF
10 pages
Sqoop Data Transfer Guide
No ratings yet
Sqoop Data Transfer Guide
9 pages
Hadoop Exam
No ratings yet
Hadoop Exam
67 pages
Chapter n3 Sqoop
No ratings yet
Chapter n3 Sqoop
24 pages
BC Ca1,2
No ratings yet
BC Ca1,2
31 pages
Cloudera Academic Partnership 8 PDF
No ratings yet
Cloudera Academic Partnership 8 PDF
69 pages
Fundamentals of Apache Sqoop Notes
100% (1)
Fundamentals of Apache Sqoop Notes
66 pages
Form Personalization LOV1
No ratings yet
Form Personalization LOV1
3 pages
Apache Sqoop Data Transfer Between Hadoop and RDBMS
No ratings yet
Apache Sqoop Data Transfer Between Hadoop and RDBMS
9 pages
Sqoop
No ratings yet
Sqoop
5 pages
Number Functions
No ratings yet
Number Functions
10 pages
B22 BDA Experiment 03
No ratings yet
B22 BDA Experiment 03
11 pages
Fuzzy Set and Possibility Theory
No ratings yet
Fuzzy Set and Possibility Theory
14 pages
SQL Notes
No ratings yet
SQL Notes
74 pages
Sqoop Incremental Import PP 200913 222451 Unlocked
No ratings yet
Sqoop Incremental Import PP 200913 222451 Unlocked
27 pages
Sqoop Tool for AI & DS Students
No ratings yet
Sqoop Tool for AI & DS Students
10 pages
Class 8 CH 5 Working With Queries
86% (7)
Class 8 CH 5 Working With Queries
2 pages
Practical 4
No ratings yet
Practical 4
11 pages
Sms SQL
No ratings yet
Sms SQL
3 pages
04 Sqoop
No ratings yet
04 Sqoop
30 pages
Database Normalization (AS)
No ratings yet
Database Normalization (AS)
58 pages
RDBMS Notes For MBA Program-1
No ratings yet
RDBMS Notes For MBA Program-1
50 pages
Data Cleaning and Processing
No ratings yet
Data Cleaning and Processing
29 pages
Bda U3
No ratings yet
Bda U3
59 pages
Data Transformation With Dplyr - Cheatsheet
100% (1)
Data Transformation With Dplyr - Cheatsheet
2 pages
ICT SB SME, SSE, LE, ECLPE Year 3
No ratings yet
ICT SB SME, SSE, LE, ECLPE Year 3
130 pages
Primary Key: The PRIMARY KEY Constraint Uniquely Identifies Each Record in A Database Table
No ratings yet
Primary Key: The PRIMARY KEY Constraint Uniquely Identifies Each Record in A Database Table
3 pages
BCA Module 4 (I)
No ratings yet
BCA Module 4 (I)
68 pages
Unit 4 3 Lumify, Data Rapper and Sqooop
No ratings yet
Unit 4 3 Lumify, Data Rapper and Sqooop
27 pages
Core Java Basic Assesment-1 Solution
No ratings yet
Core Java Basic Assesment-1 Solution
15 pages
Lab Experiments 1,2&4
No ratings yet
Lab Experiments 1,2&4
8 pages
MYSQL - Question Bank With Solutions
100% (1)
MYSQL - Question Bank With Solutions
9 pages
Tableau Functions - Cheat Sheet - 241204 - 195926
No ratings yet
Tableau Functions - Cheat Sheet - 241204 - 195926
1 page
Database Concepts for IT Students
No ratings yet
Database Concepts for IT Students
16 pages
Economics MCQ Exam Details
No ratings yet
Economics MCQ Exam Details
85 pages
Mark Whitehorn MDX vs. Dax: Consultant, Writer Professor of Analytics at University of Dundee
No ratings yet
Mark Whitehorn MDX vs. Dax: Consultant, Writer Professor of Analytics at University of Dundee
61 pages
Unit 3 Apache Sqoop and Drill
No ratings yet
Unit 3 Apache Sqoop and Drill
10 pages
Curren Log
No ratings yet
Curren Log
23 pages
MySQL - Notes (KVS)
No ratings yet
MySQL - Notes (KVS)
300 pages
Scoop Intro
No ratings yet
Scoop Intro
9 pages
5th Sem Rdbms Final
No ratings yet
5th Sem Rdbms Final
46 pages
9so9innlps42 DF100 030 - FullDeck
No ratings yet
9so9innlps42 DF100 030 - FullDeck
53 pages
Ex 7
No ratings yet
Ex 7
7 pages
Revision Solution
No ratings yet
Revision Solution
5 pages
Relational Database, Selected Writings, C. J. Date
No ratings yet
Relational Database, Selected Writings, C. J. Date
520 pages