Sqoop 1
Sqoop 1
sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
--password cloudera
sqoop-list-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera
sqoop-eval \
--connect "jdbc:mysql://10.0.2.15:3306" \
--username retail_dba \
--password cloudera \
--query "select * from retail_db.customers limit 10"
OR (-e)
sqoop-eval \
--connect "jdbc:mysql://10.0.2.15:3306" \
--username retail_dba \
--password cloudera \
-e "select * from retail_db.customers limit 10"
ifconfig
10.0.2.15
sqoop-eval \
--connect "jdbc:mysql://10.0.2.15:3306" \
--username retail_dba \
--password cloudera \
--query "describe retail_db.orders"
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--target-dir /queryresult
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /ordersresult2
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/trendytech \
--username root \
--password cloudera \
--table people \
-m 1 \
--target-dir /peopleresult
2. SPLIT BY
1. ON NUMERIC COLUMN
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders_no_primarykey \
--split-by order_id \
--warehouse-dir /ordersresult2
sqoop import \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders_no_primarykey \
--split-by "category_name" \
--warehouse-dir /ordersresult2
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders_no_primarykey \
--warehouse-dir /user/cloudera/npkresult
--autoreset-to-one-mapper \
-m 8 \
sqoop import-all-tables \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--as-sequencefile \
--autoreset-to-one-mapper \
-m 4 \
--warehouse-dir /user/cloudera/sqoopdir
please autoreset to one mapper for all tables when there is no primary key
If there is primary key use number of mappers specified.
sqoop import-all-tables \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--as-sequencefile \
-m 4 \
--warehouse-dir /user/cloudera/sqoopdir
SQOOP HELP
sqoop help
SQOOP VERSION
sqoop version
-P
sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username retail_dba \
-P
REDIRECTING LOGS
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /ordersresult2 1>query.out 2>query.err
cat query.out
cat query.err
Created by - Shubham Wadekar
COMPRESSION TECHNIQUES
1. USING GZIP
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--compress \
--warehouse-dir /user/cloudera/compresult
OR
--compress or -z
2.USING BZip2Codec
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--compression-codec BZip2Codec \
--warehouse-dir /user/cloudera/bzipcomp
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table customers \
--columns customer_id,customer_fname,customer_city \
--warehouse-dir /user/cloudera/customerresult
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--columns order_id,order_customer_id,order_status \
--where "order_status in ('complete ','closed')" \
--warehouse-dir /user/cloudera/result199
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--boundary-query "SELECT 1,68883" \
--warehouse-dir /user/cloudera/boundaryq
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table order_items \
--boundary-query "SELECT min(order_item_order_id),max(order_item_order_id) FROM
order_items WHERE order_item_order_id > 10000" \
--warehouse-dir /user/cloudera/boundaryq1
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--columns order_id,order_customer_id,order_status \
--where "order_status in ('processing')" \
--warehouse-dir /user/cloudera/result8
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--fields-terminated-by '|' \
--lines-terminated-by ';' \
--target-dir /user/cloudera/result1234
sqoop create-hive-table \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
-hive-table emps \
SQOOP VERBOSE
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--verbose \
--target-dir /queryresult8
SQOOP APPEND
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--target-dir /queryresult8 \
--append
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--target-dir /queryresult8 \
--delete-target-dir
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username retail_dba \
--password cloudera \
--table orders \
--target-dir /user/cloudera/result1234 \
--delete-target-dir \
--null-non-string "-1"
HDFS to RDBMS
If file is present in local which you want to export in RDBMS then we first need to move
that file into HDFS.
Suppose,we have CARD_TRANS.CSV file on desktop locally in cloudera then move it from
LOCAL to HDFS
sqoop export \
--connect jdbc:mysql://quickstart.cloudera:3306/banking \
--username root \
--password cloudera \
--table card_transactions \
--export-dir /data/card_trans-200913-220429.xlsx \
--fields-terminated-by ","
Go to logs --> Click on URL --> Click on number (1) of failed maps
--> Logs --> Here you will see the complete details.
STAGING TABLE
There should not be any partial transfer of data either full transfer or no transfer
So,Here comes the STAGING TABLE
Create staging table in our MYSQL
Schema of this staging table should be exactly like target table.
sqoop export \
--connect jdbc:mysql://quickstart.cloudera:3306/banking \
--username root \
--password cloudera \
--table card_transactions \
--staging-table card_transactions_stage \
--export-dir /data/card_trans-200913-220429.csv \
--fields-terminated-by ","
1.APPEND MODE
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /data \
--incremental append \
--check-column order_id \
--last-value 0
Next time when you will import you will need to give these details as input,these details
will be avilable in the logs when you import the data
--incremental append \
--check-column order_id \
--last-value 68883
Now lets say you have added 5 new records in orders table in MYSQL
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /data \
--incremental append \
--check-column order_id \
--last-value 68883
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /data \
--incremental lastmodified \
--check-column order_date \
--last-value 0 \
--append
When we use Last modified mode we need to give check column as date column.
This will import all the rows in first go.
--incremental lastmodified
--check-column order_date
--last-value 2022-09-22 03:20:09
Now add some records in the orders table and update and update any of the record.
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /data \
--incremental lastmodified \
--check-column order_date \
--last-value '2022-09-22 03:20:09' \
--append
if a record is updated in your table and then we use incremental import with last modified,
then we will get updated record also
you want hdfs file should be always in sync with the table(only updated record)
sqoop import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /data \
--incremental lastmodified \
--check-column order_date \
--last-value '2022-09-22 03:20:09' \
--merge-key order_id
sqoop job \
--create job_orders \
-- import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /data \
--incremental append \
--check-column order_id \
--last-value 0
sqoop job \
--create job_orders \
-- import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password-file file:///home/cloudera/.password-file \
--table orders \
--warehouse-dir /data \
--incremental append \
--check-column order_id \
--last-value 0
file:// indicates that the password file is in local, and not in HDFS.
if you do not mention file:// then it will expect the file in HDFS.
Now during job execution it will not ask for password,it will be a fully automated process.
ls -altr /home/cloudera
cd .sqoop ---> ls ---> metastore.db.script
sqoop eval \
-
Dhadoop.security.credential.provider.path=jceks://hdfs/user/cloudera/mysql.password.jcek
s\
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password-alias mysql.banking.password \
--query "select count(*) from orders"
By default, Sqoop will import a table named orders to a directory named orders inside your
home directory in HDFS. For example, if your username is someuser, then the import tool
will write to /user/someuser/orders/(files)
sqoop-import\
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db\
--username root\
--password cloudera NO
--table orders
Instead of using the --table, --columns and --where arguments, you can specify a SQL
statement with the --query argument. Comment men
When importing a free-form query, you must specify a destination directory with
--target-dir.
If you want to import the results of a query in parallel, then each map task will need to
execute a copy of the query, with results partitioned by bounding conditions inferred by
Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will
replace with a unique condition expression. You must also select a splitting column with
--split-by.
Note: If you are issuing the query wrapped with double quotes ("), you will have to use
\$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell
variable.
Example 1:
sqoop-import\
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db\
--username root\
--password cloudera
--query 'select * from orders where $CONDITIONS AND order_id >50000' \
--target-dir /data/orders3\
--split-by order_id
Example 2:
sqoop-import\
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db\
--username root\
--password cloudera \
--query “select * from orders where \$CONDITIONS AND order_id >50000” \
--target-dir /data/orders3\
--split-by order_id
Note: The facility of using free-form query in the current version of Sqoop is limited to
simple queries where there are no ambiguous projections and no OR conditions in the
WHERE clause. Use of complex queries such as queries that have sub-queries or joins
leading to ambiguous projections can lead to unexpected results.
By default, the import process will use JDBC which provides a reasonable cross-vendor
import channel. Some databases can perform imports in a more high-performance fashion
by using database-specific data movement tools.
For example, MySQL provides the mysqldump tool which can export data from MySQL to
other systems very quickly.
By supplying the --direct argument, you are specifying that Sqoop should attempt the direct
import channel. T
his channel may be higher performance than using JDBC. But can be used for very basic
things only.
Example:
sqoop-import \
--username root
--password cloudera \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--table orders \
--target-dir /data/orders \
--direct
Validate the data copied, either import or export by comparing the row counts from the
source and the target post copy.
Validation currently only validates data copied from a single table into HDFS and there are a
lot of limitations.
Example:
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--target-dir /data/orders \
--validate
1. Loss of connectivity from the Hadoop cluster to the database (either due to hardware
fault, or server software crashes)
3. Attempting to parse an incomplete or malformed record from the HDFS source data
Note: If an export map task fails due to these or other reasons, it will cause the export job to
fail. The results of a failed export are undefined. Each export map task operates in a
separate transaction. Furthermore, individual map tasks commit their current transaction
periodically. If a task fails, the current transaction will be rolled back. Any previously-
committed transactions will remain durable in the database, leading to a partially-complete
export.
Sqoop supports additional import targets beyond HDFS and Hive. Sqoop can also import
records into a table in HBase.
By specifying --hbase-table, you instruct Sqoop to import to a table in in HBase rather than
the directory in HDFS. Sqoop will import data to the table specified as the argument to –
hbase-table.
Each row of the input table will be transformed into an HBase Put operation to a row of the
output table. The key for each row is taken from a column of the input. By default, Sqoop
will use the split-by column as the row key column. If that is not specified, it will try to
identify the primary key column, if any, of the source table.
You can manually specify the row key column with --hbase-row-key. Each output column
will be placed in the same column family, which must be specified with --column-family.
If the target table and column family do not exist, the Sqoop job will exit with an error. You
should create the target table and column family before running an import. If you specify --
hbase-create-table, Sqoop will create the target table and column family if they do not exist,
using the default parameters from your HBase configuration.
Sqoop's import tool's main function is to upload your data into files in HDFS. If you have a
Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive
by generating and executing a CREATE TABLE statement to define the data's layout in Hive.
Importing data into Hive is as simple as adding the --hive-import option to your Sqoop
command line.
If the Hive table already exists, you can specify the --hive-overwrite option to indicate that
existing table in hive must be replaced. After your data is imported into HDFS or this step is
omitted, Sqoop will generate a Hive script containing a CREATE TABLE operation defining
your columns using Hive's types, and a LOAD DATA INPATH statement to move the data files
into Hive's warehouse directory.
Sqoop will by default import NULL values as string null. Hive is however using string \N to
denote NULL values and therefore predicates dealing with NULL (like IS NULL) will not work
correctly. You should append parameters --null-string and --null-non-string in case of import
job or --input-null-string and --input-null-non-string in case of an export job if you wish to
properly preserve NULL values. Because sqoop is using those parameters in generated code,
you need to properly escape value \N to \\N:
The table name used in Hive is, by default, the same as that of the source table. You can
control the output table name with the --hive-table option.
Example-
sqoop-import \
--connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
--username root \
--password cloudera \
--table orders \
--hive-import \
--hive-table orders_new \
--verbose
How is split size calculated when the primary key is not in sequence...I am referring to the
example: sqoop import with where clause, where we only consider the records with order
status as ‘processing'. Hence order-id is not in sequence...is it not leading to unequal
distribution of data among the mappers
It is always preferred to go with numeric columns that too where you have unique values.
Split wise is always used in a column that has no outliers and is an indexed one.
Even when the primary key is not in sequence (when we apply where condition), mappers
automatically divides the work equally among all 4 mappers.
What happens when the sqoop job fails while transferring a large transfer job?
If it is an export job with --staging-table, your staging table will hold partial data.
If it is an import job, you may find few mapper output files created in your target directory
provided one or more mappers are complete. We will have to delete the directories which
formed during import and restart the job.
What can be the Sqoop command if we wish to import only thefirst 10 records from a
table?
You can use a limit clause in select statements while importing. Sqoop eval is preferred if
you want to get the sample of data.
When we modify boundary query like "select 1, 68883" and wehave outliers in data (with
index 200000) then we would end up pulling data without an outlier record right?
If our table has a COMPOSITE primary key then how sqoop IMPORT works?,I mean how
boundary query will find min (primary key) and max (primary key)?
It's not to merge files. rather to merge keys so that we do not have repeated keys. if you
want to merge files then use getmerge
What is the meaning of Overriding argument of save job in sqoop and how it done by --
exec
It will divide the data based on characters. But the issue comes when multiple entries start
with the same characters encountered then it considers 2nd character and it goes on.
It becomes more complex when it finds uppercase, numbers, special characters etc. The
Algorithm is very complex and sometimes unable to resolve conflicts resulting in repeating
entries or missing entries. That's why it is not recommended.
how to use -e or -query argument with sqoop import? For example, if I want to import the
data limit to 10 (e.g: select * from abc limit 10;) how these arguments can be used?
Your --query must include the token $CONDITIONS--query "select * from retail_db.orders
where \$CONDITIONS LIMIT 10"
If the primary key is not number type, how will be the min and max for boundary val
query be calculated?
Non numeric keys are not recommended as it will internally convert it to ASCII value for
string n The even distribution of load among mappers is not guaranteed n job might fail as
well.
How are sqoop queries scheduled in real PRODUCTION environments, Is it oozie or some
other tool?
Oozie or Airflow
What if we want the outlier as well when importing? Suppose records are from 1 – 100
And 555 is outlier?
--clear-staging-table will ensure that data is deleted in the staging table before the export
Is it feasible to use staging table while exporting large amount of data? performance
prospective?
Not necessarily. We tried to export GBs of data, didn't find any issue. If you still have any
performance issue, you increase the frequency of job. For example: rather than running
every 4 hours, run it for every 2 hours.
Is there a way to import all columns of a table except one? I know we can use --columns
and --query options. But is there something like --exclude-column?
I don't think there is a direct parameter in sqoop command to exclude columns. (If any
please let me know). Only --exclude-tables is available to restrict certain tables while doing
import-all-tables
In boundary val query if we are split by non-primary key column and it contains duplicates
will all the records get imported?
For example - if there are two records invorder_items table for order_item_order_id =
68880
If it is not a Primary key and you have duplicates, I think that both rows with id 68880 should
be imported to HDFS.
Please check the rows retrieved and count(*) from the table to make sure.
Cat all files with grep that 68880.
hadoop fs -cat/user/cloudera/bvqresult1/order_items/* | grep 68880
What is the efficient way to handle outlier data if that data is valid? I understand we can
ignore it while customising boundary query, but in that case the record will be filtered out
right. In the example given in video, the record with ID 200000 will not be transferred to
HDFS.
First is let it run with default boundary val. but this will be inefficient from mappers point of
view. Second approach is I can run 2 sqoop imports. Let's suppose you have date from 1 to
500 and then 10000 to 10003. First import I will process only 500 records and in second I
can process the last 4. This way my imports will be faster and records will be distributed
properly on mappers
We can use --validate in sqoop to validate the data. however, this will only check for the
count of the rows from source and destination and not the data
Role of JDBC driver in sqoop setup? Is the JDBC driver enough to connect the sqoop to the
database?
Sqoop needs a connector to connect the different relational databases. Almost all Database
vendors make a JDBC connector available specific to that Database, Sqoop needs a JDBC
driver of the database for interaction.
No, Sqoop needs JDBC and a connector to connect a database.
Sqoop meta store is a tool for using hosts in a shared metadata repository. Multiple users
and remote users can define and execute saved jobs defined in metastore. End users
configured to connect the metastore in sqoop-site.xml or with the
–meta-connect argument.
Sqoop allows us to define saved jobs which make this process simple. A saved job records
the configuration information required to execute a Sqoop command at a later time. sqoop-
job tool describes how to create and work with saved jobs. Job descriptions are saved to a
private repository stored in $HOME/.sqoop/.
We can configure Sqoop to instead use a shared metastore, which makes saved jobs offered
to multiple users across a shared cluster. Starting the metastore is covered by the section on
the sqoop-metastore tool.
No. Because the only distcp import command is same as Sqoop import command and both
the commands submit parallel map-only jobs but both command functions are different.
Distcp is used to copy any type of files from Local filesystem to HDFS and Sqoop is used for
transferring the data records between RDBMS and Hadoop eco- system service.
Split-by is a clause, it is used to specify the columns of the table which are helping to
generate splits for data imports during importing the data into the Hadoop cluster. This
clause specifies the columns and helps to improve the performance via greater parallelism.
And also it helps to specify the column that has an even distribution of data to create splits,
that data is imported.
How can you execute a free-form SQL query in Sqoop to import the rows in a sequential
manner?
Ans. By using the –m 1 option in the Sqoop import command we can accomplish it.
Basically, it will create only one MapReduce task which will then import rows serially
How will you list all the columns of a table using Apache Sqoop?
Parallel import/export
While it comes to import and export the data, Sqoop uses YARN framework. Basically, that
offers fault tolerance on top of parallelism.
Incremental Load
Moreover, we can load parts of table whenever it is updated. Since Sqoop offers the facility
of the incremental load.
Full Load
It is one of the important features of sqoop, in which we can load the whole table by a single
command in Sqoop. Also, by using a single command we can load all the tables from a
database.
Compression
By using deflate(gzip) algorithm with –compress argument, We can compress your data.
Moreover, it is also possible by specifying –compression-codec argument. In addition, we
can also load compressed table in Apache Hive.
What is the advantage of using –password-file rather than -P option while preventing the
display of password in the sqoop import statement?
Inside a sqoop script, we can use The –password-file option. Whereas the -P option reads
from standard input, preventing automation.
What is a disadvantage of using –direct parameter for faster data load by sqoop?
The native utilities used by databases to support faster load do not work for binary data
formats like Sequence File.
How will you update the rows that are already exported?
Basically, to update existing rows we can use the parameter –update-key. Moreover, in it, a
comma-separated list of columns is used which uniquely identifies a row. All of these
columns are used in the WHERE clause of the generated UPDATE query. All other table
columns will be used in the SET part of the query.
It means to validate the data copied. Either import or export by comparing the row counts
from the source as well as the target post copy. Likewise, we use this option to compare the
row counts between source as well as the target just after data imported into HDFS.
Moreover, While during the imports, all the rows are deleted or added, Sqoop tracks this
change. Also updates the log file.
In Sqoop to validate the data copied is Validation main purpose. Basically, either Sqoop
import or Export by comparing the row counts from the source as well as the target post
copy.