HIVE
Cost Based Optimization
The main goal of a CBO is to generate efficient execution plans by examining the tables and
conditions specified in the query, ultimately cutting down on query execution time and
reducing resource utilization. Apache Calcite has an efficient plan pruner that can select the
cheapest query plan. All SQL queries are converted by Hive to a physical operator tree,
optimized and converted to Tez/MapReduce jobs, then executed on the Hadoop cluster. This
conversion includes SQL parsing and transforming, as well as operator-tree optimization.
Vertex Error
This error is because Tez containers are not allocating enough memory to run the query.
Update the below parameters, It might resolve your issue.
Tez-site.xml:
<name>tez.am.launch.cmd-opts</name>
<value>-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -
XX:+UseParallelGC</value
<name>tez.task.launch.cmd-opts</name>
<value>-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -
XX:+UseParallelGC</value>
Hive-site.xm:
<name>hive.tez.java.opts</name>
<value>-server -Djava.net.preferIPv4Stack=true -XX:NewRatio=8 -XX:+UseNUMA -
XX:+UseParallelGC -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps</value>
HIVE Managed vs External table
- When we load data into a Managed table, then Hive moves data into Hive warehouse
directory.
- If we drop managed table, the table metadata including its data will be deleted.
External table in HIVE:
- can be created using create table statement with Keyword EXTERNAL.
- Here default location can be overwritten by keyword LOCATION. So data will be at that
external location.
- when we drop an external table, Hive will leave the data untouched and only delete the
metadata.
DBMS
Explain Normalization and De-Normalization.
Ans: Normalization is the process of removing the redundant data from the database by
splitting the table in a well-defined manner in order to maintain data integrity. This process
saves much of the storage space.
De-normalization is the process of adding up redundant data on the table in order to speed up
the complex queries and thus achieve better performance.
What are the different types of Normalization?
Ans: Different Types of Normalization are:
First Normal Form (1NF): A relation is said to be in 1NF only when all the entities of the
table contain unique or atomic values.
Second Normal Form (2NF): A relation is said to be in 2NF only if it is in 1NF and all the
non-key attribute of the table is fully dependent on the primary key.
Third Normal Form (3NF): A relation is said to be in 3NF only if it is in 2NF and every
non-key attribute of the table is not transitively dependent on the primary key.
Boyce Code Normal form(BCNF): It is the higher version of 3Nf which does not have any
multiple overlapping candidate keys.
What are the different types of dimensions?
1) Conformed dimensions: A Dimension that is utilized as a part of different areas is called as
conformed dimension. It might be utilized with different fact tables in a single database or over
numerous data marts/warehouses. For example, if subscriber dimension is connected to two fact
tables – billing and claim then the subscriber dimension would be treated as conformed
dimension.
2) Junk Dimension: It is a dimension table comprising of attributes that don’t have a place in
the fact table or in any of the current dimension tables. Generally, these are the properties like
flags or indicators. For example, it can be member eligibility flag set as ‘Y’ or ‘N’ or any other
indicator set as true/false, any specific comments, etc. if we keep all such indicator attributes in
the fact table then its size gets increased. So, we combine all such attributes and put in a single
dimension table called as junk dimension having unique junk IDs with a possible combination of
all the indicator values.
3) Role Playing Dimension: These are the dimensions which are utilized for multiple purposes
in the same database. For example, a date dimension can be used for “Date of Claim”, “Billing
date” or “Plan Term date”. So, such a dimension will be called as Role playing dimension. The
primary key of Date dimension will be associated with multiple foreign keys in the fact table.
4) Slowly Changing Dimension (SCD): These are most important amongst all the dimensions.
These are the dimensions where attribute values vary with time. Below are the varies types of
SCDs
Type-0: These are the dimensions where attribute value remains steady with time. For
example, Subscriber’s DOB is a type-0 SCD because it will always remain the same
irrespective of the time.
Type-1: These are the dimensions where previous value of the attribute is replaced by the
current value. No history is maintained in Type-1 dimension. For example, Subscriber’s
address (where the business requires to keep the only current address of subscriber) can
be a Type-1 dimension.
Type-2: These are the dimensions where unlimited history is preserved. For example,
Subscriber’s address (where the business requires to keep a record of all the previous
addresses of the subscriber). In this case, multiple rows for a subscriber will be inserted
in the table with his/her different addresses.
There will be some column(s) that will identify the current address. For example, ‘start
date’ and ‘End date’. The row where ‘End date’ value will be blank would contain
subscriber’s current address and all other rows will be having previous addresses of the
subscriber.
Type-3: These are the type of dimensions where limited history is preserved. And we use
an additional column to maintain the history. For example, Subscriber’s address (where
the business requires to keep a record of current & just one previous address). In this
case, we can dissolve the ‘address’ column into two different columns – ‘current address’
and ‘previous address’.
So, instead of having multiple rows, we will be having just one row showing current as
well as the previous address of the subscriber.
Type-4: In this type of dimension, the historical data is preserved in a separate table. The
main dimension table holds only the current data.
For example, the main dimension table will have only one row per subscriber holding its
current address. All other previous addresses of the subscriber will be kept in the separate
history table. This type of dimension is hardly ever used.
5) Degenerated Dimension: A degenerated dimension is a dimension which is not a fact but
presents in the fact table as a primary key. It does not have its own dimension table. We can also
call it as a single attribute dimension table.
But, instead of keeping it separately in a dimension table and putting an additional join, we put
this attribute in the fact table directly as a key. Since it does not have its own dimension table, it
can never act a foreign key in fact table.
What are Slowly Changing Dimensions?
Slowly Changing Dimensions (SCD) - dimensions that change slowly over time, rather than
changing on regular schedule, time-base. In Data Warehouse there is a need to track changes in
dimension attributes in order to report historical data. Example of such dimensions could be:
customer, geography, employee.
There are many approaches how to deal with SCD. The most popular are:
Type 0 - The passive method
Type 1 - Overwriting the old value
Type 2 - Creating a new additional record
Type 3 - Adding a new column
Type 4 - Using historical table
Type 6 - Combine approaches of types 1,2,3 (1+2+3=6)
Type 0 - The passive method. In this method no special action is performed upon dimensional
changes. Some dimension data can remain the same as it was first time inserted, others may be
overwritten.
Type 1 - Overwriting the old value. In this method no history of dimension changes is kept in
the database. The old dimension value is simply overwritten be the new one. This type is easy
to maintain and is often use for data which changes are caused by processing corrections(e.g.
removal special characters, correcting spelling errors).
Before the change:
Customer_ID Customer_Name Customer_Type
1 Cust_1 Corporate
After the change:
Customer_ID Customer_Name Customer_Type
1 Cust_1 Retail
Type 2 - Creating a new additional record. In this methodology all history of dimension
changes is kept in the database. You capture attribute change by adding a new row with a new
surrogate key to the dimension table. Both the prior and new rows contain as attributes the
natural key(or other durable identifier). Also 'effective date' and 'current indicator' columns are
used in this method. There could be only one record with current indicator set to 'Y'. For
'effective date' columns, i.e. start_date and end_date, the end_date for current record usually is
set to value 9999-12-31. Introducing changes to the dimensional model in type 2 could be very
expensive database operation so it is not recommended to use it in dimensions where a new
attribute could be added in the future.
Before the change:
Customer_ID Customer_Name Customer_Type Start_Date End_Date Current_Flag
1 Cust_1 Corporate 22-07-2010 31-12-9999 Y
After the change:
Customer_ID Customer_Name Customer_Type Start_Date End_Date Current_Flag
1 Cust_1 Corporate 22-07-2010 17-05-2012 N
2 Cust_1 Retail 18-05-2012 31-12-9999 Y
Type 3 - Adding a new column. In this type usually only the current and previous value of
dimension is kept in the database. The new value is loaded into 'current/new' column and the
old one into 'old/previous' column. Generally speaking the history is limited to the number of
column created for storing historical data. This is the least commonly needed techinque.
Before the change:
Customer_ID Customer_Name Current_Type Previous_Type
1 Cust_1 Corporate Corporate
After the change:
Customer_ID Customer_Name Current_Type Previous_Type
1 Cust_1 Retail Corporate
Type 4 - Using historical table. In this method a separate historical table is used to track all
dimension's attribute historical changes for each of the dimension. The 'main' dimension table
keeps only the current data e.g. customer and customer_history tables.
Current table:
Customer_ID Customer_Name Customer_Type
1 Cust_1 Corporate
Historical table:
Customer_ID Customer_Name Customer_Type Start_Date End_Date
1 Cust_1 Retail 01-01-2010 21-07-2010
1 Cust_1 Oher 22-07-2010 17-05-2012
1 Cust_1 Corporate 18-05-2012 31-12-9999
Type 6 - Combine approaches of types 1,2,3 (1+2+3=6). In this type we have in dimension
table such additional columns as:
current_type - for keeping current value of the attribute. All history records for given item of
attribute have the same current value.
historical_type - for keeping historical value of the attribute. All history records for given
item of attribute could have different values.
start_date - for keeping start date of 'effective date' of attribute's history.
end_date - for keeping end date of 'effective date' of attribute's history.
current_flag - for keeping information about the most recent record.
In this method to capture attribute change we add a new record as in type 2. The current_type
information is overwritten with the new one as in type 1. We store the history in a
historical_column as in type 3.
Customer_I Customer_Na Current_Ty Historical_Ty Start_Da End_Da Current_Fl
D me pe pe te te ag
01-01- 21-07-
1 Cust_1 Corporate Retail N
2010 2010
22-07- 17-05-
2 Cust_1 Corporate Other N
2010 2012
18-05- 31-12-
3 Cust_1 Corporate Corporate Y
2012 9999
ACID Properties:
Atomicity
Main article: Atomicity (database systems)
Transactions are often composed of multiple statements. Atomicity guarantees that each
transaction is treated as a single "unit", which either succeeds completely, or fails completely: if
any of the statements constituting a transaction fails to complete, the entire transaction fails
and the database is left unchanged. An atomic system must guarantee atomicity in each and
every situation, including power failures, errors and crashes.
Consistency
Main article: Consistency (database systems)
Consistency ensures that a transaction can only bring the database from one valid state to
another, maintaining database invariants: any data written to the database must be valid
according to all defined rules, including constraints, cascades, triggers, and any combination
thereof. This prevents database corruption by an illegal transaction, but does not guarantee
that a transaction is correct.
Isolation
Main article: Isolation (database systems)
Transactions are often executed concurrently (e.g., reading and writing to multiple tables at the
same time). Isolation ensures that concurrent execution of transactions leaves the database in
the same state that would have been obtained if the transactions were executed sequentially.
Isolation is the main goal of concurrency control; depending on the method used, the effects of
an incomplete transaction might not even be visible to other transactions.
Durability
Main article: Durability (database systems)
Durability guarantees that once a transaction has been committed, it will remain committed
even in the case of a system failure (e.g., power outage or crash). This usually means that
completed transactions (or their effects) are recorded in non-volatile memory.
VIEW
View is a virtual table which does not have its data on its own rather the data is defined from
one or more underlying base tables.
VDL is View Definition language which represents user views and their mapping to the
conceptual schema.
SDL is Storage Definition Language which specifies the mapping between two schemas.
Data Definition Language (DDL) commands are used to define the structure that holds the
data. These commands are auto-committed i.e. changes done by the DDL commands on the
database are saved permanently.
Data Manipulation Language (DML) commands are used to manipulate the data of the
database. These commands are not auto-committed and can be rolled back.
Data Control Language (DCL) commands are used to control the visibility of the data in the
database like revoke access permission for using data in the database.
What is schema?
Ans. A schema is a collection of database objects of a User.
Define Cursor and its types.
Ans: Cursor is a temporary work area which stores the data as well as the result set occurred
after manipulation of data retrieved. A cursor can hold only one row at a time.
The 2 types of Cursor are:
Implicit cursors are declared automatically when DML statements like INSERT, UPDATE, DELETE
is executed.
Explicit cursors have to be declared when SELECT statements which are returning more than
one row are executed.
Define Database Lock and its types.
Ans: Database lock basically signifies the transaction about the current status of the data item
i.e. whether that data is being used by other transactions or not at the present point of time.
There are two types of Database lock which are Shared Lock and Exclusive Lock.
What is DEADLOCK?
Define Phantom deadlock.
Ans: Phantom deadlock detection is the condition where the deadlock does not actually exist
but due to a delay in propagating local information, deadlock detection algorithms identify the
deadlocks.
What are triggers?
Trigger in SQL is used to create a response to a specific action performed on the table such as
Insert, Update or Delete. You can invoke triggers explicitly on the table in the database.
Action and Event are two main components of SQL triggers when certain actions are performed
the event occurs in response to that action.
Syntax: CREATE TRIGGER name {BEFORE|AFTER} (event [OR..]}
ON table_name [FOR [EACH] {ROW|STATEMENT}]
EXECUTE PROCEDURE functionname {arguments}
Differentiate between ‘DELETE’, ‘TRUNCATE’ and ‘DROP’ commands.
Ans: After the execution of ‘DELETE’ operation, COMMIT and ROLLBACK statements can be
performed to retrieve the lost data.
After the execution of ‘TRUNCATE’ operation, COMMIT, and ROLLBACK statements cannot be
performed to retrieve the lost data.
‘DROP’ command is used to drop the table or key like the primary key/foreign key.
What are super, primary, candidate and foreign keys?
Ans: A superkey is a set of attributes of a relation schema upon which all attributes of the
schema are functionally dependent. No two rows can have the same value of super key
attributes.
A Candidate key is minimal superkey, i.e., no proper subset of Candidate key attributes can be a
superkey.
A Primary Key is one of the candidate keys. One of the candidate keys is selected as most
important and becomes the primary key. There cannot be more that one primary keys in a
table.
Foreign key is a field (or collection of fields) in one table that uniquely identifies a row of
another table. See this for an example.
What is the difference between having and where clause?
Ans: HAVING is used to specify a condition for a group or an aggregate function used in select
statement. The WHERE clause selects before grouping. The HAVING clause selects rows after
grouping. Unlike HAVING clause, the WHERE clause cannot contain aggregate functions.
Explain the working of SQL Privileges?
System Privilege: System privileges deal with an object of a particular type and specifies
the right to perform one or more actions on it which include Admin allows a user to
perform administrative tasks, ALTER ANY INDEX, ALTER ANY CACHE GROUP
CREATE/ALTER/DELETE TABLE, CREATE/ALTER/DELETE VIEW etc.
Object Privilege: This allows to perform actions on an object or object of another user(s)
viz. table, view, indexes etc. Some of the object privileges are EXECUTE, INSERT,
UPDATE, DELETE, SELECT, FLUSH, LOAD, INDEX, REFERENCES etc.
SQL GRANT and REVOKE commands are used to implement privileges in SQL multiple user
environments. The administrator of the database can grant or revoke privileges to or from
users of database object like SELECT, INSERT, UPDATE, DELETE, ALL etc.
GRANT Command: This command is used provide database access to user apart from an
administrator.
Syntax: GRANT privilege_name
ON object_name
TO {user_name|PUBLIC|role_name}
[WITH GRANT OPTION];
In above syntax WITH GRANT OPTIONS indicates that the user can grant the access to another
user too.
REVOKE Command: This command is used provide database deny or remove access to
database objects.
Syntax: REVOKE privilege_name
ON object_name
FROM {user_name|PUBLIC|role_name};
Why do we use SQL constraints?
Constraints are used to set the rules for all records in the table. If any constraints get violated
then it can abort the action that caused it.
Constraints are defined while creating the database itself with CREATE TABLE statement or
even after the table is created once with ALTER TABLE statement.
There are 5 major constraints are used in SQL, such as
NOT NULL: That indicates that the column must have some value and cannot be left null
UNIQUE: This constraint is used to ensure that each row and column has unique value and
no value is being repeated in any other row or column
PRIMARY KEY: This constraint is used in association with NOT NULL and UNIQUE constraints
such as on one or the combination of more than one columns to identify the particular
record with a unique identity.
FOREIGN KEY: It is used to ensure the referential integrity of data in the table and also
matches the value in one table with another using Primary Key
CHECK: It is used to ensure whether the value in columns fulfills the specified condition
DEFAULT - Sets a default value for a column when no value is specified
INDEX - Used to create and retrieve data from the database very quickly
What are clustered and non-clustered Indexes?
Ans: Clustered indexes is the index according to which data is physically stored on disk.
Therefore, only one clustered index can be created on a given database table.
Non-clustered indexes don’t define physical ordering of data, but logical ordering. Typically, a
tree is created whose leaf point to disk records. B-Tree or B+ tree are used for this purpose
What are transaction and its controls?
A transaction can be defined as the sequence task that is performed on databases in a logical
manner to gain certain results. Operations performed like Creating, updating, deleting records
in the database comes from transactions.
In simple word, we can say that a transaction means a group of SQL queries executed on
database records.
There are 4 transaction controls such as
COMMIT: It is used to save all changes made through the transaction
ROLLBACK: It is used to roll back the transaction such as all changes made by the
transaction are reverted back and database remains as before
SET TRANSACTION: Set the name of transaction
SAVEPOINT: It is used to set the point from where the transaction is to be rolled back
Make it real time question….What is Star Schema and Snowflake Schema?
The fact table contains business facts (or measures), and foreign keys which refer to candidate
keys (normally primary keys) in the dimension tables. It is located at the center of a star schema
or a snowflake schema surrounded by dimension tables.
Contrary to fact tables, dimension tables contain descriptive attributes (or fields) that are
typically textual fields (or discrete numbers that behave like text). These attributes are designed
to serve two critical purposes: query constraining and/or filtering, and query result set labeling.
Dimension attributes should be:
Verbose (labels consisting of full words)
Descriptive
Complete (having no missing values)
Discretely valued (having only one value per dimension table row)
Quality assured (having no misspellings or impossible values)
What is candidate key? What is surrogate key? What is Composite key? Super key?
The PRIMARY KEY constraint uniquely identifies each record in a database table. Primary keys
must contain UNIQUE values, and cannot contain NULL values. A table can have only one
primary key, which may consist of single or multiple fields.
A FOREIGN KEY is a key used to link two tables together. A FOREIGN KEY is a field (or collection
of fields) in one table that refers to the PRIMARY KEY in another table.
The table containing the foreign key is called the child table, and the table containing the
candidate key is called the referenced or parent table.
SQL Injection
SQL injection is the placement of malicious code in SQL statements, via web page input.
SQL injection is a code injection technique and one of the most common web hacking
techniques. It might destroy your database.
Example
While user input the user ID / password in webpage, backend process like:
txtUserId = getRequestString("UserId");
txtSQL = "SELECT * FROM Users WHERE UserId = " + txtUserId;
What is a Stored Procedure?
A stored procedure is a prepared SQL code with multiple statements that you can save, so the
code can be reused over and over again, just call it to execute it.
You can also pass parameters to a stored procedure, so that the stored procedure can act
based on the parameter value(s) that is passed.
Stored Procedure Syntax
CREATE PROCEDURE procedure_name AS
sql_statement
GO;
Execute a Stored Procedure
EXEC procedure_name;
Data Models
There are three types of data models – conceptual, logical and physical. The level of complexity
and detail increases from conceptual to logical to a physical data model.
The conceptual model shows a very basic high level of design while the physical data model
shows a very detailed view of design.
The conceptual model shows a very basic high level of design and portraying entity names and
entity relationships.
The logical model will be showing up entity names, entity relationships, attributes, primary keys
and foreign keys in each entity.
The physical data model will be showing primary keys, foreign keys, table names, column
names and column data types. This view actually elaborates how the model will be actually
implemented in the database.
KAFKA
Kafka is a distributed publish-subscribe messaging system
- designed to be fast, scalable, and durable.
Kafka maintains messages in topics.
Producers write data to topics and consumers read from topics.
Kafka is a distributed system, topics are partitioned and replicated across multiple nodes.
In Kafka producer, publish/send method
In Kafka consumer
- subscribe method and poll interface
---------------------------------------------------------------------------------------------------------------------
Modlogs – Shell script description
Row comparison operators used in subqueries - IN, ANY and ALL
There are several differences between HashMap and Hashtable in Java:
1. Hashtable is synchronized, whereas HashMap is not. This makes HashMap better for non-
threaded applications, as unsynchronized Objects typically perform better than
synchronized ones.
2. Hashtable does not allow null keys or values. HashMap allows one null key and any
number of null values.
3. One of HashMap's subclasses is LinkedHashMap, so in the event that you'd want
predictable iteration order (which is insertion order by default), you could easily swap out
the HashMap for a LinkedHashMap. This wouldn't be as easy if you were using
Hashtable.
Since synchronization is not an issue for you, I'd recommend HashMap. If synchronization
becomes an issue, you may also look at ConcurrentHashMap. ConcurrentHashMap is useful in
the case of synchronization because it uses Entry level (row level) locking whereas Hashtable
use Object level locking.
Synchronization
synchronized methods enable a simple strategy for preventing thread interference and memory
consistency errors: if an object is visible to more than one thread, all reads or writes to that
object's variables are done through synchronized methods.