Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views11 pages

Unit 2 (Big Data Analytics)

The document provides an overview of NoSQL databases, highlighting their non-relational nature, scalability, and use in big data applications. It discusses various data models, including key-value, document, and graph databases, as well as concepts like aggregation, distribution models, and consistency models. Additionally, it covers practical aspects of using Cassandra, including creating tables, reading data, and inserting data from CSV files.

Uploaded by

navata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views11 pages

Unit 2 (Big Data Analytics)

The document provides an overview of NoSQL databases, highlighting their non-relational nature, scalability, and use in big data applications. It discusses various data models, including key-value, document, and graph databases, as well as concepts like aggregation, distribution models, and consistency models. Additionally, it covers practical aspects of using Cassandra, including creating tables, reading data, and inserting data from CSV files.

Uploaded by

navata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Unit 2(Big Data Analytics) (Search with  for new topic)

Introduction to NoSQL

NoSQL Database is a non-relational Data Management System, that does not require a fixed schema. It
avoids joins, and is easy to scale. The major purpose of using a NoSQL database is for distributed data
stores with humongous data storage needs. NoSQL is used for Big data and real-time web apps. For
example, companies like Twitter, Facebook and Google collect terabytes of user data every single day.

aggregate data models

A data model is the model through which we perceive and manipulate our data. For people using a
database, the data model describes how we interact with the data in the database. This is distinct from a
storage model, which describes how the database stores and manipulates the data internally. In an ideal
world, we should be ignorant of the storage model, but in practice we need at least some inkling of it—
primarily to achieve decent performance.

In conversation, the term “data model” often means the model of the specific data in an application. A
developer might point to an entity-relationship diagram of their database and refer to that as their data
model containing customers, orders, products, and the like

aggregates

Aggregation of NoSql datasets is an important feature in many applications. restdb.io supports queries
with both grouping and aggregation of data sets. This is very helpful in developing custom reports, visual
charts, data analysis etc. The table below shows all aggregation and grouping functions:

Function Format Comment Example

Min MIN:field Returns object h={"$aggregate":["MIN:score"]}

Max MAX:field Returns object h={"$aggregate":["MAX:score"]}

Avg AVG:field Returns value h={"$aggregate":["AVG:score"]}

Sum SUM:field Returns value h={"$aggregate":["SUM:score"]}

Returns value with


Count COUNT:property h={"$aggregate":["COUNT:nplayers"]}
chosen property name

Groupby $groupby: ["field", Returns h={"$groupby":["category"]}

Page 1 of 11
Function Format Comment Example

...] "groupkey":[array]

Predefined values for:


Groupby $groupby:
$YEAR, $MONTH, h={"$groupby":["$YEAR:registered"]}
(dates) ["$YEAR:field", ...]
$DAY, $HOUR, $SEC

Groupby Format strings for: ss,


$groupby:
(dates hh, mm, dd, MM, YY.
["$DATE:format", h={"$groupby":["$DATE.MMM:registered"]}
with All formats
...]
formats) at momentjs.com

Grand $aggregate-grand- Recursive aggregation h={"$groupby":["category"], "$aggregate":


totals total: true functions of groups ["AVG:score"], "$aggregate-grand-total": true}

key-value
A key-value data model or database is also referred to as a key-value store. It is a non-relational type
of database. In this, an associative array is used as a basic database in which an individual key is linked
with just one value in a collection. For the values, keys are special identifiers. Any kind of entity can be
valued. The collection of key-value pairs stored on separate records is called key-value databases and
they do not have an already defined structure.

How do key-value databases work?

A number of easy strings or even a complicated entity are referred to as a value that is associated
with a key by a key-value database, which is utilized to monitor the entity. Like in many programming
paradigms, a key-value database resembles a map object or array, or dictionary, however, which is
put away in a tenacious manner and controlled by a DBMS.

Page 2 of 11
An efficient and compact structure of the index is used by the key
key-value
value store to have
h the option to
rapidly and dependably find value using its key. For example, Redis is a key key-value
value store used to
tracklists, maps, heaps, and primitive types (which are simple data structures) in a constant database.
Redis can uncover a very basic point of interaction to query and manipulate value types, just by
supporting a predetermined number of value types, and when arranged, is prepared to do high
throughput.

document data models

A data file in the form of a document rather than a relational table. Document models are more free
form compared to the rows and columns of the relational model. See XML, JSON, DOM,
DOM relational
database and MongoDB.

relationships

relationships
elationships are associations between different collections in a database. You can
create relationships and define its object properties for NoSQL databases using either of the following
methods:
 Embedding
Embeds the related data in collections into a single or multiple structured collections
 Referencing
Relates the data in multiple collections as an identifying or non
non-identifying relationships

graph databases

A graph database is defined as a specialized, single


single-purpose
purpose platform for creating and manipulating
graphs. Graphs contain nodes, edges, and properties, all of which are used to represent and store
data in a way that relational databases are not equipped to do.

schema less databases

Page 3 of 11
A schemaless database makes almost no changes to your data; each item is saved in its own document
with a partial schema, leaving the raw information untouched. This means that every detail is always
available and nothing is stripped to match the current schema. This is particularly valuable if your
analytics needs to change at some point in the future.

materialized views

A materialized view is a pre-computed data set derived from a query specification (the SELECT
in the view definition) and stored for later use. Because the data is pre-computed, querying a
materialized view is faster than executing a query against the base table of the view. This
performance difference can be significant when a query is run frequently or is sufficiently
complex. As a result, materialized views can speed up expensive aggregation, projection, and
selection operations, especially those that run frequently and that run on large data sets.

distribution models

Aggregate oriented databases make distribution of data easier, since the distribution
mechanism has to move the aggregate and not have to worry about related data, as all the
related data is contained in the aggregate. There are two styles of distributing data:

 Sharding: Sharding distributes different data across multiple servers, so each server acts
as the single source for a subset of data.
 Replication: Replication copies data across multiple servers, so each bit of data can be
found in multiple places. Replication comes in two forms,
 Master-slave replication makes one node the authoritative copy that handles
writes while slaves synchronize with the master and may handle reads.
 Peer-to-peer replication allows writes to any node; the nodes coordinate to
synchronize their copies of the data.

Master-slave replication reduces the chance of update conflicts but peer-to-peer replication
avoids loading all writes onto a single server creating a single point of failure. A system may use
either or both techniques. Like Riak database shards the data and also replicates it based on the
replication factor.

 Consistency Models
In the past, almost all architectures used in databases systems were strong consistent. In
these cases, most architectures would have a single database instance only responding to a few
hundred clients. Nowadays, many systems are accessed by hundreds of thousands of clients, so
there was a mandatory requirement to system’s architectures that scale. However, considering
the CAP theorem, high-availability and consistency do conflict on distributed systems when
subject to a network partition event. The majority of the projects that have been experiencing

Page 4 of 11
such high-traffic have chosen to adopt high-availability over a strong consistent architecture by
relaxing the consistency level.

Version Stamps

Many critics of NoSQL databases focus on the lack of support for transactions. Transactions are
a useful tool that helps programmers support consistency. One reason why many NoSQL
proponents worry less about a lack of transactions is that aggregate-oriented NoSQL databases
do support atomic updates within an aggregate—and aggregates are designed so that their
data forms a natural unit of update. That said, it’s true that transactional needs are something
to take into account when you decide what database to use.

As part of this, it’s important to remember that transactions have limitations. Even within a
transactional system we still have to deal with updates that require human intervention and
usually cannot be run within transactions because they would involve holding a transaction
open for too long. We can cope with these using version stamps—which turn out to be handy
in other situations as well, particularly as we move away from the single-server distribution
model.

Page 5 of 11
Cassandra Create Table
In Cassandra, CREATE TABLE command is used to create a table. Here, column family is used to
store data just like table in RDBMS.

So, you can say that CREATE TABLE command is used to create a column family in Cassandra.

Syntax:

1. CREATE (TABLE | COLUMNFAMILY) <tablename>


2. ('<column-definition>' , '<column-definition>')
3. (WITH <option> AND <option>)

Or

For declaring a primary key:

1. CREATE TABLE tablename(


2. column1 name datatype PRIMARYKEY,
3. column2 name data type,
4. column3 name data type.
5. )

You can also define a primary key by using the following syntax:

1. Create table TableName


2. (
3. ColumnName DataType,
4. ColumnName DataType,
5. ColumnName DataType
6. .
7. .
8. .
9. Primary key(ColumnName)
10. ) with PropertyName=PropertyValue;

There are two types of primary keys:

o Single primary key: Use the following syntax for single primary key.

Page 6 of 11
1. Primary key (ColumnName)
o Compound primary key: Use the following syntax for single primary key.

1. Primary key(ColumnName1,ColumnName2 . . .)

Example:

Let's take an example to demonstrate the CREATE TABLE command.

Here, we are using already created Keyspace "javatpoint".

1. CREATE TABLE student(


2. student_id int PRIMARY KEY,
3. student_name text,
4. student_city text,
5. student_fees varint,
6. student_phone varint
7. );

Cassandra - Read Data


Reading Data using Select Clause
SELECT clause is used to read data from a table in Cassandra. Using this clause, you can read a
whole table, a single column, or a particular cell. Given below is the syntax of SELECT clause.
SELECT FROM <tablename>

Example

Assume there is a table in the keyspace named emp with the following details −

emp_id emp_name emp_city emp_phone emp_sal

1 ram Hyderabad 9848022338 50000

2 robin null 9848022339 50000

Page 7 of 11
3 rahman Chennai 9848022330 50000

4 rajeev Pune 9848022331 30000

The following example shows how to read a whole table using SELECT clause. Here we are
reading a table called emp.
cqlsh:tutorialspoint> select * from emp;

emp_id | emp_city | emp_name | emp_phone | emp_sal


--------+-----------+----------+------------+---------
1 | Hyderabad | ram | 9848022338 | 50000
2 | null | robin | 9848022339 | 50000
3 | Chennai | rahman | 9848022330 | 50000
4 | Pune | rajeev | 9848022331 | 30000

(4 rows)

Reading Required Columns


The following example shows how to read a particular column in a table.
cqlsh:tutorialspoint> SELECT emp_name, emp_sal from emp;

emp_name | emp_sal
----------+---------
ram | 50000
robin | 50000
rajeev | 30000
rahman | 50000

(4 rows)

Where Clause
Using WHERE clause, you can put a constraint on the required columns. Its syntax is as follows

SELECT FROM <table name> WHERE <condition>;
Note − A WHERE clause can be used only on the columns that are a part of primary key or have
a secondary index on them.
In the following example, we are reading the details of an employee whose salary is 50000.
First of all, set secondary index to the column emp_sal.

Page 8 of 11
cqlsh:tutorialspoint> CREATE INDEX ON emp(emp_sal);
cqlsh:tutorialspoint> SELECT * FROM emp WHERE emp_sal=50000;

emp_id | emp_city | emp_name | emp_phone | emp_sal


--------+-----------+----------+------------+---------
1 | Hyderabad | ram | 9848022338 | 50000
2 | null | robin | 9848022339 | 50000
3 | Chennai | rahman | 9848022330 | 50000

Inserting data using a CSV file in Cassandra


If you want to store data in bulk then inserting data from a CSV file is one of the nice ways. If
you have data in a file so, you can directly insert your data into the database by using the COPY
command in Cassandra. It will be very useful when you have a very large database, and you
want to store data quickly and your data is in a CSV file then you can directly insert your data.
Syntax –
You can see the COPY command syntax for your reference as follows.
COPY table_name [( column_list )]
FROM 'file_name path'[, 'file2_name path', ...] | STDIN
[WITH option = 'value' [AND ...]]
Now, let’s create the sample data for implementing the approach.
Step-1 :
Creating keyspace – data
Here, you can use the following cqlsh command to create the keyspace as follows.
CREATE KEYSPACE data
WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'datacenter1' : 1
};
Step-2 :
Creating the Student_personal_data table –
Here, you can use the following cqlsh command to create the Student_personal_data table as
follows.
CREATE TABLE data.Student_personal_data (
S_id UUID PRIMARY KEY,
S_firstname text,
S_lastname text,
);
Step-3 :
Creating the CSV file –
Consider the following given table as a CSV file namely as personal_data.csv. But, in actual you
can insert data in CSV file and save it in your computer drive.
S_id(UUID) S_firstname S_lastname

Page 9 of 11
S_id(UUID) S_firstname S_lastname

e1ae4cf0-d358-4d55-b511-85902fda9cc1 Ashish christopher

e2ae4cf0-d358-4d55-b511-85902fda9cc2 Joshua D

e3ae4cf0-d358-4d55-b511-85902fda9cc3 Ken N

e4ae4cf0-d358-4d55-b511-85902fda9cc4 Christine christopher

e5ae4cf0-d358-4d55-b511-85902fda9cc5 Allie K

e6ae4cf0-d358-4d55-b511-85902fda9cc6 Lina M

Step-4 :
Inserting data from the CSV file –
In this, you will see how you can insert data into the database from existed CSV file you have,
and you can use the following cqlsh command as follows.
COPY data.Student_personal_data (S_id, S_firstname, S_lastname)
FROM 'personal_data.csv'
WITH HEADER = TRUE;
Step-5 :
Verifying the result –
Once you will execute the above command, then you will get the following result as follows.
Using 7 child processes

Starting copy of data.Student_personal_data with columns [S_id, S_firstname, S_lastname].


Processed: 6 rows; Rate: 10 rows/s; Avg. rate: 14 rows/s
6 rows imported from 1 files in 0.422 seconds (0 skipped).
You can use the following command to see the output as follows.
select * from data.Student_personal_data;
Output :
S_id S_firstname S_lastname

e5ae4cf0-d358-4d55-b511-85902fda9cc5 Allies K

e6ae4cf0-d358-4d55-b511-85902fda9cc6 Lina M

Page 10 of 11
S_id S_firstname S_lastname

e2ae4cf0-d358-4d55-b511-85902fda9cc2 Joshua D

e1ae4cf0-d358-4d55-b511-85902fda9cc1 Ashish christopher

e3ae4cf0-d358-4d55-b511-85902fda9cc3 Ken N

e4ae4cf0-d358-4d55-b511-85902fda9cc4 Christine christopher

Page 11 of 11

You might also like