Bda Manual
Bda Manual
TECHNOLOGY
Navi Mumbai-410209
Prepared by
Ms Vrushali Thakur
VISION:
To become one of the outstanding Engineering Institute in India by providing a
conductive and vibrant environment to achieve excellence in the field of Technology
MISSION:
To empower the aspiring professional students to be prudent enough to explore the
world of technology and mound them to be proficient to reach the pinnacle of success in the
competitive global economy.
To enhance the competency of the faculty in the latest technology through continuous
development programs.
To foster networking with alumni and industries.
PROGRAM EDUCATIONAL OBJECTIVES (PEO's)
To prepare the Learner with a sound foundation in the mathematical, scientific and
PEO 1 engineering fundamentals.
To motivate the Learner in the art of self-learning and to use modern tools for solving real life
PEO 2 problems.
To equip the Learner with broad education necessary to understand the impact of Computer
PEO 3 Science and Engineering in a global and social context.
To encourage, motivate and prepare the Learner’s for Lifelong learning.
PEO 4
To inculcate professional and ethical attitude, good leadership qualities and commitment to
PEO 5 social responsibilities in the Learner’s thought process.
Modern Tool Usage: Create, select and apply appropriate techniques, resources and
PO5 modern computer engineering and IT tools including prediction and modeling to
complex engineering activities with an understanding of the limitations.
The Engineer and Society: Apply reasoning informed by contextual knowledge to
PO6 assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to computer engineering practice.
Environment and Sustainability: Understand the impact of professional computer
PO7 engineering solutions in societal and environmental contexts and demonstrate
knowledge of and need for sustainable development.
PO8 Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of computer engineering practice.
PO9 Individual and Team Work: Function effectively as an individual, and as a member
or leader in diverse teams and in multidisciplinary settings.
PO12 engage in independent and lifelong learning in the broadest context of technological
change.
PROGRAM SPECIFIC OUTCOMES (PSOs)
Acquire skills to design, analyze and develop algorithms and implement them using
PSO1
high-level programming languages
Lab Objectives
1. Solve Big Data problems using Map Reduce Technique and apply to various algorithms.
2. Solve Big Data problems using Map Reduce Technique and apply to various algorithms.
3. Solve Big Data problems using Map Reduce Technique and apply to various algorithms.
4. Apply streaming analytics to real time applications
Lab Outcomes
1. To interpret business models and scientific computing paradigms, and apply software tools for
big data analytics.
2. To implement algorithms that uses Map Reduce to apply on structured and unstructured data
3. To perform hands-on NoSql databases such as Cassandra, Hadoop Hbase, MongoDB, etc.
4. To implement various data streams algorithms.
5. To develop and analyze the social network graphs with data visualization techniques.
MGM’s COLLEGE OF ENGINEERING AND TECHNOLOGY
NAVI MUMBAI-410209
List of Experiments
Hadoop, VM
S/W Requirements Java:Hadoop is written in Java. The recommended Java versionis
JDK 1.6 release and the recommended minimum revision is 31 (v
1.6.31).
How to install
Step 1 – Install VM Player
OR
Step 1 – Install Oracle Virtual Machine and Cloudera virtual box
Open terminal
Theory:
What Is Hadoop?
Hadoop is an open-source software platform designed to store and process quantities of data that
are too large for just one particular device or server. Hadoop‘s strength lies in its ability to scale
across thousands of commodity servers that don‘t share memory or disk space.
NameNode: Runs on a ―master node‖ that tracks and directs the storage of the cluster.
DataNode: Runs on ―slave nodes,‖ which make up the majority of the machines within
a cluster. The NameNode instructs data files to be split into blocks, each of which are
replicated three times and stored on machines across the cluster. These replicas ensure
the entire system won‘t go down if one server fails or is taken offline—known as ―fault
tolerance.‖
Client machine: neither a NameNode or a DataNode, Client machines have Hadoop
installed on them. They‘re responsible for loading data into the cluster, submitting
MapReduce jobs and viewing the results of the job once complete.
MapReduce
MapReduce is the system used to efficiently process the large amount of data Hadoop stores in
HDFS. Originally created by Google, its strength lies in the ability to divide a single large data
processing job into smaller tasks. All MapReduce jobs are written in Java, but other languages can
be used via the Hadoop Streaming API, which is a utility that comes with Hadoop.
JobTracker: The JobTracker oversees how MapReduce jobs are split up into tasks and
divided among nodes within the cluster.
TaskTracker: The TaskTracker accepts tasks from the JobTracker, performs the work and
alerts the JobTracker once it‘s done. TaskTrackers and DataNodes are located on thesame
nodes to improve performance.
Hadoop 2.6.5 - Installing on Ubuntu 16.04 (Single-
Node Cluster)
Installing Java
$ groups hduser
hduser : hadoop
Installing SSH
staff_111_05@staff-111-05:~$ su hduser
Password:
a–
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:/M18Dv+ku5js8npZvYi45Fr4F84SzoqXBUO5xAfo+/8 hduser@laptop
The key's randomart image is:
+---[RSA 2048] --- +
| o.o |
| .=. |
| .oo |
| .= |
| .S .|
| . .+ + o|
The second command adds the newly created key to the list of authorized keys so that Hadoop
can use ssh without prompting for a password.
Install Hadoop
We want to move the Hadoop installation to the /usr/local/hadoop directory. So, we should create
the directory first:
hduser@staff-111-05:~$sudo -v
Sorry, user hduser may not run sudo on staff-111-05.
This can be resolved by logging in as a root user, and then add hduser to sudo group:
hduser@staff-111-05:~$ cd hadoop-2.6.5
hduser@staff-111-05:~/hadoop-2.6.5$ su staff_111_05
Password:
hduser@staff-111-05:/home/hduser$ sudo adduser hduser sudo
[sudo] password for k:
Adding user `hduser' to group `sudo' ...
Adding user hduser to group sudo
Done.
Now, the hduser has root priviledge, we can move the Hadoop installation to
the /usr/local/hadoop directory without any problem:
hduser@staff-111-05:~$ vi ~/.bashrc
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
hduser@staff-111-05:~$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Adding the above statement in the hadoop-env.sh file ensures that the value of JAVA_HOME
variable will be available to Hadoop whenever it is started up.
3. /usr/local/hadoop/etc/hadoop/core-site.xml:
Open the file and enter the following in between the <configuration></configuration> tag:
hduser@staff-111-05:~$ vi /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
4. /usr/local/hadoop/etc/hadoop/mapred-site.xml
By default, the /usr/local/hadoop/etc/hadoop/ folder contains
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
file which has to be renamed/copied with the name mapred-site.xml:
hduser@staff-111-05:~$ cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the
cluster that is being used.
It specifies the directories which will be used as the namenode and the datanode on that host.
Before editing this file, we need to create two directories which will contain the namenode and
the datanode for this Hadoop installation.
This can be done using the following commands:
Open the file and enter the following content in between the <configuration></configuration>
tag:
hduser@staff-111-05:~$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
Note that hadoop namenode -format command should be executed once before we start using
Hadoop.
If this command is executed again after Hadoop has been used, it'll destroy all the data on the
Hadoop file system.
Starting Hadoop
Now it's time to start the newly installed single node cluster.
We can use start-all.sh or (start-dfs.sh and start-yarn.sh)
hduser@staff-111-05:~$ cd /usr/local/hadoop/sbin
hduser@staff-111-05:/usr/local/hadoop/sbin$ ls
distribute-exclude.sh start-all.cmd stop-balancer.sh
hadoop-daemon.sh start-all.sh stop-dfs.cmd
hadoop-daemons.sh start-balancer.sh stop-dfs.sh
hdfs-config.cmd start-dfs.cmd stop-secure-dns.sh
hdfs-config.sh start-dfs.sh stop-yarn.cmd
httpfs.sh start-secure-dns.sh stop-yarn.sh
kms.sh start-yarn.cmd yarn-daemon.sh
mr-jobhistory-daemon.sh start-yarn.sh yarn-daemons.sh
slaves.sh stop-all.sh
hduser@staff-111-05:/usr/local/hadoop/sbin$ start-all.sh
Stopping Hadoop
In order to stop all the daemons running on our machine, we can run stop-all.sh or (stop-
dfs.sh and stop-yarn.sh) :
Conclusion:
Thus we studied the Hadoop ecosystem and successfully installed Hadoop.
All of the files in the input directory (called in-dir in the command line above) are read and the
counts of words in the input are written to the output directory (called out-dir above). It is assumed
that both inputs and outputs are stored in HDFS. If your input is not already in HDFS, but is rather
in a local file system somewhere, you need to copy the data into HDFS using a command like this:
Step-4
Add external libraries from hadoop.
Right click on WordCount Project -> Build Path -> Configure Build Path -> Click on
Libraries -
> click on ‘Add External Jars..’ button.
Theory:
What is Sqoop?
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and
external datastores such as relational databases, enterprise data warehouses.
Sqoop is used to import data from external datastores into Hadoop Distributed File System or
related Hadoop eco-systems like Hive and HBase. Similarly, Sqoop can also be used to extract data
from Hadoop or its eco-systems and export it to external datastores such as relational databases,
enterprise data warehouses. Sqoop works with relational databases such as Teradata, Netezza,
Oracle, MySQL, Postgres etc.
Why is Sqoop used?
For Hadoop developers, the interesting work starts after data is loaded into HDFS. Developers play
around the data in order to find the magical insights concealed in that Big Data. For this, the data
residing in the relational database management systems need to be transferred to HDFS,play
around the data and might need to transfer back to relational database management systems. Sqoop
automates most of the process, depends on the database to describe the schema of the data to be
imported. Sqoop uses MapReduce framework to import and export the data, which provides
parallel mechanism as well as fault tolerance. Sqoop makes developers life easy by providing
command line interface. Developers just need to provide basic information like source, destination
and database authentication details in the sqoop command. Sqoop takes care of remaining part. The
scoop architecture is as follows:
1. Sqoop-Import
Sqoop import command imports a table from an RDBMS to HDFS. Each record from a table is
considered as a separate record in HDFS. Records can be stored as text files, or in binary
representation as Avro or SequenceFiles.
Generic Syntax:
$ sqoop import (generic args) (import args)
The Hadoop specific generic arguments must precede any import arguments, and the import
arguments can be of any order.
--query Executes the SQL query provided and imports the results
5. Incremental Exports
Syntax:
7. Sqoop-List-Database
Used to list all the database available on RDBMS server. Generic Syntax:
Syntax:
8. Sqoop-List-Tables
Used to list all the tables in a specified database. Generic Syntax:
Syntax:
Conclusion:
Thus we installed Scoop successfully and studied the basic commands.
Theory:
NoSQL stands for ―Not only SQL,‖ to emphasize the fact that NoSQL databases are an
alternative to SQL and can, in fact, apply SQL-like query concepts.
NoSQL covers any database that is not a traditional relational database management system
(RDBMS). The motivation behind NoSQL is mainly simplified design, horizontal scaling, and finer
control over the availability of data. NoSQL databases are more specialized for types of data,
which makes them more efficient and better performing than RDBMS servers in most instances.
Key-Value Databases
The simplest type of NoSQL database is the key-value stores. These databases store data in a
completely schema-less way, meaning that no defined structure governs what is being stored. A
key can point to any type of data, from an object, to a string value, to a programming language
function.
The advantage of key-value stores is that they are easy to implement and add data to. That makes
them great to implement as simple storage for storing and retrieving data based on a key. The
downside is that you cannot find elements based on the stored values.
2. Upgrade softwares
sudo apt-get upgrade
3. Install mongodb
sudo apt-get install mongodb
Rename it as primer-dataset.json
mongo
# End of File
# MongoDB Operations
# Save primer-dataset.json in the home directory
# Start mongodb services
mongo
# Insert a Document
db.restaurants.insert(
{
"address" : {
"street" : "2 Avenue",
"zipcode" : "10075",
"building" : "1480",
"coord" : [ -73.9557413, 40.7720266 ]
},
"borough" : "Manhattan",
"cuisine" : "Italian",
"grades" : [
{
"grade" : "A",
"score" : 11
},
{
"grade" : "B",
"score" : 17
}
],
"name" : "Vella",
"restaurant_id" : "41704620" }
)
# Find or Query Data
# You can use the find() method to issue a query to retrieve data from a collection in MongoDB.
All queries in MongoDB have the scope of a single collection.
# Queries can return all documents in a collection or only the documents that match a specified
filter or criteria. You can specify the filter or criteria in a document and pass as a parameter to
the find() method.
# The result set contains all documents in the restaurants collection.
db.restaurants.find()
# The following operation specifies an equality condition on the zipcode field in the address
embedded document.
# The result set includes only the matching documents.
db.restaurants.find( { "address.zipcode": "10075" } )
# Update Data
# You can use the update() method to update documents of a collection. The method accepts as
its parameters:
1. db.restaurants.update({'cuisine':'American'},{$set:{'cuisine':'Indian'}})
2. db.restaurants.update(
{ "name" : "Juni" },
{
$set: { "cuisine": "American (New)" }
})
3. db.restaurants.update(
{ "restaurant_id" : "41156888" },
{ $set: { "address.street": "East 31st Street" } }
# Remove Data
# You can use the remove() method to remove documents from a collection. The method takes a
conditions document that determines the documents to remove.
db.restaurants.remove( { "borough": "Manhattan" } )
# End of File
Conclusion:
Mongodb is successfully installed and basic commands are executed.
Theory:
If M is a matrix with element mij in row i and column j, and N is a matrix with element njk
in row j and column k, then the product:
P =MN
• Map: Send each matrix element mij to the key value pair:
(j, (M, i, mij )) .
Analogously, send each matrix element njk to the key value pair:
(j, (N, k, njk)) .
• Reduce: For each key j, examine its list of associated values. For each value that comes from
M , say (M, i, mij ), and each value that comes from N , say (N, k, njk), produce the tuple
(i, k, v = mij njk ),
The output of the Reduce function is a key j paired with the list of all the tuples of this form that
we get from j:
• Map: Send each matrix element mij to the key value pair:
(j, (M, i, mij )) .
Analogously, send each matrix element njk to the key value pair:
(j, (N, k, njk)) .
• Reduce: For each key j, examine its list of associated values. For each value that comes from
M , say (M, i, mij ), and each value that comes from N , say (N, k, njk), produce the tuple
(i, k, v = mij njk ),
The output of the Reduce function is a key j paired with the list of all the tuples of this form that
we get from j:
(j, [(i1, k1, v1), (i2, k2, v2), . . . , (ip, kp, vp)]) .
• Map: For each element mij of M , produce a key-value pair
Reduce: Each key (i; k) will have an associated list with all the values (M; j;mij) and (N; j; njk);
for all possible values of j. We connect the two values on the list that have the same value of j, for
each j:
We sort by j the values that begin with M and sort by j the values that begin with N, in separate
lists, The jth values on each list must have their third components, mij and njk extracted and
multiplied,
Then, these products are summed and the result is paired with (i; k) in the output of the Reduce
function.
CONCLUSION:
An algorithm in MAP Reduce is implemented.
Theory:
What is Hive?
Apache Hive is a Data warehouse system which is built to work on Hadoop. It is used to querying
and managing large datasets residing in distributed storage. Before becoming an open source
project of Apache Hadoop, Hive was originated in Facebook. It provides a mechanism to project
structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL
(HQL).Hive is used because the tables in Hive are similar to tables in a relational database.Many
users can simultaneously query the data using Hive-QL.
What is HQL?
Hive defines a simple SQL-like query language to querying and managing large datasets called
Hive-QL ( HQL ). It‘s easy to use if you‘re familiar with SQL Language. Hive allows programmers
who are familiar with the language to write the custom MapReduce framework to perform more
sophisticated analysis.
Uses of Hive:
1. The Apache Hive distributed storage.
4. By using Hive, we can access files stored in Hadoop Distributed File System (HDFS is used to
querying and managing large datasets residing in) or in other data storage systems such asApache
HBase.
Hive Commands :
Data Definition Language (DDL )
DDL statements are used to build and modify the tables and other objects in the database.
Go to Hive shell by giving the command sudo hive and enter the command ‗create database
<data base name>‘ to create the new database in the Hive.
DML statements are used to retrieve, store, modify, delete, insert and update data in the database.
Example :
Syntax :
The Load operation is used to move the data into corresponding Hive table. If the keyword
local is specified, then in the load command will give the local file system path. If the keyword
local is not specified we have to use the HDFS path of the file.
The insert command is used to load the data Hive table. Inserts can be done to a table or a partition.
• INSERT OVERWRITE is used to overwrite the existing data in the table or partition.
• INSERT INTO is used to append the data into existing data in a table. (Note: INSERT INTO
syntax is work from the version 0.8)
CONCLUSION:
Hive is demonstrated successfully.
Theory:
The Datar-Gionis-Indyk-Motwani (DGIM) Algorithm
To begin, each bit of the stream has a timestamp, the position in which it arrives. The first bit has
timestamp 1, the second has timestamp 2, and so on. Since we only need to distinguish positions
within the window of length N, we shall represent timestamps modulo N, so they can be represented
by log2 N bits. If we also store the total number of bits ever seen in the stream (i.e., the most recent
timestamp) modulo N, then we can determine from a timestamp modulo N where in the current
window the bit with that timestamp is.
..1011011000101110110010110
..1011011000101110110 010110
Example : Figure shows a bit stream divided into buckets in a way that satisfies the DGIM rules.
BE Computer Big Data Analytics Lab CSL7012
At the right (most recent) end we see two buckets of size 1. To its left we see one bucket of size
2. Note that this bucket covers four positions, but only two of them are 1. Proceeding left, we see
two buckets of size 4, and we suggest that a bucket of size 8 exists further left.
Notice that it is OK for some O's to lie between buckets. Also, observe from above Figure that the
buckets do not overlap; there is one or two of each size up to the largest size, and sizes only increase
moving left.
CONCLUSION:
One data streaming algorithm is explained and implemented.
Aim: Streaming data analysis – use flume for data capture, HIVE/PYSpark for analysis of twitter
data, chat data, weblog analysis etc.
Theory: Apache Flume is a tool for data ingestion in HDFS. It collects, aggregates and transports
large amount of streaming data such as log files, events from various sources like network traffic,
social media, email messages etc. to HDFS. Flume is a highly reliable & distributed.
The main idea behind the Flume‘s design is to capture streaming data from various web servers
to HDFS. It has simple and flexible architecture based on streaming data flows. It is fault- tolerant
and provides reliability mechanism for Fault tolerance & failure recovery.
There is a Flume agent which ingests the streaming data from various data sources to HDFS. From
the diagram, you can easily understand that the web server indicates the data source. Twitter is
among one of the famous sources for streaming data.
The flume agent has 3 components: source, sink and channel.
1. Source: It accepts the data from the incoming streamline and stores the data in the channel.
2. Channel: In general, the reading speed is faster than the writing speed. Thus, we need some
buffer to match the read & write speed difference. Basically, the buffer acts as a
intermediary storage that stores the data being transferred temporarily and therefore
prevents data loss. Similarly, channel acts as the local storage or a temporary storage
between the source of data and persistent data in the HDFS.
3. Sink: Then, our last component i.e. Sink, collects the data from the channel and commits or
writes the data in the HDFS permanently.
Step 1:
Login to the twitter account
In the Sink configuration we are going to configure HDFS properties. We will set HDFS path, write
format, file type, batch size etc. At last we are going to set memory channel
CONCLUSION:
Thus the Streaming data from twitter is successfully analyzed using flume
Aim: To study and implement basic functions and commands in R Programming and Implement
predictive Analytics techniques (regression / time series, etc.) using R.
Theory
:
Predictive models are extremely useful, when learning r language, for forecasting future outcomes
and estimating metrics that are impractical to measure. For example, data scientists could use
predictive models to forecast crop yields based on rainfall and temperature, or to determine
whether patients with certain traits are more likely to react badly to a new medication. Before we
talk about linear regression specifically, let‘s remind ourselves what a typical data science
workflow might look like. A lot of the time, we‘ll start with a question we want to answer, and do
something like the following:
1. Collect some data relevant to the problem (more is almost always better).
2. Clean, augment, and preprocess the data into a convenient form, if needed.
3. Conduct an exploratory analysis of the data to get a better sense of it.
4. Using what you find as a guide, construct a model of some aspect of the data.
5. Use the model to answer the question you started with, and validate your results.
Linear regression is one of the simplest and most common supervised machine learning algorithms
that data scientists use for predictive modeling. In this post, we‘ll use linear regression to build a
model that predicts cherry tree volume from metrics that are much easier forfolks who study trees
to measure.
We‘ll use R in this blog post to explore this data set and learn the basics of linear regression. If
you‘re new to learning the R language, we recommend our R Fundamentals and R Programming:
Intermediate courses from our R Data Analyst path. It will also help to have some very basic
statistics knowledge, but if you know what a mean and standard deviation are, you‘ll be able to
follow along. If you want to practice building the models and visualizations yourself, we‘ll be
using the following R packages:
data sets This package contains a wide variety of practice data sets. We‘ll be using one of them,
―trees‖, to learn about building linear regression models.
ggplot2 We‘ll use this popular data visualization package to build plots of our models.
GGally This package extends the functionality of ggplot2. We‘ll be using it to create a plot
matrix as part of our initial exploratory data visualization.
scatterplot3d We‘ll use this package for visualizing more complex linear regression models
with multiple predictors.
# add sources.list
# deb https://cran.csiro.au/ trusty-backports main restricted universe
# deb hhttps://cran.csiro.au//bin/linux/ubuntu trusty/
sudo apt-get update
sudo apt-get install r-base
# Users who need to compile R packages from source [e.g. package maintainers, or anyone
installing packages with install.packages()] should also install the r-base-dev package:
Conclusion:
Thus the predictive analysis is demonstrated in R.
Theory:
Text Mining is the discovery by computer of new, previously unknown information, by
automatically extracting information from different written resources. A key element is the linking
together of the extracted information together to form new facts or new hypotheses to be explored
further by more conventional means of experimentation.
Text mining is different from what we're familiar with in web search. In search, the user is typically
looking for something that is already known and has been written by someone else. The problem is
pushing aside all the material that currently isn't relevant to your needs in order to find the relevant
information.
The difference between regular data mining and text mining is that in text mining the patterns are
extracted from natural language text rather than from structured databases of facts. Databases are
designed for programs to process automatically; text is written for people to read. We do not have
programs that can "read" text and will not have such for the forseeable future. Many researchers
think it will require a full simulation of how the mind works before we can write programs that read
the way people do.
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal
of discovering useful information, suggesting conclusions, and supporting decision-making. Data
analysis has multiple facets and approaches, encompassing diverse techniques under a variety of
names, in different business, science, and social science domains.
Data mining is a particular data analysis technique that focuses on modeling and knowledge
discovery for predictive rather than purely descriptive purposes. Business intelligence coversdata
analysis that relies heavily on aggregation, focusing on business information. In statistical
applications, some people divide data analysis into descriptive statistics, exploratory data analysis
(EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the
data and CDA on confirming or falsifying existing hypotheses. Predictive analytics focuses on
application of statistical models for predictive forecasting or classification, while text analytics
applies statistical, linguistic, and structural techniques to extract and classify information from
textual sources, a species of unstructured data. All are varieties of data analysis.
Social data analysis comprises two main constituent parts: 1) data generated from social networking
sites (or through social applications), and 2) sophisticated analysis of that data, in many cases
requiring real-time (or near real-time) data analytics, measurements which understand and
appropriately weigh factors such as influence, reach, and relevancy, an understanding of the context
of the data being analyzed, and the inclusion of time horizon considerations. In short, social data
analytics involves the analysis of social media in order to understand and surface insights which is
embedded within the data.
CONCLUSION: