Module 1
Introduction to Big Data:
By
Dr Bijoy Kumar Mandal
What is Big Data?
• Big data refers to extremely large and complex datasets that are
difficult to process using traditional data processing methods.
• It encompasses structured, semi-structured, and unstructured data,
and its analysis can reveal valuable insights for organizations.
• The key characteristics of big data are often described by the "3 Vs":
• Volume (large amounts of data)
• Velocity (high speed of data creation and processing)
• Variety (different data types).
History of big data
• The concept of big data, though widely discussed recently, has roots tracing back to the
1960s and 70s with the emergence of data centers and relational databases.
• Here's a more detailed look at the evolution:
• Early Days (1960s-1980s):
∙ Data Centers and Relational Databases:
• The first data centers and the development of relational database management
systems (RDBMS) laid the groundwork for managing large datasets.
History of big data
∙ Early Data Processing:
• IBM introduced the first computer system capable of processing large amounts
of data in the 1960s.
∙ Spreadsheets and Number Crunching:
• Early businesses used large spreadsheets to analyze data and identify trends.
History of big data contd..
• The Rise of Big Data (1990s-2000s):
∙ The Term Emerges:
• The term "big data" began circulating, with John Mashey playing a key role in popularizing it.
∙ Web 2.0 and User-Generated Content:
• The rise of the internet, social media, and user-generated content (like blogs, social media
posts, and videos) led to an explosion in data volume and variety.
History of big data contd..
∙ New Data Sources:
• Mobile devices, search engines, and other online services contributed to the exponential
growth of data.
∙ Hadoop and NoSQL:
• Open-source frameworks like Apache Hadoop and NoSQL databases emerged to handle the
challenges of storing and processing large, unstructured datasets.
History of big data contd..
• The Big Data Era (2010s-Present):
∙ The 3 Vs:
∙ Gartner defined the characteristics of big data as volume, velocity, and variety.
∙ Internet of Things (IoT):
∙ The proliferation of connected devices (IoT) generated even more data about user behavior and
product performance.
History of big data contd..
∙ Machine Learning and Artificial Intelligence:
These technologies require vast amounts of data for training and analysis, further fueling the
big data landscape.
∙ Cloud Computing:
The cloud has significantly impacted the adoption and accessibility of big data technologies,
making it easier for organizations of all sizes to leverage big data.
Elements of big data
• Big data is characterized by five key elements, often referred to as the "5 V's": Volume, Velocity,
Variety, Veracity, and Value.
• These characteristics define the unique challenges and opportunities presented by large,
complex datasets.
Characteristics of big data Explanation
∙ Volume:
∙ This refers to the massive amount of data that is generated and stored. Big data often involves
petabytes or even exabytes of information.
∙ Velocity:
∙ This describes the speed at which data is generated, processed, and accessed. Big data requires real-
time or near real-time processing capabilities.
∙ Variety:
∙ Big data includes diverse types of data, ranging from structured data in databases to unstructured
data like social media posts, images, and videos.
∙ Veracity:
∙ This refers to the quality and reliability of the data. Big data can be messy and inconsistent,
requiring techniques to ensure accuracy and trustworthiness.
Value:
• This highlights the potential for deriving valuable insights and business intelligence from big data.
The ultimate goal is to extract meaningful information that can drive better decision-making.
Why Big Data
• Big data is important because it allows organizations to make more
informed decisions, improve operational efficiency, enhance
customer experiences, and drive innovation.
• By analyzing vast amounts of data from various sources, companies
can uncover hidden patterns, predict future trends, and optimize
their strategies in real-time.
• This leads to better resource management, increased agility, and
ultimately, a competitive advantage.
Unstructured data in Big Data
• Unstructured data in big data refers to information that doesn't
conform to a predefined data model or structure, making it difficult
to analyze using traditional methods.
Characteristics of Unstructured Data
• Lack of Format:
• Unstructured data does not fit neatly into tables or databases. It can be textual or non-textual,
making it difficult to categorize and organize.
• Variety: This type of data can include a wide range of formats, such as:
• Text documents (e.g., emails, reports, articles)
• Multimedia files (e.g., images, audio, video)
• Social media content (e.g., posts, comments, tweets)
• Web pages and blogs
• Volume:
• Unstructured data represents a significant portion of the data generated today. It is often larger in
volume compared to structured data.
• Diverse Sources:
• It can originate from various sources, including user-generated content, sensor data, customer
interactions and more.
What are some examples of unstructured
data?
• Unstructured data can be created by people or generated by machines.
• Here are some examples of the human-generated variety:
• Email:
• Email message fields are unstructured and cannot be parsed by traditional analytics tools. That said, email
metadata affords it some structure, and explains why email is sometimes considered semi-structured data.
• Text files:
• This category includes word processing documents, spreadsheets, presentations, email, and log files.
• Social media and websites:
• data from social networks like Twitter, LinkedIn, and Facebook, and websites such as Instagram, photo-sharing
sites, and YouTube.
• Mobile and communications data:
• For this category, look no further than text messages, phone recordings, collaboration software, chat, and
instant messaging.
• Media:
• This data includes digital photos, audio, and video files.
What are some examples of unstructured
data?
• Here are some examples of unstructured data generated by machines:
• Scientific data:
• This includes oil and gas surveys, space exploration, seismic imagery, and atmospheric data.
• Digital surveillance:
• This category features data like reconnaissance photos and videos.
• Satellite imagery:
• This data includes weather data, land forms, and military movements.
Data storage and analysis in Big Data
• Big data storage and analysis involves managing and extracting
insights from massive, diverse datasets.
• Storage solutions must handle the volume, variety, and velocity of
big data, often utilizing distributed systems like Hadoop and cloud-
based options.
• Analysis techniques include data mining, machine learning, and
statistical analysis to uncover trends and patterns.
Using big data in businesses
• Big data significantly impacts businesses by enabling them to analyze
vast amounts of information to gain valuable insights and improve
decision-making.
Here's how big data is utilized in businesses:
• 1. Improving Efficiency and Operations. • 3. Driving Innovation and Growth.
∙ Optimizing Processes. ∙ Predictive Analytics.
∙ Inventory Management. ∙ New Product Development.
∙ Predictive Maintenance. ∙ Competitive Advantage.
∙ Supply Chain Optimization. ∙ Risk Management.
• 2. Enhancing Customer Experience. ∙ Fraud Detection.
∙ Personalized Marketing. • 4. Examples of Big Data in Different Industries.
∙ Improved Customer Service. ∙ Retail.
∙ Product Development. ∙ Finance.
∙ Targeted Advertising. ∙ Transportation.
∙ Manufacturing.
Some Practical examples of companies
using big data
• Netflix
• Netflix began as a DVD mailing service and developed algorithms to help it to
predict viewers’ preferences and habits. Now it delivers films over the
internet and can easily collect information about when movies are watched,
how often films might be stopped and restarted, where they might be
abandoned, and how users rate films. This allows Netflix to predict which
films will be popular with which customers. It is also being used by Netflix to
produce its own TV series, with much greater assurance that these will be
hits.
Practical examples of companies using big
data contd..
• Amazon
• The world’s leading e-retailer collects huge amounts of information about
customers’ preferences and habits which allow it to market very accurately to
each customer. For example, it routinely makes recommendations to
customers based on products previously purchased.
Practical examples of companies using big
data contd..
• Airlines
• Airlines know where you’ve flown, preferred seats, cabin class, when you fly,
how often you search for a flight before booking, how susceptible you are to
price reductions, probably which airline you might book with instead,
whether you are returning with them but didn’t fly out with them, whether
car hire was purchased last time, what class of hotel you might book through
their site, which routes are growing in popularity, seasonality of routes. They
also know the profitability of each customer so that, for example, if a flight is
cancelled they can help the most valuable customers first.
• This information allows airlines to design new routes and timings, match
routes to planes and also to make individualised offers to each potential
passenger.
Challenges in Big Data Analytics
• Data quality
• One of the biggest challenges most businesses face is ensuring that the data they collect is reliable.
• Data access
• Companies often have data scattered across multiple systems and departments, and in structured, unstructured,
and semi-structured formats. This makes it both difficult to consolidate and analyze and vulnerable to unauthorized
use.
Bad visualizations
• Transforming data into graphs or charts through data visualization efforts helps present complex information in a
tangible, accurate way that makes it easier to understand. But using the wrong visualization method or including
too much data can lead to misleading visualizations and incorrect conclusions.
• Data privacy and security
• Controlling access to data is a never-ending challenge that requires data classification as well as security technology.
Challenges in Big Data Analytics contd..
• Talent shortage
• Many companies can’t find the talent they need to turn their vast supplies of data into usable information. The
demand for data analysts, data scientists, and other data-related roles has outpaced the supply of qualified
professionals with the necessary skills to handle complex data analytics tasks.
• Too many analytics systems and tools
• It’s not uncommon that, once an organization embarks on a data analytics strategy, it ends up buying separate tools
for each layer of the analytics process.
• Cost
• Data analytics requires investment in technology, staff, and infrastructure. But unless organizations are clear on the
benefits they’re getting from an analytics effort, IT teams may struggle to justify the cost of implementing the
initiative properly.
Challenges in Big Data Analytics contd..
• Changing technology
• The data analytics landscape is constantly evolving, with new tools, techniques, and technologies emerging all the
time. For example, the race is currently on for companies to get advanced capabilities such as artificial intelligence
(AI) and machine learning (ML) into the hands of business users as well as data scientists.
• Resistance to change
• Applying data analytics often requires what can be an uncomfortable level of change. Suddenly, teams have new
information about what’s happening in the business and different options for how they should react.
• Goalsetting
• Without clear goals and objectives, businesses will struggle to determine which data sources to use for a project,
how to analyze data, what they want to do with results, and how they’ll measure success. A lack of clear goals can
lead to unfocused data analytics efforts that don’t deliver meaningful insights or returns.
.
Thank You