Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
26 views12 pages

Datascience Notes

The document provides an introduction to data science, defining key concepts such as data, variables, and big data, along with their characteristics. It outlines the data life cycle, detailing the stages from data generation to management, emphasizing the importance of data processing, storage, and governance. Overall, it highlights the interdisciplinary nature of data science and its relevance in analyzing large datasets.

Uploaded by

keneitarus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views12 pages

Datascience Notes

The document provides an introduction to data science, defining key concepts such as data, variables, and big data, along with their characteristics. It outlines the data life cycle, detailing the stages from data generation to management, emphasizing the importance of data processing, storage, and governance. Overall, it highlights the interdisciplinary nature of data science and its relevance in analyzing large datasets.

Uploaded by

keneitarus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Data Science

LECTURE ONE
Introduction to Data Science
Definitions

What is data?
• Data refers to raw elements or unprocessed facts
• Information on the other hand is data that has been processed, organized, and
interpreted to add meaning and value
• Data can either be quantitative i.e numerical and measurable example weight,
height etc or qualitative i.e non-numerical focusing on qualities and attributes for
example customer satisfaction, level of education, gender etc

1/12
What is a variable?
• A variable is a characteristic or attribute that can assume different values
• Variables stores values of data points
• They can either be quantitative variables i.e are amounts or counts; for example,
age, number of children, and income or categorical variables which represent
groupings; for example, gender, type of car, and brand of shoes

Data set is a collection of related data that’s usually organized in a standardized


format

2/12
What is Data Science?
• Data science refers to using methods to analyze massive amounts of data(Big data)
and extract the knowledge it contains
• Data science can also be defined as ”a field of study that uses scientific methods,
processes, and systems to extract knowledge and insights from data
• It is an interdisciplinary subject which combines expertise from statistics, computer
science, mathematics, and domain-specific knowledge

3/12
Domain of Data Science
4/12
What is Big Data?
• Big data is a collection of data sets so large or complex that it becomes difficult to
process them using traditional data management techniques such as for example,
the RDBMS

Characteristics of Big Data

• Volume: Volume refers to the size of data being generated and collected. How
much data is there?
• Variety:How diverse are different types of data?
• Velocity : At what speed is new data generated? . It can also mean Velocity refers
to how quickly the data can be analysed in order to make decisions (access)

5/12
• Veracity: How accurate is the data? This implies that data can be biased, noisy,
obsolete, erroneous, misleading and therefore unreliable
• Value: May refer to the capability of turning big data into real values, which
includes an ability to collect and then leverage the data to achieve specific goals.

6/12
LECTURE TWO
Data Life Cycle

7/12
Definition of Data Life Cycle

• A data lifecycle is the sequence of stages that a unit of data goes through from its
initial generation or capture to its archiving or deletion at the end of its useful life.
• It encompasses a series of eight stages through which data pass from their
creation to their end use in decision-making
1. Data generation
This is the first stage in data life cycle and it involves the creation of data from a
variety of sources such as:
• Customer interactions such as Point of sale(POS)
• Business and financial transactions e.g e-commerce shopping carts
• Social media activities
• Internet of Things (IoT) devices such as sensors, vehicle tracker etc

8/12
2. Data collection
The second stage in the data lifecycle and it involves gathering of relevant data from a
variety of sources like:
• Surveys and questionnaires
• Web scraping i.e extracting data from websites
• IoT sensors
• Application programming interfaces (APIs)
• Transaction records
• Social media monitoring
• Observations

9/12
3. Data processing
Data processing is the third stage in the data lifecycle. It involves the following steps
that prepare data for analysis:
• Data cleaning: Removing duplicate content, correcting errors, and filling in missing
values
• Data transformation: Converting raw or unstructured data into a suitable format or
structure
• Data integration: Combining data from different sources to provide a complete,
accurate, and up-to-date dataset for BI, data analysis and other applications and
business processes
• Data reduction: Simplifying datasets by eliminating redundant or irrelevant data
• Data validation: Ensuring processed data meets organizational standards and
accurately reflects its original sources
10/12
4. Data storage
The fourth stage of the data lifecycle. It is essential in ensuring that data is accessible,
safeguarded, and backed up for future use. This stage focuses on data privacy
configuring your storage solution for privacy by securely storing processed data in:
• Databases such as RDBMs like MySQL, PostgreSQL ,Microsoft SQL Server and
NoSQL
• Data warehouses stores cleaned and processed data
• Cloud storage solutions such as google drive, Amazon S3(simple storage service)-
Stores any amount of data, Dropbox etc
• Data lakes- Stores data in its original format

11/12
5. Data management
Data management is the fifth stage in the data lifecycle. It encompasses the ongoing
organization and maintenance of data through:
• Data governance: Establishing standards, defining user roles, and ensuring
compliance.
• Data quality management: Monitoring, cleaning, and validating data.
• Data security: Implementing encryption and access controls and conducting
security audits.
• Data access and retrieval: Setting up and using indexing and cataloging techniques.
• Data integration: Creating a unified view of data and ensuring consistency.
• Data archiving and deletion: Caching or deleting outdated or infrequently used data

12/12

You might also like