Data Collection

The document discusses data collection and summarizes key aspects of the data collection process. It covers pre-data collection planning, primary and secondary data collection methods, challenges in data collection such as inconsistent or inaccurate data, and techniques for web scraping including Beautiful Soup and Scrapy libraries.

Uploaded by

mhasannajeeb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views14 pages

Data Collection

Uploaded by

mhasannajeeb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Data Collection

Hamza Yar Khan & Muhammad Hassan Najeeb

Contents
 Data Science Pipeline

 Pre Data Collection

 Data Collection and its types

 Primary Data Collection
 Secondary Data Collection

 Challenges in Data Collection

 Web Scraping
 Beautiful Soup
 Scrapy
Expertise

DATA SCIENCE PIPLINE

Before an analyst begins collecting data, they must answer three
questions first:
 What’s the goal or purpose of this research?

 What kinds of data are they planning on gathering?

 What methods and procedures will be used to collect, store, and process the
information?

 Additionally, we can break up data into qualitative and quantitative types.

 Qualitative data covers descriptions such as color, size, quality, and appearance.
 Quantitative data, unsurprisingly, deals with numbers, such as statistics, poll numbers, percentages, etc.

PRE-DATA COLLECTION
 Data Collection is the systematic approach to gathering relevant information from a variety of sources depending on the problem statement

 Data Collection is classified in two categories

 Primary
 Secondary

DATA COLLECTION
PRIMARY V SECONDARY
 Inconsistent Data
 When working with various data sources, it's conceivable that the same information will have a lack of
compatibility.
 The differences could be in formats, units, or occasionally spellings.
 Data Downtime
 Schema modifications and migration problems are just two examples of the causes of data downtime.
 Data downtime must be continuously monitored, and it must be reduced through automation.
 Data engineer spends about 80% of their time updating, maintaining, and guaranteeing the integrity of
the data pipeline.
 Ambiguous Data/ Inaccurate Data
 Even with thorough oversight, some errors can still occur in massive databases or data lakes
 Hidden Data
 Spelling mistakes can go unnoticed, formatting difficulties can occur, and column heads might be
deception
 Duplicate Data
 Too Much Data
 Data scientists, data analysts, and business users devote 80% of their work to finding and organizing the
appropriate data.
 With an increase in data volume, other problems with data quality become more serious, particularly
when dealing with streaming data and big files or databases.
 Relevant Data

CHALLENGES ?
 Web scraping refers to the extraction of data from a website.

 This information is collected and then exported into a more useful format for the user.
Be it a spreadsheet or an API

 Is web scraping legal?

 In short, the action of web scraping isn't illegal. However, some rules need to be followed. Web scraping
becomes illegal when non-publicly available data becomes extracted .

 Some libraries for Manual Web Scraping

 Beautiful Soup
 Scrapy
 Selenium

WEB SCRAPING
 Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide
idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of
work.
 Documentation -> https://www.crummy.com/software/BeautifulSoup/bs4/doc/
 Pak Wheels Data

BEAUTIFUL SOUP
 Scrapy is an open-source and collaborative web crawling framework for Python. It's
used for extracting data from websites and can be used for a wide range of
purposes such as data mining, data processing, etc.
 Scrapy works by sending HTTP requests to a website's server, and then it parses
the server's response to extract the data. It then saves the data in a structured
format such as CSV or JSON.
 Scrapy is fast and efficient, it can handle multiple concurrent requests and it's easy
to use. It also has a built-in support for handling common web scraping tasks such
as logging in, handling cookies, etc.

SCRAPY
 Information to be scraped: "Book names and prices“

BOOKBERRY.PK
CODE
CSV
Thank You.
[email protected] www.folio3.com

Pleasanton, California Head Office: Surrey, United Kingdom

Chicago, Illinois 1301 Shoreway Road, Dubai, UAE
Toronto, Canada Suite 160, Sofia, Bulgaria
Guadalajara, Mexico Belmont, CA 94002, USA Lahore & Karachi, Pakistan

Python Web Scraping Guide
100% (2)
Python Web Scraping Guide
35 pages
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
No ratings yet
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
42 pages
Online Shopping Portal: Project Report On
No ratings yet
Online Shopping Portal: Project Report On
65 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Web Scraping Course Notes
No ratings yet
Web Scraping Course Notes
89 pages
Lesson 1. Introduction To Data Wrangling
No ratings yet
Lesson 1. Introduction To Data Wrangling
56 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
AIML Manual Lab-For Students
No ratings yet
AIML Manual Lab-For Students
45 pages
Rohan Report
No ratings yet
Rohan Report
25 pages
Web Scraping with Python Guide
No ratings yet
Web Scraping with Python Guide
5 pages
Python Selenium Web Scraping Guide
No ratings yet
Python Selenium Web Scraping Guide
14 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Web Scraping for Developers
No ratings yet
Web Scraping for Developers
8 pages
Programming in Ds With Python
No ratings yet
Programming in Ds With Python
11 pages
Web Tech for Graduate Students
No ratings yet
Web Tech for Graduate Students
211 pages
Dap Mod 4-5
No ratings yet
Dap Mod 4-5
19 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
ML Week 6
No ratings yet
ML Week 6
11 pages
Module 2 - Final
No ratings yet
Module 2 - Final
58 pages
L2 - Data Acquisition
No ratings yet
L2 - Data Acquisition
48 pages
Introduction To Web Scraping
100% (1)
Introduction To Web Scraping
3 pages
HTML Viva Questions
100% (3)
HTML Viva Questions
6 pages
PYTHON Lab (21CSL46) Manual 4th Sem Final
No ratings yet
PYTHON Lab (21CSL46) Manual 4th Sem Final
69 pages
08 Gtu TPT Report
No ratings yet
08 Gtu TPT Report
37 pages
Webscraping 2
No ratings yet
Webscraping 2
2 pages
Webscraping
No ratings yet
Webscraping
12 pages
DSE 3 Unit 3
No ratings yet
DSE 3 Unit 3
4 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Final Report
No ratings yet
Final Report
39 pages
Unit - 2 Web Intelligence
No ratings yet
Unit - 2 Web Intelligence
12 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
Data Collection Techniques in Data Science
No ratings yet
Data Collection Techniques in Data Science
12 pages
Utilizing Python For Web Scraping and Incremental Data Extraction
No ratings yet
Utilizing Python For Web Scraping and Incremental Data Extraction
6 pages
Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Web Scrapping Final
No ratings yet
Web Scrapping Final
7 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Document 2
No ratings yet
Document 2
6 pages
Summary Paper 13 14 15
No ratings yet
Summary Paper 13 14 15
2 pages
Web Scraping
No ratings yet
Web Scraping
16 pages
Data Preparation
No ratings yet
Data Preparation
6 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Data Wrangling
No ratings yet
Data Wrangling
4 pages
Marketing Manager Resume
No ratings yet
Marketing Manager Resume
1 page
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
No ratings yet
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
29 pages
EJMCM Volume7 Issue3 Pages433-442
No ratings yet
EJMCM Volume7 Issue3 Pages433-442
11 pages
Python Web Scraping Basics
No ratings yet
Python Web Scraping Basics
4 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Mini Project
No ratings yet
Mini Project
13 pages
Data Collection
No ratings yet
Data Collection
10 pages
Software Engineering Project
No ratings yet
Software Engineering Project
55 pages
JavaScript Lecture Notes
No ratings yet
JavaScript Lecture Notes
5 pages
19-5E8 Tushara Priya
No ratings yet
19-5E8 Tushara Priya
23 pages
Python Libraries For Data Extraction
No ratings yet
Python Libraries For Data Extraction
10 pages
E-commerce Review Scraper Project
No ratings yet
E-commerce Review Scraper Project
15 pages
Servicenow CT 3 Batch 1
No ratings yet
Servicenow CT 3 Batch 1
18 pages
Ass5 12
No ratings yet
Ass5 12
3 pages
Milestone Proposal
No ratings yet
Milestone Proposal
3 pages
Web Scraping For Data Analytics A BeatifulSoup Implementation
No ratings yet
Web Scraping For Data Analytics A BeatifulSoup Implementation
6 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
Online Exam Management System
No ratings yet
Online Exam Management System
26 pages
Setup Multinode Hadoop Cluster Guide
No ratings yet
Setup Multinode Hadoop Cluster Guide
18 pages
Department of Computer Science: Mission Statement
No ratings yet
Department of Computer Science: Mission Statement
10 pages
Appium Desired Capabilities Guide
No ratings yet
Appium Desired Capabilities Guide
8 pages
Release Notes
No ratings yet
Release Notes
10 pages
2.Php Arrays and Superglobals
No ratings yet
2.Php Arrays and Superglobals
16 pages
Online Note Sharing for Students
No ratings yet
Online Note Sharing for Students
53 pages
Etere Active Sync
No ratings yet
Etere Active Sync
2 pages
Computer 10
No ratings yet
Computer 10
2 pages
U1 CreatingWebsiteBasicsUsingHTML
No ratings yet
U1 CreatingWebsiteBasicsUsingHTML
13 pages
SIMRAN's Resume
No ratings yet
SIMRAN's Resume
1 page
LN No 4 Css Attributes and Background - Grade 6
No ratings yet
LN No 4 Css Attributes and Background - Grade 6
3 pages
Wagtail - 2023 - Paarth Agarwal
No ratings yet
Wagtail - 2023 - Paarth Agarwal
31 pages
Call Center Callback Setup Guide
No ratings yet
Call Center Callback Setup Guide
22 pages
Shopping Portal
No ratings yet
Shopping Portal
65 pages
Full Stack Developer Profile
No ratings yet
Full Stack Developer Profile
2 pages
Upload A Document For Free Access.: Select Files From Your Computer or Choose Other Ways To Upload Below
No ratings yet
Upload A Document For Free Access.: Select Files From Your Computer or Choose Other Ways To Upload Below
7 pages
Django Concurrency
No ratings yet
Django Concurrency
36 pages
Wikipedia: The Inside Story: Andrea Rankin, June 2007
No ratings yet
Wikipedia: The Inside Story: Andrea Rankin, June 2007
13 pages
Babu Banarasi Das University Lucknow: Project Report
No ratings yet
Babu Banarasi Das University Lucknow: Project Report
43 pages