Data Collection
Hamza Yar Khan & Muhammad Hassan Najeeb
Contents
Data Science Pipeline
Pre Data Collection
Data Collection and its types
Primary Data Collection
Secondary Data Collection
Challenges in Data Collection
Web Scraping
Beautiful Soup
Scrapy
Expertise
DATA SCIENCE PIPLINE
Before an analyst begins collecting data, they must answer three
questions first:
What’s the goal or purpose of this research?
What kinds of data are they planning on gathering?
What methods and procedures will be used to collect, store, and process the
information?
Additionally, we can break up data into qualitative and quantitative types.
Qualitative data covers descriptions such as color, size, quality, and appearance.
Quantitative data, unsurprisingly, deals with numbers, such as statistics, poll numbers, percentages, etc.
PRE-DATA COLLECTION
Data Collection is the systematic approach to gathering relevant information from a variety of sources depending on the problem statement
Data Collection is classified in two categories
Primary
Secondary
DATA COLLECTION
PRIMARY V SECONDARY
Inconsistent Data
When working with various data sources, it's conceivable that the same information will have a lack of
compatibility.
The differences could be in formats, units, or occasionally spellings.
Data Downtime
Schema modifications and migration problems are just two examples of the causes of data downtime.
Data downtime must be continuously monitored, and it must be reduced through automation.
Data engineer spends about 80% of their time updating, maintaining, and guaranteeing the integrity of
the data pipeline.
Ambiguous Data/ Inaccurate Data
Even with thorough oversight, some errors can still occur in massive databases or data lakes
Hidden Data
Spelling mistakes can go unnoticed, formatting difficulties can occur, and column heads might be
deception
Duplicate Data
Too Much Data
Data scientists, data analysts, and business users devote 80% of their work to finding and organizing the
appropriate data.
With an increase in data volume, other problems with data quality become more serious, particularly
when dealing with streaming data and big files or databases.
Relevant Data
CHALLENGES ?
Web scraping refers to the extraction of data from a website.
This information is collected and then exported into a more useful format for the user.
Be it a spreadsheet or an API
Is web scraping legal?
In short, the action of web scraping isn't illegal. However, some rules need to be followed. Web scraping
becomes illegal when non-publicly available data becomes extracted .
Some libraries for Manual Web Scraping
Beautiful Soup
Scrapy
Selenium
WEB SCRAPING
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide
idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of
work.
Documentation -> https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pak Wheels Data
BEAUTIFUL SOUP
Scrapy is an open-source and collaborative web crawling framework for Python. It's
used for extracting data from websites and can be used for a wide range of
purposes such as data mining, data processing, etc.
Scrapy works by sending HTTP requests to a website's server, and then it parses
the server's response to extract the data. It then saves the data in a structured
format such as CSV or JSON.
Scrapy is fast and efficient, it can handle multiple concurrent requests and it's easy
to use. It also has a built-in support for handling common web scraping tasks such
as logging in, handling cookies, etc.
SCRAPY
Information to be scraped: "Book names and prices“
BOOKBERRY.PK
CODE
CSV
Thank You.
[email protected] www.folio3.com
Pleasanton, California Head Office: Surrey, United Kingdom
Chicago, Illinois 1301 Shoreway Road, Dubai, UAE
Toronto, Canada Suite 160, Sofia, Bulgaria
Guadalajara, Mexico Belmont, CA 94002, USA Lahore & Karachi, Pakistan