Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
101 views3 pages

Data Wrangling Study Guide

The document is a study guide on data wrangling, covering its definition, importance, and essential tasks such as data collection and cleaning. It discusses tools for data parsing, database concepts, data quality, visualization techniques, and web scraping methods. Key comparisons between data formats (CSV, JSON, XML) and database types (MySQL, PostgreSQL, NoSQL) are also included.

Uploaded by

toy955086
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views3 pages

Data Wrangling Study Guide

The document is a study guide on data wrangling, covering its definition, importance, and essential tasks such as data collection and cleaning. It discusses tools for data parsing, database concepts, data quality, visualization techniques, and web scraping methods. Key comparisons between data formats (CSV, JSON, XML) and database types (MySQL, PostgreSQL, NoSQL) are also included.

Uploaded by

toy955086
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Data Wrangling Study Guide

UNIT I: Fundamentals of Data Wrangling

What is Data Wrangling?

- Data wrangling is the process of cleaning, structuring, and enriching raw data into a desired format for

analysis.

Importance:

- Ensures data quality, consistency, and usability.

- Crucial for analytics, ML, BI.

Tasks:

- Data collection, cleaning, transformation, integration, validation, exporting.

Tools:

- Python (pandas, numpy), R, Power BI, Alteryx, Trifacta.

CSV vs JSON vs XML:

- CSV: Simple, no schema. JSON: Nested, readable. XML: Schema-rich, verbose.

UNIT II: Data Parsing and Database Concepts

Parsing PDFs:

- Tools: PyMuPDF, pdfplumber, PDFMiner, Tesseract (OCR).

- Steps: Load -> Extract -> Parse structure.


MySQL vs PostgreSQL vs NoSQL:

- MySQL: Simpler apps, less extensible.

- PostgreSQL: Complex queries, strong ACID, extensibility.

- NoSQL: Big data, flexible schema, real-time use.

NoSQL:

- Types: Document, Key-Value, Column, Graph.

- Uses: Real-time apps, unstructured data.

UNIT III: Data Quality & Cleanup

Duplicates, Fuzzy, Bad Data:

- Tools: pandas, fuzzywuzzy, recordlinkage.

Regex vs Normalization:

- Regex: Pattern matching.

- Normalization: Standardizing format.

Data Cleanup:

- Automated scripts in Python/Bash/SQL to clean and format data.

UNIT IV: Relationships & Visualization

Multiple Datasets & Correlation:

- Merge using keys, use .corr() for numeric correlation.

Time-related Charts:
- Line, Gantt charts with matplotlib, seaborn, Power BI.

Data Maps & Interactives:

- Use folium, Plotly for interactive geographic visualizations.

UNIT V: Web Scraping

Web Scraping:

- Extracting data from websites using Python tools.

Reading Web Pages (lxml):

- Use requests + lxml to parse HTML and extract data via XPath.

PySpider:

- Distributed scraping system with web UI, scheduling, task retrying.

You might also like