Data Wrangling Study Guide
UNIT I: Fundamentals of Data Wrangling
What is Data Wrangling?
- Data wrangling is the process of cleaning, structuring, and enriching raw data into a desired format for
analysis.
Importance:
- Ensures data quality, consistency, and usability.
- Crucial for analytics, ML, BI.
Tasks:
- Data collection, cleaning, transformation, integration, validation, exporting.
Tools:
- Python (pandas, numpy), R, Power BI, Alteryx, Trifacta.
CSV vs JSON vs XML:
- CSV: Simple, no schema. JSON: Nested, readable. XML: Schema-rich, verbose.
UNIT II: Data Parsing and Database Concepts
Parsing PDFs:
- Tools: PyMuPDF, pdfplumber, PDFMiner, Tesseract (OCR).
- Steps: Load -> Extract -> Parse structure.
MySQL vs PostgreSQL vs NoSQL:
- MySQL: Simpler apps, less extensible.
- PostgreSQL: Complex queries, strong ACID, extensibility.
- NoSQL: Big data, flexible schema, real-time use.
NoSQL:
- Types: Document, Key-Value, Column, Graph.
- Uses: Real-time apps, unstructured data.
UNIT III: Data Quality & Cleanup
Duplicates, Fuzzy, Bad Data:
- Tools: pandas, fuzzywuzzy, recordlinkage.
Regex vs Normalization:
- Regex: Pattern matching.
- Normalization: Standardizing format.
Data Cleanup:
- Automated scripts in Python/Bash/SQL to clean and format data.
UNIT IV: Relationships & Visualization
Multiple Datasets & Correlation:
- Merge using keys, use .corr() for numeric correlation.
Time-related Charts:
- Line, Gantt charts with matplotlib, seaborn, Power BI.
Data Maps & Interactives:
- Use folium, Plotly for interactive geographic visualizations.
UNIT V: Web Scraping
Web Scraping:
- Extracting data from websites using Python tools.
Reading Web Pages (lxml):
- Use requests + lxml to parse HTML and extract data via XPath.
PySpider:
- Distributed scraping system with web UI, scheduling, task retrying.