Data Science: Topics of Study - Explained Notes
Introduction
- Big Data and Data Science hype and getting past the hype: Data science is often surrounded by
exaggerated expectations. It's important to focus on real-world applications and measurable
outcomes.
- Why now? - Datafication: This refers to the transformation of social action into online quantified
data, enabling real-time tracking and predictive analysis.
- The current landscape of perspectives: Different industries have different perspectives on data
science, ranging from customer analytics to operations and logistics.
- Skill sets needed: Includes statistics, programming (Python/R), data wrangling, machine learning,
and domain knowledge.
Statistical Inference
- Populations and samples: Populations include all members of a defined group; samples are
subsets used for analysis.
- Statistical modelling, probability distributions, fitting a model: These tools help understand
relationships between variables and make predictions.
- Python packages for data science: Common ones include NumPy, pandas, SciPy, scikit-learn, and
statsmodels.
Exploratory Data Analysis and the Data Science Process
- Basic tools of EDA: Includes histograms, boxplots, scatterplots, and summary statistics.
- Philosophy of EDA: Emphasizes understanding data patterns before applying models.
- The Data Science Process: Steps include data collection, cleaning, EDA, modeling, interpretation,
and deployment.
Three Basic Machine Learning Algorithms
- Linear Regression: A method to model the relationship between a dependent variable and one or
more independent variables.
- k-Nearest Neighbors (k-NN): A non-parametric method used for classification and regression by
comparing distances.
- k-means: An unsupervised learning algorithm used for clustering data into k number of groups.
One More Machine Learning Algorithm and Usage in Applications
- Filtering Spam as an application: A common real-world use case of machine learning.
- Why Linear Regression and k-NN are poor for spam filtering: They fail to handle text data and
sparse features efficiently.
- Naive Bayes: Works well for spam filtering by calculating the probability of an email being spam
given the words it contains.
- Data Wrangling: The process of cleaning and unifying complex data sets for easy access and
analysis. APIs and web scraping are often used.
Feature Generation and Feature Selection
- Motivating application: Used in customer retention strategies to identify important factors.
- Feature Generation: Creating new features based on domain knowledge or data transformations.
- Feature Selection: Reducing the number of input variables using techniques like Filters, Wrappers,
Decision Trees, and Random Forests.
Recommendation Systems
- Algorithmic ingredients: Involve collaborative filtering, content-based filtering, and hybrid methods.
- Dimensionality Reduction: Helps reduce data complexity, e.g., using PCA or SVD.
- Singular Value Decomposition (SVD): A mathematical technique for factorizing matrices used in
recommendation engines.
- Principal Component Analysis (PCA): A method to emphasize variation and bring out strong
patterns in a dataset.
Mining Social-Network Graphs
- Social networks as graphs: Representing individuals as nodes and relationships as edges.
- Clustering of graphs: Grouping nodes with similar properties.
- Community discovery: Detecting communities directly within networks.
- Partitioning of graphs: Dividing graphs into parts to simplify analysis.
- Neighbourhood properties: Analyzing a node's local connections.
Data Visualization
- Principles and tools: Includes clarity, accuracy, and use of visualization libraries like Matplotlib,
Seaborn, and Plotly.
- Examples of inspiring projects: Dashboards, storytelling with data, and visual analytics used in
industries.
Data Science and Ethical Issues
- Privacy, security, ethics: Involves protecting data and using it responsibly.
- A look back at Data Science: Reflecting on its evolution and impact.
- Next-generation data scientists: Professionals who are technically strong and ethically aware.