EdYoda
Data Scientist Program
Program Curriculum
Learning outcomes:
• Learn to implement Machine Learning techniques using Python
• Learn data visualization techniques
• Learn to analyze raw data
• Learn Big Data and Spark
Python
1. Introduction to Python
• Useful Python Resources
• Python Tools and Utilities
• Python Features
2. Python Environment
• Local Environment Setup
• Downloads and Installations
• Setting up Environment Path
3. Executing Python
• Interactive Mode
• Scripting Mode
• Integrated Development Environment
4. Python Basic Syntax
• Python Identifiers
• Reserved Words
• Lines and Indentation
www.edyoda.com [email protected]
5. Python Variable Types
• Assigning Values to Variables
• Multiple Assignment
• Standard Data Types
• Data Type Conversion
6. Python Basic Operators
• Arithmetic Operators
• Comparison Operators
• Assignment Operators
• Bitwise Operators
• Logical Operators
• Membership Operators
• Identity Operators
• Operators Precedence
7. Python Decision Making
• IF statements
• IF...ELIF...ELSE Statements
• Nested IF statements
8. Python Loops
• While loop
• For loop
• Nested loop
• Break control statement
• Continue statement
• Pass statement
9. Python Numbers
• Number type conversion
• Mathematical function
• Random number function
• Trigonometric function
www.edyoda.com [email protected]
10. Python Strings
• String special operators
• String formatting operator
• Built-in string methods
11. Python Lists
• Basic list operations
• Indexing and slicing
• Built-in functions and methods
12. Python Tuples
• Basic tuple operations
• Indexing and slicing
• Built-in functions
13. Python Dictionary
• Basic Dictionary operations
• Built-in Functions and Methods
• Use cases
14. Python Functions
• Pass by reference and value
• Function Arguments
• Scope of variables
• Default Argument Values
• Keyword Arguments
• Arbitrary Argument Lists
• Unpacking Argument Lists
• Lambda Expressions
• Documentation Strings
www.edyoda.com [email protected]
15. Python Modules
• Importing Modules
• Namespaces and scoping
• Packages
16. Python Files I/O
• Writing and Parsing Text Files
• Parsing Text Using Regular Expressions
• Writing and Parsing XML Files
• Writing and Parsing JSON Files
• Writing and Parsing CSV Files
17. Python Exceptions
• The except clause with multiple exceptions
• The try-finally clause
• Argument of an Exception
• Raising an exception
• User-Defined Exceptions
18. Python Classes and Objects
• Creating Classes
• Creating instance objects
• Destroying Objects (Garbage Collection)
• Custom Classes
• Attributes and Methods
• Inheritance and Polymorphism
• Using Properties to Control Attribute Access
19. Functional Programming
• Lambda
• Filter
• Map
• Functools
www.edyoda.com [email protected]
20. Iterators and Generators
• Itertools
• Generators
• Decorators
21. Collections
• Deque
• Counter
• OrderedDict
• ChainMap
23. Debugging, Testing
• Pdb
• Breakpoints
24. Regular Expressions
• Characters and Character Classes
• Quantifiers
• Grouping and Capturing
• Assertions and Flags
• The Regular Expression Module
25. Deploying Python Applications
• Pip
• Virtualenv
• The init.py files
• The setup.py file
• Installing the package
• Software deployment in Python
www.edyoda.com [email protected]
Data Wrangling
1. Black Box Introduction to Machine Learning
• What is not Machine Learning
• What is Machine Learning
• Types of ML - Supervised, Unsupervised
• Supervised - Classification, Regression
• Unsupervised - Clustering, Association
• Machine Learning Pipeline
2. Essential NumPy
• Introduction to NumPy
• Creation
• Access
• Stacking and Splitting
• Methods
• Broadcasting
3. Pandas for Machine Learning
• Introduction to Pandas
• Understanding Series & DataFrames
• Loading CSV,JSON
• Connecting databases
• Descriptive Statistics
• Accessing subsets of data - Rows, Columns, Filters
• Handling Missing Data
• Dropping rows & columns
• Handling Duplicates
• Function Application - map, apply, groupby, rolling, str
• Merge, Join & Concatenate
• Stacking, Unstacking & Melting
• Pivot-tables
• Normalizing JSON
• Application - EDA on Employee data, sales data
www.edyoda.com [email protected]
4. Understanding Visualization:
• Introduction to matplotlib & seaborn
• Basic Plotting
• Title, Labels, Legends, Grid, colormap, xticks, yticks
• Color, linewidth
• Sub Plotting
• Scatter plot
• Histogram
• Bar Graphs
• Plotting distributions
• Plotting 3D data
• Fundamentals of Tableau
Mathematics Fundamentals
1. Essential Maths & Statistics
• Essential Linear Algebra
• Matrix Operations
• Understanding distributions
• Probability Concepts
• Calculus
• Understanding distributions
• Mean, Median, Mode, Quantile
• Other statistics Concepts
• Sampling Techniques
Machine Learning
1. Linear Models for Classification & Regression
• Simple Linear Regression using Ordinary Least Squares
• Gradient Descent Algorithm
• Regularized Regression Methods - Ridge, Lasso, Elastic Net
• Logistic Regression for Classification
• OnLine Learning Methods - Stochastic Gradient Descent & Passive Aggressive
• Robust Regression - Dealing with outliers & Model errors
• Polynomial Regression
• Bias-Variance Tradeoff
• Application - House Price, Cancer Prediction, Insurance Prediction
www.edyoda.com [email protected]
2. Preprocessing for Machine Learning
• Introduction to Preprocessing
• StandardScaler
• MinMaxScaler
• RobustScaler
• Normalization
• Binarization
• Encoding Categorical (Ordinal & Nominal) Features
• Imputation
• Polynomial Features
• Custom Transformer
• Text Processing
• CountVectorizer
• TfIdf
• HashingVectorizer
• Image using skimage
3. Decision Trees
• Introduction to Decision Trees
• The Decision Tree Algorithms
• Decision Tree for Classification
• Decision Tree for Regression
• Advantages & Limitations of Decision Trees
• Application - Cloth Prediction
4. Naive Bayes
• Introduction Bayes' Theorem
• Naive Bayes Classifier
• Gaussian Naive Bayes
• Multinomial Naive Bayes
• Bernoulli’s Naive Bayes
• Naive Bayes for out-of-core
• Application - Text Classification, Sentiment Analysis and Spam & Non-spam
classification
www.edyoda.com [email protected]
5. Composite Estimators using Pipelines & FeatureUnions
• Introduction to Composite Estimators
• Pipelines
• Transformed Target Regressor
• FeatureUnions
• ColumnTransformer
• GridSearch on pipeline
• Application - Author classification
6. Model Selection & Evaluation
• Cross Validation
• Hyperparameter Tuning
• Model Evaluation
• Model Persistence
• Validation Curves
• Learning Curves
7. Feature Selection & Dimensionality Reduction
• Introduction to Feature Selection
• Variance Threshold
• Chi-squared stats
• ANOVA using f_classif
• Univariate Linear Regression Tests using f_regression
• F-score vs Mutual Information
• Mutual Information for discrete value
• Mutual Information for continues value
• SelectKBest
• SelectPercentile
• SelectFromModel
• Recursive Feature Elimination
• PCA
• SVD
• Application - Credit Risk Prediction
8. Nearest Neighbors
• Fundamentals of Nearest Neighbor Algorithm
• Unsupervised Nearest Neighbors
• Nearest Neighbors for Classification
www.edyoda.com [email protected]
• Nearest Neighbors for Regression
• Nearest Centroid Classifier
• Application - Nearest neighbour for face inpainting
9. Clustering Techniques
• Introduction to Unsupervised Learning
• Clustering
• Similarity or Distance Calculation
• Clustering as an Optimization Function
• Types of Clustering Methods
• Partitioning Clustering - KMeans & Meanshift
• Hierarchical Clustering - Agglomerative
• Density Based Clustering - DBSCAN
• Measuring Performance of Clusters
• Comparing all clustering methods
• Application - Grouping similar customers
10. Anomaly Detection
• What are Outliers ?
• Statistical Methods for Univariate Data
• Using Gaussian Mixture Models
• Fitting an elliptic envelope
• Isolation Forest
• Local Outlier Factor
• Using clustering method like DBSCAN
• Application - Anomaly detection for credit risk prediction
11. Support Vector Machines
• Introduction to Support Vector Machines
• Maximal Margin Classifier
• Soft Margin Classifier
• SVM Algorithm for Classification
• SVM for Regression
• Hyper-parameters in SVM
• Application - Face recognition and breast cancer classification
www.edyoda.com [email protected]
12. Dealing with Imbalanced Classes
• What are imbalanced classes & their impact?
• OverSampling
• UnderSampling
• Connecting Sampler to pipelines
• Making classification algorithm aware of Imbalance
• Anomaly Detection
• Application - Fraud detection
13. Ensemble Methods
• Introduction to Ensemble Methods
• RandomForest
• AdaBoost
• Gradient Boosting Tree
• VotingClassifier
• XGBoost
• Application - Malicious data detection
14. Recommendation Engine
• Understanding distance vector calculation - cosine, euclidean, manhattan
• Types of Recommendation Engines
• Recommendation based on similarity
• Application - Grouping videos based on description, user rating prediction
15. Time Series Modeling
• Simple Average & Moving Average
• Single Exponential Smoothing
• Holt’s linear trend method
• Holt’s winter seasonal method
• ARIMA
16. Packaging & Deployment
• Creating Python Package
• Deploy trained model behind REST interface
• Deploy model behind API call
• Deploy on AWS cloud (optional)
www.edyoda.com [email protected]
Big Data Ecosystem
1. Introduction to Big Data
• Big Data
• Understanding distributed computing
• Introduction to Hadoop
• HDFS, YARN, MapReduce
• Limitations of Hadoop
• Introduction to Spark
• Introduction to Kafka
• Hive
• Cassandra
2. Internal Details of Spark
• Driver
• Executors
• Partitions
• Jobs
• Stages
• Tasks
• Resilient Distributed Datastructure
• DataFrames as a High Level Datastructure
3. Foundations of Spark using RDD
• Basics of Distributed Computing
• Resilient Distributed Dataset
• Simple Transformers - map,filter,groupby
• Actions - Collect, count, foreach
• Complex api - combinebykey
• Caching, Debugging
• Important Configuration
4. Data Wrangling using DataFrames
• Creating DataFrames from collections
• Creating a DataFrame from csv,json etc.
• DataFrame Row
www.edyoda.com [email protected]
• DataFrame Column
• Creating tables from dataframe
• SQL query
• DataFrame Grouping
• DataFrame Functions
• User Defined Functions (UDF)
5. Packaging & Deployment of Spark Applications
• The spark-submit command
• Command line parameters
• Deploying the app programmatically
• Configuring your SparkSession
• Modularizing code
• Structure of the module
• Building an egg
• User defined functions in Spark
• Submitting a job
• Monitoring execution
www.edyoda.com [email protected]