Data analysis and mining
Xuan–Hieu Phan
VNU University of Engineering and Technology
[email protected] Updated: February 1, 2024
data analysis and mining course @ Xuan–Hieu Phan course introduction 1 / 38
Outline
1 Data analysis and mining concepts
2 Course requirement and evaluation
3 Course contents and schedule
4 Course resources
5 References
6 Summary
data analysis and mining course @ Xuan–Hieu Phan course introduction 2 / 38
Outline
1 Data analysis and mining concepts
2 Course requirement and evaluation
3 Course contents and schedule
4 Course resources
5 References
6 Summary
data analysis and mining course @ Xuan–Hieu Phan course introduction 3 / 38
Why is data analysis and mining?
data analysis and mining course @ Xuan–Hieu Phan course introduction 4 / 38
Data is growing rapidly
data analysis and mining course @ Xuan–Hieu Phan course introduction 5 / 38
What is data mining?
According to Aggarwal [2]:
Data mining is the study of collecting, cleaning, processing, analyzing, and gaining useful insights
from data.
According to Leskovec et al [3]:
The most commonly accepted definition of “data mining” is the discovery of “models” for data.
According to Wikipedia:
Data mining is the process of extracting and discovering patterns in large data sets involving
methods at the intersection of machine learning, statistics, and database systems. Data mining is
an interdisciplinary subfield of computer science and statistics with an overall goal of extracting
information (with intelligent methods) from a data set and transforming the information into a
comprehensible structure for further use.
data analysis and mining course @ Xuan–Hieu Phan course introduction 6 / 38
Data mining and knowledge discovery process
Data mining: searching for knowledge (interesting
patterns) from data [1]
Knowledge discovery process [1]
data analysis and mining course @ Xuan–Hieu Phan course introduction 7 / 38
CRISP–DM model
The cross–industry standard process for data mining (CRISP–DM) is a process
model that serves as the base for a data science process. It has six sequential phases:
1 Business understanding
2 Data understanding
3 Data preparation
4 Modeling
5 Evaluation
6 Deployment
Published in 1999 to standardize data mining processes across industries,
CRISP–DM has since become the most common methodology for data mining,
analytics, and data science projects.
data analysis and mining course @ Xuan–Hieu Phan course introduction 8 / 38
CRISP–DM process
data analysis and mining course @ Xuan–Hieu Phan course introduction 9 / 38
CRISP–DM phase 1: Business understanding
This phase focuses on understanding the objectives and requirements of the project. It
includes four tasks:
1 Determine business objectives:
thoroughly understand, from a business perspective, what the customer/company
really wants to accomplish, and then define business success criteria.
2 Assess situation:
determine resources availability, project requirements, assess risks and contingencies,
and conduct a cost–benefit analysis.
3 Determine data mining goals:
in addition to defining the business objectives, you should also define what success
looks like from a technical data mining perspective.
4 Produce project plan:
select technologies and tools and define detailed plans for each project phase.
data analysis and mining course @ Xuan–Hieu Phan course introduction 10 / 38
CRISP–DM phase 2: Data understanding
This phase drives the focus to identify, collect, and analyze the data sets that can help
you accomplish the project goals. This phase also has four tasks:
1 Collect initial data:
acquire the necessary data and (if necessary) load it into your analysis tool.
2 Describe data:
examine the data and document its surface properties like data format, number of
records, or field identities.
3 Explore data:
dig deeper into the data. Query it, visualize it, and identify relationships among the
data.
4 Verify data quality:
how clean/dirty is the data? Document any quality issues.
data analysis and mining course @ Xuan–Hieu Phan course introduction 11 / 38
CRISP–DM phase 3: Data preparation
This phase prepares the final data for modeling. It has five tasks:
1 Select data:
determine which data sets will be used and document reasons for inclusion/exclusion.
2 Clean data:
often this is the lengthiest task. Without it, you will likely fall victim to garbage-in,
garbage-out. A common practice during this task is to correct, impute, or remove
erroneous values.
3 Construct data:
derive new attributes that will be helpful. For example, derive someone’s body mass
index from height and weight fields.
4 Integrate data:
create new data sets by combining data from multiple sources.
5 Format data:
re-format data as necessary. For example, you might convert string values that store
numbers to numeric values so that you can perform mathematical operations.
data analysis and mining course @ Xuan–Hieu Phan course introduction 12 / 38
CRISP–DM phase 4: Modeling
In this phase, you will likely build and assess various models based on several different
modeling techniques. This phase has four tasks:
1 Select modeling techniques:
determine which algorithms to try (e.g., logistic regression, random forest, neural nets).
2 Generate test design:
pending your modeling approach, you might need to split the data into training, test,
and validation sets.
3 Build model:
select (hyper) parameters and train your models. This might just be executing a few
lines of code like “reg_model = LinearRegression().fit(X, y)”.
4 Assess model:
generally, multiple models are competing against each other, and the data scientist
needs to interpret the model results based on domain knowledge, the pre-defined
success criteria, and the test design.
data analysis and mining course @ Xuan–Hieu Phan course introduction 13 / 38
CRISP–DM phase 5: Evaluation
Whereas the Assess model task of the Modeling phase focuses on technical model
assessment, the Evaluation phase looks more broadly at which model best meets the
business and what to do next. This phase has three tasks:
1 Evaluate results:
do the models meet the business success criteria? Which one(s) should we approve for
the business based on “real” tests like A/B testing?
2 Review process:
review the work accomplished. Was anything overlooked? Were all steps properly
executed? Summarize findings and correct anything if needed.
3 Determine next steps:
based on the previous three tasks, determine whether to proceed to deployment, iterate
further, or initiate new projects.
data analysis and mining course @ Xuan–Hieu Phan course introduction 14 / 38
CRISP–DM phase 6: Deployment
A model is not particularly useful unless the customer can access its results. The
complexity of this phase varies widely. This final phase has four tasks:
1 Plan deployment:
develop and document a plan for deploying the model.
2 Plan monitoring and maintenance:
develop a thorough monitoring and maintenance plan to avoid issues during the
operational phase (or post-project phase) of a model.
3 Produce final report:
the project team documents a summary of the project which might include a final
presentation of data mining results.
4 Review project:
conduct a project retrospective about what went well, what could have been better,
and how to improve in the future.
data analysis and mining course @ Xuan–Hieu Phan course introduction 15 / 38
Types of data analytics
data analysis and mining course @ Xuan–Hieu Phan course introduction 16 / 38
Data types (by domains)
Retail data (product items, cart, basket, transaction)
Financial data (bank, credit, payment, stock, crypto currency)
Telecommunication data (CDR, etc.)
Transportation data
Scientific and educational data
Medical and biological data (x-ray, MRI, gene, etc.)
Web data (web pages, web logs, online user behaviors, etc.)
Marketing and advertising data
Online social network and social media data
IoT, sensor, and surveillance data
Manufacturing data
Map and spatial data
Software development data
Data in other areas (environment, agriculture, hydrology, etc.)
data analysis and mining course @ Xuan–Hieu Phan course introduction 17 / 38
Major problems in data analysis and mining
Data classification
Regression
Data clustering
Association pattern mining
Outlier detection
Other application–oriented problems
data analysis and mining course @ Xuan–Hieu Phan course introduction 18 / 38
Applications of data analysis and mining
Market basket analysis
Finance: credit scoring, fraud detection, forecasting, etc.
Customer analysis: demographic, segmentation, intent, etc.
Telco: customer segmentation, churn prediction, fraud, etc.
Marketing and advertising: ad targeting, ad optimization, etc.
Recommender systems: next best offer, cross-sell, up-sell, etc.
Social network analysis: link prediction, suggestion, etc.
Text/web mining: opinion mining, social media listening, etc.
Data mining in biology and medicine
Data analysis in software development
Data mining in production, environment, hydrology, etc.
data analysis and mining course @ Xuan–Hieu Phan course introduction 19 / 38
Outline
1 Data analysis and mining concepts
2 Course requirement and evaluation
3 Course contents and schedule
4 Course resources
5 References
6 Summary
data analysis and mining course @ Xuan–Hieu Phan course introduction 20 / 38
Data mining is an interdisciplinary field
Databases
Mathematics (linear algebra, analytics, optimization)
Probability and statistics
Machine learning
Visualization
High–performance computing
data analysis and mining course @ Xuan–Hieu Phan course introduction 21 / 38
Prerequisite courses and skills
Prerequisite courses:
Databases
Data structures and algorithms
Probability and statistics
Advanced mathematics
Required skills:
Programming with Python, R, or Java
data analysis and mining course @ Xuan–Hieu Phan course introduction 22 / 38
Course evaluation
Class contributions (10%)
Midterm evaluation (40%):
Group projects, or
Group seminar
Final exam (50%):
Writing exam, or
Oral exam
data analysis and mining course @ Xuan–Hieu Phan course introduction 23 / 38
Outline
1 Data analysis and mining concepts
2 Course requirement and evaluation
3 Course contents and schedule
4 Course resources
5 References
6 Summary
data analysis and mining course @ Xuan–Hieu Phan course introduction 24 / 38
Lectures and schedule
Course introduction (this one)
Basic probability and statistics (optional)
Data understanding
Data preprocessing and preparation
Distance and similarity
Data classification
Classification model assessment
Regression (optional)
Data clustering
Association pattern mining
Recommender systems
Data visualization (optional)
Group seminar or project presentation
data analysis and mining course @ Xuan–Hieu Phan course introduction 25 / 38
Seminar topics
1 Dimensionality reduction (Book3 Chap.11, Book2 Chap.2–2.4.3, Book4 Chap.7)
2 Advanced data classification (Book1 Chap.9, Book2 Chap.11)
3 Outlier detection (Book1 Chap.12, Book2 Chap.8.9)
4 Finding similar items (Book3 Chap.3)
5 Mining text data (Book2 Chap.13)
6 Mining web data (Book2 Chap.18, Book3 Chap.5)
7 Mining time-series data (Book2 Chap.14)
8 Mining data streams (Book2 Chap.12, Book3 Chap.4)
9 Mining discrete sequences (Book2 Chap.15)
10 Mining spatial data (Book2 Chap.16)
11 Mining graph data (Book2 Chap.17)
12 Social network analysis (Book2 Chap.19, Book3 Chap.10)
13 Privacy-preserving data mining (Book2 Chap.20)
data analysis and mining course @ Xuan–Hieu Phan course introduction 26 / 38
Seminar papers
Will be updated every year. Papers will be selected from major data mining related
conferences like SIGKDD, WSDM, SIGIR, WWW, SIGMOD, VLDB, ICDE, ICML,
AAAI, IJCAI, ACL, EMNLP, . . .
data analysis and mining course @ Xuan–Hieu Phan course introduction 27 / 38
Seminar papers (for term 1 and 2, 2023-2024)
If students do not pick any seminar topic (book chapters) from slide 26, they can select
any paper from the following conferences for group seminar:
Accepted papers from KDD-2021, KDD-2022, KDD-2023.
Accepted papers from WSDM-2021, WSDM-2022, WSDM-2023.
Accepted papers from SIGIR-2021, SIGIR-2022, SIGIR-2023.
Accepted papers from WWW-2021, WWW-2022, WWW-2023.
Any data mining related papers from SIGMOD, VLDB, ICDE, AAAI, IJCAI, etc.
The selected papers must be approved by the lecturer.
data analysis and mining course @ Xuan–Hieu Phan course introduction 28 / 38
Outline
1 Data analysis and mining concepts
2 Course requirement and evaluation
3 Course contents and schedule
4 Course resources
5 References
6 Summary
data analysis and mining course @ Xuan–Hieu Phan course introduction 29 / 38
Textbooks and references
Textbooks:
Data Mining: Concepts and Techniques [Book1]
Data Mining: The Textbook [Book2]
Further reading:
Mining of Massive Datasets [Book3]
Data Mining and Analysis: Fundamental Concepts and Algorithms [Book4]
Networks, Crowds, and Markets: Reasoning About a Highly Connected World [Book5]
Books for practice:
Python Data Science Handbook: Essential Tools for Working with Data [Book6]
Data Science from Scratch: First Principles with Python [Book7]
data analysis and mining course @ Xuan–Hieu Phan course introduction 30 / 38
Software libraries and tools
Scikit–learn: Machine learning in Python
Weka: Data mining software in Java
H2O: Fast scalable machine learning
Apache Mahout: Scalable machine learning and data mining
TensorFlow: Open source software library for machine intelligence
Torch and PyTorch
Keras: the Python deep learning API
Dato Open Source (GraphLab)
MLPack: A scalable C++ machine learning library
NLTK: Natural language toolkit
OpenCV: Open source for computer vision
R: The R project for statistical computing
...
data analysis and mining course @ Xuan–Hieu Phan course introduction 31 / 38
Data mining forums and organizations
ACM SIGKDD: ACM Special Interest Group on Knowledge Discovery and Data
Mining (www.sigkdd.org)
KDnuggets: Data Mining, Analytics, Big Data, and Data Science
(www.kdnuggets.com)
IEEE Big Data: bigdata.ieee.org
Data Science Central: www.datasciencecentral.com
Big Data University: bigdatauniversity.com
Data Science Community: datascience.community
Kaggle: machine learning and data science community (kaggle.com)
data analysis and mining course @ Xuan–Hieu Phan course introduction 32 / 38
Data mining conferences and journals
Conferences:
ACM SIGKDD: International Conference on Knowledge Discovery and Data Mining
WSDM: ACM International Conference on Web Search and Data Mining
SIAM International Conference on Data Mining
IEEE ICDM: IEEE International Conference on Data Mining
Big Data: International Conference on Big Data
PKDD: Principles and Practice of Knowledge Discovery in Databases
PAKDD: Pacific-Asia Conference on Knowledge Discovery and Data Mining
Journals:
Data Mining and Knowledge Discovery (Springer)
IEEE Transactions on Knowledge and Data Engineering (TKDE)
ACM Transactions on Knowledge Discovery from Data (TKDD)
KDD Explorations.
data analysis and mining course @ Xuan–Hieu Phan course introduction 33 / 38
Data mining related conferences
Databases and Data Engineering: SIGMOD, VLDB, ICDE, . . .
Machine Learning: NeurIPS, ICML, ICLR, COLT, ECML, ACML, . . .
IR and Web: ACM SIGIR, WWW, ICEC, ECIR, . . .
NLP and Text Mining: ACL, NAACL, EMNLP, HLT, CoNLL, COLING, EACL,
IJCNLP, PACLING, PACLIC, . . .
Artificial Intelligence: IJCAI, AAAI, AISTATS, UAI, . . .
data analysis and mining course @ Xuan–Hieu Phan course introduction 34 / 38
Outline
1 Data analysis and mining concepts
2 Course requirement and evaluation
3 Course contents and schedule
4 Course resources
5 References
6 Summary
data analysis and mining course @ Xuan–Hieu Phan course introduction 35 / 38
References
[1] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan
Kaufmann, Elsevier, 2012 [Book1].
[2] C. Aggarwal. Data Mining: The Textbook. Springer, 2015 [Book2].
[3] J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of Massive Datasets.
Cambridge University Press, 2014 [Book3].
[4] M. J. Zaki and W. M. Jr. Data Mining and Analysis: Fundamental Concepts and
Algorithms. Cambridge University Press, 2013 [Book4].
[5] D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning About a
Highly Connected World. Cambridge University Press, 2010 [Book5].
[6] J. VanderPlas. Python Data Science Handbook: Essential Tools for Working with
Data. O’Reilly, 2023 [Book6].
[7] J. Grus. Data Science from Scratch: First Principles with Python. O’Reilly, 2019
[Book7].
data analysis and mining course @ Xuan–Hieu Phan course introduction 36 / 38
Outline
1 Data analysis and mining concepts
2 Course requirement and evaluation
3 Course contents and schedule
4 Course resources
5 References
6 Summary
data analysis and mining course @ Xuan–Hieu Phan course introduction 37 / 38
Summary
Important concepts in data mining: definition, data mining process, CRISP–DM
model, types of data analytics, major data mining problems, data types, and
real–world applications of data mining.
Course and requirements: prerequisite and related courses, project or seminar,
course evaluation, final exam, etc.
Lectures and schedule.
Seminar topics.
Related resources: conferences, journals, forums, software tools.
data analysis and mining course @ Xuan–Hieu Phan course introduction 38 / 38