CRASH COURSE DATA SCIENCE -
(BEGINNER LEVEL)
DATA COLLECTION
1) Data collection is the process of gathering relevant
information from various sources to analyze and derive insights.
2) In data science, the quality of collected data directly impacts
the accuracy of the resulting analysis and models.
3) A well-defined sampling strategy ensures that collected data
is representative of the larger population.
4) Surveys, interviews, and questionnaires are common
methods for collecting primary data directly from individuals.
5) Web scraping involves extracting information from websites
and is often used to collect data from online sources.
6) Sensor networks and Internet of Things (IoT) devices
contribute to the collection of real-time data in various
applications.
7) Secondary data refers to data collected by someone else for
a different purpose but can still be useful for analysis.
8) The bias present in collected data can lead to skewed
insights and inaccurate conclusions.
9) Data curation involves organizing, cleaning, and preparing
collected data for analysis.
10) The process of data collection should follow ethical
guidelines to ensure privacy and respect for individuals' rights.
DESCRIPTIVE STATISTICS
1) Descriptive statistics summarize and describe the main features of a dataset.
2) Descriptive statistics can be used to summarize both categorical and
numerical variables.
3) Range is a measure of dispersion that represents the difference between the
maximum and minimum values in a dataset.
4) The range is NOT a measure of central tendency that represents the middle
value in a dataset.
5) The interquartile range (IQR) is a measure of spread that represents the range
between the first quartile (Q1) and the third quartile (Q3).
6) The mode is the value that occurs most frequently in a dataset.
7) The median is less affected by outliers than the mean.
8) The median is less influenced by extreme values in the dataset, making it a
more robust measure of central tendency compared to the mean.
9) Standard deviation measures the average distance of values from the mean.
10) Standard deviation quantifies the dispersion or spread of data by measuring
the average distance between each data point and the mean.
11) Variance is NOT the square root of the standard deviation.
12) Variance is the squared value of the standard deviation.
13) Skewness is a measure of the symmetry of a distribution.
14) Skewness indicates the extent to which a distribution is skewed or
asymmetrical.
15) Correlation measures the strength and direction of the linear relationship
between two numerical variables.
EXPLORATORY DATA ANALYSIS
1) Exploratory data analysis involves summarizing and visualizing data to
gain insights and understand patterns.
2) Exploratory data analysis is typically performed after data cleaning and
preprocessing to ensure the data is in a suitable format for analysis.
3) Exploratory data analysis includes identifying outliers (extreme values) and
missing values in the dataset, which can impact the validity of the analysis.
4) Descriptive statistics, such as mean, median, and standard deviation, are
commonly calculated during exploratory data analysis to summarize the
central tendency and dispersion of the data.
5) Exploratory data analysis is NOT a flexible and iterative process.
6) Exploratory data analysis can help detect relationships and correlations
between variables, which can provide valuable insights into the dataset.
7) The primary goal of Exploratory data analysis is to gain an understanding
of the data rather than formal hypothesis testing and statistical inference.
8) Exploratory data analysis can reveal potential data quality issues, such as
inconsistent or erroneous values, and identify data anomalies that require
further investigation.
9) Graphical techniques, such as histograms, scatter plots, and box plots,
are commonly used in exploratory data analysis to visualize the distribution,
relationships, and outliers in the data.
10) Exploratory data analysis is NOT an ongoing process
DATA VISUALISATIONS
1) Data visualisation is the presentation of data in a graphical or pictorial
format
2) Bar chart, line chart and pie chart are some of the common types of
visualisation charts
3) A line chart is a data visualization technique suitable for displaying trends
over time
4) A heat map is used to represent distribution of values with colors
5) A tree map is used to show hierarchical data using nested rectangles
6) A box plot is used to show the distribution of data
7) A choropleth map is used to represent geographic data with colour
variations
8) The points on the scatter plot show the relationship between two
variables
9) In a bar chart, y-axis shows the dependent variable while x-axis shows
the independent variable
10) Python is the most commonly used programming language that creates
interactive data visualisations
DATA CLEANING
1) Imputation technique is used to fill in missing values
2) Outlier detection is used to identify and handle unusual data points
3) Standardization is used to bring all variables to a common scale
4) Deduplication is used to identify and handle duplicate records
5) Regular Expressions are used for pattern matching and extraction
6) One-Hot Encoding is used for handling categorical variables
7) Scaling is used to re-scale numerical variables
8) Trimming is used to remove unnecessary white spaces
9) Mean imputation Replacing missing values with the mean of the variable
10) Forward filling Filling missing values with the value before them
11) Interpolation Estimating missing values based on the adjacent values
12) Deleting rows Removing rows with missing values
MACHINE LEARNING
1) The two main categories of machine learning models are supervised and
unsupervised.
2) Labeled data in supervised learning provides correct answers for training
the model to learn relationships between input features and output labels.
3) Precision is the ratio of correctly predicted positive observations to the
total predicted positives, while recall is the ratio of correctly predicted
positive observations to the total actual positives.
4) Accuracy might not be suitable for imbalanced datasets because it can be
dominated by the majority class and may not reflect the true model
performance.
5) Cross-validation assesses a machine learning model's performance by
dividing the dataset into subsets, training/evaluating the model on different
combinations, and providing insights into its generalization capability.