Alright Bhavik I’ll give you all 35 questions rewritten in proper Part A style –
each answer 4–5 lines, 40–60 words, definition + example/point.
DEV – Part A (2 Marks Answers)
Unit 1: Basics of Data Exploration
1. Two basic organizing concepts of data analysis
The two basic concepts are cases (units about which information is collected) and
variables (characteristics measured across cases).
Example: In a school dataset, each student is a case, while age, marks, and attendance
are variables.
2. Cases and Variables
A case is a single observation in a dataset, while a variable is a property or
measurement that varies across cases.
Example: In e-commerce data, each order is a case, and variables include order value,
product type, and payment method.
3. Twyman’s Law
Twyman’s Law states: “The more interesting or unusual the data appears, the more
likely it is wrong.”
Example: If a survey shows 120% participation rate, it indicates an error rather than a
meaningful insight.
4. Responsibilities of a Data Analyst (+ example)
A data analyst collects, cleans, analyzes, and visualizes data for decision-making.
Example: In e-commerce, if sales drop suddenly, the analyst investigates causes like
low website traffic, competitor discounts, or stock issues.
5. Gini Coefficient & Inequality
The Gini coefficient measures inequality, ranging from 0 (perfect equality) to 1
(maximum inequality).
Example: A Gini of 0.20 shows fair income distribution, while 0.60 indicates sharp
inequality within the population.
6. Features visible in Histogram
Histograms reveal shape (normal/skewed), center (mean/median), spread
(range/variance), and outliers.
Example: A bell-shaped histogram of exam marks indicates most students scored near
the average.
7. Mean vs Hanning
The mean shows central tendency of data, while Hanning smoothing reduces
fluctuations in time series.
Example: Mean is used for average salaries, while Hanning helps smooth noisy stock
price movements.
8. Easy way to visualize distribution
The simplest way is using a histogram or boxplot.
Example: A histogram of monthly incomes shows distribution frequency, while a boxplot
highlights the median, quartiles, and outliers.
9. Purpose of smoothing in time series
Smoothing removes random noise from data to highlight overall patterns and trends.
Example: A moving average of monthly sales reveals seasonal cycles more clearly.
10. Contingency Table
A contingency table displays counts for two categorical variables to show associations.
Example: Gender (male/female) vs shopping preference (online/offline) can be analyzed
using a 2×2 table.
11. Purpose of Resistant Line
A resistant line in scatterplots shows the overall trend without being affected by
outliers.
Example: In hours studied vs marks, resistant line shows positive trend even if one
student scored abnormally low.
12. Standardized Variables
Standardization converts data into z-scores, forcing mean = 0 and SD = 1.
Example: Converting test scores in maths and English allows fair comparison despite
different scoring scales.
13. Interquartile Range (IQR)
Dataset: 56, 65, 72, 75, 80, 82, 85, 88, 90, 95.
Q1 = 72, Q3 = 88 → IQR = 16. It shows the middle 50% spread, helping detect variability
and outliers.
14. Why transformations are used
Transformations stabilize variance, reduce skewness, and make patterns clearer.
Example: Taking log of income data reduces extreme values, making comparisons fairer.
15. Third variable affecting relationship
Sometimes a third variable changes the relationship between two others.
Example: Ice cream sales and drowning appear correlated, but temperature is the third
variable influencing both.
16. Correlation vs Causation
Correlation means two variables move together, but causation means one directly
influences the other.
Example: Height and weight are correlated; smoking causes lung cancer (causation).
17. R Libraries for Visualization
Three popular libraries in R are:
• ggplot2 (for layered graphics),
• lattice (multivariate plots),
• plotly (interactive visualizations).
18. Relationship between two variables
The relationship may be positive, negative, or none.
Example: Hours studied and exam scores show a positive relationship, while stress and
productivity often show a negative relationship.
19. Features in histogram & data type
Histograms show shape, spread, center, and outliers of continuous data.
Example: A histogram of daily temperatures reveals normal distribution patterns.
20. Causes of Resistant Lines
Resistant lines arise from regression methods that reduce influence of outliers.
Example: A median-based line fits better when extreme values exist in data.
Unit 2 & 3
21. Basic methods of processing in API
The seven steps are: Acquire, Parse, Filter, Mine, Represent, Refine, Interact.
Example: COVID-19 dashboards acquire case data, filter by country, and represent as
charts.
22. Median in even vs odd datasets
For odd n, median = middle value. For even n, it is the average of two middle values.
Example: Dataset [3,5,7] → median = 5; [3,5,7,9] → median = (5+7)/2 = 6.
23. Work on EDA
Exploratory Data Analysis (EDA) was pioneered by John Tukey in the 1970s.
He introduced visual tools like stem-and-leaf plots, boxplots, and emphasized
visualization before formal modeling.
24. Purpose of Histogram in EDA
Histograms show the frequency distribution of data.
Example: In exam marks, a histogram reveals if scores are normally distributed or
skewed.
25. Regression line for prediction
Regression finds the best-fit line between two variables.
Example: Predict marks (Y) from study hours (X) using Y = a + bX. More study hours →
higher marks.
26. Median in even vs odd datasets (repeat)
Covered in Q22.
27. Twyman’s Law (repeat)
Covered in Q3.
28. Key contributions in EDA
John Tukey contributed boxplots, resistant statistics, and exploratory visualizations.
Example: Boxplots highlight outliers and spread effectively in social science research.
29. Histogram in EDA (repeat)
Covered in Q24.
30. Regression line predicting marks (repeat)
Covered in Q25.
31. Interpreting contingency table
Check row %, column %, and association between categories.
Example: Table of gender vs shopping mode shows whether preference differs by
gender.
32. Main use of Chi-square test
Tests whether two categorical variables are independent.
Example: Checking if education level is independent of job satisfaction.
33. Percentage Table
Converts frequencies to percentages, making group comparison easier.
Example: 60% of males vs 40% of females prefer online shopping.
34. Parse & Refine in Visualization
Parse: Structure raw data into usable form.
Refine: Improve design for clarity (e.g., colors, labels).
Example: Parsing survey data, then refining chart with clear labels.
35. Tree Diagram Purpose
A tree diagram shows hierarchical relationships in branching form.
Example: A company’s organization chart or a decision tree in machine learning.
Now all 35 questions are in exam-ready Part A format.
Do you want me to also prepare Part B (5-mark) answers in ~150–200 words each,
or should we first finish a one-page revision sheet of all formulas/diagrams for quick
recall?