Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views7 pages

PDF For Ds

Uploaded by

TC Mathan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views7 pages

PDF For Ds

Uploaded by

TC Mathan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 7

can you give the 100 question i will paste in my word

Certainly! Here are 100 important data science interview questions (with concise
answers), organized into logical sections. You can copy and paste these directly
into your Word document for preparation.

# Data Science Interview Questions and Answers

## Fundamentals

1. What is Data Science?


Data Science is an interdisciplinary field that uses scientific methods, processes,
algorithms, and systems to extract insights and knowledge from structured and
unstructured data.

2. Differentiate between Data Science, Data Analytics, and Machine Learning.


- Data Science: Full data pipeline, from gathering to modeling.
- Data Analytics: Focuses on analyzing data and reporting insights.
- Machine Learning: Creating models that learn patterns from data.

3. What is the difference between structured and unstructured data?


Structured data is organized (tables, databases). Unstructured data includes text,
images, audio, etc.

4. What is data wrangling?


The process of cleaning, structuring, and enriching raw data into a form suitable
for analysis.

5. What are the steps in a data science project?


Define problem, collect data, clean data, explore/analyze data, model building,
evaluation, deployment.

6. Name some real-life applications of Data Science.


Fraud detection, recommendation systems, healthcare diagnostics, sentiment
analysis, self-driving cars.

7. What is feature engineering?


Creating or transforming raw data features to improve model performance.

8. What is EDA (Exploratory Data Analysis)?


Analyzing datasets to summarize characteristics often visually, before modeling.

9. What are categorical and numerical variables?


Categorical: represent types/groups. Numerical: represent quantities/values.

10. What is the importance of data visualization?


It helps understand data patterns, outliers, and trends, making communication
easier.

## Statistics and Probability

11. Define mean, median, and mode.


Mean: average. Median: middle value. Mode: most frequent value.

12. What is standard deviation?


Measures the spread of data around the mean.

13. Explain variance.


The average of squared differences from the mean; shows data dispersion.
14. What is the Central Limit Theorem?
The mean of large samples from any distribution will approximate a normal
distribution.

15. What is correlation?


Measures the relationship/association between two variables.

16. Difference between correlation and covariance.


Correlation scales values between -1 and 1; covariance is unscaled and depends on
units.

17. What is hypothesis testing?


A statistical method to determine if there’s enough evidence to reject a null
hypothesis.

18. Explain Type I and Type II errors.


Type I: False positive (rejecting a true null). Type II: False negative (accepting
a false null).

19. What is p-value?


Probability, under null hypothesis, of obtaining a result at least as extreme as
the one observed.

20. What is a confidence interval?


A range that estimates a population parameter with a known probability (e.g., 95%).

## Machine Learning Basics

21. What is supervised learning?


Learning using labeled data; the model predicts outcomes based on known inputs and
outputs.

22. List some supervised learning algorithms.


Linear regression, logistic regression, decision trees, KNN, SVM, random forest.

23. What is unsupervised learning?


Learning from unlabeled data; models find hidden patterns or groupings.

24. List some unsupervised learning algorithms.


K-means clustering, hierarchical clustering, PCA, t-SNE, DBSCAN.

25. What is semi-supervised learning?


Uses both labeled and unlabeled data for training to improve learning accuracy.

26. What is reinforcement learning?


An agent learns by interacting with an environment, receiving rewards/penalties.

27. Name differences between classification and regression.


Classification predicts categories; regression predicts continuous values.

28. What is cross-validation?


A technique to validate model performance by splitting data into train/test sets
multiple times.

29. Explain overfitting and underfitting.


Overfitting: Model learns noise; performs poorly on new data.
Underfitting: Model is too simple; can’t capture data patterns.
30. How can you prevent overfitting?
Use cross-validation, regularization, pruning (trees), gather more data, early
stopping.

## Algorithms and Models

31. What is linear regression?


A model predicting continuous outcomes based on input variables using a linear
relationship.

32. Assumptions of linear regression?


Linearity, independence, homoscedasticity, normality of errors.

33. What is logistic regression?


A model used for binary classification problems; outputs probabilities.

34. What is a Decision Tree?


A tree-like model of decisions; splits data on feature values for prediction.

35. How do you build a Decision Tree?


Select the best feature to split based on impurity; split recursively; stop at
certain depth or purity.

36. What is pruning in Decision Trees?


Removing sections that provide little predictive power to prevent overfitting.

37. Explain Random Forest.


An ensemble of decision trees; combines outputs of multiple trees to improve
performance.

38. What is bagging?


Bootstrap Aggregating: trains multiple models on different samples; aggregates
results.

39. What is boosting?


Sequentially combines weak models to form a stronger model (e.g., AdaBoost,
XGBoost).

40. What is K-Nearest Neighbors?


A non-parametric method where prediction is based on the k closest data points.

41. What is SVM (Support Vector Machine)?


A model that finds the optimal boundary to separate classes; uses kernels for non-
linear data.

42. What is Naive Bayes?


A probabilistic classifier based on Bayes’ theorem, assuming feature independence.

43. What is clustering?


Grouping similar data points; main type: unsupervised learning.

44. What is K-means clustering?


Partitions data into k clusters by minimizing intra-cluster distance.

45. What is PCA (Principal Component Analysis)?


Reduces data dimensionality by transforming to new axes with maximum variance.

## Model Evaluation
46. What is a confusion matrix?
A table showing true positives (TP), false positives (FP), true negatives (TN), and
false negatives (FN).

47. Define precision and recall.


Precision: TP / (TP + FP).
Recall: TP / (TP + FN).

48. What is F1-score?


The harmonic mean of precision and recall.

49. What is ROC-AUC?


Receiver Operating Characteristic curve and Area Under Curve: evaluate classifier
performance.

50. What is accuracy?


(Number of correct predictions) / (total predictions).

51. What is specificity?


TN / (TN + FP): Measures true negative rate.

52. How is model performance affected by imbalanced data?


Accuracy can be misleading; use metrics like F1, ROC-AUC.

53. How do you handle imbalanced datasets?


Oversample minority, undersample majority, use synthetic data, change metrics, use
ensemble methods.

54. What is cross-entropy loss?


Loss function measuring difference between predicted probability and actual label.

55. How do you select important features?


Using correlation, feature importance, regularization (L1), wrapper/select models.

56. What is bias-variance trade-off?


Bias: error from incorrect assumptions; variance: error from sensitivity to small
fluctuations. Need a balance.

57. What are ROC and Precision-Recall curves?


ROC: trade-off between TPR and FPR; PR: precision vs recall.

58. What is grid search?


A way to systematically work through multiple parameter combinations to tune
models.

59. What is regularization?


Adding a penalty to model to discourage complexity and prevent overfitting (L1,
L2).

60. What is early stopping?


Halting training of a model when validation performance stops improving.

## Python and Data Science Tools

61. Name key Python libraries for Data Science.


Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, TensorFlow, Keras, PyTorch.

62. What is Pandas used for?


Data manipulation and analysis using DataFrames.
63. How do you read a CSV file in Python?
Using pd.read_csv('filename.csv') from pandas.

64. How do you handle missing data in pandas?


Using functions like dropna() or fillna().

65. What is NumPy?


A library for numerical computations and handling arrays.

66. What is Matplotlib used for?


Creating visualizations/plots.

67. What is Scikit-learn?


Machine learning library for classical ML algorithms.

68. What is TensorFlow?


Open-source library for deep learning/neural network development.

69. What is Keras?


High-level deep learning API, runs on top of TensorFlow.

70. How do you split a dataset into train/test sets?


Using train_test_split from scikit-learn.

71. How do you find missing values in a DataFrame?


Using isnull().sum().

72. What is data serialization?


Storing data objects for reuse (e.g., using pickle, joblib).

73. How to merge DataFrames in pandas?


Using merge(), join(), or concat().

74. What is a Jupyter Notebook?


An interactive web tool for coding, visualizations, and documentation.

75. How do you plot a histogram in matplotlib?


Using plt.hist(data).

## Big Data and Ecosystem

76. What is Hadoop?


An open-source framework for distributed storage/processing of big data.

77. What is MapReduce?


Programming model for processing large datasets with parallel/distributed
algorithms.

78. What is Spark?


A big data processing framework faster than Hadoop’s MapReduce.

79. What are NoSQL databases?


Non-relational databases (MongoDB, Cassandra) for flexible, large-scale storage.

80. What is HDFS?


Hadoop Distributed File System for storing huge data files.

81. Difference between SQL and NoSQL?


SQL: structured, relational; NoSQL: flexible schema, non-relational.

82. What is a data lake?


Centralized repository for storing structured and unstructured data.

83. What is the purpose of Apache Hive?


Warehouse software for querying and managing large datasets in distributed storage.

84. What is schema-on-read?


Defining data structure at the time of reading (not storing).

85. What is real-time data processing?


Analyzing data as it arrives instead of waiting for batch processing.

86. What is batch processing?


Processing data in groups or batches at intervals.

87. List common visualization tools.


Matplotlib, Seaborn, Tableau, Power BI, Plotly.

88. What is OLAP?


Online Analytical Processing: enables fast analysis of multi-dimensional data.

89. What is ETL?


Extract, Transform, Load—process to move and prepare data from sources to storage.

## Advanced Topics

90. What is deep learning?


A subset of machine learning using neural networks with multiple layers.

91. When would you use a CNN vs. an RNN?


CNN for image data, RNN for sequential/time-series/text data.

92. What is an autoencoder?


A neural network to learn efficient data encoding/decoding, often for
dimensionality reduction or anomaly detection.

93. What is transfer learning?


Using a pretrained model on a new, related problem to improve performance.

94. What is NLP (Natural Language Processing)?


Domain of extracting, processing, and modeling text/language data.

95. What are word embeddings?


Vector representations of words capturing context and meaning (e.g., Word2Vec,
GloVe).

96. What is time series analysis?


Study of data points collected/observed at specific time intervals.

97. What is LSTM?


Long Short-Term Memory: a type of RNN for capturing long-term dependencies in
sequence data.

98. What is GAN?


Generative Adversarial Network: two neural networks (generator, discriminator)
competing in zero-sum game.
99. What is hyperparameter tuning?
Searching for the best model parameters (learning rate, depth etc.) for improved
performance.

## Scenario-Based Questions

100. Given an imbalanced dataset, how would you improve model performance?
Use sampling techniques, try different metrics, algorithmic changes, or ensemble
models.

Feel free to copy and paste this collection into your Word document for your
interview preparation! If you need detailed answers for any specific question, just
ask.

You might also like