What is Big Data?
Big data refers to extremely large datasets that are complex and challenging to manage, process,
or analyze using traditional data-processing techniques. Big data is characterized by its volume,
velocity, variety, veracity, and value, making it valuable for gaining insights and driving
decision-making.
What are the 5 V’s of Big Data?
1. Volume: The amount of data being generated and processed.
2. Velocity: The speed at which data is generated, collected, and analyzed.
3. Variety: The diversity of data types (structured, unstructured, and semi-structured).
4. Veracity: The reliability and accuracy of the data.
5. Value: The meaningful insights and benefits derived from data.
What is Machine Learning?
Machine learning is a subset of artificial intelligence (AI) that focuses on creating systems
capable of learning and improving from experience without explicit programming. It uses
algorithms and statistical models to analyze and make predictions or decisions based on data.
Importance of Data Science
Data science is critical because it enables organizations to:
Extract meaningful insights from raw data.
Enhance decision-making with data-driven strategies.
Optimize business operations and customer experiences.
Solve complex problems in various fields like healthcare, finance, and technology.
Predict trends and future outcomes using advanced analytics.
Define Data Manipulation
Data manipulation involves modifying, organizing, or transforming data to make it more useful
for analysis. This process includes cleaning, filtering, sorting, aggregating, and reshaping data to
gain meaningful insights.
What is NumPy?
NumPy (Numerical Python) is a Python library used for numerical and scientific computing. It
provides support for:
Multi-dimensional arrays and matrices.
Mathematical functions for data operations.
High-performance tools for manipulating large datasets.
How to Reshape Arrays?
Reshaping in NumPy allows arrays to change their shape without altering their data.
Example:
python
Copy code
import numpy as np
# Original array
arr = np.array([1, 2, 3, 4, 5, 6])
# Reshaping into a 2x3 array
reshaped_arr = arr.reshape(2, 3)
print(reshaped_arr)
Output:
lua
Copy code
[[1 2 3]
[4 5 6]]
What is Meant by Absolute Value?
The absolute value of a number is its non-negative value without regard to its sign. In Python,
the abs() function calculates the absolute value.
Example:
python
Copy code
print(abs(-5)) # Output: 5
print(abs(7)) # Output: 7
Categories of Basic Array Manipulation in NumPy
1. Reshaping Arrays: Changing the shape of arrays (e.g., .reshape()).
2. Joining Arrays: Combining arrays (e.g., np.concatenate()).
3. Splitting Arrays: Dividing arrays into sub-arrays (e.g., np.split()).
4. Broadcasting: Performing operations on arrays of different shapes.
5. Transposing: Reversing or swapping axes (e.g., .T or np.transpose()).
Example for Index Alignment in Series
Index alignment ensures that operations between two pandas Series are performed based on
matching index labels.
Example:
python
Copy code
import pandas as pd
# Creating two Series
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])
# Adding Series
result = s1 + s2
print(result)
Output:
r
Copy code
a NaN
b 6.0
c 8.0
d NaN
dtype: float64
Unmatched indices result in NaN.
1. What is NumPy?
NumPy (Numerical Python) is a Python library widely used for numerical and scientific
computing. It provides:
Support for multi-dimensional arrays and matrices.
A collection of mathematical functions to operate on arrays.
Fast computation for large datasets due to its C-based implementation.
2. How to Reshape Arrays?
Reshaping allows you to change the dimensions of an array without altering its data.
Example:
python
Copy code
import numpy as np
# Original array
arr = np.array([1, 2, 3, 4, 5, 6])
# Reshape into a 2x3 matrix
reshaped_arr = arr.reshape(2, 3)
print(reshaped_arr)
Output:
lua
Copy code
[[1 2 3]
[4 5 6]]
3. What is Meant by Absolute Value?
Absolute value is the non-negative value of a number, regardless of its sign. In Python, the
abs() function computes the absolute value.
Example:
python
Copy code
print(abs(-10)) # Output: 10
print(abs(7)) # Output: 7
4. Categories of Basic Array Manipulation in NumPy
1. Reshaping Arrays: Change the shape of arrays (.reshape()).
2. Concatenating Arrays: Combine arrays (np.concatenate()).
3. Splitting Arrays: Divide arrays into sub-arrays (np.split()).
4. Transposing: Swap or reverse axes of arrays (.T or np.transpose()).
5. Broadcasting: Perform operations on arrays of different shapes.
5. Example of Index Alignment in Series
Index alignment ensures operations on pandas Series are based on matching index labels.
Example:
python
Copy code
import pandas as pd
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])
# Adding Series
result = s1 + s2
print(result)
Output:
r
Copy code
a NaN
b 6.0
c 8.0
d NaN
dtype: float64
Unmatched indices result in NaN.
6. Two Commonly Used Data Structures in Pandas
1. Series: A one-dimensional labeled array capable of holding any data type.
2. DataFrame: A two-dimensional labeled data structure, similar to a table or spreadsheet.
7. Define Boolean Indexing
Boolean indexing is a technique in pandas to filter data using boolean conditions.
Example:
python
Copy code
import pandas as pd
data = pd.Series([10, 20, 30, 40, 50])
filtered_data = data[data > 25]
print(filtered_data)
Output:
go
Copy code
2 30
3 40
4 50
dtype: int64
8. Example for Working with Time Series
Pandas provides robust tools for working with time-indexed data.
Example:
python
Copy code
import pandas as pd
# Creating a time series
date_range = pd.date_range('2024-01-01', periods=5, freq='D')
time_series = pd.Series([100, 200, 300, 400, 500], index=date_range)
print(time_series)
Output:
yaml
Copy code
2024-01-01 100
2024-01-02 200
2024-01-03 300
2024-01-04 400
2024-01-05 500
Freq: D, dtype: int64
9. Differentiate Between Concat and Append
Aspect Concat Append
Combines DataFrames along a particular
Usage Adds rows to a DataFrame or Series.
axis.
Simpler, often used for appending
Parameters Offers more flexibility with axis, keys, etc.
rows.
Efficiency Faster for combining multiple objects. Less efficient for large datasets.
10. What are Contour Plots?
Contour plots are graphical representations of 3D data in 2D. They use contour lines or color
regions to represent values at specific levels.
Example (using Matplotlib):
python
Copy code
import numpy as np
import matplotlib.pyplot as plt
# Define data
x = np.linspace(-5, 5, 50)
y = np.linspace(-5, 5, 50)
X, Y = np.meshgrid(x, y)
Z = np.sin(np.sqrt(X**2 + Y**2))
# Plot contour
plt.contour(X, Y, Z, levels=10, cmap='viridis')
plt.colorbar()
plt.title("Contour Plot")
plt.show()
1. What is Big Data?
Big Data refers to extremely large, complex, and diverse datasets that traditional data processing
methods cannot handle efficiently. It requires advanced tools and techniques to store, process,
and analyze for extracting meaningful insights.
2. What are the 5 V’s of Big Data?
1. Volume: The vast amount of data generated every second from sources like social media,
IoT devices, and sensors.
2. Velocity: The speed at which data is generated, collected, and processed.
3. Variety: The diversity in data formats, including structured (databases), unstructured
(text, images), and semi-structured (JSON, XML).
4. Veracity: The accuracy and reliability of the data.
5. Value: The usefulness and insights derived from the data for decision-making.
3. What is Machine Learning?
Machine Learning (ML) is a subset of artificial intelligence (AI) focused on building algorithms
that allow systems to learn from data and improve performance over time without explicit
programming. Common ML applications include image recognition, natural language
processing, and predictive analytics.
4. Importance of Data Science
Data science plays a vital role in various domains by enabling:
Decision-making: Using data-driven insights to improve processes and strategies.
Problem-solving: Identifying trends, patterns, and solutions in large datasets.
Business growth: Optimizing operations and enhancing customer experiences.
Automation: Building AI models for tasks like recommendation systems and chatbots.
Advancing research: In fields like healthcare, finance, and technology.
5. Define Data Mining
Data mining is the process of discovering patterns, trends, and actionable insights from large
datasets using techniques like machine learning, statistical analysis, and database systems. It is
often used in applications such as fraud detection, market analysis, and customer segmentation.