0% found this document useful (0 votes)

3 views21 pages

Data Engineering Part - 2

The document provides a comprehensive set of interview questions and answers for Azure Data Engineering positions, specifically targeting candidates with 0-3 years of experience. It covers various topics such as Spark performance optimization, handling common issues in PySpark, resource allocation strategies, and debugging techniques. Key concepts discussed include the differences between RDDs, DataFrames, and Datasets, as well as methods to resolve the small file problem in Spark.

Uploaded by

Ritik kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views21 pages

Data Engineering Part - 2

Uploaded by

Ritik kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Azure Data Engineering

Interview Q & A for: -

(0 – 3 Year)
Part -2

pg. 1 www.linkedin.com/in/ranjit-a873ba243
15. Have you heard of the persist command? What is its purpose?
16. You have a job with some performance issues. There are some joins that are
taking too much time. What will be your approach?
17. Apart from broadcast, are there other ways to optimize joins?
18. Why do you think partitioning will improve performance?
19. Have you worked with pandas?
20. How are you running your Spark jobs?
21. Do you have anything to ask?
22. What are the key differences between RDDs, DataFrames, and Datasets in
PySpark?
23. What are common issues in PySpark and how do you resolve them?
24. Describe how you would optimize a PySpark job that is running slowly. What
are the key factors you would look at?
25. What is the small file issue in Spark, and how do you resolve it?
26. Assume you have a dataset of 500 GB that needs to be processed on a Spark
cluster. The cluster has 10 nodes, each with 64 GB of memory and 16 cores. How
would you allocate resources for your Spark job?
27. How do you debug and fix Out of Memory errors in Spark?
28. How do you diagnose and improve Spark application performance?
29. What is coalesce in Spark?

pg. 2 www.linkedin.com/in/ranjit-a873ba243
30. What is repartition in Spark?
31. What is the difference between cache() and persist() in Spark?
32. What are actions and transformations in Spark?
33. What is the difference between map and flatMap in Spark?
34. When will shuffling happen in Spark?
35. How do you read a Spark file with a delimiter | or \t in a DataFrame?
36. Explain a scenario where you apply an optimization technique in Spark.
37. How do you find duplicates from a table in Spark?
38. How do you find the 2nd highest salary from a table?
39. How do you use LAG to add the previous ID to the next row?
40. How do you perform a Left Join in PySpark?
41. How do you perform a Right Join in PySpark?
42. How do you perform an Inner Join in PySpark?
43. How do you perform a Full Outer Join in PySpark?
44. How do you perform a Left Anti Join in PySpark?
45. How do you perform a Left Semi Join in PySpark?
46. How do you perform a Cross Join in PySpark?
47. What is an Anti Join in PySpark?
48. Given two DataFrames df1 and df2, how do you:
1. Get the sum of total items grouped on charid
2. Get the sum of sales grouped on sales_units?

pg. 3 www.linkedin.com/in/ranjit-a873ba243
1. Have you heard of the persist command? What is its purpose?

Answer: Yes, the persist command in Spark is used to store an RDD or DataFrame in memory or
on disk across operations. This helps in improving the performance of iterative algorithms and
interactive queries by avoiding recomputation of the data.

2. You have a job with some performance issues. There are some joins that are taking too
much time. What will be your approach?

Answer: To address performance issues with joins, I would:

1. Use broadcast joins for small tables to avoid shuffling large datasets.
2. Ensure proper partitioning of the data to minimize data movement.
3. Optimize the join conditions and filter out unnecessary data before the join.
4. Use caching or persisting intermediate results to avoid recomputation.
5. Tune Spark configurations such
as spark.sql.shuffle.partitions and spark.executor.memory.

3. Apart from broadcast, are there other ways to optimize joins?

Answer: Yes, other ways to optimize joins include:

1. Using sort-merge join for large datasets that are already sorted.
2. Using shuffle hash join for smaller datasets.

pg. 4 www.linkedin.com/in/ranjit-a873ba243
3. Ensuring proper indexing and partitioning of the data.
4. Using bucketing to colocate related data in the same partitions.

4. Why do you think partitioning will improve performance?

Answer: Partitioning improves performance by:

1. Reducing the amount of data shuffled across the network.
2. Enabling parallel processing of data partitions.
3. Improving data locality by colocating related data.
4. Allowing efficient filtering and pruning of partitions during query execution.

5. Have you worked with pandas?

Answer: Yes, I have worked with pandas for data manipulation and analysis in Python. It
provides powerful data structures like DataFrames and Series, which are useful for handling
structured data.

6. How are you running your Spark jobs?

Answer: I run Spark jobs using various methods, including:

1. Submitting jobs through the Spark-submit command.
2. Using notebooks such as Jupyter or Databricks.
3. Scheduling jobs with workflow managers like Apache Airflow.
4. Running jobs on cloud platforms like AWS EMR or Azure HDInsight.

7. Do you have anything to ask?

Answer: Yes, I would like to know more about the specific challenges and goals of your data
engineering projects. Additionally, I am interested in understanding the team's workflow
and the tools and technologies you use.

8. What are the key differences between RDDs, DataFrames, and Datasets in PySpark?

Spark Resilient Distributed Datasets (RDD), DataFrame, and Datasets are key abstractions in
Spark that enable us to work with structured data in a distributed computing environment. Even
though they are all ways of representing data, they have key differences:

pg. 5 www.linkedin.com/in/ranjit-a873ba243
 RDDs are low-level APIs that lack a schema and offer control over the data. They are
immutable collections of objects

 DataFrames are high-level APIs built on top of RDDs optimized for performance but are
not safe-type. They organize structured and semi-structured data into named columns.

 Datasets combine the benefits of RDDs and DataFrames. They are high-level APIs that
provide safe-type abstraction. They support Python and Scala and provide compile-time
type checking while being faster than DataFrames.

9. Common Issues in PySpark and How to Resolve Them

1. Environment Setup Issues

 Problem: PySpark not installed or environment variables not set correctly.
 Solution:
 Install PySpark using pip install pyspark.
 Set JAVA_HOME, HADOOP_HOME, and SPARK_HOME environment variables
properly.

2. Out of Memory Errors
 Problem: Tasks running out of memory due to large data volumes.
 Solution:
 Optimize the number of partitions using repartition() or coalesce().
 Increase executor memory (--executor-memory) and driver memory (--
driver-memory) in the configuration.
3. Skewed Data
 Problem: Uneven data distribution causing slow performance.
 Solution:
 Use the salting technique to balance partitions.
 Use broadcast join for small datasets.
4. Shuffle Performance Bottlenecks
 Problem: Excessive shuffling during operations like groupBy or join.
 Solution:
 Use narrow transformations like map and filter where possible.

pg. 6 www.linkedin.com/in/ranjit-a873ba243
 Adjust spark.sql.shuffle.partitions to reduce shuffle partitions.
5. Serialization Issues
 Problem: Incorrect serialization causing errors or slowdowns.
 Solution:
 Use Kryo serialization by
setting spark.serializer to org.apache.spark.serializer.KryoSerializer.
 Register custom classes if required for better performance.
6. Schema Mismatch
 Problem: Input data schema not matching the expected schema.
 Solution:
 Define schemas explicitly using StructType instead of inferring.
 Validate schema compatibility before processing.
7. Slow UDF Performance
 Problem: Python UDFs slowing down processing.
 Solution:
 Use PySpark’s built-in functions instead of UDFs when possible.
 Switch to pandas UDFs for better performance.
8. Dependency Conflicts
 Problem: Version mismatches between PySpark, Hadoop, or libraries.
 Solution:
 Ensure compatible versions of Spark, Hadoop, and Python are installed.
 Use virtual environments to manage dependencies.
9. Debugging Challenges
 Problem: Limited visibility into distributed jobs.
 Solution:
 Use explain() to analyze query execution plans.
 Enable Spark UI for monitoring job execution and troubleshooting.
10. File Handling Issues
 Problem: Errors while reading/writing data to/from storage.
 Solution:
 Ensure correct file paths and permissions.
 Use supported file formats like Parquet or ORC for better performance.
11. Inefficient Partitioning
 Problem: Too many or too few partitions affecting performance.
 Solution:
 Use df.rdd.getNumPartitions() to check partition count.
 Adjust partitions using repartition() for better parallelism.
12. Ambiguous Column References
 Problem: Errors during operations due to duplicate column names in joins.
 Solution:

pg. 7 www.linkedin.com/in/ranjit-a873ba243
 Rename columns before joining using withColumnRenamed().
13. Catalyst Optimizer Limitations
 Problem: PySpark's optimizer fails to optimize complex queries.
 Solution:
 Simplify the query logic.
 Use caching (df.cache()) for repeated computations.
14. Missing Dependencies in Cluster
 Problem: Errors due to missing Python or Java libraries on cluster nodes.
 Solution:
 Use --py-files to distribute Python dependencies.
 Ensure all cluster nodes have the required dependencies installed.
15. Long Execution Time
 Problem: Jobs taking too long to execute.
 Solution:
 Profile and optimize transformations.
 Cache intermediate results to avoid recomputation.

10. Describe how you would optimize a PySpark job that is running slowly. What are the key
factors you would look at?

If a PySpark job is running slow, there are several aspects we can improve to optimize its
performance:
 Ensuring a proper size and number of data partitions to minimize data shuffling
during transformations.
 Using DataFrames instead of RRDs because they already use several Optimization
modules to improve the performance of spark workloads.
 Using broadcasting joins and broadcast variables for joining a small dataset with a
larger dataset.
 Caching and persisting intermediate DataFrames that are reused.
 Adjusting the number of partitions, executor cores, and instances to effectively use
cluster resources.
 Choosing the appropriate file formats to minimize data size.

11. Small file issue

Small File Problem is a common issue in Apache Spark, which can impact the performance of
Spark applications. This problem occurs when Spark jobs process a large number of small files,
which are typically much smaller in size than the block size of the Hadoop Distributed File
System (HDFS).

pg. 8 www.linkedin.com/in/ranjit-a873ba243
Causes of the Small File Problem
The Small File Problem in Spark can occur due to the following reasons:
 Input data partitioning: When input data is partitioned in a way that creates many small
partitions, each partition may contain small files, which can lead to the Small File
Problem.
 Data generation process: If the data generation process generates many small files
instead of larger ones, it can lead to the Small File Problem.
 Data ingestion process: If the data ingestion process writes data in a manner that
creates many small files, it can lead to the Small File Problem

Resolving the Small File Problem

There are several ways to resolve the Small File Problem in Spark. Here are some of the most
common methods:
1. Combine Small Files: You can combine small files into larger files using the
"repartition" method in Spark. By combining small files, you can reduce the number of
tasks required to process the data, which can improve performance.
2. Increase Block Size: Increasing the block size of HDFS can also help reduce the
number of small files. By increasing the block size, you can ensure that files are written
in larger blocks, reducing the number of files in the directory.
3. Use Sequence Files: Sequence files are a file format that can store multiple small files
in a single file. By using sequence files, you can reduce the number of files in a directory
and improve performance.
4. Use Hadoop Archive Files: Hadoop archive files (HAR) are another file format that can
store multiple small files in a single file. By using HAR files, you can reduce the number
of files in a directory and improve performance.

12. Assume you have a dataset of 500 GB that needs to be processed

on a Spark cluster. The cluster has 10 nodes, each with 64 GB of memory
and 16 cores. How would you allocate resources for your Spark job?

Resource allocation in Spark depends on multiple factors such as the size of the dataset, the
complexity of computations, and the configuration of the cluster. Here are some guidelines for
different scenarios:

Scenario 1: General Guidelines

 Number of Executors: Reserve some resources for the Operating System and Hadoop
daemons. For example, reserve 1 core and 1 GB per node. This leaves 15 cores and 63
GB per node for Spark. To avoid network I/O during shuffles, it's best to have as many
executors as nodes. So, you could have 10 executors.
 Memory per Executor: Allocate the total memory available per node, i.e., 63 GB.

pg. 9 www.linkedin.com/in/ranjit-a873ba243
 Cores per Executor: Allocate the total cores available per node, i.e., 15 cores. For better
concurrency, assign around 5 cores per executor, leading to around 3 executors per
node.
 Driver Memory: The driver program can be run on a separate node, or if on the same
cluster, allocate it 1–2 cores and about 5–10% of the total memory.

Scenario 2: Processing 1 TB of Data with 5 Nodes (8 Cores, 32 GB RAM Each)

 Number of Executors: Reserve 1 core for Hadoop and OS daemons, leaving 7 cores per
node. With 5 nodes, you have a total of 35 available cores. Allocating around 5 cores per
executor, you would have 7 executors.
 Memory per Executor: Reserve 1 GB for the OS and Hadoop daemons, leaving 31 GB.
Allocate around 27 GB to the executor, leaving some off-heap memory.
 Cores per Executor: Keep it at 5 cores per executor.
 Driver Memory: Assign around 3–4 GB of memory to the driver, and run it on a separate
node if possible to avoid resource competition.
 Memory and Core Allocation for Hadoop Daemons: 1 GB and 1 core respectively.

Scenario 3: Processing 5 TB of Data with 50 Nodes (16 Cores, 128 GB RAM Each)
 Number of Executors: After reserving resources for the OS and daemons, you are left
with 15 cores and 127 GB per node. Run 15 executors per node, for a total of 750
executors.
 Memory per Executor: With 127 GB available on each node, allocate approximately 8 GB
per executor, allowing for some off-heap usage.
 Cores per Executor: Allocate 1 core per executor to maximize parallelism and reduce
task scheduling overhead.
 Driver Memory: Run the driver on a separate node if possible, and allocate about 10–20
GB, as it needs to collect task states from a large number of executors.
 Data Serialization: Consider using Kryo serialization for more efficient serialization of
data.

13. How to Debugging and Fixing Out of Memory Errors

If a Spark job is running out of memory with the error: “java.lang.OutOfMemoryError: Java heap
space”, here are some steps to debug and fix the issue:
 Increase Executor Memory: Increase the executor memory with
the spark.executor.memory property.
 Increase Driver Memory: If the driver is running out of memory, increase it with
the spark.driver.memory property.

pg. 10 www.linkedin.com/in/ranjit-a873ba243
 Memory Management: Analyze how memory is being used in your Spark job. If caching
is excessive, consider reducing it or using the MEMORY_AND_DISK storage level to spill
to disk when necessary.
 Data Serialization: Use Kryo serialization, which is more memory-efficient than Java
serialization.
 Partitioning: If some tasks handle significantly more data than others, you might have a
data skew problem. Re-partitioning the data might help.

14. How to Diagnosing and Improving Spark Application Performance?

If your Spark application is running slower than expected, here are some steps to diagnose and
improve performance:
 Check Resource Utilization: Use Spark’s web UI or other monitoring tools to check CPU
and memory utilization. Low CPU usage could indicate an I/O or network bottleneck,
while high garbage collection times could indicate memory issues.
 Data Skew: If some tasks take much longer than others, you might have a data skew
problem. Consider repartitioning your data.
 Serialization: If a lot of time is spent on serialization and deserialization, switch to Kryo
serialization, which is more efficient.
 Tuning Parallelism: Adjust the level of parallelism. Too few partitions can lead to less
concurrency, while too many can lead to excessive overhead. A rule of thumb is to have
2–3 tasks per CPU core in your cluster.
 Caching: If your application reuses intermediate RDDs or DataFrames, use caching to
avoid recomputation.
 Tune Spark Configuration: Depending on the characteristics of your application and
dataset, you may need to tune various Spark configurations. For example,
increase spark.driver.memory, spark.executor.memory, or spark.network.timeout, or
decrease spark.memory.fraction.

15. What is coalesce in Spark?

Answer: coalesce is a transformation in Spark that reduces the number of partitions in a

DataFrame or RDD. It is often used to optimize the performance of a job by reducing the
number of partitions to a specified number, which can be useful when you have a large number
of small partitions. Unlike repartition, coalesce avoids a full shuffle of the data.

16. What is repartition in Spark?

pg. 11 www.linkedin.com/in/ranjit-a873ba243
Answer: repartition is a transformation in Spark that reshuffles the data in a DataFrame or RDD
to increase or decrease the number of partitions. This operation involves a full shuffle of the
data across the cluster, which can be useful for balancing the data distribution or increasing
parallelism.

17. What is the difference between cache() and persist() in Spark?

Answer: Both cache() and persist() are used to store DataFrames or RDDs in memory to speed
up subsequent actions. The difference is:
 cache(): By default, it stores the data in memory only.
 persist(): Allows you to specify different storage levels (e.g., memory, disk, or a
combination) using the StorageLevel class.

18. What are actions and transformations in Spark?

Answer:
 Transformations: These are operations that create a new DataFrame or RDD from an
existing one. They are lazily evaluated, meaning they are not executed until an action is
called. Examples include map(), filter(), flatMap(), groupBy(), and join().
 Actions: These are operations that trigger the execution of transformations and return a
result to the driver program or write data to an external storage system. Examples
include collect(), count(), saveAsTextFile(), and reduce().

19. What is the difference between map and flatMap in Spark?

Answer:
 map: Applies a function to each element of the DataFrame or RDD and returns a new
DataFrame or RDD with the results. The number of elements remains the same.
 flatMap: Similar to map, but the function applied can return a list of elements, which
are then flattened into a single DataFrame or RDD. This can change the number of
elements.

20. When will shuffling happen in Spark?

Answer: Shuffling occurs when data is redistributed across the cluster, which can happen
during operations like groupByKey(), reduceByKey(), join(), distinct(), and repartition().
Shuffling involves moving data between partitions and can be an expensive operation in
terms of performance.

21. How do you read a Spark file with a delimiter | or \t in a DataFrame?

pg. 12 www.linkedin.com/in/ranjit-a873ba243
Answer: You can specify the delimiter using the option method when reading the file.
Here is an example:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ReadFile").getOrCreate()

# Reading a file with delimiter '|'

df_pipe = spark.read.option("delimiter", "|").csv("path_to_file", header=True,
inferSchema=True)

# Reading a file with delimiter '\t'

df_tab = spark.read.option("delimiter", "\t").csv("path_to_file", header=True,
inferSchema=True)

22. Explain a scenario where you apply an optimization technique in Spark.

Answer: One common optimization technique is using broadcast joins to handle skewed
data. For example, if you have a large DataFrame df_large and a small
DataFrame df_small, you can use a broadcast join to avoid shuffling the large
DataFrame:
from pyspark.sql.functions import broadcast

df_large = spark.read.csv("path_to_large_file", header=True, inferSchema=True)

df_small = spark.read.csv("path_to_small_file", header=True, inferSchema=True)

# Using broadcast join to optimize performance

df_joined = df_large.join(broadcast(df_small), df_large["key"] == df_small["key"])

This technique helps in reducing the shuffle overhead and improves the performance of
the join operation.

23. Find the Duplicates from a Table

SQL Query:
SELECT column_name, COUNT(*)
FROM your_table
GROUP BY column_name
HAVING COUNT(*) > 1;

pg. 13 www.linkedin.com/in/ranjit-a873ba243
PySpark DataFrame Code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("FindDuplicates").getOrCreate()

df = spark.read.csv("path_to_file", header=True, inferSchema=True)

duplicates = df.groupBy("column_name").count().filter(col("count") > 1)

duplicates.show()

24. Find the 2nd Highest Salary from a Table

SQL Query:
SELECT salary
FROM (
SELECT salary, ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num
FROM your_table
)
WHERE row_num = 2;
PySpark DataFrame Code:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

spark = SparkSession.builder.appName("SecondHighestSalary").getOrCreate()

df = spark.read.csv("path_to_file", header=True, inferSchema=True)

windowSpec = Window.orderBy(df["salary"].desc())
df_with_row_num = df.withColumn("row_num",
row_number().over(windowSpec))
second_highest_salary =
df_with_row_num.filter(df_with_row_num["row_num"] == 2).select("salary")
second_highest_salary.show()

25. Lag Question Regarding ID Column: How to Add Previous ID to Next Same Row

SQL Query:

pg. 14 www.linkedin.com/in/ranjit-a873ba243
SELECT id, LAG(id) OVER (ORDER BY id) AS previous_id
FROM your_table;
PySpark DataFrame Code:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import lag

spark = SparkSession.builder.appName("LagFunction").getOrCreate()

df = spark.read.csv("path_to_file", header=True, inferSchema=True)

windowSpec = Window.orderBy("id")
df_with_lag = df.withColumn("previous_id", lag("id").over(windowSpec))
df_with_lag.show()

26. Left Join in PySpark

Explanation: A left join returns all the rows from the left DataFrame and the matching
rows from the right DataFrame. If there is no match, the result will contain null values
for columns from the right DataFrame.

Use: Left joins are useful when you want to keep all records from the left DataFrame and
include corresponding records from the right DataFrame if they exist.

Example:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LeftJoin").getOrCreate()

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

df_right = spark.read.csv("path_to_right_file", header=True, inferSchema=True)

df_left_join = df_left.join(df_right, df_left["key"] == df_right["key"], "left")

df_left_join.show()

27. Right Join in PySpark

pg. 15 www.linkedin.com/in/ranjit-a873ba243
Explanation: A right join returns all the rows from the right DataFrame and the matching
rows from the left DataFrame. If there is no match, the result will contain null values for
columns from the left DataFrame.

Use: Right joins are useful when you want to keep all records from the right DataFrame
and include corresponding records from the left DataFrame if they exist.

Example:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RightJoin").getOrCreate()

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

df_right = spark.read.csv("path_to_right_file", header=True, inferSchema=True)

df_right_join = df_left.join(df_right, df_left["key"] == df_right["key"], "right")

df_right_join.show()

28. Inner Join in PySpark

Explanation: An inner join returns only the rows that have matching values in both
DataFrames. If there is no match, the row is excluded from the result.

Use: Inner joins are useful when you want to retrieve only the records that have
corresponding matches in both DataFrames.

Example:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("InnerJoin").getOrCreate()

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

df_right = spark.read.csv("path_to_right_file", header=True, inferSchema=True)

df_inner_join = df_left.join(df_right, df_left["key"] == df_right["key"], "inner")

df_inner_join.show()

29. Full Outer Join in PySpark

pg. 16 www.linkedin.com/in/ranjit-a873ba243
Explanation: A full outer join returns all the rows when there is a match in either the left or
right DataFrame. If there is no match, the result will contain null values for columns from the
DataFrame that does not have a match.

Use: Full outer joins are useful when you want to retrieve all records from both DataFrames,
regardless of whether there is a match.

Example:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("FullOuterJoin").getOrCreate()

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

df_right = spark.read.csv("path_to_right_file", header=True, inferSchema=True)

df_full_outer_join = df_left.join(df_right, df_left["key"] == df_right["key"], "outer")

df_full_outer_join.show()

30. Left Anti Join in PySpark

Explanation: A left anti join returns only the rows from the left DataFrame that do not
have a match in the right DataFrame.

Use: Left anti joins are useful when you want to find records in the left DataFrame that
do not have corresponding matches in the right DataFrame.

Example:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LeftAntiJoin").getOrCreate()

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

df_right = spark.read.csv("path_to_right_file", header=True, inferSchema=True)

df_left_anti = df_left.join(df_right, df_left["key"] == df_right["key"], "left_anti")

df_left_anti.show()

31. Left Semi Join in PySpark

pg. 17 www.linkedin.com/in/ranjit-a873ba243
Explanation: A left semi join returns only the rows from the left DataFrame that have a
match in the right DataFrame. It is similar to an inner join, but it returns only columns
from the left DataFrame.

Use: Left semi joins are useful when you want to filter the left DataFrame to include only
rows that have corresponding matches in the right DataFrame.

Example:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LeftSemiJoin").getOrCreate()

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

df_right = spark.read.csv("path_to_right_file", header=True, inferSchema=True)

df_left_semi = df_left.join(df_right, df_left["key"] == df_right["key"], "left_semi")

df_left_semi.show()

32. Cross Join in PySpark

Explanation: A cross join returns the Cartesian product of the two DataFrames, meaning
it returns all possible combinations of rows from the left and right DataFrames.

Use: Cross joins are useful when you need to generate all possible combinations of rows
from two DataFrames. However, they can be very expensive in terms of computation
and memory, so use them with caution.

Example:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CrossJoin").getOrCreate()

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

df_right = spark.read.csv("path_to_right_file", header=True, inferSchema=True)

df_cross_join = df_left.crossJoin(df_right)
df_cross_join.show()

33. Anti Join in PySpark

pg. 18 www.linkedin.com/in/ranjit-a873ba243
Explanation: An anti join returns only the rows from the left DataFrame that do not
have a match in the right DataFrame. It is similar to a left anti join but can be used in
different contexts.

Use: Anti joins are useful when you want to exclude records from the left DataFrame
that have corresponding matches in the right DataFrame.

Example:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("AntiJoin").getOrCreate()

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

df_right = spark.read.csv("path_to_right_file", header=True, inferSchema=True)

df_anti = df_left.join(df_right, df_left["key"] == df_right["key"], "anti")

df_anti.show()

34. Write python on code based on below explanation.

df1 = productcode, charid, productid, items , sales_units

df2 = productcode, charid, productid

from above schema write program to

1) get the sum of total items grouped on charid
2) get the sumof sales grouped on sales_units

Answer:
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum

# Initialize Spark session

spark = SparkSession.builder.appName("AggregationExample").getOrCreate()

# Sample data for df1

data1 = [
("P1", "C1", "PR1", 10, 100),
("P2", "C1", "PR2", 20, 200),

pg. 19 www.linkedin.com/in/ranjit-a873ba243
("P3", "C2", "PR3", 30, 300),
("P4", "C2", "PR4", 40, 400)
]

# Sample data for df2

data2 = [
("P1", "C1", "PR1"),
("P2", "C1", "PR2"),
("P3", "C2", "PR3"),
("P4", "C2", "PR4")
]

# Define schema for df1

schema1 = ["productcode", "charid", "productid", "items", "sales_units"]

# Define schema for df2

schema2 = ["productcode", "charid", "productid"]

# Create DataFrames
df1 = spark.createDataFrame(data1, schema1)
df2 = spark.createDataFrame(data2, schema2)

# 1. Get the sum of total items grouped by charid

sum_items = df1.groupBy("charid").agg(sum("items").alias("total_items"))
sum_items.show()

# 2. Get the sum of sales grouped by sales_units

sum_sales = df1.groupBy("sales_units").agg(sum("sales_units").alias("total_sales"))
sum_sales.show()

Explanation:
1. Initialize Spark session: This sets up the Spark environment.
2. Sample data for df1 and df2: These are the sample data based on the given schema.
3. Define schema for df1 and df2: These are the column names for the DataFrames.
4. Create DataFrames: The sample data is converted into Spark DataFrames.
5. Get the sum of total items grouped by charid: The groupBy method is used to group the
data by charid, and the agg method is used to calculate the sum of items.
6. Get the sum of sales grouped by sales_units: The groupBy method is used to group the
data by sales_units, and the agg method is used to calculate the sum of sales_units.

pg. 20 www.linkedin.com/in/ranjit-a873ba243
Was this helpful
Support by giving it like and comment

Follow and connect for more:

www.linkedin.com/in/ranjit-a873ba243

pg. 21 www.linkedin.com/in/ranjit-a873ba243

2025 Pyspark Interview Questions Collections
No ratings yet
2025 Pyspark Interview Questions Collections
50 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Pyspark 4
No ratings yet
Pyspark 4
5 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
Spark Interview Prep for Telugu Speakers
100% (3)
Spark Interview Prep for Telugu Speakers
31 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
Pyspark
100% (1)
Pyspark
48 pages
PySpark Interview Questions Guide
100% (3)
PySpark Interview Questions Guide
126 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Deloitte & EY Data Engineer Interview Questions
No ratings yet
Deloitte & EY Data Engineer Interview Questions
26 pages
Pyspark STAR Questions
No ratings yet
Pyspark STAR Questions
21 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
9 pages
Interview Questions For 5 Yrs of Exp
No ratings yet
Interview Questions For 5 Yrs of Exp
6 pages
Apache Spark Interview Scenarios
No ratings yet
Apache Spark Interview Scenarios
4 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
4 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
Spark Interview Prep for Data Engineers
No ratings yet
Spark Interview Prep for Data Engineers
22 pages
PySpark Interview QA
No ratings yet
PySpark Interview QA
2 pages
Digital Portfolio
No ratings yet
Digital Portfolio
14 pages
Data Engineer Interview Prep
No ratings yet
Data Engineer Interview Prep
27 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
PySpark Interview Questions 2025
No ratings yet
PySpark Interview Questions 2025
8 pages
HirePurchaseScheme CurrentPriceList (07!06!2013)
No ratings yet
HirePurchaseScheme CurrentPriceList (07!06!2013)
60 pages
Mantas Interface Oracle FLEXCUBE Universal Banking Release 11.3.0 (May) (2011) Oracle Part Number E51536-01
100% (1)
Mantas Interface Oracle FLEXCUBE Universal Banking Release 11.3.0 (May) (2011) Oracle Part Number E51536-01
24 pages
Sample Field Report Ict
No ratings yet
Sample Field Report Ict
2 pages
Pyspark Theory Questions
No ratings yet
Pyspark Theory Questions
5 pages
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
Important Interview Qa
No ratings yet
Important Interview Qa
13 pages
PySpark Core Concepts & Interview Prep
No ratings yet
PySpark Core Concepts & Interview Prep
8 pages
Facebook As A Social Media and A Business Platform
No ratings yet
Facebook As A Social Media and A Business Platform
6 pages
Interview Q & A (SQL Spark HIVE Airflow AWS Kafka) - 1
No ratings yet
Interview Q & A (SQL Spark HIVE Airflow AWS Kafka) - 1
25 pages
Spark Main
No ratings yet
Spark Main
75 pages
Myinterview Qs
No ratings yet
Myinterview Qs
9 pages
TOS Q1 TLEICTCSS 9 - Jonalyn Ambrona
No ratings yet
TOS Q1 TLEICTCSS 9 - Jonalyn Ambrona
2 pages
Spark Tips 1716698498
No ratings yet
Spark Tips 1716698498
7 pages
Apache Spark Components Guide
No ratings yet
Apache Spark Components Guide
6 pages
PySpark Interview Questions Big Data
No ratings yet
PySpark Interview Questions Big Data
8 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
2 pages
Data Engineer
No ratings yet
Data Engineer
12 pages
User Guide Xpr7550e Xpr7580e Color Display
No ratings yet
User Guide Xpr7550e Xpr7580e Color Display
1,149 pages
Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
No ratings yet
Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
16 pages
Freedium - cfd-PySpark Interview Questions
No ratings yet
Freedium - cfd-PySpark Interview Questions
17 pages
Most Asked Interview Questions in Top MNC'S: 1. A. Partitioning Caching Broadcasting
No ratings yet
Most Asked Interview Questions in Top MNC'S: 1. A. Partitioning Caching Broadcasting
4 pages
Full PySpark Interview QA
No ratings yet
Full PySpark Interview QA
5 pages
Swift Programming The Ultimate Beginner S Guide To Learn Swift Programming Step by Step 3nd Edition Alexander Aronowitz & NLN LNC (Aronowitz PDF Download
100% (1)
Swift Programming The Ultimate Beginner S Guide To Learn Swift Programming Step by Step 3nd Edition Alexander Aronowitz & NLN LNC (Aronowitz PDF Download
42 pages
Pyspark
No ratings yet
Pyspark
6 pages
Syllabus
No ratings yet
Syllabus
50 pages
SparkStepbyStepInterviewGuide Draft
No ratings yet
SparkStepbyStepInterviewGuide Draft
3 pages
Senior Data Engineer Qna
No ratings yet
Senior Data Engineer Qna
4 pages
Yosys Manual: Claire Xenia Wolf
No ratings yet
Yosys Manual: Claire Xenia Wolf
278 pages
Tech Mahindra
No ratings yet
Tech Mahindra
2 pages
Geoinformatics 2007 Vol05
No ratings yet
Geoinformatics 2007 Vol05
48 pages
Code Optimization in Spark
No ratings yet
Code Optimization in Spark
4 pages
Senior Data Engineer Qs
No ratings yet
Senior Data Engineer Qs
7 pages
Interview
No ratings yet
Interview
1 page
FPGA Implementation of Convolutional Neural Networ PDF
No ratings yet
FPGA Implementation of Convolutional Neural Networ PDF
10 pages
Feedback
No ratings yet
Feedback
9 pages
PySpark Interview Questions Shubham
No ratings yet
PySpark Interview Questions Shubham
3 pages
Imran Anwar SE
No ratings yet
Imran Anwar SE
2 pages
Syn-2151 10/100/1000baset Ethernet Media Converter
No ratings yet
Syn-2151 10/100/1000baset Ethernet Media Converter
2 pages
C# OOP: Inheritance & Polymorphism
No ratings yet
C# OOP: Inheritance & Polymorphism
40 pages
04 - Signaling in MTP
No ratings yet
04 - Signaling in MTP
68 pages
Pyspark Interview Q & A in Topic Wise
No ratings yet
Pyspark Interview Q & A in Topic Wise
5 pages
Complete Data Engineer Interview Guide
No ratings yet
Complete Data Engineer Interview Guide
3 pages
T Rec G.984.3 200803 I!!pdf e
No ratings yet
T Rec G.984.3 200803 I!!pdf e
146 pages
Descriptive Texts on Favorite Items
No ratings yet
Descriptive Texts on Favorite Items
6 pages
Pyspark - Notes 1
No ratings yet
Pyspark - Notes 1
3 pages
Os Lab
No ratings yet
Os Lab
27 pages
FCC Install Guide - With Dual Lane Support Updated
No ratings yet
FCC Install Guide - With Dual Lane Support Updated
35 pages
Projects List
No ratings yet
Projects List
7 pages
Jms 2
No ratings yet
Jms 2
11 pages
Gw2011a PDF
No ratings yet
Gw2011a PDF
10 pages
Investigating Windows
No ratings yet
Investigating Windows
10 pages
21stCenturyLit Week7&8
No ratings yet
21stCenturyLit Week7&8
4 pages
MRTG Server Monitoring Guide
No ratings yet
MRTG Server Monitoring Guide
15 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Wi-Fi PLUG
No ratings yet
Wi-Fi PLUG
1 page
Spark Basic Info
No ratings yet
Spark Basic Info
11 pages
Pysparkq
No ratings yet
Pysparkq
3 pages
Execr
No ratings yet
Execr
4 pages
11.embedded Systems+GS
No ratings yet
11.embedded Systems+GS
10 pages
GD300 White Paper Project Report
No ratings yet
GD300 White Paper Project Report
4 pages

Data Engineering Part - 2

Uploaded by

Data Engineering Part - 2

Uploaded by

Azure Data Engineering

Interview Q & A for: -

Answer: To address performance issues with joins, I would:

3. Apart from broadcast, are there other ways to optimize joins?

Answer: Yes, other ways to optimize joins include:

4. Why do you think partitioning will improve performance?

Answer: Partitioning improves performance by:

5. Have you worked with pandas?

6. How are you running your Spark jobs?

Answer: I run Spark jobs using various methods, including:

7. Do you have anything to ask?

9. Common Issues in PySpark and How to Resolve Them

1. Environment Setup Issues

11. Small file issue

Resolving the Small File Problem

12. Assume you have a dataset of 500 GB that needs to be processed

Scenario 1: General Guidelines

Scenario 2: Processing 1 TB of Data with 5 Nodes (8 Cores, 32 GB RAM Each)

13. How to Debugging and Fixing Out of Memory Errors

14. How to Diagnosing and Improving Spark Application Performance?

15. What is coalesce in Spark?

Answer: coalesce is a transformation in Spark that reduces the number of partitions in a

16. What is repartition in Spark?

17. What is the difference between cache() and persist() in Spark?

19. What is the difference between map and flatMap in Spark?

20. When will shuffling happen in Spark?

21. How do you read a Spark file with a delimiter | or \t in a DataFrame?

# Reading a file with delimiter '|'

# Reading a file with delimiter '\t'

22. Explain a scenario where you apply an optimization technique in Spark.

df_large = spark.read.csv("path_to_large_file", header=True, inferSchema=True)

# Using broadcast join to optimize performance

23. Find the Duplicates from a Table

df = spark.read.csv("path_to_file", header=True, inferSchema=True)

duplicates = df.groupBy("column_name").count().filter(col("count") > 1)

24. Find the 2nd Highest Salary from a Table

df = spark.read.csv("path_to_file", header=True, inferSchema=True)

df = spark.read.csv("path_to_file", header=True, inferSchema=True)

26. Left Join in PySpark

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

df_left_join = df_left.join(df_right, df_left["key"] == df_right["key"], "left")

27. Right Join in PySpark

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

df_right_join = df_left.join(df_right, df_left["key"] == df_right["key"], "right")

28. Inner Join in PySpark

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

df_inner_join = df_left.join(df_right, df_left["key"] == df_right["key"], "inner")

29. Full Outer Join in PySpark

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

df_full_outer_join = df_left.join(df_right, df_left["key"] == df_right["key"], "outer")

30. Left Anti Join in PySpark

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

df_left_anti = df_left.join(df_right, df_left["key"] == df_right["key"], "left_anti")

31. Left Semi Join in PySpark

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

df_left_semi = df_left.join(df_right, df_left["key"] == df_right["key"], "left_semi")

32. Cross Join in PySpark

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

33. Anti Join in PySpark

df_left = spark.read.csv("path_to_left_file", header=True, inferSchema=True)

df_anti = df_left.join(df_right, df_left["key"] == df_right["key"], "anti")

34. Write python on code based on below explanation.

df1 = productcode, charid, productid, items , sales_units

from above schema write program to

# Initialize Spark session

# Sample data for df1

# Sample data for df2

# Define schema for df1

# Define schema for df2

# 1. Get the sum of total items grouped by charid

# 2. Get the sum of sales grouped by sales_units

Follow and connect for more:

You might also like