-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Create submit_spark_job_to_driver_node_group_cluster.py #13423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Sample code to submit a Spark job to a Dataproc driver node group cluster.
Here is the summary of changes. You are about to add 1 region tag.
This comment is generated by snippet-bot.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @aman-ebay, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a new Python sample that illustrates the process of submitting a Spark job to a Google Cloud Dataproc cluster. The sample focuses on demonstrating how to correctly configure job submission, particularly for clusters that are set up with a driver node group, and includes steps to retrieve the job's output.
Highlights
- New Sample Script: Adds a new Python script (
dataproc/snippets/submit_spark_job_to_driver_node_group_cluster.py
) to demonstrate submitting a Spark job to a Dataproc cluster. - Driver Node Group Support: Specifically shows how to configure the
driver_scheduling_config
when submitting a job to a cluster utilizing a driver node group. - Job Submission Logic: Utilizes the
google.cloud.dataproc_v1.JobControllerClient
to submit the Spark job asynchronously. - Output Retrieval: Includes code to fetch and print the job output from the Cloud Storage bucket where Dataproc stores results.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces a new Python script, submit_spark_job_to_driver_node_group_cluster.py
, for submitting Spark jobs to Dataproc clusters. The review suggests enhancing command-line usability with argparse
, improving error handling with try-except
blocks, clarifying code with type hints and docstrings, and ensuring proper resource cleanup with context managers.
operation = job_client.submit_job_as_operation( | ||
request={"project_id": project_id, "region": region, "job": job} | ||
) | ||
response = operation.result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matches = re.match("gs://(.*?)/(.*)", response.driver_output_resource_uri) | ||
|
||
output = ( | ||
storage.Client() | ||
.get_bucket(matches.group(1)) | ||
.blob(f"{matches.group(2)}.000000000") | ||
.download_as_bytes() | ||
.decode("utf-8") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve error handling and resource management in this section:
- Check if
re.match
returnsNone
before accessing match groups. - Wrap GCS operations in a
try...except
block to catch potential exceptions. - Use
storage.Client()
as a context manager.
matches = re.match("gs://(.*?)/(.*)", response.driver_output_resource_uri)
if not matches:
print(f"Error: Could not parse driver output URI: {response.driver_output_resource_uri}")
raise ValueError
try:
with storage.Client() as storage_client:
bucket_name = matches.group(1)
blob_name = f"{matches.group(2)}.000000000"
output = (
storage_client.get_bucket(bucket_name)
.blob(blob_name)
.download_as_bytes()
.decode("utf-8")
)
except Exception as e:
print(f"Error downloading job output: {e}")
raise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
from google.cloud import dataproc_v1 as dataproc | ||
from google.cloud import storage | ||
|
||
def submit_job(project_id, region, cluster_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a docstring (as per PEP 257) and type hints (as per PEP 484) to the submit_job
function to improve readability and maintainability.
def submit_job(project_id: str, region: str, cluster_name: str) -> None:
"""Submits a Spark job to the specified Dataproc cluster with a driver node group and prints the output.
Args:
project_id: The Google Cloud project ID.
region: The Dataproc region where the cluster is located.
cluster_name: The name of the Dataproc cluster.
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
job_client = dataproc.JobControllerClient( | ||
client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
if __name__ == "__main__": | ||
|
||
my_project_id = "your_cluster" # <-- REPLACE THIS | ||
my_region = "us-central1" # <-- REPLACE THIS | ||
my_cluster_name = "your-node-group-cluster" # <-- REPLACE THIS | ||
|
||
submit_job(my_project_id, my_region, my_cluster_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use argparse
to handle command-line arguments for project_id
, region
, and cluster_name
to make the script more usable.
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(
description="Submits a Spark job to a Dataproc driver node group cluster."
)
parser.add_argument("--project_id", help="The Google Cloud project ID.", required=True)
parser.add_argument("--region", help="The Dataproc region where the cluster is located.", required=True)
parser.add_argument("--cluster_name", help="The name of the Dataproc cluster.", required=True)
args = parser.parse_args()
submit_job(args.project_id, args.region, args.cluster_name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
This sample needs a test before merging. The current test error is on I just checked the quota on the testing project and there should be 24 available N2 CPUs in us-central1. This is most likely due to multiple tests running at the same time, all of them trying to allocate the N2 CPUs at the same time. I know you didn't modify that file, but the tests will run for all samples under the affected directory. From the error it looks like it's trying to allocate 12 CPUs (probably the default?), which is a lot for a sample test. There's probably some other sample also trying to allocate a large number of CPUs. I would try to configure the clusters on existing samples to start with the minimum number of CPUs it can, that'll fix any quota issues and tests will run faster as well. |
Sample code to submit a Spark job to a Dataproc driver node group cluster.