Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Create submit_spark_job_to_driver_node_group_cluster.py #13423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

aman-ebay
Copy link
Contributor

Sample code to submit a Spark job to a Dataproc driver node group cluster.

Sample code to submit a Spark job to a Dataproc driver node group cluster.
@aman-ebay aman-ebay requested review from a team as code owners June 12, 2025 00:30
@product-auto-label product-auto-label bot added the samples Issues that are directly related to samples. label Jun 12, 2025
Copy link

snippet-bot bot commented Jun 12, 2025

Here is the summary of changes.

You are about to add 1 region tag.

This comment is generated by snippet-bot.
If you find problems with this result, please file an issue at:
https://github.com/googleapis/repo-automation-bots/issues.
To update this comment, add snippet-bot:force-run label or use the checkbox below:

  • Refresh this comment

@product-auto-label product-auto-label bot added the api: dataproc Issues related to the Dataproc API. label Jun 12, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @aman-ebay, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new Python sample that illustrates the process of submitting a Spark job to a Google Cloud Dataproc cluster. The sample focuses on demonstrating how to correctly configure job submission, particularly for clusters that are set up with a driver node group, and includes steps to retrieve the job's output.

Highlights

  • New Sample Script: Adds a new Python script (dataproc/snippets/submit_spark_job_to_driver_node_group_cluster.py) to demonstrate submitting a Spark job to a Dataproc cluster.
  • Driver Node Group Support: Specifically shows how to configure the driver_scheduling_config when submitting a job to a cluster utilizing a driver node group.
  • Job Submission Logic: Utilizes the google.cloud.dataproc_v1.JobControllerClient to submit the Spark job asynchronously.
  • Output Retrieval: Includes code to fetch and print the job output from the Cloud Storage bucket where Dataproc stores results.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a new Python script, submit_spark_job_to_driver_node_group_cluster.py, for submitting Spark jobs to Dataproc clusters. The review suggests enhancing command-line usability with argparse, improving error handling with try-except blocks, clarifying code with type hints and docstrings, and ensuring proper resource cleanup with context managers.

operation = job_client.submit_job_as_operation(
request={"project_id": project_id, "region": region, "job": job}
)
response = operation.result()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Wrap the operation.result() call in a try...except block to handle potential exceptions if the job submission fails or times out.

    try:
        response = operation.result()
    except Exception as e:
        print(f"Error submitting job or waiting for completion: {e}")
        raise

Comment on lines 63 to 71
matches = re.match("gs://(.*?)/(.*)", response.driver_output_resource_uri)

output = (
storage.Client()
.get_bucket(matches.group(1))
.blob(f"{matches.group(2)}.000000000")
.download_as_bytes()
.decode("utf-8")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Improve error handling and resource management in this section:

  • Check if re.match returns None before accessing match groups.
  • Wrap GCS operations in a try...except block to catch potential exceptions.
  • Use storage.Client() as a context manager.
    matches = re.match("gs://(.*?)/(.*)", response.driver_output_resource_uri)
    if not matches:
        print(f"Error: Could not parse driver output URI: {response.driver_output_resource_uri}")
        raise ValueError

    try:
        with storage.Client() as storage_client:
            bucket_name = matches.group(1)
            blob_name = f"{matches.group(2)}.000000000"
            output = (
                storage_client.get_bucket(bucket_name)
                .blob(blob_name)
                .download_as_bytes()
                .decode("utf-8")
            )
    except Exception as e:
        print(f"Error downloading job output: {e}")
        raise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

from google.cloud import dataproc_v1 as dataproc
from google.cloud import storage

def submit_job(project_id, region, cluster_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a docstring (as per PEP 257) and type hints (as per PEP 484) to the submit_job function to improve readability and maintainability.

def submit_job(project_id: str, region: str, cluster_name: str) -> None:
    """Submits a Spark job to the specified Dataproc cluster with a driver node group and prints the output.

    Args:
        project_id: The Google Cloud project ID.
        region: The Dataproc region where the cluster is located.
        cluster_name: The name of the Dataproc cluster.
    """

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 35 to 37
job_client = dataproc.JobControllerClient(
client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using JobControllerClient as a context manager (with ... as ...:) ensures proper resource management and release, even if errors occur.

    with dataproc.JobControllerClient(
        client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
    ) as job_client:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 79 to 85
if __name__ == "__main__":

my_project_id = "your_cluster" # <-- REPLACE THIS
my_region = "us-central1" # <-- REPLACE THIS
my_cluster_name = "your-node-group-cluster" # <-- REPLACE THIS

submit_job(my_project_id, my_region, my_cluster_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use argparse to handle command-line arguments for project_id, region, and cluster_name to make the script more usable.

if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(
        description="Submits a Spark job to a Dataproc driver node group cluster."
    )
    parser.add_argument("--project_id", help="The Google Cloud project ID.", required=True)
    parser.add_argument("--region", help="The Dataproc region where the cluster is located.", required=True)
    parser.add_argument("--cluster_name", help="The name of the Dataproc cluster.", required=True)

    args = parser.parse_args()
    submit_job(args.project_id, args.region, args.cluster_name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@aman-ebay aman-ebay self-assigned this Jun 13, 2025
@leahecole leahecole assigned davidcavazos and unassigned aman-ebay Jun 13, 2025
@davidcavazos
Copy link
Contributor

davidcavazos commented Jun 24, 2025

This sample needs a test before merging.

The current test error is on dataproc/snippets/instantiate_inline_workflow_template.py:
google.api_core.exceptions.Aborted: 409 Error submitting create cluster request: Insufficient 'N2_CPUS' quota. Requested 12.0, available 0.0. Your resource request exceeds your available quota.

I just checked the quota on the testing project and there should be 24 available N2 CPUs in us-central1. This is most likely due to multiple tests running at the same time, all of them trying to allocate the N2 CPUs at the same time.

I know you didn't modify that file, but the tests will run for all samples under the affected directory. From the error it looks like it's trying to allocate 12 CPUs (probably the default?), which is a lot for a sample test. There's probably some other sample also trying to allocate a large number of CPUs. I would try to configure the clusters on existing samples to start with the minimum number of CPUs it can, that'll fix any quota issues and tests will run faster as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: dataproc Issues related to the Dataproc API. samples Issues that are directly related to samples.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants