|
1 |
| -# Cloud Dataproc API Examples |
| 1 | +These samples have been moved. |
2 | 2 |
|
3 |
| -[![Open in Cloud Shell][shell_img]][shell_link] |
4 |
| - |
5 |
| -[shell_img]: http://gstatic.com/cloudssh/images/open-btn.png |
6 |
| -[shell_link]: https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/GoogleCloudPlatform/python-docs-samples&page=editor&open_in_editor=dataproc/README.md |
7 |
| - |
8 |
| -Sample command-line programs for interacting with the Cloud Dataproc API. |
9 |
| - |
10 |
| -See [the tutorial on the using the Dataproc API with the Python client |
11 |
| -library](https://cloud.google.com/dataproc/docs/tutorials/python-library-example) |
12 |
| -for information on a walkthrough you can run to try out the Cloud Dataproc API sample code. |
13 |
| - |
14 |
| -Note that while this sample demonstrates interacting with Dataproc via the API, the functionality demonstrated here could also be accomplished using the Cloud Console or the gcloud CLI. |
15 |
| - |
16 |
| -`list_clusters.py` is a simple command-line program to demonstrate connecting to the Cloud Dataproc API and listing the clusters in a region. |
17 |
| - |
18 |
| -`submit_job_to_cluster.py` demonstrates how to create a cluster, submit the |
19 |
| -`pyspark_sort.py` job, download the output from Google Cloud Storage, and output the result. |
20 |
| - |
21 |
| -`single_job_workflow.py` uses the Cloud Dataproc InstantiateInlineWorkflowTemplate API to create an ephemeral cluster, run a job, then delete the cluster with one API request. |
22 |
| - |
23 |
| -`pyspark_sort.py_gcs` is the same as `pyspark_sort.py` but demonstrates |
24 |
| - reading from a GCS bucket. |
25 |
| - |
26 |
| -## Prerequisites to run locally: |
27 |
| - |
28 |
| -* [pip](https://pypi.python.org/pypi/pip) |
29 |
| - |
30 |
| -Go to the [Google Cloud Console](https://console.cloud.google.com). |
31 |
| - |
32 |
| -Under API Manager, search for the Google Cloud Dataproc API and enable it. |
33 |
| - |
34 |
| -## Set Up Your Local Dev Environment |
35 |
| - |
36 |
| -To install, run the following commands. If you want to use [virtualenv](https://virtualenv.readthedocs.org/en/latest/) |
37 |
| -(recommended), run the commands within a virtualenv. |
38 |
| - |
39 |
| - * pip install -r requirements.txt |
40 |
| - |
41 |
| -## Authentication |
42 |
| - |
43 |
| -Please see the [Google cloud authentication guide](https://cloud.google.com/docs/authentication/). |
44 |
| -The recommended approach to running these samples is a Service Account with a JSON key. |
45 |
| - |
46 |
| -## Environment Variables |
47 |
| - |
48 |
| -Set the following environment variables: |
49 |
| - |
50 |
| - GOOGLE_CLOUD_PROJECT=your-project-id |
51 |
| - REGION=us-central1 # or your region |
52 |
| - CLUSTER_NAME=waprin-spark7 |
53 |
| - ZONE=us-central1-b |
54 |
| - |
55 |
| -## Running the samples |
56 |
| - |
57 |
| -To run list_clusters.py: |
58 |
| - |
59 |
| - python list_clusters.py $GOOGLE_CLOUD_PROJECT --region=$REGION |
60 |
| - |
61 |
| -`submit_job_to_cluster.py` can create the Dataproc cluster or use an existing cluster. To create a cluster before running the code, you can use the [Cloud Console](console.cloud.google.com) or run: |
62 |
| - |
63 |
| - gcloud dataproc clusters create your-cluster-name |
64 |
| - |
65 |
| -To run submit_job_to_cluster.py, first create a GCS bucket (used by Cloud Dataproc to stage files) from the Cloud Console or with gsutil: |
66 |
| - |
67 |
| - gsutil mb gs://<your-staging-bucket-name> |
68 |
| - |
69 |
| -Next, set the following environment variables: |
70 |
| - |
71 |
| - BUCKET=your-staging-bucket |
72 |
| - CLUSTER=your-cluster-name |
73 |
| - |
74 |
| -Then, if you want to use an existing cluster, run: |
75 |
| - |
76 |
| - python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET |
77 |
| - |
78 |
| -Alternatively, to create a new cluster, which will be deleted at the end of the job, run: |
79 |
| - |
80 |
| - python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET --create_new_cluster |
81 |
| - |
82 |
| -The script will setup a cluster, upload the PySpark file, submit the job, print the result, then, if it created the cluster, delete the cluster. |
83 |
| - |
84 |
| -Optionally, you can add the `--pyspark_file` argument to change from the default `pyspark_sort.py` included in this script to a new script. |
| 3 | +https://github.com/googleapis/python-dataproc/tree/master/samples |
0 commit comments