Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit ca4e6a4

Browse files
Symmetriesvuppallitk744leahecole
authored
Data science onramp Data ingestion (GoogleCloudPlatform#4447)
* add data ingestion code * begin addressing comments * change submit job * address code structure and global variable issues * get dataproc job output and fix linting * fix PR comments * linting and global vars * address Brad PR comments * broken clean.py * Revert "broken clean.py" This reverts commit 580c8e1. * optimize data ingestion * fix linting errors * fix minor style issues * remove pip from cluster config * load external datasets from url * add citibike dataset notebook * address leahs comments * add gas dataset code * add scikit learn * rename file * small wording change * address brad and diego comments * fix linting issues * add outputs * add US holidays feature engineering * Delete noxfile.py * minor changes * change output * added dry-run flag * address leah comments * address brads comments * add weather feature engineering * dry-run flag * normalize weather values * address some review comments * address comments from live code review * add env var support and upload to gcs bucket * change import order and clear incorrect output * optimize setup test * query data in test * address live session comments * add break statement * revert breaking table and dataset name change * small cleanup * more cleanup * fix incorrect outputs * Data cleaning script * addressed PR comments 1 * addressed PR comments 2 * Refactored to match tutorial doc * added testing files * added sh script * added dry-run flag * changed --test flag to --dry-run * gcs upload now writes to temp location before copying objects to final location * fixed linting * linting fixes * Revert "Dataset Feature Engineering" * fix datetime formatting in setup job * uncomment commented dataset creation and writing * add data ingestion code * begin addressing comments * change submit job * address code structure and global variable issues * get dataproc job output and fix linting * fix PR comments * linting and global vars * address Brad PR comments * broken clean.py * Revert "broken clean.py" This reverts commit 580c8e1. * optimize data ingestion * fix linting errors * fix minor style issues * remove pip from cluster config * load external datasets from url * added dry-run flag * dry-run flag * address some review comments * optimize setup test * query data in test * address live session comments * add break statement * revert breaking table and dataset name change * fix datetime formatting in setup job * uncomment commented dataset creation and writing * fix import order * use GOOGLE_CLOUD_PROJECT environment variable * blacken and add f-strings to dms notation * change test variables names to match data cleaning * blacken setup_test file * fix unchanged variable name * WIP: address PR comments * apply temporary fix for ANACONDA optional component * remove data cleaning files Co-authored-by: vuppalli <[email protected]> Co-authored-by: Tushar Khan <[email protected]> Co-authored-by: Vismita Uppalli <[email protected]> Co-authored-by: Leah E. Cole <[email protected]> Co-authored-by: Tushar Khan <[email protected]>
1 parent 60d97ae commit ca4e6a4

File tree

6 files changed

+439
-29
lines changed

6 files changed

+439
-29
lines changed

.gitignore

Lines changed: 0 additions & 29 deletions
This file was deleted.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
pytest==6.0.0
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
#grpcio==1.29.0
2+
#google-auth==1.16.0
3+
#google-auth-httplib2==0.0.3
4+
google-cloud-storage==1.28.1
5+
google-cloud-dataproc==2.0.0
6+
google-cloud-bigquery==1.25.0
Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
"""Setup Dataproc job for Data Science Onramp Sample Application
2+
This job ingests an external gas prices in NY dataset as well as
3+
takes a New York Citibike dataset available on BigQuery and
4+
"dirties" the dataset before uploading it back to BigQuery
5+
It needs the following arguments
6+
* the name of the Google Cloud Storage bucket to be used
7+
* the name of the BigQuery dataset to be created
8+
* an optional --test flag to upload a subset of the dataset for testing
9+
"""
10+
11+
import random
12+
import sys
13+
14+
from google.cloud import bigquery
15+
import pandas as pd
16+
from py4j.protocol import Py4JJavaError
17+
from pyspark.sql import SparkSession
18+
from pyspark.sql.functions import date_format, expr, UserDefinedFunction, when
19+
from pyspark.sql.types import FloatType, StringType, StructField, StructType
20+
21+
TABLE = "bigquery-public-data.new_york_citibike.citibike_trips"
22+
CITIBIKE_TABLE_NAME = "RAW_DATA"
23+
EXTERNAL_TABLES = {
24+
"gas_prices": {
25+
"url": "https://data.ny.gov/api/views/wuxr-ni2i/rows.csv",
26+
"schema": StructType(
27+
[
28+
StructField("Date", StringType(), True),
29+
StructField("New_York_State_Average_USD_per_Gal", FloatType(), True),
30+
StructField("Albany_Average_USD_per_Gal", FloatType(), True),
31+
StructField("Blinghamton_Average_USD_per_Gal", FloatType(), True),
32+
StructField("Buffalo_Average_USD_per_Gal", FloatType(), True),
33+
StructField("Nassau_Average_USD_per_Gal", FloatType(), True),
34+
StructField("New_York_City_Average_USD_per_Gal", FloatType(), True),
35+
StructField("Rochester_Average_USD_per_Gal", FloatType(), True),
36+
StructField("Syracuse_Average_USD_per_Gal", FloatType(), True),
37+
StructField("Utica_Average_USD_per_Gal", FloatType(), True),
38+
]
39+
),
40+
},
41+
}
42+
43+
44+
# START MAKING DATA DIRTY
45+
def trip_duration(duration):
46+
"""Converts trip duration to other units"""
47+
if not duration:
48+
return None
49+
seconds = f"{str(duration)} s"
50+
minutes = f"{str(float(duration) / 60)} min"
51+
hours = f"{str(float(duration) / 3600)} h"
52+
return random.choices(
53+
[seconds, minutes, hours, str(random.randint(-1000, -1))],
54+
weights=[0.3, 0.3, 0.3, 0.1],
55+
)[0]
56+
57+
58+
def station_name(name):
59+
"""Replaces '&' with '/' with a 50% chance"""
60+
if not name:
61+
return None
62+
return random.choice([name, name.replace("&", "/")])
63+
64+
65+
def user_type(user):
66+
"""Manipulates the user type string"""
67+
if not user:
68+
return None
69+
return random.choice(
70+
[
71+
user,
72+
user.upper(),
73+
user.lower(),
74+
"sub" if user == "Subscriber" else user,
75+
"cust" if user == "Customer" else user,
76+
]
77+
)
78+
79+
80+
def gender(s):
81+
"""Manipulates the gender string"""
82+
if not s:
83+
return None
84+
return random.choice(
85+
[
86+
s.upper(),
87+
s.lower(),
88+
s[0].upper() if len(s) > 0 else "",
89+
s[0].lower() if len(s) > 0 else "",
90+
]
91+
)
92+
93+
94+
def convert_angle(angle):
95+
"""Converts long and lat to DMS notation"""
96+
if not angle:
97+
return None
98+
degrees = int(angle)
99+
minutes = int((angle - degrees) * 60)
100+
seconds = int((angle - degrees - minutes / 60) * 3600)
101+
new_angle = f"{degrees}\u00B0{minutes}'{seconds}\""
102+
return random.choices([str(angle), new_angle], weights=[0.55, 0.45])[0]
103+
104+
105+
def create_bigquery_dataset(dataset_name):
106+
# Create BigQuery Dataset
107+
client = bigquery.Client()
108+
dataset_id = f"{client.project}.{dataset_name}"
109+
dataset = bigquery.Dataset(dataset_id)
110+
dataset.location = "US"
111+
dataset = client.create_dataset(dataset)
112+
113+
114+
def write_to_bigquery(df, table_name, dataset_name):
115+
"""Write a dataframe to BigQuery"""
116+
client = bigquery.Client()
117+
dataset_id = f"{client.project}.{dataset_name}"
118+
119+
# Saving the data to BigQuery
120+
df.write.format("bigquery").option("table", f"{dataset_id}.{table_name}").save()
121+
122+
print(f"Table {table_name} successfully written to BigQuery")
123+
124+
125+
def main():
126+
# Get command line arguments
127+
BUCKET_NAME = sys.argv[1]
128+
DATASET_NAME = sys.argv[2]
129+
130+
# Create a SparkSession under the name "setup"
131+
spark = SparkSession.builder.appName("setup").getOrCreate()
132+
133+
spark.conf.set("temporaryGcsBucket", BUCKET_NAME)
134+
135+
create_bigquery_dataset(DATASET_NAME)
136+
137+
# Whether we are running the job as a test
138+
test = False
139+
140+
# Check whether or not the job is running as a test
141+
if "--test" in sys.argv:
142+
test = True
143+
print("A subset of the whole dataset will be uploaded to BigQuery")
144+
else:
145+
print("Results will be uploaded to BigQuery")
146+
147+
# Ingest External Datasets
148+
for table_name, data in EXTERNAL_TABLES.items():
149+
df = spark.createDataFrame(pd.read_csv(data["url"]), schema=data["schema"])
150+
151+
write_to_bigquery(df, table_name, DATASET_NAME)
152+
153+
# Check if table exists
154+
try:
155+
df = spark.read.format("bigquery").option("table", TABLE).load()
156+
# if we are running a test, perform computations on a subset of the data
157+
if test:
158+
df = df.sample(False, 0.00001)
159+
except Py4JJavaError:
160+
print(f"{TABLE} does not exist. ")
161+
return
162+
163+
# Declare dictionary with keys column names and values user defined
164+
# functions and return types
165+
udf_map = {
166+
"tripduration": (trip_duration, StringType()),
167+
"start_station_name": (station_name, StringType()),
168+
"start_station_latitude": (convert_angle, StringType()),
169+
"start_station_longitude": (convert_angle, StringType()),
170+
"end_station_name": (station_name, StringType()),
171+
"end_station_latitude": (convert_angle, StringType()),
172+
"end_station_longitude": (convert_angle, StringType()),
173+
"usertype": (user_type, StringType()),
174+
"gender": (gender, StringType()),
175+
}
176+
177+
# Declare which columns to set some values to null randomly
178+
null_columns = [
179+
"tripduration",
180+
"starttime",
181+
"stoptime",
182+
"start_station_latitude",
183+
"start_station_longitude",
184+
"end_station_latitude",
185+
"end_station_longitude",
186+
]
187+
188+
# Dirty the columns
189+
for name, udf in udf_map.items():
190+
df = df.withColumn(name, UserDefinedFunction(*udf)(name))
191+
192+
# Format the datetimes correctly
193+
for name in ["starttime", "stoptime"]:
194+
df = df.withColumn(name, date_format(name, "yyyy-MM-dd'T'HH:mm:ss"))
195+
196+
# Randomly set about 5% of the values in some columns to null
197+
for name in null_columns:
198+
df = df.withColumn(name, when(expr("rand() < 0.05"), None).otherwise(df[name]))
199+
200+
# Duplicate about 0.01% of the rows
201+
dup_df = df.sample(True, 0.0001)
202+
203+
# Create final dirty dataframe
204+
df = df.union(dup_df)
205+
206+
print("Uploading citibike dataset...")
207+
write_to_bigquery(df, CITIBIKE_TABLE_NAME, DATASET_NAME)
208+
209+
210+
if __name__ == "__main__":
211+
main()
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Submit a PySpark job via the Cloud Dataproc Jobs API
2+
# Requires having CLUSTER_NAME and BUCKET_NAME set as
3+
# environment variables
4+
5+
gcloud dataproc jobs submit pyspark \
6+
--cluster ${CLUSTER_NAME} \
7+
--jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
8+
--driver-log-levels root=FATAL \
9+
setup.py -- ${BUCKET_NAME} new_york_citibike_trips

0 commit comments

Comments
 (0)