-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Data science onramp Data ingestion #4447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
leahecole
merged 105 commits into
GoogleCloudPlatform:master
from
leahecole:data-ingestion
Aug 14, 2020
Merged
Changes from all commits
Commits
Show all changes
105 commits
Select commit
Hold shift + click to select a range
92cf763
add data ingestion code
vuppalli 739114a
begin addressing comments
Symmetries 681eaf3
change submit job
vuppalli 4afbf1c
address code structure and global variable issues
Symmetries 744f80c
get dataproc job output and fix linting
vuppalli 8cd7dc6
fix PR comments
vuppalli 81265d2
linting and global vars
vuppalli 3e86bda
address Brad PR comments
vuppalli 580c8e1
broken clean.py
tk744 4ed5a15
Revert "broken clean.py"
tk744 e6fe99d
optimize data ingestion
Symmetries 540acaa
fix linting errors
vuppalli a7e2972
fix minor style issues
Symmetries 3e5ba3b
remove pip from cluster config
Symmetries 2106153
load external datasets from url
Symmetries 0692122
add citibike dataset notebook
vuppalli 01eb916
address leahs comments
vuppalli 56faae7
add gas dataset code
vuppalli 5c672c4
add scikit learn
vuppalli f6faeeb
rename file
vuppalli d47841c
small wording change
vuppalli d548dd0
address brad and diego comments
vuppalli 129a5b9
fix linting issues
vuppalli 5c284ea
add outputs
vuppalli 6c565f4
add US holidays feature engineering
vuppalli c9b5dd7
Delete noxfile.py
vuppalli c5283c9
minor changes
vuppalli 74881a9
change output
vuppalli 7d21865
Merge branch 'feature-engineering' of https://github.com/Symmetries/p…
vuppalli 9febbad
added dry-run flag
tk744 1980346
address leah comments
vuppalli 2557afe
address brads comments
vuppalli 44bf12e
add weather feature engineering
vuppalli 5d56b97
dry-run flag
Symmetries 0561803
normalize weather values
vuppalli 22be5d3
address some review comments
Symmetries bbf8d03
address comments from live code review
vuppalli 20964d5
add env var support and upload to gcs bucket
vuppalli 266708b
change import order and clear incorrect output
vuppalli f040542
optimize setup test
Symmetries 55354df
query data in test
Symmetries 5f80974
address live session comments
Symmetries e883765
add break statement
Symmetries 2ec8b30
revert breaking table and dataset name change
Symmetries 1d13604
small cleanup
vuppalli c017aff
more cleanup
vuppalli c86b3ee
fix incorrect outputs
vuppalli 5b66d5c
Data cleaning script
tk744 6fe9ac6
addressed PR comments 1
tk744 c5d8b90
addressed PR comments 2
tk744 68ad196
Refactored to match tutorial doc
tk744 2df1dbf
added testing files
tk744 9b7fe27
added sh script
tk744 dcb5a65
added dry-run flag
tk744 75fb658
changed --test flag to --dry-run
tk744 3620271
gcs upload now writes to temp location before copying objects to fina…
tk744 8c9e44e
fixed linting
tk744 99cf698
linting fixes
tk744 d38e1b2
Merge pull request #3 from Symmetries/feature-engineering
Symmetries 8a5e5fd
Revert "Dataset Feature Engineering"
Symmetries 817172d
Merge pull request #5 from Symmetries/revert-3-feature-engineering
Symmetries 6f6f404
Merge branch 'master' into master
leahecole ade554f
Merge branch 'master' into master
tk744 b5ea09e
fix datetime formatting in setup job
Symmetries 213dfca
uncomment commented dataset creation and writing
Symmetries 589568a
add data ingestion code
vuppalli 9148f5b
begin addressing comments
Symmetries 1abf664
change submit job
vuppalli c600724
address code structure and global variable issues
Symmetries ce04a6f
get dataproc job output and fix linting
vuppalli ef2d2b3
fix PR comments
vuppalli 93394a3
linting and global vars
vuppalli a6fc6e6
address Brad PR comments
vuppalli 1c9f526
broken clean.py
tk744 327cf5b
Revert "broken clean.py"
tk744 4bf07ee
optimize data ingestion
Symmetries 8dbd3bc
fix linting errors
vuppalli 4cdd733
fix minor style issues
Symmetries 0769754
remove pip from cluster config
Symmetries 52da79a
load external datasets from url
Symmetries 2ac38ab
added dry-run flag
tk744 5ead6b2
dry-run flag
Symmetries 3bb0f79
address some review comments
Symmetries c753ed7
optimize setup test
Symmetries e0ffb41
query data in test
Symmetries b0d334b
address live session comments
Symmetries 33afd6c
add break statement
Symmetries 9acb94e
revert breaking table and dataset name change
Symmetries c97d454
fix datetime formatting in setup job
Symmetries 41406f9
uncomment commented dataset creation and writing
Symmetries 0fcb63e
Merge branch 'master' into data-ingestion
Symmetries ca3c592
fix import order
Symmetries cf3aae3
use GOOGLE_CLOUD_PROJECT environment variable
Symmetries c0dc053
resolve merge issue
Symmetries 5c3df6e
Merge branch 'master' of https://github.com/Symmetries/python-docs-sa…
Symmetries 4a3c941
Merge branch 'master' into data-ingestion
leahecole dc11440
blacken and add f-strings to dms notation
Symmetries 39b5289
Merge branch 'data-ingestion' of https://github.com/Symmetries/python…
Symmetries d35b855
change test variables names to match data cleaning
Symmetries 6105f79
blacken setup_test file
Symmetries 35ec8cb
fix unchanged variable name
Symmetries 9561f35
WIP: address PR comments
Symmetries 3242654
apply temporary fix for ANACONDA optional component
Symmetries b82059b
remove data cleaning files
Symmetries 2f655e3
Merge branch 'master' into data-ingestion
leahecole File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
pytest==6.0.0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
#grpcio==1.29.0 | ||
#google-auth==1.16.0 | ||
#google-auth-httplib2==0.0.3 | ||
google-cloud-storage==1.28.1 | ||
google-cloud-dataproc==2.0.0 | ||
google-cloud-bigquery==1.25.0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,211 @@ | ||
"""Setup Dataproc job for Data Science Onramp Sample Application | ||
This job ingests an external gas prices in NY dataset as well as | ||
takes a New York Citibike dataset available on BigQuery and | ||
"dirties" the dataset before uploading it back to BigQuery | ||
It needs the following arguments | ||
* the name of the Google Cloud Storage bucket to be used | ||
* the name of the BigQuery dataset to be created | ||
* an optional --test flag to upload a subset of the dataset for testing | ||
""" | ||
|
||
import random | ||
import sys | ||
|
||
from google.cloud import bigquery | ||
import pandas as pd | ||
from py4j.protocol import Py4JJavaError | ||
from pyspark.sql import SparkSession | ||
from pyspark.sql.functions import date_format, expr, UserDefinedFunction, when | ||
from pyspark.sql.types import FloatType, StringType, StructField, StructType | ||
|
||
TABLE = "bigquery-public-data.new_york_citibike.citibike_trips" | ||
CITIBIKE_TABLE_NAME = "RAW_DATA" | ||
EXTERNAL_TABLES = { | ||
"gas_prices": { | ||
"url": "https://data.ny.gov/api/views/wuxr-ni2i/rows.csv", | ||
"schema": StructType( | ||
[ | ||
StructField("Date", StringType(), True), | ||
StructField("New_York_State_Average_USD_per_Gal", FloatType(), True), | ||
StructField("Albany_Average_USD_per_Gal", FloatType(), True), | ||
StructField("Blinghamton_Average_USD_per_Gal", FloatType(), True), | ||
StructField("Buffalo_Average_USD_per_Gal", FloatType(), True), | ||
StructField("Nassau_Average_USD_per_Gal", FloatType(), True), | ||
StructField("New_York_City_Average_USD_per_Gal", FloatType(), True), | ||
StructField("Rochester_Average_USD_per_Gal", FloatType(), True), | ||
StructField("Syracuse_Average_USD_per_Gal", FloatType(), True), | ||
StructField("Utica_Average_USD_per_Gal", FloatType(), True), | ||
] | ||
), | ||
}, | ||
} | ||
|
||
|
||
# START MAKING DATA DIRTY | ||
def trip_duration(duration): | ||
"""Converts trip duration to other units""" | ||
if not duration: | ||
return None | ||
seconds = f"{str(duration)} s" | ||
minutes = f"{str(float(duration) / 60)} min" | ||
hours = f"{str(float(duration) / 3600)} h" | ||
return random.choices( | ||
[seconds, minutes, hours, str(random.randint(-1000, -1))], | ||
weights=[0.3, 0.3, 0.3, 0.1], | ||
)[0] | ||
|
||
|
||
def station_name(name): | ||
"""Replaces '&' with '/' with a 50% chance""" | ||
if not name: | ||
return None | ||
return random.choice([name, name.replace("&", "/")]) | ||
|
||
|
||
def user_type(user): | ||
"""Manipulates the user type string""" | ||
if not user: | ||
return None | ||
return random.choice( | ||
[ | ||
user, | ||
user.upper(), | ||
user.lower(), | ||
"sub" if user == "Subscriber" else user, | ||
"cust" if user == "Customer" else user, | ||
] | ||
) | ||
|
||
|
||
def gender(s): | ||
"""Manipulates the gender string""" | ||
if not s: | ||
return None | ||
return random.choice( | ||
[ | ||
s.upper(), | ||
s.lower(), | ||
s[0].upper() if len(s) > 0 else "", | ||
s[0].lower() if len(s) > 0 else "", | ||
] | ||
) | ||
|
||
|
||
def convert_angle(angle): | ||
"""Converts long and lat to DMS notation""" | ||
if not angle: | ||
return None | ||
degrees = int(angle) | ||
minutes = int((angle - degrees) * 60) | ||
seconds = int((angle - degrees - minutes / 60) * 3600) | ||
new_angle = f"{degrees}\u00B0{minutes}'{seconds}\"" | ||
return random.choices([str(angle), new_angle], weights=[0.55, 0.45])[0] | ||
|
||
|
||
def create_bigquery_dataset(dataset_name): | ||
# Create BigQuery Dataset | ||
client = bigquery.Client() | ||
dataset_id = f"{client.project}.{dataset_name}" | ||
dataset = bigquery.Dataset(dataset_id) | ||
dataset.location = "US" | ||
dataset = client.create_dataset(dataset) | ||
|
||
|
||
def write_to_bigquery(df, table_name, dataset_name): | ||
"""Write a dataframe to BigQuery""" | ||
client = bigquery.Client() | ||
dataset_id = f"{client.project}.{dataset_name}" | ||
|
||
# Saving the data to BigQuery | ||
df.write.format("bigquery").option("table", f"{dataset_id}.{table_name}").save() | ||
|
||
print(f"Table {table_name} successfully written to BigQuery") | ||
|
||
|
||
def main(): | ||
# Get command line arguments | ||
BUCKET_NAME = sys.argv[1] | ||
DATASET_NAME = sys.argv[2] | ||
|
||
# Create a SparkSession under the name "setup" | ||
spark = SparkSession.builder.appName("setup").getOrCreate() | ||
|
||
spark.conf.set("temporaryGcsBucket", BUCKET_NAME) | ||
|
||
create_bigquery_dataset(DATASET_NAME) | ||
|
||
# Whether we are running the job as a test | ||
test = False | ||
|
||
# Check whether or not the job is running as a test | ||
if "--test" in sys.argv: | ||
test = True | ||
print("A subset of the whole dataset will be uploaded to BigQuery") | ||
else: | ||
print("Results will be uploaded to BigQuery") | ||
|
||
# Ingest External Datasets | ||
for table_name, data in EXTERNAL_TABLES.items(): | ||
df = spark.createDataFrame(pd.read_csv(data["url"]), schema=data["schema"]) | ||
|
||
write_to_bigquery(df, table_name, DATASET_NAME) | ||
|
||
# Check if table exists | ||
try: | ||
df = spark.read.format("bigquery").option("table", TABLE).load() | ||
# if we are running a test, perform computations on a subset of the data | ||
if test: | ||
df = df.sample(False, 0.00001) | ||
except Py4JJavaError: | ||
print(f"{TABLE} does not exist. ") | ||
return | ||
|
||
# Declare dictionary with keys column names and values user defined | ||
# functions and return types | ||
udf_map = { | ||
"tripduration": (trip_duration, StringType()), | ||
"start_station_name": (station_name, StringType()), | ||
"start_station_latitude": (convert_angle, StringType()), | ||
"start_station_longitude": (convert_angle, StringType()), | ||
"end_station_name": (station_name, StringType()), | ||
"end_station_latitude": (convert_angle, StringType()), | ||
"end_station_longitude": (convert_angle, StringType()), | ||
"usertype": (user_type, StringType()), | ||
"gender": (gender, StringType()), | ||
} | ||
|
||
# Declare which columns to set some values to null randomly | ||
null_columns = [ | ||
"tripduration", | ||
"starttime", | ||
"stoptime", | ||
"start_station_latitude", | ||
"start_station_longitude", | ||
"end_station_latitude", | ||
"end_station_longitude", | ||
] | ||
|
||
# Dirty the columns | ||
for name, udf in udf_map.items(): | ||
df = df.withColumn(name, UserDefinedFunction(*udf)(name)) | ||
|
||
# Format the datetimes correctly | ||
for name in ["starttime", "stoptime"]: | ||
df = df.withColumn(name, date_format(name, "yyyy-MM-dd'T'HH:mm:ss")) | ||
|
||
# Randomly set about 5% of the values in some columns to null | ||
for name in null_columns: | ||
df = df.withColumn(name, when(expr("rand() < 0.05"), None).otherwise(df[name])) | ||
|
||
# Duplicate about 0.01% of the rows | ||
dup_df = df.sample(True, 0.0001) | ||
|
||
# Create final dirty dataframe | ||
df = df.union(dup_df) | ||
|
||
print("Uploading citibike dataset...") | ||
write_to_bigquery(df, CITIBIKE_TABLE_NAME, DATASET_NAME) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Submit a PySpark job via the Cloud Dataproc Jobs API | ||
# Requires having CLUSTER_NAME and BUCKET_NAME set as | ||
# environment variables | ||
|
||
gcloud dataproc jobs submit pyspark \ | ||
--cluster ${CLUSTER_NAME} \ | ||
--jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \ | ||
--driver-log-levels root=FATAL \ | ||
setup.py -- ${BUCKET_NAME} new_york_citibike_trips |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.