5/5/2020 AdTech Sample Notebook (Part 1) - Databricks
AdTech Sample Notebook (Part 1)
Advertising Technology Sample Notebook (Part 1)
(http://databricks.com) Import Notebook
The purpose of this notebook is to provide example code to make sense of advertising-based web logs. This notebook does
the following:
Setup the connection to your S3 bucket to access the web logs
Create an external table against these web logs including the use of regular expression to parse the logs
Identity Country (ISO-3166-1 Three Letter ISO Country Codes) based on IP address by calling a REST Web service API
Identify Browser and OS information based on the User Agent string within the web logs using the user-agents PyPi
package.
Convert the Apache web logs date information, create a userid, and join back to the Browser and OS information
Setup Instructions
Please refer to the Databricks Data Import How-To Guide (https://databricks.com/wp-
content/uploads/2015/08/Databricks-how-to-data-import.pdf) on how to import data into S3 for use with Databricks
notebooks.
> # Setup AWS configuration
import urllib
ACCESS_KEY = "[REPLACE_WITH_ACCESS_KEY]"
SECRET_KEY = "[REPLACE_WITH_SECRET_KEY]"
ENCODED_SECRET_KEY = urllib.quote(SECRET_KEY, "")
AWS_BUCKET_NAME = "[REPLACE_WITH_BUCKET_NAME]"
MOUNT_NAME = "mdl"
# Mount S3 bucket
dbutils.fs.mount("s3n://%s:%s@%s/" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
Out[9]: True
> # View the log files within the mdl mount
display(dbutils.fs.ls("/mnt/mdl/accesslogs/"))
path name
dbfs:/mnt/mdl/accesslogs/databricks.com-access.log databricks.com-acce
> # Count the number of rows within the sample Apache Access logs
myAccessLogs = sc.textFile("/mnt/mdl/accesslogs/")
myAccessLogs.count()
Out[90]: 5383
>
Create External Table
Create an external table against the access log data where we define a regular expression format as part of the
serializer/deserializer (SerDe) definition.
Instead of writing ETL logic to do this, our table definition handles this.
Original Format: %s %s %s [%s] \"%s %s HTTP/1.1\" %s %s
Example Web Log Row
10.0.0.213 - 2185662 [14/Aug/2015:00:05:15 -0800] "GET /Hurricane+Ridge/rss.xml HTTP/1.1" 200 288
https://cdn2.hubspot.net/hubfs/438089/notebooks/Samples/Miscellaneous/AdTech_Sample_Notebook_Part_1.html 1/1