Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views1 page

AdTech Sample Notebook (Part 1) - Databricks

The AdTech Sample Notebook provides example code for analyzing advertising-based web logs using Databricks. It includes steps for setting up an S3 connection, creating an external table with regular expressions, identifying country, browser, and OS information from web logs, and processing Apache log data. Instructions for data import and AWS configuration are also provided.

Uploaded by

Tuan Minh Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views1 page

AdTech Sample Notebook (Part 1) - Databricks

The AdTech Sample Notebook provides example code for analyzing advertising-based web logs using Databricks. It includes steps for setting up an S3 connection, creating an external table with regular expressions, identifying country, browser, and OS information from web logs, and processing Apache log data. Instructions for data import and AWS configuration are also provided.

Uploaded by

Tuan Minh Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

5/5/2020 AdTech Sample Notebook (Part 1) - Databricks

AdTech Sample Notebook (Part 1)

Advertising Technology Sample Notebook (Part 1)


(http://databricks.com)  Import Notebook

The purpose of this notebook is to provide example code to make sense of advertising-based web logs. This notebook does
the following:
Setup the connection to your S3 bucket to access the web logs
Create an external table against these web logs including the use of regular expression to parse the logs
Identity Country (ISO-3166-1 Three Letter ISO Country Codes) based on IP address by calling a REST Web service API
Identify Browser and OS information based on the User Agent string within the web logs using the user-agents PyPi
package.
Convert the Apache web logs date information, create a userid, and join back to the Browser and OS information

Setup Instructions
Please refer to the Databricks Data Import How-To Guide (https://databricks.com/wp-
content/uploads/2015/08/Databricks-how-to-data-import.pdf) on how to import data into S3 for use with Databricks
notebooks.

> # Setup AWS configuration


import urllib
ACCESS_KEY = "[REPLACE_WITH_ACCESS_KEY]"
SECRET_KEY = "[REPLACE_WITH_SECRET_KEY]"
ENCODED_SECRET_KEY = urllib.quote(SECRET_KEY, "")
AWS_BUCKET_NAME = "[REPLACE_WITH_BUCKET_NAME]"
MOUNT_NAME = "mdl"

# Mount S3 bucket
dbutils.fs.mount("s3n://%s:%s@%s/" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)

Out[9]: True

> # View the log files within the mdl mount


display(dbutils.fs.ls("/mnt/mdl/accesslogs/"))

path name
dbfs:/mnt/mdl/accesslogs/databricks.com-access.log databricks.com-acce

> # Count the number of rows within the sample Apache Access logs
myAccessLogs = sc.textFile("/mnt/mdl/accesslogs/")
myAccessLogs.count()

Out[90]: 5383

>

Create External Table


Create an external table against the access log data where we define a regular expression format as part of the
serializer/deserializer (SerDe) definition.
Instead of writing ETL logic to do this, our table definition handles this.
Original Format: %s %s %s [%s] \"%s %s HTTP/1.1\" %s %s
Example Web Log Row
10.0.0.213 - 2185662 [14/Aug/2015:00:05:15 -0800] "GET /Hurricane+Ridge/rss.xml HTTP/1.1" 200 288

https://cdn2.hubspot.net/hubfs/438089/notebooks/Samples/Miscellaneous/AdTech_Sample_Notebook_Part_1.html 1/1

You might also like