Explore the capabilities of the Python Faker library (https://faker.readthedocs.io/) for dynamic data generation!
Whether you're a data scientist, engineer, or analyst, this tutorial will guide you through the process of creating realistic and diverse datasets using Faker and then harnessing the distributed computing capabilities of PySpark to aggregate and analyze the generated data. Throughout this guide, you will explore effective techniques for data generation that enhance performance and optimize resource usage. Whether you're working with large datasets or simply seeking to streamline your data generation process, this tutorial offers valuable insights to elevate your skills.
Note: This is not synthetic data, as it is generated using straightforward methods and is unlikely to conform to any real-life distribution. Still, it serves as a valuable resource for testing purposes when authentic data is unavailable.
The Python faker
module needs to be installed. Note that on Google Colab you can use !pip
as well as just pip
(no exclamation mark).
!pip install faker
Collecting faker Downloading Faker-30.3.0-py3-none-any.whl.metadata (15 kB) Requirement already satisfied: python-dateutil>=2.4 in /usr/local/lib/python3.10/dist-packages (from faker) (2.8.2) Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from faker) (4.12.2) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.4->faker) (1.16.0) Downloading Faker-30.3.0-py3-none-any.whl (1.8 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 14.9 MB/s eta 0:00:00 Installing collected packages: faker Successfully installed faker-30.3.0
Import Faker
and set a random seed ($42$).
from faker import Faker
# Set the seed value of the shared `random.Random` object
# across all internal generators that will ever be created
Faker.seed(42)
fake
is a fake data generator with DE_de
locale.
fake = Faker('de_DE')
fake.seed_locale('de_DE', 42)
# Creates and seeds a unique `random.Random` object for
# each internal generator of this `Faker` instance
fake.seed_instance(42)
With fake
you can generate fake data, such as name, email, etc.
print(f"A fake name: {fake.name()}")
print(f"A fake email: {fake.email()}")
A fake name: Aleksandr Weihmann A fake email: [email protected]
Import Pandas to save data into a dataframe
# true if running on Google Colab
import sys
IN_COLAB = 'google.colab' in sys.modules
if not IN_COLAB:
!pip install pandas==1.5.3
import pandas as pd
The function create_row_faker
creates one row of fake data. Here we choose to generate a row containing the following fields:
fake.name()
fake.postcode()
fake.email()
fake.country()
.def create_row_faker(num=1):
fake = Faker('de_DE')
fake.seed_locale('de_DE', 42)
fake.seed_instance(42)
output = [{"name": fake.name(),
"age": fake.random_int(0, 100),
"postcode": fake.postcode(),
"email": fake.email(),
"nationality": fake.country(),
} for x in range(num)]
return output
Generate a single row
create_row_faker()
[{'name': 'Aleksandr Weihmann', 'age': 35, 'postcode': '32181', 'email': '[email protected]', 'nationality': 'Fidschi'}]
Generate n=3
rows
create_row_faker(3)
[{'name': 'Aleksandr Weihmann', 'age': 35, 'postcode': '32181', 'email': '[email protected]', 'nationality': 'Fidschi'}, {'name': 'Prof. Kurt Bauer B.A.', 'age': 91, 'postcode': '37940', 'email': '[email protected]', 'nationality': 'Guatemala'}, {'name': 'Ekkehart Wiek-Kallert', 'age': 13, 'postcode': '61559', 'email': '[email protected]', 'nationality': 'Brasilien'}]
Generate a dataframe df_fake
of 5000 rows using create_row_faker
.
We're using the cell magic %%time
to time the operation.
%%time
df_fake = pd.DataFrame(create_row_faker(5000))
CPU times: user 604 ms, sys: 21.6 ms, total: 626 ms Wall time: 922 ms
View dataframe
df_fake
name | age | postcode | nationality | ||
---|---|---|---|---|---|
0 | Aleksandr Weihmann | 35 | 32181 | [email protected] | Fidschi |
1 | Prof. Kurt Bauer B.A. | 91 | 37940 | [email protected] | Guatemala |
2 | Ekkehart Wiek-Kallert | 13 | 61559 | [email protected] | Brasilien |
3 | Annelise Rohleder-Hornig | 80 | 93103 | [email protected] | Guatemala |
4 | Magrit Knappe B.A. | 47 | 34192 | [email protected] | Guadeloupe |
... | ... | ... | ... | ... | ... |
4995 | Hanno Jopich-Rädel | 99 | 13333 | [email protected] | Syrien |
4996 | Herr Arno Ebert B.A. | 63 | 36790 | [email protected] | Slowenien |
4997 | Miroslawa Schüler | 22 | 11118 | [email protected] | Republik Moldau |
4998 | Janusz Nerger | 74 | 33091 | [email protected] | Belarus |
4999 | Frau Cathleen Bähr | 97 | 89681 | [email protected] | St. Barthélemy |
5000 rows × 5 columns
For more fake data generators see Faker's standard providers as well as community providers.
Install PySpark.
!pip install pyspark
Collecting pyspark Downloading pyspark-3.5.3.tar.gz (317.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 317.3/317.3 MB 4.2 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.10/dist-packages (from pyspark) (0.10.9.7) Building wheels for collected packages: pyspark Building wheel for pyspark (setup.py) ... done Created wheel for pyspark: filename=pyspark-3.5.3-py2.py3-none-any.whl size=317840625 sha256=d49b95160902fd735151aa83f8f91b2b5d347b39ed3bc52a08467c1d73f9b4c7 Stored in directory: /root/.cache/pip/wheels/1b/3a/92/28b93e2fbfdbb07509ca4d6f50c5e407f48dce4ddbda69a4ab Successfully built pyspark Installing collected packages: pyspark Successfully installed pyspark-3.5.3
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Faker demo") \
.getOrCreate()
df = spark.createDataFrame(create_row_faker(5000))
To avoid getting the warning, either use pyspark.sql.Row and let Spark infer datatypes or create a schema for the dataframe specifying the datatypes of all fields (here's the list of all datatypes).
from pyspark.sql.types import *
schema = StructType([StructField('name', StringType()),
StructField('age',IntegerType()),
StructField('postcode',StringType()),
StructField('email', StringType()),
StructField('nationality',StringType())])
df = spark.createDataFrame(create_row_faker(5000), schema)
df.printSchema()
root |-- name: string (nullable = true) |-- age: integer (nullable = true) |-- postcode: string (nullable = true) |-- email: string (nullable = true) |-- nationality: string (nullable = true)
Let's generate some more data (dataframe with $5\cdot10^4$ rows). The file will be partitioned by Spark.
%%time
n = 5*10**4
df = spark.createDataFrame(create_row_faker(n), schema)
CPU times: user 4.61 s, sys: 42.8 ms, total: 4.66 s Wall time: 5.55 s
df.show(10, truncate=False)
+---------------------------------+---+--------+----------------------------+----------------------+ |name |age|postcode|email |nationality | +---------------------------------+---+--------+----------------------------+----------------------+ |Aleksandr Weihmann |35 |32181 |[email protected] |Fidschi | |Prof. Kurt Bauer B.A. |91 |37940 |[email protected] |Guatemala | |Ekkehart Wiek-Kallert |13 |61559 |[email protected] |Brasilien | |Annelise Rohleder-Hornig |80 |93103 |[email protected] |Guatemala | |Magrit Knappe B.A. |47 |34192 |[email protected]|Guadeloupe | |Univ.Prof. Gotthilf Wilmsen B.Sc.|29 |56413 |[email protected] |Litauen | |Franjo Etzold-Hentschel |95 |96965 |[email protected] |Belize | |Steffen Dörschner |19 |69166 |[email protected] |Tunesien | |Milos Ullmann |14 |51462 |[email protected] |Griechenland | |Prof. Urban Döring |80 |89325 |[email protected] |Vereinigtes Königreich| +---------------------------------+---+--------+----------------------------+----------------------+ only showing top 10 rows
It took a long time (~4 sec. for 50000 rows)!
Can we do better?
The function create_row_faker()
returns a list. This is not efficient, what we need is a generator instead.
d = create_row_faker(5)
# what type is d?
type(d)
list
Let us turn d
into a generator
d = ({"name": fake.name(),
"age": fake.random_int(0, 100),
"postcode": fake.postcode(),
"email": fake.email(),
"nationality": fake.country()} for i in range(5))
# what type is d?
type(d)
generator
%%time
n = 5*10**4
fake = Faker('de_DE')
fake.seed_locale('de_DE', 42)
fake.seed_instance(42)
d = ({"name": fake.name(),
"age": fake.random_int(0, 100),
"postcode": fake.postcode(),
"email": fake.email(),
"nationality": fake.country()}
for i in range(n))
df = spark.createDataFrame(d, schema)
CPU times: user 3.69 s, sys: 20.6 ms, total: 3.71 s Wall time: 3.77 s
df.show(10, truncate=False)
+---------------------------------+---+--------+----------------------------+----------------------+ |name |age|postcode|email |nationality | +---------------------------------+---+--------+----------------------------+----------------------+ |Aleksandr Weihmann |35 |32181 |[email protected] |Fidschi | |Prof. Kurt Bauer B.A. |91 |37940 |[email protected] |Guatemala | |Ekkehart Wiek-Kallert |13 |61559 |[email protected] |Brasilien | |Annelise Rohleder-Hornig |80 |93103 |[email protected] |Guatemala | |Magrit Knappe B.A. |47 |34192 |[email protected]|Guadeloupe | |Univ.Prof. Gotthilf Wilmsen B.Sc.|29 |56413 |[email protected] |Litauen | |Franjo Etzold-Hentschel |95 |96965 |[email protected] |Belize | |Steffen Dörschner |19 |69166 |[email protected] |Tunesien | |Milos Ullmann |14 |51462 |[email protected] |Griechenland | |Prof. Urban Döring |80 |89325 |[email protected] |Vereinigtes Königreich| +---------------------------------+---+--------+----------------------------+----------------------+ only showing top 10 rows
This wasn't faster.
Let us look at how one can leverage Hadoop's parallelism to generate dataframes and speed up the process.
We are going to use Spark's RDD and the function parallelize
. In order to do this, we are going to need to extract the Spark context from the current session.
sc = spark.sparkContext
sc
In order to decide on the number of partitions, we are going to look at the number of (virtual) CPU's on the local machine. If you have a cluster you can have a larger number of CPUs across multiple nodes but this is not the case here.
The standard Google Colab virtual machine has $2$ virtual CPUs (one CPU with two threads), so that is the maximum parallelization that you can achieve.
Note:
CPUs = threads per core × cores per socket × sockets
!lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
CPU(s): 2 Thread(s) per core: 2 Core(s) per socket: 1 Socket(s): 1
Due to the limited number of CPUs on this machine, we'll use only $2$ partitions. Even so, the data generation timing has improved dramatically!
%%time
n = 5*10**4
num_partitions = 2
# Create an empty RDD with the specified number of partitions
empty_rdd = sc.parallelize(range(num_partitions), num_partitions)
# Define a function that will run on each partition to generate the fake data
def generate_fake_data(_):
fake = Faker('de_DE') # Create a new Faker instance per partition
fake.seed_locale('de_DE', 42)
fake.seed_instance(42)
for _ in range(n // num_partitions): # Divide work across partitions
yield {
"name": fake.name(),
"age": fake.random_int(0, 100),
"postcode": fake.postcode(),
"email": fake.email(),
"nationality": fake.country()
}
# Use mapPartitions to generate fake data for each partition
rdd = empty_rdd.mapPartitions(generate_fake_data)
# Convert the RDD to a DataFrame
df = rdd.toDF()
CPU times: user 17.3 ms, sys: 5.13 ms, total: 22.5 ms Wall time: 636 ms
I'm convinced that the reason everyone always looks at the first $5$ rows in Spark's RDDs is an homage to the classic jazz piece 🎷🎶.
rdd.take(5)
[{'name': 'Aleksandr Weihmann', 'age': 35, 'postcode': '32181', 'email': '[email protected]', 'nationality': 'Fidschi'}, {'name': 'Prof. Kurt Bauer B.A.', 'age': 91, 'postcode': '37940', 'email': '[email protected]', 'nationality': 'Guatemala'}, {'name': 'Ekkehart Wiek-Kallert', 'age': 13, 'postcode': '61559', 'email': '[email protected]', 'nationality': 'Brasilien'}, {'name': 'Annelise Rohleder-Hornig', 'age': 80, 'postcode': '93103', 'email': '[email protected]', 'nationality': 'Guatemala'}, {'name': 'Magrit Knappe B.A.', 'age': 47, 'postcode': '34192', 'email': '[email protected]', 'nationality': 'Guadeloupe'}]
df.show()
+---+--------------------+--------------------+--------------------+--------+ |age| email| name| nationality|postcode| +---+--------------------+--------------------+--------------------+--------+ | 35|bbeckmann@example...| Aleksandr Weihmann| Fidschi| 32181| | 91|hildaloechel@exam...|Prof. Kurt Bauer ...| Guatemala| 37940| | 13| [email protected]|Ekkehart Wiek-Kal...| Brasilien| 61559| | 80|[email protected]|Annelise Rohleder...| Guatemala| 93103| | 47|gottliebmisicher@...| Magrit Knappe B.A.| Guadeloupe| 34192| | 29| [email protected]|Univ.Prof. Gotthi...| Litauen| 56413| | 95|frederikpechel@ex...|Franjo Etzold-Hen...| Belize| 96965| | 19| [email protected]| Steffen Dörschner| Tunesien| 69166| | 14| [email protected]| Milos Ullmann| Griechenland| 51462| | 80|augustewulff@exam...| Prof. Urban Döring|Vereinigtes König...| 89325| | 62|polinarosenow@exa...| Krzysztof Junitz| Belarus| 15430| | 16| [email protected]|Frau Zita Wesack ...| Samoa| 82489| | 85|carlokambs@exampl...| Olaf Jockel MBA.| Nordmazedonien| 78713| | 90|karl-heinrichstau...|Prof. Emil Albers...| Falklandinseln| 31051| | 60| [email protected]|Otfried Rudolph-Rust| Madagaskar| 76311| | 82|hans-hermannreisi...| Michail Söding| Bulgarien| 06513| | 31|davidssusan@examp...| Dr. Erna Misicher| Côte d’Ivoire| 78108| | 51| [email protected]|Dipl.-Ing. Jana H...| Äußeres Ozeanien| 26064| | 7|bjoernpechel@exam...|Dr. Cordula Hübel...| Trinidad und Tobago| 50097| | 86|scholldarius@exam...|Herr Konstantinos...| Republik Korea| 61939| +---+--------------------+--------------------+--------------------+--------+ only showing top 20 rows
Show the first five records in the dataframe of fake data.
df.show(n=5, truncate=False)
+---+----------------------------+------------------------+-----------+--------+ |age|email |name |nationality|postcode| +---+----------------------------+------------------------+-----------+--------+ |35 |[email protected] |Aleksandr Weihmann |Fidschi |32181 | |91 |[email protected] |Prof. Kurt Bauer B.A. |Guatemala |37940 | |13 |[email protected] |Ekkehart Wiek-Kallert |Brasilien |61559 | |80 |[email protected] |Annelise Rohleder-Hornig|Guatemala |93103 | |47 |[email protected]|Magrit Knappe B.A. |Guadeloupe |34192 | +---+----------------------------+------------------------+-----------+--------+ only showing top 5 rows
Do some data aggregation:
import pyspark.sql.functions as F
df.groupBy('postcode') \
.agg(F.count('postcode').alias('Count'), F.round(F.avg('age'), 2).alias('Average age')) \
.filter('Count>3') \
.orderBy('Average age', ascending=False) \
.show(5)
+--------+-----+-----------+ |postcode|Count|Average age| +--------+-----+-----------+ | 60653| 4| 98.5| | 59679| 4| 98.5| | 37287| 4| 98.5| | 63287| 4| 98.0| | 37841| 4| 97.5| +--------+-----+-----------+ only showing top 5 rows
Postcode $18029$ has the highest average age ($91.75$). Show all entries for postcode $18029$ using filter
.
df.filter('postcode==18029').show(truncate=False)
+---+-----+----+-----------+--------+ |age|email|name|nationality|postcode| +---+-----+----+-----------+--------+ +---+-----+----+-----------+--------+
We are going to use multiple locales with weights (following the examples in the documentation).
Here's the list of all available locales.
from faker import Faker
# set a seed for the random generator
Faker.seed(0)
Generate data with locales de_DE
and de_AT
with weights respectively $5$ and $2$.
The distribution of locales will be:
de_DE
- $71.43\%$ of the time ($5 / (5+2)$)de_AT
- $28.57\%$ of the time ($2 / (5+2)$)from collections import OrderedDict
locales = OrderedDict([
('de_DE', 5),
('de_AT', 2),
])
fake = Faker(locales)
fake.seed_instance(42)
fake.locales
['de_DE', 'de_AT']
fake.seed_locale('de_DE', 0)
fake.seed_locale('de_AT', 0)
fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group',
'mail', 'current_location'])
{'current_location': (Decimal('26.547114'), Decimal('-10.243190')), 'blood_group': 'B-', 'name': 'Axel Jung', 'sex': 'M', 'mail': '[email protected]', 'birthdate': datetime.date(2004, 1, 27)}
from pyspark.sql.types import *
location = StructField('current_location',
StructType([StructField('lat', DecimalType()),
StructField('lon', DecimalType())])
)
schema = StructType([StructField('name', StringType()),
StructField('birthdate', DateType()),
StructField('sex', StringType()),
StructField('blood_group', StringType()),
StructField('mail', StringType()),
location
])
fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group',
'mail', 'current_location'])
{'current_location': (Decimal('79.153888'), Decimal('-0.003034')), 'blood_group': 'B-', 'name': 'Dr. Anita Suppan', 'sex': 'F', 'mail': '[email protected]', 'birthdate': datetime.date(1980, 10, 9)}
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Faker demo - part 2") \
.getOrCreate()
Create dataframe with $5\cdot10^3$ rows.
%%time
n = 5*10**3
d = (fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group',
'mail', 'current_location'])
for i in range(n))
df = spark.createDataFrame(d, schema)
CPU times: user 2.85 s, sys: 11 ms, total: 2.86 s Wall time: 3.16 s
df.printSchema()
root |-- name: string (nullable = true) |-- birthdate: date (nullable = true) |-- sex: string (nullable = true) |-- blood_group: string (nullable = true) |-- mail: string (nullable = true) |-- current_location: struct (nullable = true) | |-- lat: decimal(10,0) (nullable = true) | |-- lon: decimal(10,0) (nullable = true)
Note how location
represents a tuple data structure (a StructType
of StructField
s).
df.show(n=10, truncate=False)
+---------------------------+----------+---+-----------+-------------------------+----------------+ |name |birthdate |sex|blood_group|mail |current_location| +---------------------------+----------+---+-----------+-------------------------+----------------+ |Prof. Valentine Niemeier |1979-11-12|F |B- |[email protected] |{74, 164} | |Magrit Graf |1943-09-15|F |A- |[email protected] |{-86, -34} | |Harriet Weiß-Liebelt |1960-07-24|F |AB+ |[email protected] |{20, 126} | |Marisa Heser |1919-08-27|F |B- |[email protected] |{73, 169} | |Alexa Loidl-Schönberger |1934-08-30|F |O- |[email protected] |{-23, -117} | |Rosa-Maria Schwital B.Sc. |1928-02-08|F |O- |[email protected] |{2, -113} | |Herr Roland Caspar B.Sc. |1932-09-09|M |O- |[email protected]|{24, 100} | |Bernard Mude |1943-03-01|M |O- |[email protected] |{22, -104} | |Prof. Violetta Eberl |1913-09-09|F |B+ |[email protected] |{80, -135} | |Alexandre Oestrovsky B.Eng.|1925-03-29|M |A- |[email protected] |{-62, 43} | +---------------------------+----------+---+-----------+-------------------------+----------------+ only showing top 10 rows
Write to parquet file (Parquet is a compressed, efficient columnar data representation compatible with all frameworks in the Hadoop ecosystem):
df.write.mode("overwrite").parquet("fakedata.parquet")
Check the size of parquet file (it is actually a directory containing the partitions):
!du -h fakedata.parquet
188K fakedata.parquet
!ls -lh fakedata.parquet
total 172K -rw-r--r-- 1 root root 71K Oct 13 21:28 part-00000-7184573d-f659-4fd7-83ba-7291658ca6ed-c000.snappy.parquet -rw-r--r-- 1 root root 98K Oct 13 21:28 part-00001-7184573d-f659-4fd7-83ba-7291658ca6ed-c000.snappy.parquet -rw-r--r-- 1 root root 0 Oct 13 21:28 _SUCCESS
Don't forget to close the Spark session when you're done!
Even when no jobs are running, the Spark session holds memory resources, that get released only when the session is properly stopped.
# Function to check memory usage
import subprocess
def get_memory_usage_ratio():
# Run the 'free -h' command
result = subprocess.run(['free', '-h'], stdout=subprocess.PIPE, text=True)
# Parse the output
lines = result.stdout.splitlines()
# Initialize used and total memory
used_memory = None
total_memory = None
# The second line contains the memory information
if len(lines) > 1:
# Split the line into parts
memory_parts = lines[1].split()
total_memory = memory_parts[1] # Total memory
used_memory = memory_parts[2] # Used memory
return used_memory, total_memory
Stop the session and compare.
# Check memory usage before stopping the Spark session
used_memory, total_memory = get_memory_usage_ratio()
print(f"Memory used before stopping Spark session: {used_memory}")
print(f"Total Memory: {total_memory}")
Memory used before stopping Spark session: 1.7Gi Total Memory: 12Gi
# Stop the Spark session
spark.stop()
# Check memory usage after stopping the Spark session
used_memory, total_memory = get_memory_usage_ratio()
print(f"Memory used after stopping Spark session: {used_memory}")
print(f"Total Memory: {total_memory}")
Memory used after stopping Spark session: 1.6Gi Total Memory: 12Gi
The amount of memory released may not be impressive in this case, but holding onto unnecessary resources is inefficient. Also, memory waste can add up quickly when multiple sessions are running.