Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
f4e9cdb
Converted tests to pytest. Build a Python package. Update requirement…
rjurney Feb 16, 2025
c256244
Restore Python .gitignore
rjurney Feb 16, 2025
6c3df0b
Extra newline removed
rjurney Feb 16, 2025
b2838d2
Merge branch 'master' of github.com:graphframes/graphframes into rjur…
rjurney Feb 16, 2025
caf5091
Added VERSION file set to 0.8.5
rjurney Feb 16, 2025
7cfa2d1
isort; fiex edgesDF variable name.
rjurney Feb 16, 2025
2ca9a15
Merge branch 'master' of github.com:graphframes/graphframes into rjur…
rjurney Feb 16, 2025
a8bf0be
Back out Dockerfile changes
rjurney Feb 16, 2025
54a942d
Back out version change in build.sbt
rjurney Feb 16, 2025
8b0e346
Backout changes to config and run-tests
rjurney Feb 16, 2025
46c2b93
Back out pytest conversion
rjurney Feb 16, 2025
18b5da0
Back out version changes to make nose tests pass
rjurney Feb 16, 2025
8eca097
Remove changes to requirements
rjurney Feb 16, 2025
277c06f
Put nose back in requirements.txt
rjurney Feb 16, 2025
b55ee48
Remove version bump to version.sbt
rjurney Feb 16, 2025
f8a8fd9
Remove packages related to testing
rjurney Feb 16, 2025
bc2cb36
Remove old setup.py / setup.cfg
rjurney Feb 16, 2025
728be33
New pyproject.toml and poetry.lock
rjurney Feb 16, 2025
3cea1a8
Short README for Python package, poetry won't allow a ../README.md path
rjurney Feb 16, 2025
87cc975
Remove requirements files in favor of pyproject.toml
rjurney Feb 16, 2025
6f84a5a
Try to poetrize CI build
rjurney Feb 16, 2025
9a8eef0
pyspark min 3.4
rjurney Feb 16, 2025
75ecd99
Local python README in pyproject.toml
rjurney Feb 16, 2025
80231d0
Trying to remove he working folder to debug scala issue
rjurney Feb 16, 2025
2a9170b
Set Python working directory again
rjurney Feb 16, 2025
3de2263
Accidental newline
rjurney Feb 16, 2025
4662717
Install Python for test...
rjurney Feb 17, 2025
1b7b9f8
Run tests from python/ folder
rjurney Feb 17, 2025
58da493
Try running tests from python/
rjurney Feb 17, 2025
9f4aa24
poetry run the unit tests
rjurney Feb 17, 2025
11b2782
poetry run the tests
rjurney Feb 17, 2025
9772344
Try just using 'python' instead of a path
rjurney Feb 17, 2025
d55dbfe
poetry run the last line, graphframes.main
rjurney Feb 17, 2025
2fc4d08
Remove test/ folder from style paths, it doesn't exist
rjurney Feb 17, 2025
8297a13
Remove .vscode
rjurney Feb 17, 2025
2035d98
VERSION back to 0.8.4
rjurney Feb 17, 2025
f9f4bd7
Remove tutorials reference
rjurney Feb 17, 2025
9ddd6b2
VERSION is a Python thing, it belongs in python/
rjurney Feb 17, 2025
7065647
Include the README.md and LICENSE in the Python package
rjurney Feb 17, 2025
a6c7e91
Some classifiers for pyproject.toml
rjurney Feb 17, 2025
51e3e6d
Trying poetry install action instead of manual install
rjurney Feb 17, 2025
272be06
Removing SPARK_HOME
rjurney Feb 17, 2025
4587999
Returned SPARK_HOME settings
rjurney Feb 17, 2025
2422b22
Minimized the PR to just these files
rjurney Feb 17, 2025
073dced
Merge in rjurney/build-upgrades and in turn master
rjurney Feb 17, 2025
0a1faba
Created tutorials dependency group to minimize main bloat
rjurney Feb 17, 2025
c0d6d7b
Make motif.py execute in whole again
rjurney Feb 17, 2025
5bb4c26
Minor isort format and cleanup of download.py
rjurney Feb 17, 2025
99e6a4d
Minor isort format and cleanup of utils.py
rjurney Feb 17, 2025
662e197
Removed case sensitivity from the script - that was confusing people …
rjurney Feb 17, 2025
beaa35d
motif.py now matches tutorial code, runs and handles case insensitivity.
rjurney Feb 17, 2025
1bf4a9e
Regenerate poetry.lock
rjurney Feb 21, 2025
ef19784
Setup a 'graphframes stackexchange' comand.
rjurney Feb 21, 2025
4400cb4
Make graphframes.tutorials.motif use a checkpoint dir unique, and fro…
rjurney Feb 21, 2025
d549c56
Use spark.sparkContext.setCheckpointDir directly instead of instantia…
rjurney Feb 21, 2025
b970636
Using 'from __future__ import annotations' intsead of List and Tuple
rjurney Feb 21, 2025
3788941
Now retry three times if we can't connect for any reason in 'graphfra…
rjurney Feb 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions python/MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ recursive-exclude * __pycache__
recursive-exclude * *.pyc
include README.md
include LICENSE
include graphframes/tutorials/data/.exists
19 changes: 19 additions & 0 deletions python/graphframes/console.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
import click
from graphframes.tutorials import download


@click.group()
def cli():
"""GraphFrames CLI: a collection of commands for graphframes."""
pass


cli.add_command(download.stackexchange)


def main():
cli()


if __name__ == "__main__":
main()
88 changes: 88 additions & 0 deletions python/graphframes/tutorials/download.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
#!/usr/bin/env python

"""Download and decompress the Stack Exchange data dump from the Internet Archive."""

import os

import click
import py7zr
import requests # type: ignore


@click.command()
@click.argument("subdomain")
@click.option(
"--data-dir",
default="python/graphframes/tutorials/data",
help="Directory to store downloaded files",
)
@click.option(
"--extract/--no-extract", default=True, help="Whether to extract the archive after download"
)
def stackexchange(subdomain: str, data_dir: str, extract: bool) -> None:
"""Download Stack Exchange archive for a given SUBDOMAIN.

Example: python/graphframes/tutorials/download.py stats.meta

Note: This won't work for stackoverflow.com archives due to size.
"""
# Create data directory if it doesn't exist
os.makedirs(data_dir, exist_ok=True)

# Construct archive URL and filename
archive_url = f"https://archive.org/download/stackexchange/{subdomain}.stackexchange.com.7z"
archive_path = os.path.join(data_dir, f"{subdomain}.stackexchange.com.7z")

click.echo(f"Downloading archive from {archive_url}")

try:
# Download the file with retries
max_retries = 3
retry_count = 0

while retry_count < max_retries:
try:
response = requests.get(archive_url, stream=True)
response.raise_for_status() # Raise exception for bad status codes
break
except (
requests.exceptions.RequestException,
requests.exceptions.ConnectionError,
requests.exceptions.HTTPError,
requests.exceptions.Timeout,
) as e:
retry_count += 1
if retry_count == max_retries:
click.echo(f"Failed to download after {max_retries} attempts: {e}", err=True)
raise click.Abort()
click.echo(f"Download attempt {retry_count} failed, retrying...")

total_size = int(response.headers.get("content-length", 0))

with click.progressbar(length=total_size, label="Downloading") as bar: # type: ignore
with open(archive_path, "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
bar.update(len(chunk))

click.echo(f"Download complete: {archive_path}")

# Extract if requested
if extract:
click.echo("Extracting archive...")
output_dir = f"{subdomain}.stackexchange.com"
with py7zr.SevenZipFile(archive_path, mode="r") as z:
z.extractall(path=os.path.join(data_dir, output_dir))
click.echo(f"Extraction complete: {output_dir}")

except requests.exceptions.RequestException as e:
click.echo(f"Error downloading archive: {e}", err=True)
raise click.Abort()
except py7zr.Bad7zFile as e:
click.echo(f"Error extracting archive: {e}", err=True)
raise click.Abort()


if __name__ == "__main__":
stackexchange()
203 changes: 203 additions & 0 deletions python/graphframes/tutorials/motif.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
"""Demonstrate GraphFrames network motif finding capabilities. Code from the Network Motif Finding Tutorial."""

#
# Interactive Usage: pyspark --packages graphframes:graphframes:0.8.4-spark3.5-s_2.12
#
# Batch Usage: spark-submit --packages graphframes:graphframes:0.8.4-spark3.5-s_2.12 python/graphframes/tutorials/motif.py
#

import click
import pyspark.sql.functions as F
from pyspark.sql import DataFrame, SparkSession

from graphframes import GraphFrame

# Initialize a SparkSession
spark: SparkSession = SparkSession.builder.appName("Stack Overflow Motif Analysis").getOrCreate()
spark.sparkContext.setCheckpointDir("/tmp/graphframes-checkpoints/motif")

# Change me if you download a different stackexchange site
STACKEXCHANGE_SITE = "stats.meta.stackexchange.com"
BASE_PATH = f"python/graphframes/tutorials/data/{STACKEXCHANGE_SITE}"


#
# Load the nodes and edges from disk, repartition, checkpoint [plan got long for some reason] and cache.
#

# We created these in stackexchange.py from Stack Exchange data dump XML files
NODES_PATH: str = f"{BASE_PATH}/Nodes.parquet"
nodes_df: DataFrame = spark.read.parquet(NODES_PATH)

# Repartition the nodes to give our motif searches parallelism
nodes_df = nodes_df.repartition(50).checkpoint().cache()

# We created these in stackexchange.py from Stack Exchange data dump XML files
EDGES_PATH: str = f"{BASE_PATH}/Edges.parquet"
edges_df: DataFrame = spark.read.parquet(EDGES_PATH)

# Repartition the edges to give our motif searches parallelism
edges_df = edges_df.repartition(50).checkpoint().cache()

# What kind of nodes we do we have to work with?
node_counts = (
nodes_df.select("id", F.col("Type").alias("Node Type"))
.groupBy("Node Type")
.count()
.orderBy(F.col("count").desc())
# Add a comma formatted column for display
.withColumn("count", F.format_number(F.col("count"), 0))
)
node_counts.show()

# What kind of edges do we have to work with?
edge_counts = (
edges_df.select("src", "dst", F.col("relationship").alias("Edge Type"))
.groupBy("Edge Type")
.count()
.orderBy(F.col("count").desc())
# Add a comma formatted column for display
.withColumn("count", F.format_number(F.col("count"), 0))
)
edge_counts.show()

g = GraphFrame(nodes_df, edges_df)

g.vertices.show(10)
click.echo(f"Node columns: {g.vertices.columns}")

g.edges.sample(0.0001).show(10)

# Sanity test that all edges have valid ids
edge_count = g.edges.count()
valid_edge_count = (
g.edges.join(g.vertices, on=g.edges.src == g.vertices.id)
.select("src", "dst", "relationship")
.join(g.vertices, on=g.edges.dst == g.vertices.id)
.count()
)

# Just up and die if we have edges that point to non-existent nodes
assert (
edge_count == valid_edge_count
), f"Edge count {edge_count} != valid edge count {valid_edge_count}"
click.echo(f"Edge count: {edge_count:,} == Valid edge count: {valid_edge_count:,}")

# G4: Continuous Triangles
paths = g.find("(a)-[e1]->(b); (b)-[e2]->(c); (c)-[e3]->(a)")

# Show the first path
paths.show(3)

graphlet_type_df = paths.select(
F.col("a.Type").alias("A_Type"),
F.col("e1.relationship").alias("(a)-[e1]->(b)"),
F.col("b.Type").alias("B_Type"),
F.col("e2.relationship").alias("(b)-[e2]->(c)"),
F.col("c.Type").alias("C_Type"),
F.col("e3.relationship").alias("(c)-[e3]->(a)"),
)

graphlet_count_df = (
graphlet_type_df.groupby(
"A_Type", "(a)-[e1]->(b)", "B_Type", "(b)-[e2]->(c)", "C_Type", "(c)-[e3]->(a)"
)
.count()
.orderBy(F.col("count").desc())
# Add a comma formatted column for display
.withColumn("count", F.format_number(F.col("count"), 0))
)
graphlet_count_df.show()

# G5: Divergent Triangles
paths = g.find("(a)-[e1]->(b); (a)-[e2]->(c); (c)-[e3]->(b)")

graphlet_type_df = paths.select(
F.col("a.Type").alias("A_Type"),
F.col("e1.relationship").alias("(a)-[e1]->(b)"),
F.col("b.Type").alias("B_Type"),
F.col("e2.relationship").alias("(a)-[e2]->(c)"),
F.col("c.Type").alias("C_Type"),
F.col("e3.relationship").alias("(c)-[e3]->(b)"),
)

graphlet_count_df = (
graphlet_type_df.groupby(
"A_Type", "(a)-[e1]->(b)", "B_Type", "(a)-[e2]->(c)", "C_Type", "(c)-[e3]->(b)"
)
.count()
.orderBy(F.col("count").desc())
# Add a comma formatted column for display
.withColumn("count", F.format_number(F.col("count"), 0))
)
graphlet_count_df.show()

# G17: A directed 3-path is a surprisingly diverse graphlet
paths = g.find("(a)-[e1]->(b); (b)-[e2]->(c); (d)-[e3]->(c)")

# Visualize the four-path by counting instances of paths by node / edge type
graphlet_type_df = paths.select(
F.col("a.Type").alias("A_Type"),
F.col("e1.relationship").alias("(a)-[e1]->(b)"),
F.col("b.Type").alias("B_Type"),
F.col("e2.relationship").alias("(b)-[e2]->(c)"),
F.col("c.Type").alias("C_Type"),
F.col("e3.relationship").alias("(d)-[e3]->(c)"),
F.col("d.Type").alias("D_Type"),
)
graphlet_count_df = (
graphlet_type_df.groupby(
"A_Type",
"(a)-[e1]->(b)",
"B_Type",
"(b)-[e2]->(c)",
"C_Type",
"(d)-[e3]->(c)",
"D_Type",
)
.count()
.orderBy(F.col("count").desc())
# Add a comma formatted column for display
.withColumn("count", F.format_number(F.col("count"), 0))
)
graphlet_count_df.show()

graphlet_count_df.orderBy(
[
"A_Type",
"(a)-[e1]->(b)",
"B_Type",
"(b)-[e2]->(c)",
"C_Type",
"(d)-[e3]->(c)",
"D_Type",
],
ascending=False,
).show(104)

# A user answers an answer that answers a question that links to an answer.
linked_vote_paths = paths.filter(
(F.col("a.Type") == "Vote")
& (F.col("e1.relationship") == "CastFor")
& (F.col("b.Type") == "Question")
& (F.col("e2.relationship") == "Links")
& (F.col("c.Type") == "Question")
& (F.col("e3.relationship") == "CastFor")
& (F.col("d.Type") == "Vote")
)

# Sanity check the count - it should match the table above
linked_vote_paths.count()

b_vote_counts = linked_vote_paths.select("a", "b").distinct().groupBy("b").count()
c_vote_counts = linked_vote_paths.select("c", "d").distinct().groupBy("c").count()

linked_vote_counts = (
linked_vote_paths.filter((F.col("a.VoteTypeId") == 2) & (F.col("d.VoteTypeId") == 2))
.select("b", "c")
.join(b_vote_counts, on="b", how="inner")
.withColumnRenamed("count", "b_count")
.join(c_vote_counts, on="c", how="inner")
.withColumnRenamed("count", "c_count")
)
linked_vote_counts.stat.corr("b_count", "c_count")
Loading