Adding motif finding tutorial using the stats.meta.stackexchange.com data dump#473
Adding motif finding tutorial using the stats.meta.stackexchange.com data dump#473
Conversation
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #473 +/- ##
==========================================
- Coverage 91.43% 91.20% -0.24%
==========================================
Files 18 18
Lines 829 864 +35
Branches 52 101 +49
==========================================
+ Hits 758 788 +30
- Misses 71 76 +5 ☔ View full report in Codecov by Sentry. |
…sts. Later will make these extras?
…xchange Data Dump from Internet Archive
…orial and two Databricks blog posts on GraphFrames.
|
@SauronShepherd @bjornjorgensen Can you please review this PR? My motif finding tutorial is finally ready :) I want to ship it and then cut a new release. It includes a new extended README and other improvements. Please forgive me for the size - it got out of hand - I'll create smaller PRs in the future. |
|
@rjurney Can we split this PR to series of smaller PRs? At least separate infrastructure part (CI, build, gitignore, etc.) and tutorial itself? |
|
I agree on that, Sem.
I've only reviewed the first ones, but I have some doubts:
- The only differences in many lines seem to be the end character. Is that
ok?
13c5e74
- Why mentioning explictly a concrete IDE in the .gitignore? Maybe that's
something every developed should do on its own according to their IDE.
.vscode
1319434
- Why excluding a data folder that maybe shouldn't be located inside the
project in the first place? python/graphframes/examples/data
Why download local test data inside the project?
8d84baa
These are not critical points, they just crossed my mind while having a
look to the PR.
El sáb, 8 feb 2025 a las 12:31, Sem ***@***.***>) escribió:
… @rjurney <https://github.com/rjurney> Can we split this PR to series of
smaller PRs? At least separate infrastructure part (CI, build, gitignore,
etc.) and tutorial itself?
—
Reply to this email directly, view it on GitHub
<#473 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACCN674CC7WXYV3VP4G3ZHT2OXTJDAVCNFSM6AAAAABUFSGKESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNBVGA4TEMRTGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Offhand I didn't know how to break it up, but I will figure it out and do so. |
The most important changes that blocks other PRs are related to the |
| FROM ubuntu:22.04 | ||
|
|
||
| ARG PYTHON_VERSION=3.8 | ||
| ARG PYTHON_VERSION=3.9 |
There was a problem hiding this comment.
This is for a docker file that we don't use at here at github..
put this in one PR
Like update dockerFile..
|
|
||
| ```bash | ||
| # Interactive Scala/Java | ||
| $ spark-shell --packages graphframes:graphframes:0.8.3-spark3.5-s_2.12 |
There was a problem hiding this comment.
graphframes:0.8.3 .4 I belive?
|
|
||
| ## GraphFrames Internals | ||
|
|
||
| To learn how GraphFrames works internally to combine graph and relational queries, check out the paper [GraphFrames: An Integrated API for Mixing Graph and |
There was a problem hiding this comment.
Add a note about the google usergroup?
| This project is compatible with Spark 2.4+. However, significant speed improvements have been | ||
| made to DataFrames in more recent versions of Spark, so you may see speedups from using the latest | ||
| Spark version. | ||
| This project is compatible with Spark 2.4+. However, significant speed improvements have been made to DataFrames in more recent versions of Spark, so you may see speedups from using the latest Spark version. |
There was a problem hiding this comment.
Spark 3.4 or something..
| subprojects into the `docs` directory (and then also into the `_site` directory). We use a | ||
| jekyll plugin to run `build/sbt unidoc` before building the site so if you haven't run it (recently) it | ||
| may take some time as it generates all of the scaladoc. The jekyll plugin also generates the | ||
| When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `build/sbt unidoc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the |
There was a problem hiding this comment.
have this and dev/release_guide.md and docs/_config.yml in a own PR -> update docs
| .withVertexColumn( | ||
| "rank", | ||
| F.lit(1.0 / numVertices), | ||
| F.coalesce(Pregel.msg(), F.lit(0.0)) * F.lit(1.0 - alpha) |
There was a problem hiding this comment.
this it not the same as before..
lit(0.0)) * lit(1.0 - alpha) + lit(alpha / numVertices)) seams to be changed to F.lit(0.0)) * F.lit(1.0 - alpha)
| # Collect and sort results. | ||
| resultRows = ranks.sort(ranks.id).collect() | ||
| result = map(lambda x: x.rank, resultRows) | ||
| result = list(map(lambda x: x.rank, resultRows)) |
There was a problem hiding this comment.
I dont think you need a list here when you are using a zip 3 lines down...
| # Compare each result with its expected value using a tolerance of 1e-3. | ||
| for a, b in zip(result, expected): | ||
| self.assertAlmostEqual(a, b, delta = 1e-3) | ||
| assert a == pytest.approx(b, abs=1e-3) |
There was a problem hiding this comment.
what happends with delta ?
| assert len(all1) == 1 | ||
| labels2 = labels.filter("id >= 5").select("label").collect() | ||
| all2 = set([x.label for x in labels2]) | ||
| all2 = {row.label for row in labels2} |
There was a problem hiding this comment.
what is this?
change a set to dict?
| # Create bidirectional edges. | ||
| all_edges = [z for (a, b) in edges for z in [(a, b), (b, a)]] | ||
| edges = self.spark.createDataFrame(all_edges, ["src", "dst"]) | ||
| edgesDF = self.spark.createDataFrame(all_edges, ["src", "dst"]) |
There was a problem hiding this comment.
no..
edges are another dataframe..
|
@SemyonSinchenko @bjornjorgensen thank you guys VERY much for these reviews! Would you recommend I split it up before addressing the issues, or address the issues before splitting it up into multiple PRs? |
I would recommend to leave it as is for now and open a series of small PRs, related to CI, pytest, build, etc. |
|
Okay guys, diving into splitting this PR up... |
…ney/motif-tutorial
|
@SauronShepherd @SemyonSinchenko @bjornjorgensen please have a look at #511 - the actual documentation portion of the PR. I will do a second and third one now for the docs code and build improvement stuff. |
|
@SauronShepherd @SemyonSinchenko @bjornjorgensen @WeichenXu123 okay also created #512 and #513. I want to try to merge these and ship a new release this coming week, in advance of the GraphFrames Hackathon. |
This PR makes the following additions to create a tutorial on motif finding using
stats.meta.stackexchange.comdata dump at the internet archive. Teaching the concepts behind this powerful tool will drive increased adoption of GraphFrames.docs/motif-tutorial.mdpython/graphframes/examples-download.py,xml_to_parquet.py,graph.pyandmotif.pypython/graphframes/examples/dataThis code was originally written by myself under the MIT License for a class at Connected Data London 2024 called Full Stack Graph Machine Learning. It can be found at https://github.com/Graphlet-AI/graphml-class/tree/main/graphml_class/stats