Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat: Spark 4.0.x support#603

Closed
SemyonSinchenko wants to merge 11 commits intographframes:spark-4xfrom
SemyonSinchenko:spark-4
Closed

feat: Spark 4.0.x support#603
SemyonSinchenko wants to merge 11 commits intographframes:spark-4xfrom
SemyonSinchenko:spark-4

Conversation

@SemyonSinchenko
Copy link
Collaborator

@SemyonSinchenko SemyonSinchenko commented Jun 11, 2025

What changes were proposed in this pull request?

Everything related to the Spark 4.0.x support.

Why are the changes needed?

Close #576

@SemyonSinchenko SemyonSinchenko mentioned this pull request Jun 11, 2025
3 tasks
new Column(expr.expr.transform { case UnresolvedAttribute(nameParts) =>
UnresolvedAttribute(colName +: nameParts)
})
private def applyExprToCol(c: Column, colName: String, fieldNames: Seq[String]): Column = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time trying to get the 4.0 support working as well, I think there will need to be major version-specific shims for this (and a few other connect related helper functions). What I got working for Spark 4 is

def applyExprToCol(spark: SparkSession, expr: Column, colName: String): Column = {
    val converted = spark.asInstanceOf[ClassicSparkSession].converter(expr.node)
    ExpressionUtils.column(converted.transform { case UnresolvedAttribute(nameParts) =>
      UnresolvedAttribute(colName +: nameParts)
    })
  }

Only downside is it's not automatically compatible with Connect as well, but not sure how else to go about it. It'd maybe be nice if ColumnNode had the same tree transform helpers then it could be connect-compatible

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry about connect! It will work via Plugin anyway, so we do not need to even care about this. And thanks for the snippet!!!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found org.apache.spark.sql.classic.RichColumn that provides expr: Expression, so maybe there is no need to add a shim...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the shim would be for also maintaining support for Spark 3 as well

@SemyonSinchenko
Copy link
Collaborator Author

I'm really surprised that it works!

@SemyonSinchenko SemyonSinchenko changed the title [DO-NOT-MERGE] feat: Spark 4.0.x support feat: Spark 4.0.x support Jun 12, 2025
@SemyonSinchenko SemyonSinchenko marked this pull request as ready for review June 12, 2025 07:40
@SemyonSinchenko
Copy link
Collaborator Author

@SauronShepherd @james-willis @rjurney Hi! Could you take a look please?

Note: the target branch is not main but spark-4x

@SemyonSinchenko SemyonSinchenko requested a review from rjurney June 12, 2025 07:43
@SemyonSinchenko SemyonSinchenko linked an issue Jun 12, 2025 that may be closed by this pull request
3 tasks
@SemyonSinchenko SemyonSinchenko added pyspark-classic GraphFrames on PySpark Classic pyspark-connect GraphFrames on PySpark Connect labels Jun 12, 2025
@james-willis
Copy link
Collaborator

I am a little worried that we are not targeting a single branch that supports both spark 3 and 4.

We will probably want to support spark 3 for some time and this might increase the maintainance burden of brining patches and features to both spark 3 and 4.

In sedona we have directories that have copies of the same code for different versions of spark: https://github.com/apache/sedona/tree/master/spark

not sure if that pattern would succeed here.

@SemyonSinchenko
Copy link
Collaborator Author

I am a little worried that we are not targeting a single branch that supports both spark 3 and 4.

We will probably want to support spark 3 for some time and this might increase the maintainance burden of brining patches and features to both spark 3 and 4.

In sedona we have directories that have copies of the same code for different versions of spark: https://github.com/apache/sedona/tree/master/spark

not sure if that pattern would succeed here.

That is the big question. In GraphFrames there are:

  • core
  • spark connect
  • pyspark classic
  • pyspark connect

In the nearest future I'm going to add also property graphs support as a subproject and I hope to have i/o subproject too (for the support of import/export from too GraphAr, Neo4j, Nebula, etc.)

And all of these projects needs to be copied to support both 3.5.x and 4.0.x from the same branch. That is why I'm thinking about relying on branches instead of relying on multiple subprojects for different versions of spark.

I do not think that shims and/or reflection magic will work here too just because from 3.5.x to 4.0.x there are breaking changes not only in accesses and methods...

Overall: I'm a terrible release manager, I do not like CI/CD as well as build systems topic, I'm not very experienced in best practices about supporting multiple versions, etc. So, if someone experienced can draw a diagram or something like this about what would be the best way it will be nice!

@Kimahriman
Copy link
Contributor

I have a mostly working version of using shims to support 3.x and 4.x I can try to clean up and make a PR for next week

@SauronShepherd
Copy link
Contributor

The goal is to support Spark 4 or actually to take advantage and leverage some of the new features of this new version? I have some ideas in mind that may improve some algorithms a lot (not sure about that, but I think it's worith trying it).

Sorry for being away for so long, but I’m back and quite interested in this ticket. I can start helping out next week.

@Kimahriman
Copy link
Contributor

Made the PR with supporting multiple versions: #608

@SemyonSinchenko
Copy link
Collaborator Author

Closed in favor of #608

@SemyonSinchenko SemyonSinchenko deleted the spark-4 branch July 19, 2025 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pyspark-classic GraphFrames on PySpark Classic pyspark-connect GraphFrames on PySpark Connect scala

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[EPIC] Spark 4.0 support

4 participants