feat: Spark 4.0.x support#603
Conversation
| new Column(expr.expr.transform { case UnresolvedAttribute(nameParts) => | ||
| UnresolvedAttribute(colName +: nameParts) | ||
| }) | ||
| private def applyExprToCol(c: Column, colName: String, fieldNames: Seq[String]): Column = { |
There was a problem hiding this comment.
I spent some time trying to get the 4.0 support working as well, I think there will need to be major version-specific shims for this (and a few other connect related helper functions). What I got working for Spark 4 is
def applyExprToCol(spark: SparkSession, expr: Column, colName: String): Column = {
val converted = spark.asInstanceOf[ClassicSparkSession].converter(expr.node)
ExpressionUtils.column(converted.transform { case UnresolvedAttribute(nameParts) =>
UnresolvedAttribute(colName +: nameParts)
})
}Only downside is it's not automatically compatible with Connect as well, but not sure how else to go about it. It'd maybe be nice if ColumnNode had the same tree transform helpers then it could be connect-compatible
There was a problem hiding this comment.
Don't worry about connect! It will work via Plugin anyway, so we do not need to even care about this. And thanks for the snippet!!!
There was a problem hiding this comment.
I just found org.apache.spark.sql.classic.RichColumn that provides expr: Expression, so maybe there is no need to add a shim...
There was a problem hiding this comment.
Well the shim would be for also maintaining support for Spark 3 as well
|
I'm really surprised that it works! |
|
@SauronShepherd @james-willis @rjurney Hi! Could you take a look please? Note: the target branch is not |
|
I am a little worried that we are not targeting a single branch that supports both spark 3 and 4. We will probably want to support spark 3 for some time and this might increase the maintainance burden of brining patches and features to both spark 3 and 4. In sedona we have directories that have copies of the same code for different versions of spark: https://github.com/apache/sedona/tree/master/spark not sure if that pattern would succeed here. |
That is the big question. In GraphFrames there are:
In the nearest future I'm going to add also property graphs support as a subproject and I hope to have i/o subproject too (for the support of import/export from too GraphAr, Neo4j, Nebula, etc.) And all of these projects needs to be copied to support both 3.5.x and 4.0.x from the same branch. That is why I'm thinking about relying on branches instead of relying on multiple subprojects for different versions of spark. I do not think that shims and/or reflection magic will work here too just because from 3.5.x to 4.0.x there are breaking changes not only in accesses and methods... Overall: I'm a terrible release manager, I do not like CI/CD as well as build systems topic, I'm not very experienced in best practices about supporting multiple versions, etc. So, if someone experienced can draw a diagram or something like this about what would be the best way it will be nice! |
|
I have a mostly working version of using shims to support 3.x and 4.x I can try to clean up and make a PR for next week |
|
The goal is to support Spark 4 or actually to take advantage and leverage some of the new features of this new version? I have some ideas in mind that may improve some algorithms a lot (not sure about that, but I think it's worith trying it). Sorry for being away for so long, but I’m back and quite interested in this ticket. I can start helping out next week. |
|
Made the PR with supporting multiple versions: #608 |
|
Closed in favor of #608 |
What changes were proposed in this pull request?
Everything related to the Spark 4.0.x support.
Why are the changes needed?
Close #576