-
Notifications
You must be signed in to change notification settings - Fork 330
Description
Vector UDF Incompatibility in Databricks Environment
Background
Dotnet.spark.worker, when parses binary stream, expects for a single large Arrow batch or data.
However, for Databricks Runtime 14.3 (As i understand, for any linux env), behavior differs, as driver dataset divided by multiple batches, 10k rows in each.
This format uses different binary format(It has more arguments)
Due to differing binary representations, it misinterprets incoming byte streams—reading incorrect data, resulting in a hang with no response.
Reproduction
Create a Spark pipeline that:
- Ingests a moderate-sized dataset.
- Performs a GroupBy (ensure groups contain >10k rows).
- Applies a vector UDF via GroupBy().Apply(...), operating on the grouped RecordBatch.
Actual results
The UDF execution never completes. The .NET Spark worker attempts to read more bytes than are available, blocking indefinitely while waiting for data that is never sent.
Expected result
UDF should execute and return successfully.
TBD: Either a single consolidated RecordBatch or an IEnumerable should be passed into the UDF.
Investigation Summary
Two issues prevent vector UDFs from functioning correctly in the Databricks environment.
1. Vector UDFs operate on a collection of RecordBatch instances, not a single batch
The spark.sql.execution.arrow.maxRecordsPerBatch setting (introduced in Spark 2.3.0, default: 10,000) is central to this behavior. Batch-based processing is advantageous for .NET for Spark, as it avoids the 2GB Arrow buffer limit encountered with GroupBy().Apply(...).
I implemented a PoC to validate performance for a specific use case. It successfully resolves the issue with the following changes:
- The vector UDF now accepts IEnumerable instead of a single RecordBatch.
- Additional parameters are passed but intentionally ignored. In the Python implementation, these enforce batch ordering; however, in .NET it mean we need to collect all batches before reordering and sending them to UDF, which is a memory-demanding task. In observed cases, batches arrived sequentially, so this functionality is omitted from PoC.
2. In DBR 15.3+, all UDFs fail to execute
This issue arises from a change in Databricks’ fork of Spark. The CreatePythonFunction API introduces an additional optional parameter that is not accounted for in the current implementation. As a result, UDF execution fails with the following error:
[Error] [JvmBridge] JVM method execution failed: Static method 'createPythonFunction' failed for class 'org.apache.spark.sql.api.dotnet.SQLUtils' when called with 7 arguments ([Index=1, Type=Byte[], Value=System.Byte[]], [Index=2, Type=Hashtable, Value=Microsoft.Spark.Interop.Internal.Java.Util.Hashtable], [Index=3, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=4, Type=String, Value=Microsoft.Spark.Worker], [Index=5, Type=String, Value=2.1.1.0], [Index=6, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=7, Type=null, Value=null])
[2024-09-13T10:47:53.1569404Z] [machine] [Error] [JvmBridge] java.lang.NoSuchMethodError: org.apache.spark.api.python.SimplePythonFunction.<init>(Lscala/collection/Seq;Ljava/util/Map;Ljava/util/List;Ljava/lang/String;Ljava/lang/String;Ljava/util/List;Lorg/apache/spark/api/python/PythonAccumulatorV2;)V
at org.apache.spark.sql.api.dotnet.SQLUtils$.createPythonFunction(SQLUtils.scala:35)
at org.apache.spark.sql.api.dotnet.SQLUtils.createPythonFunction(SQLUtils.scala)
References:
-
PoC Commit, see SqlCommandExecutor::ExecuteArrowGroupedMapCommand
Notes:
- CoGroupedMap UDFs are unaffected as they do not leverage batching (as of Spark 3.5).
- Databricks runtime 15.3, 16.3 doesn't work with dotnet.spark entirely due to the second issue, so no testing is performed on it. Details in the linked PR
UseArrowcan be disabled to enable backward-compatible format, that should work with current dotnet.spark implementation