Replies: 6 comments 10 replies
-
|
What you describe here are so-called scalar UDFs. They are called "Scalar" because they - at least conceptionally - apply to a single row of data in a single call and also return a single Scalar value as a result. We have thought about Scalar Python UDFs at length already. Implementing those not immediately straightforward due to two main issues:
There are some ideas that we could implement Python UDFs 1 and 2, they each come with their own drawbacks.
There are some good news, we have already implemented table-valued UDFs for Python. These functions get a chunk of the input table passed as a Pandas data frame, can manipulate it in whatever way they want and are required to return a data frame again. Again, the UDF only sees a chunk of the input at a time. Here is an example import duckdb
conn = duckdb.connect()
query_result = conn.query("SELECT * FROM generate_series(10)")
print(query_result)
map_result = query_result.map(lambda df: df['generate_series'].add(42).to_frame())
print(map_result)Here, the table-valued UDF is the lambda in the call to |
Beta Was this translation helpful? Give feedback.
-
|
@hannes I think you're overthinking scalar UDF's and trying to make them performant. I would also really like scalar UDF's in duckdb and I don't care about performance, I just need them to work. So I hope it wouldn't be difficult to implement something simple to use and register. My use case is mainly using Duckdb for CI / Unit testing / Compatibility. I think scalar UDF's would be valuable regardless of performance. Having them available but slow would be a great value add imo. |
Beta Was this translation helpful? Give feedback.
-
|
hey @hannes is there any update on this feature request? |
Beta Was this translation helpful? Give feedback.
-
|
I've got a branch here that can do this: import duckdb
from duckdb.typing import *
con = duckdb.connect()
def table_func(df):
print(df)
return df
con.register_table_function('no_op_tablefunc', table_func)
rel2 = con.sql('select * from no_op_tablefunc((from range(10)))')
print(rel2)This just utilizes the same code as |
Beta Was this translation helpful? Give feedback.
-
|
Scalar UDFs have been implemented in #7171 |
Beta Was this translation helpful? Give feedback.
-
|
pohuan#3 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Currently, you can create UDFs in python and use them using
duckdb.from_df(df).map(my_udf).I would like to be able to register my UDF with duckdb such that I can call it from within SQL statements, similar to how you can register pandas DataFrames, eg.
con.register('my_table', df). This is also similar to how you can register custom python UDFs for use in pyspark.I propose something like:
The motivation for this is I am trying to use splink for fuzzy deduplication, and in that framework you specify comparison functions using SQL statements. I want to be able to compare using
is_nickname_of(first_name_l, first_name_r) OR is_nickname_of(first_name_r, first_name_l)I can do this currently using splink's pyspark backend, but it would be great to be able to do it with duckDB as well so I could swap between backends for exploration (duckdb) and full runs (pyspark)Beta Was this translation helpful? Give feedback.
All reactions