-
Notifications
You must be signed in to change notification settings - Fork 21
Add DataFrame.persist, and notes on execution model #307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e287002
to
7a8dcf0
Compare
3e097aa
to
1dd4678
Compare
Would you mind elaborating a bit more on what the benefit of |
Sure - anything related to automatic execution is going to result in people accidentally double-computing things Quick example: df: DataFrame
features = df.drop_columns('target').to_array() # in Polars, triggers the whole DAG behind `df`
target = df.col('target').to_array() # in Polars, also triggers the whole DAG behind `df` as opposed to df: DataFrame
df = df.maybe_execute()
features = df.drop_columns('target').to_array()
target = df.col('target').to_array() In the first example, Polars would push down the Note that for your library, you'd be free to ignore |
I'm afraid I still don't quite see it, sorry! How would you implement Would there be some cache associated with |
Thanks for asking, this will help clarify things So, let's consider the following cases case 1: lazy dataframe, with lazy array counterpart, which requires computation for
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the further explanations even though I must say that I am still a bit confused about the semantics of maybe_execute
. This example benefits from a collection taking place immediately when maybe_execute
is called. However, the latter example appears to revolve around the idea of allowing a deferred collection when __bool__
is called.
A fully lazy implementation would need to error out when calling __bool__
regardless of the fact if a user called maybe_execute
or not. A lazy/eager hybrid such as a Polars-backed implementation certainly benefits from explicit "collection hints" if they are computed explicitly where evoked.
This makes me wonder about the benefit of also using maybe_execute
to allow subsequent collections compared to simply always allowing those by default.
To be guaranteed to run across all implementations, :meth:`maybe_execute` should | ||
be executed at some point before calling this method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that all operations are potentially (e.g. in a polars-based implementation) eager after a call to maybe_execute
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's right, though Column
would still be backed by an expression (which is lazy), but the parent dataframe would be eager. you can try this out with https://github.com/data-apis/dataframe-api-compat
This method may force execution. If necessary, it should be called | ||
at most once per dataframe, and as late as possible in the pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "at most once" rather than "as few times as possible"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you're using it multiple times, then you're potentially re-executing things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but there are reasonable cases where that would be what you want, are there not? For example, you may want to collect a dataframe, filter it further in a lazy manner, and then collect it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure but why would you collect it before filtering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit of a constructed example, but maybe you want to do computations on the entire dataframe and also on some subset of it. It would make sense to collect just prior to the first computation on the entire frame so that whatever came before it doesn't have to be recomputed when doing the computation on the subset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like
df: DataFrame
df = df.persist()
sub_df_1 = df.filter(df.col('a') > 0)
sub_df_2 = df.filter(df.col('a') <= 0)
features_1= []
for column_name in sub_df_1.column_names:
if sub_df_1.col(column_name).std() > 0:
features_1.append(column_name)
features_2= []
for column_name in sub_df_2.column_names:
if sub_df_2.col(column_name).std() > 0:
features_2.append(column_name)
?
You'd still just be calling it once per dataframe - could you show an example of where you'd want to call it twice for the same dataframe?
The Dataframe API has a `DataFrame.maybe_evaluate` for addressing the above. We can use it to rewrite the code above | ||
as follows: | ||
```python | ||
df: DataFrame | ||
df = df.maybe_execute() | ||
features = [] | ||
for column_name in df.column_names: | ||
if df.col(column_name).std() > 0: | ||
features.append(column_name) | ||
return features | ||
``` | ||
|
||
Note that `maybe_evaluate` is to be interpreted as a hint, rather than as a directive - | ||
the implementation itself may decide | ||
whether to force execution at this step, or whether to defer it to later. | ||
For example, a dataframe which can convert to a lazy array could decide to ignore | ||
`maybe_evaluate` when evaluting `DataFrame.to_array` but to respect it when evaluating | ||
`float(Column.std())`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, maybe_evaluate
may do:
- Nothing at all
- Nothing at this point but allows later collections when the backend thinks it is expedient
- An immediate collection
What happens to subsequent calls to df
assuming that a collection did take place? Are they eager or lazy?
df: DataFrame
column_name: str
df = df.maybe_execute()
col = df.col(column_name)
filtered_col = col.filter(col > 42) # is this computation eager now?
filtered_col.std()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this computation eager now?
It's implementation-dependent. It can stay lazy
What really matters is when you do
bool(filtered_col.std())
(which you might trigger via if filtered_col.std() > 0
- at that point:
- if
maybe_execute
wasn't called previously, this is unsupported by the standard and may vary across implementations - if
maybe_execute
was called, then libraries supporting eager evaluation should return a result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would you want the first option rather than an implicit default for the latter behavior? It seems rather obvious that bool(filtered_col.std())
needs to materialize something so it is hardly a surprise to the user at this point. Sure, the user may want to strategically place a maybe_execute
earlier for performance reasons, but why introduce undefined behavior if they don't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to introduce undefined behaviour, I just mean that it's undefined by the Dataframe API - the Standard makes no guarantee of what will happen there
features.append(column_name) | ||
return features | ||
``` | ||
as that will potentially re-trigger the same execution multiple times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there are no guarantees here, are there? Given that maybe_execute
still allows for deferred execution the backend may still re-trigger the same execution multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you're right, that's an issue with the suggestion I put in #307 (comment)
not sure what to suggest, will think about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's two cases that really need addressing:
bool(scalar)
requires computation in all casesto_array
only requires computation in some cases
Thanks for your comments I've excluded Conversely, the |
I think If we look at lazy execution frameworks (ignoring Ibis since it defers to different backends for actual implementation / execution), they all have a method like What would your thoughts be if we added a method like |
Initial thought - sounds good! Will try it out in my implementation / skrub and see where this takes us |
Just tried, and no, Having said that, maybe |
Having said that, as far as I can tell, df: dask.DataFrame
df = df.persist() is very roughly equivalent (ok, not exactly, as Dask would write to the cluster rather than to your local machine, but bear with me) df: polars.LazyFrame
df = df.collect().lazy() Maybe So then # DataFrame Standard
df: DataFrame
df = df.persist()
features = []
for column_name in df.column_names:
if df.col(column_name).std() > 0:
features.append(column_name)
return features would be roughly equivalent to # polars
df: polars.LazyFrame
df = df.collect().lazy()
features = []
for column_name in df.column_names:
if df.collect()[column_name].std() > 0:
features.append(column_name)
return features and # Dask
df: dask.DataFrame
df = df.persist()
features = []
for column_name in df.column_names:
if df[column_name].std().compute() > 0:
features.append(column_name)
return features And maybe that's fine? |
Yea, this is more or less what I had in mind. There's still a footgun if someone doesn't use the method, but it at least gives a standard compliant way for folks to write code that nicely works across both eager and lazy implementations without introducing any implementation burden onto any of the libraries. |
Sure but we could raise if e.g. df: DataFrame
features = []
for column_name in df.column_names:
if df.col(column_name).std() > 0: # raises, tell you to call `persist` on parent dataframe
features.append(column_name)
return features Correct way: df: DataFrame
df = df.persist()
features = []
for column_name in df.column_names:
if df.col(column_name).std() > 0:
features.append(column_name)
return features I think the real footgun would be calling df: DataFrame
features = []
for column_name in df.column_names:
if df.persist().col(column_name).std() > 0:
features.append(column_name)
return features so this is why the "use
|
What are the rules for propagating a |
It doesn't need to be part of the standard, but an implementation could raise if you try to bring a scalar into Python without having called |
Trying to catch up here. This seems like a reasonable thing to add, given that all lazy libraries seem to have it. I'm not sure if solves the same problem as
It looks to me like (1) is solved by class LazyFrame:
...
def persist(self) -> PermissiveLazyFrame:
return self._to_permissive()
class PermissiveLazyFrame(LazyFrame):
def __bool__(self) -> bool:
return self.collect().__bool__()
def __int__(self) -> int:
return self.collect().__bool__()
def __float__(self) -> float:
return self.collect().__bool__()
def to_array(self) -> numpy.ndarray:
return self.collect().numpy()
# add some magic here to ensure that for all other method calls,
# the return type is PermissiveLazyFrame, not LazyFrame It could be a short implementation, the above is pretty much all that's needed. This would also answer be in line with Marco's answer on when things raise (if For the standard, the description of |
Something like that (but note that it's I've tried this out in |
little demo: # t.py
from __future__ import annotations
import pandas as pd
import polars as pl
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from dataframe_api.typing import SupportsDataFrameAPI
from dataframe_api import DataFrame
dfpd = pd.DataFrame({'a': [1, 1, 1], 'b': [4,5,6]})
dfpl = pl.DataFrame({'a': [1, 1, 1], 'b': [4,5,6]})
def this_raises(df_raw: SupportsDataFrameAPI):
df = df_raw.__dataframe_consortium_standard__(api_version='2023.11-beta')
features = []
for column_name in df.column_names:
if df.col(column_name).std() > 0:
features.append(column_name)
return features
def this_runs(df_raw: SupportsDataFrameAPI):
df = df_raw.__dataframe_consortium_standard__(api_version='2023.11-beta')
df = df.persist() # type: ignore
features = []
for column_name in df.column_names:
if df.col(column_name).std() > 0:
features.append(column_name)
return features
def this_runs_but_dont_do_it(df_raw: SupportsDataFrameAPI):
df = df_raw.__dataframe_consortium_standard__(api_version='2023.11-beta')
features = []
for column_name in df.column_names:
if df.persist().col(column_name).std() > 0: # type: ignore
features.append(column_name)
return features Then (note: tracebacks shortened): In [1]: this_raises(dfpd)
---------------------------------------------------------------------------
ValueError: Method scalar operation requires you to call `.persist` first on the parent dataframe.
Note: `.persist` forces materialisation in lazy libraries and so should be called as late as possible in your pipeline, and only once per dataframe.
In [2]: this_raises(dfpl)
---------------------------------------------------------------------------
ValueError: Cannot materialise a lazy dataframe, please call `persist` first
In [3]: this_runs(dfpd)
Out[3]: ['b']
In [4]: this_runs(dfpl)
Out[4]: ['b']
In [5]: this_runs_but_dont_do_it(dfpd)
Out[5]: ['b']
In [6]: this_runs_but_dont_do_it(dfpl)
Out[6]: ['b'] Error messages needs sorting out, but this is the idea |
I've updated, and removed the "propagation" part. We can talk about that next time - for now let's just get I think people agreed on everything in this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Marco. Apologies for missing yesterday's meeting where perhaps this was discussed.
For what it's worth, I think asking the user of an API to "think lazy" when they don't want/need lazy semantics might make the API difficult to a general audience. But, this API is not for the general audience and I understand that this is the best we can do to support more DataFrame libraries.
Approving, and thanks for the work here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, +1 for getting this in. Two minor comments to consider.
spec/purpose_and_scope.md
Outdated
@@ -125,9 +125,10 @@ See the [use cases](use_cases.md) section for details on the exact use cases con | |||
Implementation details of the dataframes and execution of operations. This includes: | |||
|
|||
- How data is represented and stored (whether the data is in memory, disk, distributed) | |||
- Expectations on when the execution is happening (in an eager or lazy way) | |||
- Expectations on when the execution is happening (in an eager or lazy way), other than `DataFrame.persist` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: not entirely accurate, since it's only a hint so there is still no "when" prescribed.
How about saying instead: "(see Exection model
for some caveats)" in order to keep things in one place?
Co-authored-by: Ralf Gommers <[email protected]>
thanks all, merging then we can discuss propagation (or lack of) next time, but I'm glad we've been able to agree on this. it's something to be proud of. well done all! 🎉 |
For now I'm keeping
Column.to_array
out of it - once we sort out #298 , we can add that too