Add DataFrame.persist, and notes on execution model #307

MarcoGorelli · 2023-10-31T11:36:38Z

For now I'm keeping Column.to_array out of it - once we sort out #298 , we can add that too

cbourjau · 2023-11-06T10:28:09Z

Would you mind elaborating a bit more on what the benefit of may_evaluate is compared to explicitly allowing certain functions such as shape (unless we reuse the semantics from the array-API standard's shape function?) and to_array to optionally materialize?

MarcoGorelli · 2023-11-06T10:42:24Z

Sure - anything related to automatic execution is going to result in people accidentally double-computing things

Quick example:

df: DataFrame
features = df.drop_columns('target').to_array()  # in Polars, triggers the whole DAG behind `df`
target = df.col('target').to_array()  # in Polars, also triggers the whole DAG behind `df`

as opposed to

df: DataFrame
df = df.maybe_execute()
features = df.drop_columns('target').to_array()
target = df.col('target').to_array()

In the first example, Polars would push down the 'target' selection

Note that for your library, you'd be free to ignore maybe_execute here, because you can do df.drop_columns('target').to_array() lazily

cbourjau · 2023-11-06T12:46:09Z

I'm afraid I still don't quite see it, sorry! How would you implement to_array in your second example to take advantage of the maybe_execute hint assuming that (a) to_array returns an array-API compliant array and (b) that array is a numpy array for the pandas-backed-dataframe implementation in question?

Would there be some cache associated with df after the maybe_execute call? If so, would that not deteriorate the user experience with respect to memory consumption? I.e. the user may expect the memory of the features array to be GC'ed after features went out of scope, but instead it would live for as long as df does.

MarcoGorelli · 2023-11-06T15:14:30Z

Thanks for asking, this will help clarify things

So, let's consider the following cases

case 1: lazy dataframe, with lazy array counterpart, which requires computation for `bool(col.std())`

I think Dask would be an example of this. They have a lazy array, but bool(df['a'].std()) raises telling you to call .compute.

You could have may_execute just set a boolean flag, which is then ignored for Column.to_array and DataFrame.to_array but is respected for Column.std().__bool__()

Something like:

class Scalar:
    def __bool__(self):
        if self.parent_df.may_execute:
            self.parent_df = self.parent_df.compute()
            return bool(value)
        else:
            raise RuntimeError("please call `may_execute` on the parent dataframe first")

class Column:
    def to_array(self):
        # ignore `self.may_execute` here, as `to_array` doesn't need it
        return self.column.to_array()

    def std(self):
        return Scalar(self.column.std(), parent_df = self.parent_df)

case 2: lazy dataframe, with eager array counterpart, which requires computation for `bool(col.std())`

Could just alias may_execute to the underlying collect / compute call, e.g.

class DataFrame:
    # [...]
    def may_execute(self):
        return DataFrame(self.dataframe.collect())

case 3: eager dataframe, everything's eager

may_execute is a no-op

class DataFrame:
    # [...]
    def maybe_execute(self):
        return DataFrame(self.dataframe)

EDIT

As noted below, the above (case 1) may not be a great idea anyway #307 (comment)

cbourjau

Thanks for the further explanations even though I must say that I am still a bit confused about the semantics of maybe_execute. This example benefits from a collection taking place immediately when maybe_execute is called. However, the latter example appears to revolve around the idea of allowing a deferred collection when __bool__ is called.

A fully lazy implementation would need to error out when calling __bool__ regardless of the fact if a user called maybe_execute or not. A lazy/eager hybrid such as a Polars-backed implementation certainly benefits from explicit "collection hints" if they are computed explicitly where evoked.

This makes me wonder about the benefit of also using maybe_execute to allow subsequent collections compared to simply always allowing those by default.

cbourjau · 2023-11-06T16:53:38Z

spec/API_specification/dataframe_api/dataframe_object.py

+        To be guaranteed to run across all implementations, :meth:`maybe_execute` should
+        be executed at some point before calling this method.


Does this mean that all operations are potentially (e.g. in a polars-based implementation) eager after a call to maybe_execute?

that's right, though Column would still be backed by an expression (which is lazy), but the parent dataframe would be eager. you can try this out with https://github.com/data-apis/dataframe-api-compat

cbourjau · 2023-11-06T16:55:04Z

spec/API_specification/dataframe_api/dataframe_object.py

+            This method may force execution. If necessary, it should be called
+            at most once per dataframe, and as late as possible in the pipeline.


Why "at most once" rather than "as few times as possible"?

if you're using it multiple times, then you're potentially re-executing things

Sure, but there are reasonable cases where that would be what you want, are there not? For example, you may want to collect a dataframe, filter it further in a lazy manner, and then collect it again.

sure but why would you collect it before filtering?

It is a bit of a constructed example, but maybe you want to do computations on the entire dataframe and also on some subset of it. It would make sense to collect just prior to the first computation on the entire frame so that whatever came before it doesn't have to be recomputed when doing the computation on the subset.

something like

df: DataFrame df = df.persist() sub_df_1 = df.filter(df.col('a') > 0) sub_df_2 = df.filter(df.col('a') <= 0) features_1= [] for column_name in sub_df_1.column_names: if sub_df_1.col(column_name).std() > 0: features_1.append(column_name) features_2= [] for column_name in sub_df_2.column_names: if sub_df_2.col(column_name).std() > 0: features_2.append(column_name)

?

You'd still just be calling it once per dataframe - could you show an example of where you'd want to call it twice for the same dataframe?

cbourjau · 2023-11-06T17:18:05Z

spec/design_topics/execution_model.md

+The Dataframe API has a `DataFrame.maybe_evaluate` for addressing the above. We can use it to rewrite the code above
+as follows:
+```python
+df: DataFrame
+df = df.maybe_execute()
+features = []
+for column_name in df.column_names:
+    if df.col(column_name).std() > 0:
+        features.append(column_name)
+return features
+```
+
+Note that `maybe_evaluate` is to be interpreted as a hint, rather than as a directive -
+the implementation itself may decide
+whether to force execution at this step, or whether to defer it to later.
+For example, a dataframe which can convert to a lazy array could decide to ignore
+`maybe_evaluate` when evaluting `DataFrame.to_array` but to respect it when evaluating
+`float(Column.std())`.


So, maybe_evaluate may do:

Nothing at all

Nothing at this point but allows later collections when the backend thinks it is expedient

An immediate collection

What happens to subsequent calls to df assuming that a collection did take place? Are they eager or lazy?

df: DataFrame column_name: str df = df.maybe_execute() col = df.col(column_name) filtered_col = col.filter(col > 42) # is this computation eager now? filtered_col.std()

is this computation eager now?

It's implementation-dependent. It can stay lazy

What really matters is when you do

bool(filtered_col.std())

(which you might trigger via if filtered_col.std() > 0 - at that point:

if maybe_execute wasn't called previously, this is unsupported by the standard and may vary across implementations

if maybe_execute was called, then libraries supporting eager evaluation should return a result

When would you want the first option rather than an implicit default for the latter behavior? It seems rather obvious that bool(filtered_col.std()) needs to materialize something so it is hardly a surprise to the user at this point. Sure, the user may want to strategically place a maybe_execute earlier for performance reasons, but why introduce undefined behavior if they don't?

you don't need to introduce undefined behaviour, I just mean that it's undefined by the Dataframe API - the Standard makes no guarantee of what will happen there

cbourjau · 2023-11-06T17:25:02Z

spec/design_topics/execution_model.md

+        features.append(column_name)
+return features
+```
+as that will potentially re-trigger the same execution multiple times.


But there are no guarantees here, are there? Given that maybe_execute still allows for deferred execution the backend may still re-trigger the same execution multiple times.

yes you're right, that's an issue with the suggestion I put in #307 (comment)

not sure what to suggest, will think about this

Looks like there's two cases that really need addressing:

bool(scalar) requires computation in all cases

to_array only requires computation in some cases

MarcoGorelli · 2023-11-07T13:58:05Z

Thanks for your comments

I've excluded to_array from this PR, as it's probably not necessary at this point

Conversely, the if df.col('a').std() > 0: call is really necessary to resolve (#305 (comment))

kkraus14 · 2023-11-07T20:23:04Z

I think may_execute is too ambiguous here. It just implies that the dataframe may hit a blocking operation at some point, but doesn't clearly indicate that it should explicitly execute and block or otherwise do something to influence downstream usage. Also, does doing another operation after may_execute invalidate the may_execute flag?

If we look at lazy execution frameworks (ignoring Ibis since it defers to different backends for actual implementation / execution), they all have a method like cache (Polars, Spark) / persist (Dask, Spark) that don't block but explicitly instruct that all future usage after that operation will not retrigger computation before that operation.

What would your thoughts be if we added a method like cache that is a no-op for eager DataFrames, which gives a better path to avoid performance pitfalls in lazy dataframes, but then generally allowed operations like to_array and bool(Scalar) to always allow running?

MarcoGorelli · 2023-11-07T20:34:02Z

Initial thought - sounds good! Will try it out in my implementation / skrub and see where this takes us

MarcoGorelli · 2023-11-08T11:46:35Z

Initial thought - sounds good! Will try it out in my implementation / skrub and see where this takes us

Just tried, and no, polars.LazyFrame.cache doesn't persist the dataframe into memory, see pola-rs/polars#2842

Having said that, maybe may_execute could just defer to .collect in Polars and persist in Dask?

MarcoGorelli · 2023-11-08T12:49:11Z

Having said that, as far as I can tell,

df: dask.DataFrame
df = df.persist()

is very roughly equivalent (ok, not exactly, as Dask would write to the cluster rather than to your local machine, but bear with me)

df: polars.LazyFrame
df = df.collect().lazy()

Maybe .persist is OK then, and we document the list of methods which implicitly force computation

So then

# DataFrame Standard
df: DataFrame
df = df.persist()
features = []
for column_name in df.column_names:
    if df.col(column_name).std() > 0:
        features.append(column_name)
return features

would be roughly equivalent to

# polars
df: polars.LazyFrame
df = df.collect().lazy()
features = []
for column_name in df.column_names:
    if df.collect()[column_name].std() > 0:
        features.append(column_name)
return features

and

# Dask
df: dask.DataFrame
df = df.persist()
features = []
for column_name in df.column_names:
    if df[column_name].std().compute() > 0:
        features.append(column_name)
return features

And maybe that's fine?

kkraus14 · 2023-11-08T17:38:12Z

Yea, this is more or less what I had in mind. There's still a footgun if someone doesn't use the method, but it at least gives a standard compliant way for folks to write code that nicely works across both eager and lazy implementations without introducing any implementation burden onto any of the libraries.

MarcoGorelli · 2023-11-08T18:45:32Z

There's still a footgun if someone doesn't use the method

Sure but we could raise if persist hasn't been called earlier?

e.g.

df: DataFrame
features = []
for column_name in df.column_names:
    if df.col(column_name).std() > 0:  # raises, tell you to call `persist` on parent dataframe
        features.append(column_name)
return features

Correct way:

df: DataFrame
df = df.persist()
features = []
for column_name in df.column_names:
    if df.col(column_name).std() > 0:
        features.append(column_name)
return features

I think the real footgun would be calling .persist within a loop:

df: DataFrame
features = []
for column_name in df.column_names:
    if df.persist().col(column_name).std() > 0:
        features.append(column_name)
return features

so this is why the "use .persist as late and as little as possible" rule (which we'd document) should still apply

EDIT: though even this last case could be prevented by erroring if persist is called when is_persisted = True. Footguns are still possible, but whatever, if we documented the best practice and have examples, I think it's OK EDIT2: the previous comment doesn't hold because operations here aren't inplace

kkraus14 · 2023-11-08T21:19:56Z

Sure but we could raise if persist hasn't been called earlier?

e.g.

df: DataFrame
features = []
for column_name in df.column_names:
    if df.col(column_name).std() > 0:  # raises, tell you to call `persist` on parent dataframe
        features.append(column_name)
return features

What are the rules for propagating a persist status vs invalidating it? I think it would be difficult to have a set of cohesive rules for this.

MarcoGorelli · 2023-11-09T11:35:23Z

It doesn't need to be part of the standard, but an implementation could raise if you try to bring a scalar into Python without having called persist at some point between __dataframe_consortium_standard__ and the call which forces computation

rgommers · 2023-11-09T12:00:48Z

If we look at lazy execution frameworks (ignoring Ibis since it defers to different backends for actual implementation / execution), they all have a method like cache (Polars, Spark) / persist (Dask, Spark) that don't block but explicitly instruct that all future usage after that operation will not retrigger computation before that operation.

Trying to catch up here. This seems like a reasonable thing to add, given that all lazy libraries seem to have it. I'm not sure if solves the same problem as may_execute(); there are two more or less orthogonal things here:

Avoid recomputing expensive calls more than once in lazy implementations
Allow writing code where method calls that cannot be kept lazy by a library do work

It looks to me like (1) is solved by .persist, while for (2) it's not yet 100% clear to me from the discussion above. It will solve the problem if it returns something that, for Polars, is neither a LazyFrame nor an EagerFrame. But rather a lazyframe with a few methods that are able trigger execution. I.e. something like:

class LazyFrame:
    ...
    def persist(self) -> PermissiveLazyFrame:
        return self._to_permissive()

class PermissiveLazyFrame(LazyFrame):
    def __bool__(self) -> bool:
        return self.collect().__bool__()

    def __int__(self) -> int:
        return self.collect().__bool__()

    def __float__(self) -> float:
        return self.collect().__bool__()

    def to_array(self) -> numpy.ndarray:
        return self.collect().numpy()

    # add some magic here to ensure that for all other method calls,
    # the return type is PermissiveLazyFrame, not LazyFrame

It could be a short implementation, the above is pretty much all that's needed. This would also answer be in line with Marco's answer on when things raise (if .persist has been called at least once) - because LazyFrame continues raising. Is my understanding correct there?

For the standard, the description of __bool__ & co would then still be slightly awkward, something like: "returns a boolean scalar; may raise for lazy implementations; is guaranteed not to raise if .persist() has been called on the dataframe before unless the library is not able to execute anything eagerly."

MarcoGorelli · 2023-11-09T12:32:05Z

Something like that (but note that it's Scalar.__bool__, not LazyFrame.__bool__)

I've tried this out in dataframe-api-compat anyway - if anyone fancied trying it out (pip install dataframe-api-compat - I'm releasing a bit liberally whilst we have no users) and seeing if it matches their expectations, that'd be really helpful

MarcoGorelli · 2023-11-09T12:46:09Z

little demo:

# t.py
from __future__ import annotations

import pandas as pd
import polars as pl

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from dataframe_api.typing import SupportsDataFrameAPI
    from dataframe_api import DataFrame

dfpd = pd.DataFrame({'a': [1, 1, 1], 'b': [4,5,6]})
dfpl = pl.DataFrame({'a': [1, 1, 1], 'b': [4,5,6]})

def this_raises(df_raw: SupportsDataFrameAPI):
    df = df_raw.__dataframe_consortium_standard__(api_version='2023.11-beta')
    features = []
    for column_name in df.column_names:
        if df.col(column_name).std() > 0:
            features.append(column_name)
    return features

def this_runs(df_raw: SupportsDataFrameAPI):
    df = df_raw.__dataframe_consortium_standard__(api_version='2023.11-beta')
    df = df.persist()  # type: ignore
    features = []
    for column_name in df.column_names:
        if df.col(column_name).std() > 0:
            features.append(column_name)
    return features

def this_runs_but_dont_do_it(df_raw: SupportsDataFrameAPI):
    df = df_raw.__dataframe_consortium_standard__(api_version='2023.11-beta')
    features = []
    for column_name in df.column_names:
        if df.persist().col(column_name).std() > 0:  # type: ignore
            features.append(column_name)
    return features

Then (note: tracebacks shortened):

In [1]: this_raises(dfpd)
---------------------------------------------------------------------------
ValueError: Method scalar operation requires you to call `.persist` first on the parent dataframe.

Note: `.persist` forces materialisation in lazy libraries and so should be called as late as possible in your pipeline, and only once per dataframe.

In [2]: this_raises(dfpl)
---------------------------------------------------------------------------
ValueError: Cannot materialise a lazy dataframe, please call `persist` first

In [3]: this_runs(dfpd)
Out[3]: ['b']

In [4]: this_runs(dfpl)
Out[4]: ['b']

In [5]: this_runs_but_dont_do_it(dfpd)
Out[5]: ['b']

In [6]: this_runs_but_dont_do_it(dfpl)
Out[6]: ['b']

Error messages needs sorting out, but this is the idea

MarcoGorelli · 2023-11-10T07:53:57Z

I've updated, and removed the "propagation" part. We can talk about that next time - for now let's just get persist in?

I think people agreed on everything in this PR

shwina

Thanks Marco. Apologies for missing yesterday's meeting where perhaps this was discussed.

For what it's worth, I think asking the user of an API to "think lazy" when they don't want/need lazy semantics might make the API difficult to a general audience. But, this API is not for the general audience and I understand that this is the best we can do to support more DataFrame libraries.

Approving, and thanks for the work here!

rgommers

Overall LGTM, +1 for getting this in. Two minor comments to consider.

spec/API_specification/dataframe_api/dataframe_object.py

rgommers · 2023-11-10T16:09:57Z

spec/purpose_and_scope.md

@@ -125,9 +125,10 @@ See the [use cases](use_cases.md) section for details on the exact use cases con
 Implementation details of the dataframes and execution of operations. This includes:

 - How data is represented and stored (whether the data is in memory, disk, distributed)
- Expectations on when the execution is happening (in an eager or lazy way)
+- Expectations on when the execution is happening (in an eager or lazy way), other than `DataFrame.persist`


minor: not entirely accurate, since it's only a hint so there is still no "when" prescribed.

How about saying instead: "(see Exection model for some caveats)" in order to keep things in one place?

Co-authored-by: Ralf Gommers <[email protected]>

spec/purpose_and_scope.md

MarcoGorelli · 2023-11-10T18:38:20Z

thanks all, merging then

we can discuss propagation (or lack of) next time, but I'm glad we've been able to agree on this. it's something to be proud of. well done all! 🎉

wip: add notes on execution model

7a8dcf0

MarcoGorelli force-pushed the may-execute branch from e287002 to 7a8dcf0 Compare October 31, 2023 15:43

MarcoGorelli added 2 commits October 31, 2023 15:58

reword

0f4188b

remove column mentions for now

1dd4678

MarcoGorelli marked this pull request as ready for review October 31, 2023 16:26

MarcoGorelli changed the title ~~wip: add notes on execution model~~ Add DataFrame.maybe_execute, and notes on execution model Oct 31, 2023

MarcoGorelli mentioned this pull request Oct 31, 2023

Add example of a sklearn like pipeline #294

Merged

MarcoGorelli force-pushed the may-execute branch from 3e097aa to 1dd4678 Compare October 31, 2023 17:33

MarcoGorelli added the API design label Nov 1, 2023

cbourjau reviewed Nov 6, 2023

View reviewed changes

MarcoGorelli mentioned this pull request Nov 6, 2023

Dealing with if scalar #315

Closed

MarcoGorelli added 2 commits November 7, 2023 11:34

Merge remote-tracking branch 'upstream/main' into may-execute

7c72dd2

remove to_array

e4f47c7

kkraus14 mentioned this pull request Nov 7, 2023

What's the deal with Scalars? #305

Closed

use persist instead

b6b648b

MarcoGorelli added 2 commits November 8, 2023 17:08

remove note on propagation

6d5a599

update purpose and scope

6cef569

MarcoGorelli added 2 commits November 8, 2023 18:18

Merge remote-tracking branch 'upstream/main' into may-execute

3704a4b

reduce execution_model

4bf81c2

MarcoGorelli changed the title ~~Add DataFrame.maybe_execute, and notes on execution model~~ Add DataFrame.persist, and notes on execution model Nov 9, 2023

MarcoGorelli requested review from cbourjau, rgommers, shwina and kkraus14 November 10, 2023 14:33

shwina approved these changes Nov 10, 2023

View reviewed changes

rgommers approved these changes Nov 10, 2023

View reviewed changes

Update spec/API_specification/dataframe_api/dataframe_object.py

305a44b

Co-authored-by: Ralf Gommers <[email protected]>

MarcoGorelli commented Nov 10, 2023

View reviewed changes

spec/purpose_and_scope.md Outdated Show resolved Hide resolved

Update spec/purpose_and_scope.md

e0b7458

kkraus14 approved these changes Nov 10, 2023

View reviewed changes

MarcoGorelli merged commit 7be00b6 into data-apis:main Nov 10, 2023

		To be guaranteed to run across all implementations, :meth:`maybe_execute` should
		be executed at some point before calling this method.

		This method may force execution. If necessary, it should be called
		at most once per dataframe, and as late as possible in the pipeline.

Add DataFrame.persist, and notes on execution model #307

Add DataFrame.persist, and notes on execution model #307

Uh oh!

Conversation

MarcoGorelli commented Oct 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbourjau commented Nov 6, 2023

Uh oh!

MarcoGorelli commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbourjau commented Nov 6, 2023

Uh oh!

MarcoGorelli commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

case 1: lazy dataframe, with lazy array counterpart, which requires computation for bool(col.std())

case 2: lazy dataframe, with eager array counterpart, which requires computation for bool(col.std())

case 3: eager dataframe, everything's eager

EDIT

Uh oh!

cbourjau left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli commented Nov 7, 2023

Uh oh!

kkraus14 commented Nov 7, 2023

Uh oh!

MarcoGorelli commented Nov 7, 2023

Uh oh!

MarcoGorelli commented Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcoGorelli commented Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkraus14 commented Nov 8, 2023

Uh oh!

MarcoGorelli commented Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkraus14 commented Nov 8, 2023

Uh oh!

MarcoGorelli commented Nov 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgommers commented Nov 9, 2023

Uh oh!

MarcoGorelli commented Oct 31, 2023 •

edited

Loading

MarcoGorelli commented Nov 6, 2023 •

edited

Loading

MarcoGorelli commented Nov 6, 2023 •

edited

Loading

case 1: lazy dataframe, with lazy array counterpart, which requires computation for `bool(col.std())`

case 2: lazy dataframe, with eager array counterpart, which requires computation for `bool(col.std())`

MarcoGorelli Nov 6, 2023 •

edited

Loading

MarcoGorelli commented Nov 8, 2023 •

edited

Loading

MarcoGorelli commented Nov 8, 2023 •

edited

Loading

MarcoGorelli commented Nov 8, 2023 •

edited

Loading

MarcoGorelli commented Nov 9, 2023 •

edited

Loading

MarcoGorelli commented Nov 9, 2023 •

edited

Loading

MarcoGorelli commented Nov 10, 2023 •

edited

Loading