Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add DataFrame.persist, and notes on execution model #307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Nov 10, 2023

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Oct 31, 2023

For now I'm keeping Column.to_array out of it - once we sort out #298 , we can add that too

@MarcoGorelli MarcoGorelli marked this pull request as ready for review October 31, 2023 16:26
@MarcoGorelli MarcoGorelli changed the title wip: add notes on execution model Add DataFrame.maybe_execute, and notes on execution model Oct 31, 2023
@cbourjau
Copy link
Contributor

cbourjau commented Nov 6, 2023

Would you mind elaborating a bit more on what the benefit of may_evaluate is compared to explicitly allowing certain functions such as shape (unless we reuse the semantics from the array-API standard's shape function?) and to_array to optionally materialize?

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Nov 6, 2023

Sure - anything related to automatic execution is going to result in people accidentally double-computing things

Quick example:

df: DataFrame
features = df.drop_columns('target').to_array()  # in Polars, triggers the whole DAG behind `df`
target = df.col('target').to_array()  # in Polars, also triggers the whole DAG behind `df`

as opposed to

df: DataFrame
df = df.maybe_execute()
features = df.drop_columns('target').to_array()
target = df.col('target').to_array()

In the first example, Polars would push down the 'target' selection


Note that for your library, you'd be free to ignore maybe_execute here, because you can do df.drop_columns('target').to_array() lazily

@cbourjau
Copy link
Contributor

cbourjau commented Nov 6, 2023

I'm afraid I still don't quite see it, sorry! How would you implement to_array in your second example to take advantage of the maybe_execute hint assuming that (a) to_array returns an array-API compliant array and (b) that array is a numpy array for the pandas-backed-dataframe implementation in question?

Would there be some cache associated with df after the maybe_execute call? If so, would that not deteriorate the user experience with respect to memory consumption? I.e. the user may expect the memory of the features array to be GC'ed after features went out of scope, but instead it would live for as long as df does.

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Nov 6, 2023

Thanks for asking, this will help clarify things

So, let's consider the following cases

case 1: lazy dataframe, with lazy array counterpart, which requires computation for bool(col.std())

I think Dask would be an example of this. They have a lazy array, but bool(df['a'].std()) raises telling you to call .compute.

You could have may_execute just set a boolean flag, which is then ignored for Column.to_array and DataFrame.to_array but is respected for Column.std().__bool__()

Something like:

class Scalar:
    def __bool__(self):
        if self.parent_df.may_execute:
            self.parent_df = self.parent_df.compute()
            return bool(value)
        else:
            raise RuntimeError("please call `may_execute` on the parent dataframe first")

class Column:
    def to_array(self):
        # ignore `self.may_execute` here, as `to_array` doesn't need it
        return self.column.to_array()

    def std(self):
        return Scalar(self.column.std(), parent_df = self.parent_df)

case 2: lazy dataframe, with eager array counterpart, which requires computation for bool(col.std())

Could just alias may_execute to the underlying collect / compute call, e.g.

class DataFrame:
    # [...]
    def may_execute(self):
        return DataFrame(self.dataframe.collect())

case 3: eager dataframe, everything's eager

may_execute is a no-op

class DataFrame:
    # [...]
    def maybe_execute(self):
        return DataFrame(self.dataframe)

EDIT

As noted below, the above (case 1) may not be a great idea anyway #307 (comment)

Copy link
Contributor

@cbourjau cbourjau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the further explanations even though I must say that I am still a bit confused about the semantics of maybe_execute. This example benefits from a collection taking place immediately when maybe_execute is called. However, the latter example appears to revolve around the idea of allowing a deferred collection when __bool__ is called.

A fully lazy implementation would need to error out when calling __bool__ regardless of the fact if a user called maybe_execute or not. A lazy/eager hybrid such as a Polars-backed implementation certainly benefits from explicit "collection hints" if they are computed explicitly where evoked.

This makes me wonder about the benefit of also using maybe_execute to allow subsequent collections compared to simply always allowing those by default.

Comment on lines 70 to 71
To be guaranteed to run across all implementations, :meth:`maybe_execute` should
be executed at some point before calling this method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that all operations are potentially (e.g. in a polars-based implementation) eager after a call to maybe_execute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right, though Column would still be backed by an expression (which is lazy), but the parent dataframe would be eager. you can try this out with https://github.com/data-apis/dataframe-api-compat

Comment on lines 994 to 995
This method may force execution. If necessary, it should be called
at most once per dataframe, and as late as possible in the pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "at most once" rather than "as few times as possible"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you're using it multiple times, then you're potentially re-executing things

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but there are reasonable cases where that would be what you want, are there not? For example, you may want to collect a dataframe, filter it further in a lazy manner, and then collect it again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure but why would you collect it before filtering?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit of a constructed example, but maybe you want to do computations on the entire dataframe and also on some subset of it. It would make sense to collect just prior to the first computation on the entire frame so that whatever came before it doesn't have to be recomputed when doing the computation on the subset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like

df: DataFrame
df = df.persist()
sub_df_1 = df.filter(df.col('a') > 0)
sub_df_2 = df.filter(df.col('a') <= 0)
features_1= []
for column_name in sub_df_1.column_names:
    if sub_df_1.col(column_name).std() > 0:
        features_1.append(column_name)
features_2= []
for column_name in sub_df_2.column_names:
    if sub_df_2.col(column_name).std() > 0:
        features_2.append(column_name)

?

You'd still just be calling it once per dataframe - could you show an example of where you'd want to call it twice for the same dataframe?

Comment on lines 46 to 63
The Dataframe API has a `DataFrame.maybe_evaluate` for addressing the above. We can use it to rewrite the code above
as follows:
```python
df: DataFrame
df = df.maybe_execute()
features = []
for column_name in df.column_names:
if df.col(column_name).std() > 0:
features.append(column_name)
return features
```

Note that `maybe_evaluate` is to be interpreted as a hint, rather than as a directive -
the implementation itself may decide
whether to force execution at this step, or whether to defer it to later.
For example, a dataframe which can convert to a lazy array could decide to ignore
`maybe_evaluate` when evaluting `DataFrame.to_array` but to respect it when evaluating
`float(Column.std())`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, maybe_evaluate may do:

  • Nothing at all
  • Nothing at this point but allows later collections when the backend thinks it is expedient
  • An immediate collection

What happens to subsequent calls to df assuming that a collection did take place? Are they eager or lazy?

df: DataFrame
column_name: str
df = df.maybe_execute()
col = df.col(column_name)
filtered_col = col.filter(col > 42)  # is this computation eager now?
filtered_col.std()

Copy link
Contributor Author

@MarcoGorelli MarcoGorelli Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this computation eager now?

It's implementation-dependent. It can stay lazy

What really matters is when you do

bool(filtered_col.std())

(which you might trigger via if filtered_col.std() > 0 - at that point:

  • if maybe_execute wasn't called previously, this is unsupported by the standard and may vary across implementations
  • if maybe_execute was called, then libraries supporting eager evaluation should return a result

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would you want the first option rather than an implicit default for the latter behavior? It seems rather obvious that bool(filtered_col.std()) needs to materialize something so it is hardly a surprise to the user at this point. Sure, the user may want to strategically place a maybe_execute earlier for performance reasons, but why introduce undefined behavior if they don't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to introduce undefined behaviour, I just mean that it's undefined by the Dataframe API - the Standard makes no guarantee of what will happen there

features.append(column_name)
return features
```
as that will potentially re-trigger the same execution multiple times.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there are no guarantees here, are there? Given that maybe_execute still allows for deferred execution the backend may still re-trigger the same execution multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you're right, that's an issue with the suggestion I put in #307 (comment)

not sure what to suggest, will think about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's two cases that really need addressing:

  • bool(scalar) requires computation in all cases
  • to_array only requires computation in some cases

@MarcoGorelli
Copy link
Contributor Author

Thanks for your comments

I've excluded to_array from this PR, as it's probably not necessary at this point

Conversely, the if df.col('a').std() > 0: call is really necessary to resolve (#305 (comment))

@kkraus14
Copy link
Collaborator

kkraus14 commented Nov 7, 2023

I think may_execute is too ambiguous here. It just implies that the dataframe may hit a blocking operation at some point, but doesn't clearly indicate that it should explicitly execute and block or otherwise do something to influence downstream usage. Also, does doing another operation after may_execute invalidate the may_execute flag?

If we look at lazy execution frameworks (ignoring Ibis since it defers to different backends for actual implementation / execution), they all have a method like cache (Polars, Spark) / persist (Dask, Spark) that don't block but explicitly instruct that all future usage after that operation will not retrigger computation before that operation.

What would your thoughts be if we added a method like cache that is a no-op for eager DataFrames, which gives a better path to avoid performance pitfalls in lazy dataframes, but then generally allowed operations like to_array and bool(Scalar) to always allow running?

@MarcoGorelli
Copy link
Contributor Author

Initial thought - sounds good! Will try it out in my implementation / skrub and see where this takes us

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Nov 8, 2023

Initial thought - sounds good! Will try it out in my implementation / skrub and see where this takes us

Just tried, and no, polars.LazyFrame.cache doesn't persist the dataframe into memory, see pola-rs/polars#2842

Having said that, maybe may_execute could just defer to .collect in Polars and persist in Dask?

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Nov 8, 2023

Having said that, as far as I can tell,

df: dask.DataFrame
df = df.persist()

is very roughly equivalent (ok, not exactly, as Dask would write to the cluster rather than to your local machine, but bear with me)

df: polars.LazyFrame
df = df.collect().lazy()

Maybe .persist is OK then, and we document the list of methods which implicitly force computation

So then

# DataFrame Standard
df: DataFrame
df = df.persist()
features = []
for column_name in df.column_names:
    if df.col(column_name).std() > 0:
        features.append(column_name)
return features

would be roughly equivalent to

# polars
df: polars.LazyFrame
df = df.collect().lazy()
features = []
for column_name in df.column_names:
    if df.collect()[column_name].std() > 0:
        features.append(column_name)
return features

and

# Dask
df: dask.DataFrame
df = df.persist()
features = []
for column_name in df.column_names:
    if df[column_name].std().compute() > 0:
        features.append(column_name)
return features

And maybe that's fine?

@kkraus14
Copy link
Collaborator

kkraus14 commented Nov 8, 2023

Yea, this is more or less what I had in mind. There's still a footgun if someone doesn't use the method, but it at least gives a standard compliant way for folks to write code that nicely works across both eager and lazy implementations without introducing any implementation burden onto any of the libraries.

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Nov 8, 2023

There's still a footgun if someone doesn't use the method

Sure but we could raise if persist hasn't been called earlier?

e.g.

df: DataFrame
features = []
for column_name in df.column_names:
    if df.col(column_name).std() > 0:  # raises, tell you to call `persist` on parent dataframe
        features.append(column_name)
return features

Correct way:

df: DataFrame
df = df.persist()
features = []
for column_name in df.column_names:
    if df.col(column_name).std() > 0:
        features.append(column_name)
return features

I think the real footgun would be calling .persist within a loop:

df: DataFrame
features = []
for column_name in df.column_names:
    if df.persist().col(column_name).std() > 0:
        features.append(column_name)
return features

so this is why the "use .persist as late and as little as possible" rule (which we'd document) should still apply

EDIT: though even this last case could be prevented by erroring if persist is called when is_persisted = True. Footguns are still possible, but whatever, if we documented the best practice and have examples, I think it's OK EDIT2: the previous comment doesn't hold because operations here aren't inplace

@kkraus14
Copy link
Collaborator

kkraus14 commented Nov 8, 2023

Sure but we could raise if persist hasn't been called earlier?

e.g.

df: DataFrame
features = []
for column_name in df.column_names:
    if df.col(column_name).std() > 0:  # raises, tell you to call `persist` on parent dataframe
        features.append(column_name)
return features

What are the rules for propagating a persist status vs invalidating it? I think it would be difficult to have a set of cohesive rules for this.

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Nov 9, 2023

It doesn't need to be part of the standard, but an implementation could raise if you try to bring a scalar into Python without having called persist at some point between __dataframe_consortium_standard__ and the call which forces computation

@rgommers
Copy link
Member

rgommers commented Nov 9, 2023

If we look at lazy execution frameworks (ignoring Ibis since it defers to different backends for actual implementation / execution), they all have a method like cache (Polars, Spark) / persist (Dask, Spark) that don't block but explicitly instruct that all future usage after that operation will not retrigger computation before that operation.

Trying to catch up here. This seems like a reasonable thing to add, given that all lazy libraries seem to have it. I'm not sure if solves the same problem as may_execute(); there are two more or less orthogonal things here:

  1. Avoid recomputing expensive calls more than once in lazy implementations
  2. Allow writing code where method calls that cannot be kept lazy by a library do work

It looks to me like (1) is solved by .persist, while for (2) it's not yet 100% clear to me from the discussion above. It will solve the problem if it returns something that, for Polars, is neither a LazyFrame nor an EagerFrame. But rather a lazyframe with a few methods that are able trigger execution. I.e. something like:

class LazyFrame:
    ...
    def persist(self) -> PermissiveLazyFrame:
        return self._to_permissive()

class PermissiveLazyFrame(LazyFrame):
    def __bool__(self) -> bool:
        return self.collect().__bool__()

    def __int__(self) -> int:
        return self.collect().__bool__()

    def __float__(self) -> float:
        return self.collect().__bool__()

    def to_array(self) -> numpy.ndarray:
        return self.collect().numpy()

    # add some magic here to ensure that for all other method calls,
    # the return type is PermissiveLazyFrame, not LazyFrame

It could be a short implementation, the above is pretty much all that's needed. This would also answer be in line with Marco's answer on when things raise (if .persist has been called at least once) - because LazyFrame continues raising. Is my understanding correct there?

For the standard, the description of __bool__ & co would then still be slightly awkward, something like: "returns a boolean scalar; may raise for lazy implementations; is guaranteed not to raise if .persist() has been called on the dataframe before unless the library is not able to execute anything eagerly."

@MarcoGorelli
Copy link
Contributor Author

Something like that (but note that it's Scalar.__bool__, not LazyFrame.__bool__)

I've tried this out in dataframe-api-compat anyway - if anyone fancied trying it out (pip install dataframe-api-compat - I'm releasing a bit liberally whilst we have no users) and seeing if it matches their expectations, that'd be really helpful

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Nov 9, 2023

little demo:

# t.py
from __future__ import annotations

import pandas as pd
import polars as pl

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from dataframe_api.typing import SupportsDataFrameAPI
    from dataframe_api import DataFrame

dfpd = pd.DataFrame({'a': [1, 1, 1], 'b': [4,5,6]})
dfpl = pl.DataFrame({'a': [1, 1, 1], 'b': [4,5,6]})

def this_raises(df_raw: SupportsDataFrameAPI):
    df = df_raw.__dataframe_consortium_standard__(api_version='2023.11-beta')
    features = []
    for column_name in df.column_names:
        if df.col(column_name).std() > 0:
            features.append(column_name)
    return features

def this_runs(df_raw: SupportsDataFrameAPI):
    df = df_raw.__dataframe_consortium_standard__(api_version='2023.11-beta')
    df = df.persist()  # type: ignore
    features = []
    for column_name in df.column_names:
        if df.col(column_name).std() > 0:
            features.append(column_name)
    return features

def this_runs_but_dont_do_it(df_raw: SupportsDataFrameAPI):
    df = df_raw.__dataframe_consortium_standard__(api_version='2023.11-beta')
    features = []
    for column_name in df.column_names:
        if df.persist().col(column_name).std() > 0:  # type: ignore
            features.append(column_name)
    return features

Then (note: tracebacks shortened):

In [1]: this_raises(dfpd)
---------------------------------------------------------------------------
ValueError: Method scalar operation requires you to call `.persist` first on the parent dataframe.

Note: `.persist` forces materialisation in lazy libraries and so should be called as late as possible in your pipeline, and only once per dataframe.

In [2]: this_raises(dfpl)
---------------------------------------------------------------------------
ValueError: Cannot materialise a lazy dataframe, please call `persist` first

In [3]: this_runs(dfpd)
Out[3]: ['b']

In [4]: this_runs(dfpl)
Out[4]: ['b']

In [5]: this_runs_but_dont_do_it(dfpd)
Out[5]: ['b']

In [6]: this_runs_but_dont_do_it(dfpl)
Out[6]: ['b']

Error messages needs sorting out, but this is the idea

@MarcoGorelli MarcoGorelli changed the title Add DataFrame.maybe_execute, and notes on execution model Add DataFrame.persist, and notes on execution model Nov 9, 2023
@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Nov 10, 2023

I've updated, and removed the "propagation" part. We can talk about that next time - for now let's just get persist in?

I think people agreed on everything in this PR

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Marco. Apologies for missing yesterday's meeting where perhaps this was discussed.

For what it's worth, I think asking the user of an API to "think lazy" when they don't want/need lazy semantics might make the API difficult to a general audience. But, this API is not for the general audience and I understand that this is the best we can do to support more DataFrame libraries.

Approving, and thanks for the work here!

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, +1 for getting this in. Two minor comments to consider.

@@ -125,9 +125,10 @@ See the [use cases](use_cases.md) section for details on the exact use cases con
Implementation details of the dataframes and execution of operations. This includes:

- How data is represented and stored (whether the data is in memory, disk, distributed)
- Expectations on when the execution is happening (in an eager or lazy way)
- Expectations on when the execution is happening (in an eager or lazy way), other than `DataFrame.persist`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: not entirely accurate, since it's only a hint so there is still no "when" prescribed.

How about saying instead: "(see Exection model for some caveats)" in order to keep things in one place?

@MarcoGorelli
Copy link
Contributor Author

thanks all, merging then

we can discuss propagation (or lack of) next time, but I'm glad we've been able to agree on this. it's something to be proud of. well done all! 🎉

@MarcoGorelli MarcoGorelli merged commit 7be00b6 into data-apis:main Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants