-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[DataFrame] DataFrame operations "obj" return type #5707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is the 2nd time a request to infer the return type has come up this week. I'll consider different approaches here. At a high level, it feels like having strongly typed column fields based on the current schema will solve this issue, #5684 and the F# specific request from #5670. One idea I've been thinking about here is generating fields at runtime using reflection. Something along the lines of |
@pgovind Sounds good! |
Thirded! Please, oh please, add some typing! Firstly, its absence makes API discoverability very poor. For example, if I write
it's entirely unclear what I'm getting back when 'max' is an object. Maybe I'm actually getting back some kind of structured objects that represents the cell rather than the value within it? FWIW, I'm less interested in using DataFrames for Jupyter than for desktop applications so having a decent intellisense experience is pretty important. Secondly, the lack of typing just results in increasingly ugly code with unnecessary casts or Select operations. FWIW I wrote a DataFrame-like library for internal use. The operation to fetch a column was.
Although it's slightly annoying to have to specify the type in these kinds of indexing operations, it does have the advantage that subsequent accesses to the data can use type inference. Whilst I'm wishing for things... it would be highly desirable to be able to specify columns that can't contain NULLS and to have operations that can 'clean' null cells. E.g. I'd really like to do something like...
|
In your specific example,
I think this solves most of our type inference problems. 1 for Jupyter and 2 for desktop users. Also, have you looked at |
Thanks Prashanth. :-) It's possible things have moved on in the API - I'm using the current nuget release (0.2.0). In that release Max and similar functions on a PrimitiveDataFrameColumn actually return an object. (And it's not obvious how to fetch a typed Column from a DataFrame). It sounds like things are moving in a good direction though with the changes you mention. I d o have a one question though - as a consumer of the API I don't really see why there's a difference between PrimitiveDataFrameColumn and StringDataFrameColumn? Why not just DataFrame where T can be any primitive System type or even an arbitrary user-supplied type? (I recognise that some operations are really meaningful for numeric types - maybe the design goal is to focus on numeric processing rathern than general tabular data?)
I can't claim it's a killer requirement :-) It mainly derives from wanting to avoid having to write all my filter/mutation operations as functions accepting a nullable type (I find all those '?'s a bit ugly) and a general desire to separate the processing pipeline into a 'cleanup' phase where nulls are dropped/replaced with meaningful data and a 'processing' phase where special-cases (nulls) can be ignored. I'm probably being a it over-fastidious on this though! |
This stems from a desire to support the Apache Arrow format. The Arrow format lays out the memory for different primitive types and going from Arrow -> DataFrame or vice-versa is zero-copy. It also comes with the advantage that we can support hardware intrinsics much better in the future. |
This has been one of my biggest annoyances when working with DataFrames and I'd love to see it fixed. Currently, I have to unbox every value with a cast. It is not clear to me why this is necessary when the data type of the value is stored in the DataFrameColumn which should be able to do the unboxing for me? It seems like the two PRs have stalled, so is there any progress on this? If there are problems with Apache Arrow support, would it be possible to focus on native types first? |
Given a DataFrameColumn with data similar to the following:
Performing operations like
Sum
return values of typeobj
. This causes issues when trying to use it in other contexts / operations.Example:
Suppose I want to manually calculate the sum of a numeric column and divide it by 3:
The code throws the following exception - The type 'int' does not match the type 'obj'
The first thought is to cast.
This approache, throws exception - typecheck error The type 'obj' does not support a conversion to the type 'float32.
The two ways that actually work are the following:
Using the casting operator:
or unboxing:
In a sense it's not really a bug, but the ability to provide the return type and avoid casting would make it cleaner / simpler on the user.
i.e.
df.[PetalLength].Sum<float32>()
The text was updated successfully, but these errors were encountered: