Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[DataFrame] DataFrame operations "obj" return type #5707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
luisquintanilla opened this issue Jan 24, 2020 · 7 comments
Open

[DataFrame] DataFrame operations "obj" return type #5707

luisquintanilla opened this issue Jan 24, 2020 · 7 comments

Comments

@luisquintanilla
Copy link
Contributor

Given a DataFrameColumn with data similar to the following:

image

Performing operations like Sum return values of type obj. This causes issues when trying to use it in other contexts / operations.

Example:

Suppose I want to manually calculate the sum of a numeric column and divide it by 3:

df.["PetalLength"].Sum() / 3

The code throws the following exception - The type 'int' does not match the type 'obj'

The first thought is to cast.

((float32)df.["PetalLength"].Sum())

This approache, throws exception - typecheck error The type 'obj' does not support a conversion to the type 'float32.

The two ways that actually work are the following:

Using the casting operator:

(df.["PetalLength"].Sum() :?> float32) / 3.f

or unboxing:

df.["PetalLength"].Sum() |> unbox<float32> 

In a sense it's not really a bug, but the ability to provide the return type and avoid casting would make it cleaner / simpler on the user.

i.e. df.[PetalLength].Sum<float32>()

@pgovind
Copy link

pgovind commented Jan 25, 2020

This is the 2nd time a request to infer the return type has come up this week. I'll consider different approaches here. At a high level, it feels like having strongly typed column fields based on the current schema will solve this issue, #5684 and the F# specific request from #5670.

One idea I've been thinking about here is generating fields at runtime using reflection. Something along the lines of DataFrame inferredDataFrame = new DataFrame.ReadCsv(); where we create properties on the returned DataFrame for each column in the csv file. Then code such as inferredDataFrame.PetalLength could return a PrimitiveDataFrameColumn<float> and subsequent ops on it such as Sum would return float. Not sure if this is even possible yet, so I'll prototype it next week :)

@luisquintanilla
Copy link
Contributor Author

@pgovind Sounds good!

@NeilMacMullen
Copy link

Thirded! Please, oh please, add some typing!

Firstly, its absence makes API discoverability very poor. For example, if I write

 var col = new PrimitiveDataFrameColumn<int>("column of ints");
 var max =col.Max();     

it's entirely unclear what I'm getting back when 'max' is an object. Maybe I'm actually getting back some kind of structured objects that represents the cell rather than the value within it?

FWIW, I'm less interested in using DataFrames for Jupyter than for desktop applications so having a decent intellisense experience is pretty important.

Secondly, the lack of typing just results in increasingly ugly code with unnecessary casts or Select operations.

FWIW I wrote a DataFrame-like library for internal use. The operation to fetch a column was.

var columnOfT = frame.GetColumn<T>("name");

Although it's slightly annoying to have to specify the type in these kinds of indexing operations, it does have the advantage that subsequent accesses to the data can use type inference.

Whilst I'm wishing for things... it would be highly desirable to be able to specify columns that can't contain NULLS and to have operations that can 'clean' null cells. E.g. I'd really like to do something like...

var frame = new DataFrame.LoadCsv("someCsvWithEmptyCells.csv");
var sanitisedAges= frame
                           .Columns<int>("ages")
                           .ReplaceNull(0) ;

//The idea here is that instead of an age column that may contain NULLs, 
//we've replaced them with a specific value AND generated a column 
//whose type and semantics no longer allow the admission of NULL.  I.e. the type
//of sanitisedAges is something like  NonNullablePrimitiveDataFrame<int>  
 
var ages = sanitisedAges.ToArray(); 

//results in an array of ints whereas the same operation on the original
//PrimitiveDataFrameColumn<int> object would have resulted in an array of int?s 

@pgovind
Copy link

pgovind commented Feb 14, 2020

In your specific example, max would be an int. We only lose type information when APIs are called on the base DataFrameColumn objects. We're working on a couple ways to improve this at the moment though:

  1. Extension for DataFrame + Jupyter notebooks that adds properties that return concrete columns eerhardt/DotNetInteractiveExtension#25 is enabling an extension that would make properties that return strongly typed columns on a DataFrame in Jupyter. So, now something like df.Price would return a PrimitiveDataFrameColumn<float> (as opposed to df["Price"]) which would return a weakly typed DataFrameColumn
  2. Similar to what you suggested, Add support for window operations on columns corefxlab#2827 adds the following APIs on DataFrame:
    GetPrimitiveDataFrameColumn<T>(ColumnName)
    GetStringDataFrameColumn(ColumnName)
    GetArrowStringDataFrameColumn(ColumnName)

I think this solves most of our type inference problems. 1 for Jupyter and 2 for desktop users.

Also, have you looked at FillNulls? It exists on all the columns types and replaces null values with a specified value. It returns the same column type though, so the resulting column still has the ability to contain nulls. Out of curiosity, do you have examples of when a NonNullablePrimitiveDataFrameColumn<int> would be useful?

@NeilMacMullen
Copy link

NeilMacMullen commented Feb 15, 2020

Thanks Prashanth. :-) It's possible things have moved on in the API - I'm using the current nuget release (0.2.0). In that release Max and similar functions on a PrimitiveDataFrameColumn actually return an object. (And it's not obvious how to fetch a typed Column from a DataFrame). It sounds like things are moving in a good direction though with the changes you mention. I d o have a one question though - as a consumer of the API I don't really see why there's a difference between PrimitiveDataFrameColumn and StringDataFrameColumn? Why not just DataFrame where T can be any primitive System type or even an arbitrary user-supplied type? (I recognise that some operations are really meaningful for numeric types - maybe the design goal is to focus on numeric processing rathern than general tabular data?)

Also, have you looked at FillNulls?
Thanks - I'd missed that - looks very useful.

Out of curiosity, do you have examples of when a NonNullablePrimitiveDataFrameColumn would be useful?

I can't claim it's a killer requirement :-) It mainly derives from wanting to avoid having to write all my filter/mutation operations as functions accepting a nullable type (I find all those '?'s a bit ugly) and a general desire to separate the processing pipeline into a 'cleanup' phase where nulls are dropped/replaced with meaningful data and a 'processing' phase where special-cases (nulls) can be ignored. I'm probably being a it over-fastidious on this though!

@pgovind
Copy link

pgovind commented Feb 20, 2020

Why not just DataFrame where T can be any primitive System type or even an arbitrary user-supplied type?

This stems from a desire to support the Apache Arrow format. The Arrow format lays out the memory for different primitive types and going from Arrow -> DataFrame or vice-versa is zero-copy. It also comes with the advantage that we can support hardware intrinsics much better in the future.

@pgovind pgovind transferred this issue from dotnet/corefxlab Mar 11, 2021
@TheJanzap
Copy link

TheJanzap commented May 6, 2025

This has been one of my biggest annoyances when working with DataFrames and I'd love to see it fixed. Currently, I have to unbox every value with a cast. It is not clear to me why this is necessary when the data type of the value is stored in the DataFrameColumn which should be able to do the unboxing for me?

It seems like the two PRs have stalled, so is there any progress on this? If there are problems with Apache Arrow support, would it be possible to focus on native types first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants