-
Notifications
You must be signed in to change notification settings - Fork 21
Trying to define "data frame" #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks Wes. I think the only thing I would clarify in your definition is that the columns are named. I think it's safe to require named columns in a data frame (requiring uniqueness or just allowing string names is a topic to discuss though). It probably isn't appropriate to include row labels in the definition, despite them being present in at least R and pandas. Agreed with all of your exclusions from the definition. Can you clarify a question about your RPandas comment? Are you saying that the API we define shouldn't limit itself to a subset of the pandas API? If so, then agreed. |
I think that this definition is too high-level. The terms "dataset", "column", "logical data type" requires precise definition. And the same is for data frames. The Data frame as a concept was introduced in S and has a very specific meaning in S, and then R, a table-like structure which allows matrix-like operations. I strongly prefer to have a definition in terms of algebra, similar to the relational model, which will allow defining semantics of the operations, thus defining the execution model and then one can derive API. SQL is just one example of API implementing Codd's relational algebra. I think that Kepner's associative arrays' math is a good example that can power such formalism. |
Yes, that's what I mean. One productive thing we could do would be to highlight some of the known deficiencies / limitations of the pandas API as examples of ways that we ideally would like to do better going forward. |
This is tricky. If you survey the spectrum of libraries that are considered by users to be "data frame" libraries, the only real commonality you have is that
In R/S, a
We can say that the goal is to define a general purpose "data frame algebra" (this is what Ibis does, for example), but there are plenty of "data frame" libraries that have no such algebra implemented. They merely provide a simple programming interface for interacting with tabular data. Berkeley BIDS for example created a minimalistic data frame library for pedagogical purposes because they felt the expansive nature of pandas got in the way of teaching certain things. This presentation of mine from 2015 goes more in depth on this topic of defining "data frame". |
@wesm I think this viewpoint is problematic for a number of reasons. 1.) If it is just a programming interface for tables, is an ORM a dataframe? ORMs are composable interfaces for relational tables, so why isn't SQLAlchemy a dataframe by this definition? If SQLAlchemy is a dataframe, then why would anyone use pandas? 2.) If a dataframe is just an interface for table, we have nothing to discuss because standards are well defined for relational structures. Dataframes were created in S to treat objects as matrix-like and table-like with no pre-defined schema: Read Here!. What is the difference between a dataframe and a relation in your definition? 3.) It leaves out some important widely used components of dataframes. I am of the mindset that we should try to understand, support, and optimize for what users are trying to do. We need to differentiate the dataframe from other existing, well-defined data structures, or agree that it is just a table/matrix/spreadsheet. There is no sense in defining new APIs on existing APIs/standards/structures. It is much easier to answer questions like "How is a dataframe different from X?" than it is to just say "What is a dataframe?", so that is where I propose we start. My perspective is taken from lessons I have learned from understanding and studying user behaviors. I think it is important that we try to support and maintain behaviors and functionalities and optimize from that constraint. It is also taken from history, where it emerged from a specific need that is/was not met with existing data structures. While the dataframe has roots in both relational and linear algebra systems, it is neither a table nor a matrix. We can conceptualize dataframes from both relational and linear algebra points of view, however the dataframe has some data model differences that ultimately conflict the fundamental data model of each. From a relational viewpoint, dataframes consist of:
The lazily-induced schema basically allows the dataframe system to interpret the types itself, not require that the types are declared upfront. This is something that relational systems cannot do. From a linear algebra viewpoint, dataframes consist of:
It is important to note that we don't know how to solve some of these problems optimally yet. That is the exciting thing about working on dataframes: there are plenty of unsolved problems. It won't just be an engineering exercise. I think this thread will likely get very cluttered if we try to discuss each component of the dataframe in one place, and it will be difficult to gauge consensus. It is very likely that there will be disagreement on certain components of the data model, a disparate set of tools with very different capabilities are represented here. |
The primary modality of the programming interface is "expressing data manipulations and analytical operations on tabular datasets". I don't think that describes the primary modality of an ORM like SQLAlchemy. For example, some backends of Ibis are implemented using SQLAlchemy. The SQL query My point really is that "data frame" is just a name and it means different things to different people. It happens to be that some "data frame libraries" have substantially different scopes of features, but their commonality of providing a programming interface whose dominant modality is data manipulations and analytics on tabular data sets. |
I think we should not try to include all systems that market themselves as dataframes in this definition. We should be opinionated and precise, otherwise we will end up where we are now, with no well-defined way of determining what a dataframe is or what the API should be. Users will call something what it is marketed as, so I don't think that calling a project a dataframe makes it so.
Yes, I completely agree that this is the problem. It has more or less become a marketing term. In my opinion we should define some standard that will determine whether a system conforms to a precise and specific definition, otherwise we may end up not making meaningful progress in defining the API either. |
Maybe we shouldn't try to push a very specific definition to the word dataframe if it is used very broadly/loosely. What about having a descriptive name for different use cases. Let's say (by lack of a better name), we call a dataframe 'level-0' a bag of order columns, as described above by Wes. A definition that fits almost all dataframes and will allow us to define an interchange API. Maybe this is what we should call a dataframe in the end, but if we want to be specific in a discussion, we may want to give it a more explicit name. Later on, when we will work on a computational API/operations/features, we can talk about a dataframe-level-1 (again by lack of a better). Maybe we can start by having very specific/boring names that match very specific goal and see if we can regroup/rename them when we have a better perspective. For instance, I don't think Vaex will ever match the description by @devin-petersohn, but still, it would be good to have names/APIs for the overlap between Vaex/Pandas/Modin, and to be able to describe what Modin/Pandas can do in addition to Vaex and visa versa (If we ever get there). |
To make this concrete, we could consider adopting some of the nomenclatures in https://arxiv.org/pdf/2001.00888.pdf such as a matrix dataframe (all columns of the same type, int or float). |
@aregm and @devin-petersohn I think it would be useful to separate concerns a little here. You are diving straight into defining things so precisely that a lot of semantics are fixed. Those are good things to worry about and we'll indeed need to deal with that, however it's the most detailed level at which we should define things in an API. A few thoughts:
Agreed with "all systems". The ones I listed above though (plus cuDF, and probably a few more) all seem reasonable to take into account and consider impact on. It would also be useful to list ones that do market themselves as dataframes but are so different that there's not much sense in taking them into account. Do you have any in mind?
This sounds like a good idea. Would be good to have a separate tracking issue for this, and then if needed split off from there to go in detail on specific methods/topics. @TomAugspurger I'm sure you have a bunch in mind, would you be able to start this issue? |
Opened #4 for the sub-discussion of avoiding issues with pandas API. meta-comment: I'd like to see the discussion on definitions go on a bit further and then let's try to summarize the defininition in a hackmd document that we can achieve consensus on. I'd be happy to write / co-write a draft sometime next week. |
@maartenbreddels I think Vaex is close to the definition, much closer than many other systems 😄. I have a list below of classifications similar to what you were mentioning about what certain types of systems can do vs others, feel free to edit! @rgommers I see your point, my definition is more along the lines of traditional dataframes. We do need specific definitions and precision to meaningfully describe an API and standard. Perhaps binning systems to define multiple standards will help? Here is a candidate high-level binning from my perspective, each of which can potentially have its own standard. My PhD thesis focus is on what I have labeled as "Traditional Dataframes", so that set of properties is going to be more precise at first than the others (system maintainers feel free to edit to add properties/put your system in the right bin, you will know your system better than me). Note: My intention with this binning is for creating standards, misclassifying a system will make it difficult to create a standard about that group, so we should try to be as precise as possible. Traditional DataframesProperties:
Systems:
Columnar DataframesProperties:
Systems:
Relational DataframesProperties:
Systems:
Unclassified DataframesSystems:
... |
Thanks @devin-petersohn. I think your binning is interesting and we'll need it at some point; it does drag in a lot of assumptions though, for example on underlying storage. Re schema predefined vs at runtime: this is execution rather than API related I think. Re matrix multiply: that's quite specific (e.g. not defined for heterogenously typed columns, extremely inefficient unless one has 2-D contiguous storage and can call on the BLAS-optimized implementation of the underlying array). If you're doing linear algebra as a user, you probably should be using arrays, rather than dataframes that act as arrays with axis labels tacked on. Re row/column symmetry: also a little artificial probably. Normally rows and columns have different semantic meaning (rows -> observations, columns -> features), there's typically a dtype per column, not per row, etc.
This is an important point. I think we have a bit of a misunderstanding here. The aim is a single API standard for things that you split into multiple "bins", not multiple standards. Or maybe what you mean is similar to what @maartenbreddels called "levels". At the lowest (most core/common) levels one can write, e.g., programs that Pandas, Dask and Vaex can all run and give the same results for (just with different performance profiles), even though you put them in different bins. We didn't have a chance to talk beforehand about this consortium; maybe it's worth having a quick call. I'd also be interested to learn what your main use case or objective is. I'll send you an email. |
Thanks @rgommers I'll sync offline about the consortium discussion, but some comments about your points. I understand the purpose of this group, I am just trying to be precise. Without precision there is nothing to discuss, we may end up with a non-impactful API.
I must completely disagree here. If a schema is required before building a table, you are limited to APIs that have a known output schema given the input schema and operator. Relational systems can only do compatible relational operators, so if that is the minimum common subset then there is a lot left out in the other types of systems that are commonly used and valuable for users. I don't think it will be meaningful if we just decide to go with a new api for doing SQL queries, which is the lowest common denominator of all of these systems. This is why precisely deciding on "what is a dataframe" is so important to do at the outset.
This was an illustrative example, and yes it is hard to optimize in a dataframe setting, but that is a challenge specific to the implementation of those systems. All listed systems support it, which was the main point of the example. Those systems are hybrids of the relational and linear algebra data model.
You are correct that there is a schematic asymmetry but I did not want to get that low level in those points. The interchangeability of columns and rows is possible in each of those systems, and is commonly used in pandas. Often this is because data in the wild comes in many different formats, and the orientation of the data may not be correct in the source files. The bins do have a hierarchy and @maartenbreddels's comment was what I was getting at. The groups later in that list cannot/do not implement data model features earlier the list, and the groups earlier that list are a superset or or can emulate the data model of every from below. |
Ibis definition is highly relying on relational algebra as a basis and Python semantics for operations. This is an approach, but IMHO is one step ahead of the thing I am talking about. Let me try to explain here. First of all what I understand under algebra. In a broader term, the definition of Chamber's Dictionary: "any of a number of systems using symbols and involving reasoning about relationships and operations". More specifically, an algebra consists of a set of objects and a set of operators that together satisfy certain axioms or laws (such as closure, associativity, commutativity, etc.) What you are referring to as a set of APIs, on the foundation level is called calculus. In the same way, that relational algebra served as a vehicle for implementing the relational calculus, which led to QUEL -> SEQUEL -> SQL, dataframe algebra can serve as a vehicle to define dataframe calculus, which will lead to maybe many systems and help existing dataframe systems to remove redundancies and actually be called dataframe systems. But to get there we need as a first step to define the underlying object. Is the dataframe a simple relation? Is it a relation on top of lists? Is it a simple collection of associative arrays? Is it a matrix of simple data types? Is it a matrix of structured data types? The calculus - do we need a reduction algorithm? Should we follow Python, R, or SparkSQL? Or maybe S? Or maybe they all caught it wrong? I am not suggesting to lay down math foundation (I guess nobody is interested), but I would suggest to at least in this workgroup and at least on an intuition level have a common understanding of what is the dataframe object, what are the operators, what are the laws. Maybe it is associative arrays, maybe Python/Pandas, maybe SparkSQL - but let's give it a thought and discuss. Maybe interested parties can come up with their definition and we can discuss and compare them? |
"associative arrays" are basically dictionaries if I read Wikipedia correctly. Which sounds about right; a mapping with column names as keys and 1-D arrays (homogeneous data type per column) as values. @wesm said "a collection of columns each having their own logical data type". your "matrix of simple data types" could be the same as well (under the condition that matrix doesn't imply underlying 2-D contiguous storage or other restriction beyond a set of 1-D arrays of the same length). It sounds like we're all saying similar things in slightly different words.
I'd suggest to add structured data types as out of scope for the API standard (v1 at least), even if it's possible to include structured dtypes in some current dataframe libraries. Data types should be limited to something tractable (e.g. integers, floats, strings, datetime types, categoricals). Everything else can be postponed till a future version of the API standard.
doing this exercise from the perspective of:
is going to be a lot more productive than doing it from first principles I believe.
The answer to that should be motivated by:
Hard to believe they got it all wrong. Mostly they implement the same math/logic, with different APIs. We'd like to end up with a clean, Pythonic API I'd think. |
So, an attempt to summarize the discussion thus far. @wesm started with a general definition
And a few properties that the definition explicitly takes no stance one. To this I would add the restriction (brought up by @devin-petersohn and others) that the rows are ordered, and possibly that there are row labels. There's some concern that this definition leaves out crucial components of what makes a dataframe a dataframe. I think this is mostly around "what operations must a dataframe support" (linear algebra? joins?, etc.), but I'm not sure that's necessary to pin that down here. That will be made clear by the API standard. This relates to the "levels" brought up by @maartenbreddels in #2 (comment). I think it's worth asking: what is this definition for? For me, this will aid us in designing the API. When we're discussing any given method we'll use the definition to inform the answer (should the dataframe have an One property I don't understand is "a lazily induced schema". Could you expand on that @devin-petersohn? |
Sure, happy to expand on lazily induced schema. The idea of the lazily induced schema is that the user need not define their data schema upfront, nor does the schema need to be known for every output operator. This is in contrast with relational systems that require data to be defined schema first. Operators in systems that support lazily induced schemas do not need to have a known output schema for any given input schema (but they can). The This idea is all about flexibility. Data in the wild is often schemaless and semi-structured. Relational systems have (mostly) solved the problem of structured data. Dataframes like pandas fit the need for the schemaless poorly structured data. |
Thanks, I think this is the component I was missing earlier. An example would be something like |
There was a question on the sync call today about defining "what is a data frame?". People may have different perspectives, but I wanted to offer mine:
A "data frame" is a programming interface for expressing data manipulations and analytical operations on tabular datasets (a dataset, in turn, is a collection of columns each having their own logical data type) in a general purpose programming language. The interface often exposes imperative, composable constructs where operations consist of multiple statements or function calls. This contrasts with the declarative interface of query languages like SQL.
Things that IMHO should not be included in the definition, and are implementation-specific concerns, and any given "data frame library" may work differently:
Hopefully one objective of this group will be to define a standardized programming interface that avoids commingling implementation-specific details into the interface.
That said, there may be people that want to create "RPandas" (see RPython) -- i.e. to provide for substituting new objects into existing code that uses pandas. If that's what some people want, we will need to clarify that up front.
The text was updated successfully, but these errors were encountered: