Study on the pandas API: What is the most commonly used? #3

devin-petersohn · 2020-05-15T21:25:22Z

I have spent a lot of time trying to understand users and their behaviors in order to optimize for them. As a part of this work, I have done numerous studies on what gets used in pandas.

This will be extremely useful when it comes to defining a dataframe standard, because what people are using can help inform us on what behaviors to support.

For this study, we scraped the top 6000 notebooks from Kaggle by upvote.

Repo here, reproduction script included: https://github.com/modin-project/study_kaggle_usage

Results here: results.csv

datapythonista · 2020-05-15T22:15:57Z

This is really cool, thanks a lot for sharing!

amueller · 2020-05-18T21:51:29Z

This is awesome! I'm somewhat surprised at the common use of values tbh.
Probably obvious but maybe worth mentioning: kaggle is a very biased source. Basically by design everything is read from CSV files, for example, and it's unlikely anyone reads sql. Also, the workloads are quite specific. Still really cool!

datapythonista · 2020-05-18T22:03:58Z

I use values to "export" pandas data to numpy to train scikit-learn models. Not sure if that's the reason, but it doesn't surprise me. I guess it's as read_csv, biased in that kaggle users will be converting data to numpy to use in scikit-learn.

amueller · 2020-05-18T22:33:33Z

At some point a pandas dev told me to use np.array instead of values, but I guess that's not very common. Also, you don't need to do any conversion for using sklearn ;)

datapythonista · 2020-05-18T23:20:15Z

I agree with both points (I think they meant DataFrame.array, which is somehow recent). But I think that's still the reason why the notebooks use .values frequently.

devin-petersohn · 2020-05-18T23:43:38Z

kaggle is a very biased source.

We have done the same thing over 1 million ~~all~~ public GitHub repos and gists, published here: https://arxiv.org/pdf/2001.00888.pdf in Section 3.6. I thought it better to share some raw data we have collected rather than the summary in that paper. The GitHub scripts still need some cleaning up to be made public.

I do not think the bias of Kaggle diminishes the value of this data (not that you are implying this).

TomAugspurger · 2020-05-28T15:54:26Z

Here are the top 40ish pandas methods by pageviews on pandas' docs

read_csv
DataFrame
DataFrame.drop
DataFrame.sort_values
DataFrame.to_csv
DataFrame.groupby
DataFrame.merge
DataFrame.rename
read_excel
concat
DataFrame.dropna
DataFrame.append
to_datetime
DataFrame.loc
DataFrame.set_index
DataFrame.reset_index
DataFrame.fillna
DataFrame.replace
DataFrame.apply
DataFrame.astype
DataFrame.iloc
DataFrame.to_excel
DataFrame.join
DataFrame.drop_duplicates
DataFrame.from_dict
Series
DataFrame.index
DataFrame.dtypes
pivot_table
DataFrame.columns
Series.value_counts
get_dummies
DataFrame.filter
DataFrame.plot
DataFrame.describe

I view the number of page views as some function(usefulness, complexity), hence read_csv with its 50+ parameters at the top :)

tdimitri · 2020-06-13T18:25:37Z

I love this list. Below are some decisions we made in riptide to work with existing pandas users while trying to eliminate duplicate methods or sometimes too much being put into the same method (like sort and merge).

na: fill_na and drop_na. We broke out fill_na to fill_forward, fill_backward
sort: we did sort_copy, sort_inplace, sort_view <-- new one that pandas does not have
dtypes/describe/stats: all related to information about the columns
apply: (big topic!) apply_reduce, apply_nonreduce, apply_numba (last one not yet implemented)
note: apply_reduce has a transform=True (and this got rid of .transform)
columns: for all column operations we started all methods with col_. Thus 'col_rename' and 'col_map' are the most used.
join/merge: (another big topic!) this got broken out into merge_enrich, and a few more merge_ methods. concat went to 'stack_rows' we started using stack_rows as a generic API to stack anything.
set_index/reset_index: we eliminated these (this will end up being a big discussion topic and is related to row labels -- i think?)
drop_duplicates: this is really 'first', 'last' and 'nth' and is related to groups or groupings (which is going back to categoricals)
groupby: this goes back to groups which goes back to categoricals
.loc/[,:], etc: These are related to row and column indexing which we reduced into always using [,] syntax
pivot_table: we have this also (was a fight as I tried to eliminate it) -- we often use another API to do what pivot does (specifically set_index, followed by pivot can be done in a new method).
filter: this needs more discussion -- we have filter= kwarg in many methods
value_counts: this became .count() everywhere
to/from: from_dict -- surprised to see this on the list. not surprisingly we have to_pandas, from_pandas
save/load: not on the list but related

kgryte · 2020-12-10T17:36:02Z

To add to this discussion, we've done some analysis of downstream library usage of pandas APIs, which can be found here.

devin-petersohn mentioned this issue May 16, 2020

Avoiding the "pandas trap" #4

Open

kgryte mentioned this issue Dec 10, 2020

API candidates for standardization #34

Open

rgommers mentioned this issue Dec 2, 2021

potentially relevant usage patterns / targets for a developer-focused API #71

Open

jbrockmendel mentioned this issue Aug 26, 2022

DISC: pd.DataFrame methods we specifically _don't_ want included #83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Study on the pandas API: What is the most commonly used? #3

Study on the pandas API: What is the most commonly used? #3

devin-petersohn commented May 15, 2020

datapythonista commented May 15, 2020

amueller commented May 18, 2020

datapythonista commented May 18, 2020 •

edited

Loading

amueller commented May 18, 2020

datapythonista commented May 18, 2020

devin-petersohn commented May 18, 2020 •

edited

Loading

TomAugspurger commented May 28, 2020 •

edited

Loading

tdimitri commented Jun 13, 2020 •

edited

Loading

kgryte commented Dec 10, 2020

Study on the pandas API: What is the most commonly used? #3

Study on the pandas API: What is the most commonly used? #3

Comments

devin-petersohn commented May 15, 2020

datapythonista commented May 15, 2020

amueller commented May 18, 2020

datapythonista commented May 18, 2020 • edited Loading

amueller commented May 18, 2020

datapythonista commented May 18, 2020

devin-petersohn commented May 18, 2020 • edited Loading

TomAugspurger commented May 28, 2020 • edited Loading

tdimitri commented Jun 13, 2020 • edited Loading

kgryte commented Dec 10, 2020

datapythonista commented May 18, 2020 •

edited

Loading

devin-petersohn commented May 18, 2020 •

edited

Loading

TomAugspurger commented May 28, 2020 •

edited

Loading

tdimitri commented Jun 13, 2020 •

edited

Loading