Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Study on the pandas API: What is the most commonly used? #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
devin-petersohn opened this issue May 15, 2020 · 9 comments
Open

Study on the pandas API: What is the most commonly used? #3

devin-petersohn opened this issue May 15, 2020 · 9 comments

Comments

@devin-petersohn
Copy link
Member

I have spent a lot of time trying to understand users and their behaviors in order to optimize for them. As a part of this work, I have done numerous studies on what gets used in pandas.

This will be extremely useful when it comes to defining a dataframe standard, because what people are using can help inform us on what behaviors to support.

For this study, we scraped the top 6000 notebooks from Kaggle by upvote.

Repo here, reproduction script included: https://github.com/modin-project/study_kaggle_usage

Results here: results.csv

@datapythonista
Copy link
Member

This is really cool, thanks a lot for sharing!

@amueller
Copy link

This is awesome! I'm somewhat surprised at the common use of values tbh.
Probably obvious but maybe worth mentioning: kaggle is a very biased source. Basically by design everything is read from CSV files, for example, and it's unlikely anyone reads sql. Also, the workloads are quite specific. Still really cool!

@datapythonista
Copy link
Member

datapythonista commented May 18, 2020

I use values to "export" pandas data to numpy to train scikit-learn models. Not sure if that's the reason, but it doesn't surprise me. I guess it's as read_csv, biased in that kaggle users will be converting data to numpy to use in scikit-learn.

@amueller
Copy link

At some point a pandas dev told me to use np.array instead of values, but I guess that's not very common. Also, you don't need to do any conversion for using sklearn ;)

@datapythonista
Copy link
Member

I agree with both points (I think they meant DataFrame.array, which is somehow recent). But I think that's still the reason why the notebooks use .values frequently.

@devin-petersohn
Copy link
Member Author

devin-petersohn commented May 18, 2020

kaggle is a very biased source.

We have done the same thing over 1 million all public GitHub repos and gists, published here: https://arxiv.org/pdf/2001.00888.pdf in Section 3.6. I thought it better to share some raw data we have collected rather than the summary in that paper. The GitHub scripts still need some cleaning up to be made public.

I do not think the bias of Kaggle diminishes the value of this data (not that you are implying this).

@TomAugspurger
Copy link

TomAugspurger commented May 28, 2020

Here are the top 40ish pandas methods by pageviews on pandas' docs

read_csv
DataFrame
DataFrame.drop
DataFrame.sort_values
DataFrame.to_csv
DataFrame.groupby
DataFrame.merge
DataFrame.rename
read_excel
concat
DataFrame.dropna
DataFrame.append
to_datetime
DataFrame.loc
DataFrame.set_index
DataFrame.reset_index
DataFrame.fillna
DataFrame.replace
DataFrame.apply
DataFrame.astype
DataFrame.iloc
DataFrame.to_excel
DataFrame.join
DataFrame.drop_duplicates
DataFrame.from_dict
Series
DataFrame.index
DataFrame.dtypes
pivot_table
DataFrame.columns
Series.value_counts
get_dummies
DataFrame.filter
DataFrame.plot
DataFrame.describe

I view the number of page views as some function(usefulness, complexity), hence read_csv with its 50+ parameters at the top :)

@tdimitri
Copy link

tdimitri commented Jun 13, 2020

I love this list. Below are some decisions we made in riptide to work with existing pandas users while trying to eliminate duplicate methods or sometimes too much being put into the same method (like sort and merge).

na: fill_na and drop_na. We broke out fill_na to fill_forward, fill_backward
sort: we did sort_copy, sort_inplace, sort_view <-- new one that pandas does not have
dtypes/describe/stats: all related to information about the columns
apply: (big topic!) apply_reduce, apply_nonreduce, apply_numba (last one not yet implemented)
note: apply_reduce has a transform=True (and this got rid of .transform)
columns: for all column operations we started all methods with col_. Thus 'col_rename' and 'col_map' are the most used.
join/merge: (another big topic!) this got broken out into merge_enrich, and a few more merge_ methods. concat went to 'stack_rows' we started using stack_rows as a generic API to stack anything.
set_index/reset_index: we eliminated these (this will end up being a big discussion topic and is related to row labels -- i think?)
drop_duplicates: this is really 'first', 'last' and 'nth' and is related to groups or groupings (which is going back to categoricals)
groupby: this goes back to groups which goes back to categoricals
.loc/[,:], etc: These are related to row and column indexing which we reduced into always using [,] syntax
pivot_table: we have this also (was a fight as I tried to eliminate it) -- we often use another API to do what pivot does (specifically set_index, followed by pivot can be done in a new method).
filter: this needs more discussion -- we have filter= kwarg in many methods
value_counts: this became .count() everywhere
to/from: from_dict -- surprised to see this on the list. not surprisingly we have to_pandas, from_pandas
save/load: not on the list but related

@kgryte
Copy link
Contributor

kgryte commented Dec 10, 2020

To add to this discussion, we've done some analysis of downstream library usage of pandas APIs, which can be found here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants