-
Notifications
You must be signed in to change notification settings - Fork 3
Add writing functionality for dataframes #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
rdata/conversion/_conversion.py
Outdated
| value = None | ||
|
|
||
| elif obj.info.type == parser.RObjectType.ALTREP: | ||
| value = convert_altrep_to_range(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this is ok?
In principle it is ok to write altreps when possible by default. I am not sure if we should add an option for the Converter not to use altreps, for having compatibility with tools that do not understand them.
rdata/_write.py
Outdated
| compression: Compression. | ||
| encoding: Encoding to be used for strings within data. | ||
| format_version: File format version. | ||
| constructor_dict: Dictionary mapping Python types to R classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not really true, right? It maps Python classes to functions to convert them to R classes (which is more powerful, as it can choose a different R class depending on the attributes of the object).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, changed to Dictionary mapping Python classes to functions converting them to R classes. here and in other locations.
This reverts commit 8a269ae.
trossi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vnmabus Sorry for the long delay with this PR - my past months were pretty busy. I have pushed fixes based on your suggestions.
rdata/_write.py
Outdated
| compression: Compression. | ||
| encoding: Encoding to be used for strings within data. | ||
| format_version: File format version. | ||
| constructor_dict: Dictionary mapping Python types to R classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, changed to Dictionary mapping Python classes to functions converting them to R classes. here and in other locations.
rdata/conversion/__init__.py
Outdated
| build_r_data as build_r_data, | ||
| convert_to_r_object as convert_to_r_object, | ||
| convert_to_r_object_for_rda as convert_to_r_object_for_rda, | ||
| DEFAULT_CONSTRUCTOR_DICT as DEFAULT_CONSTRUCTOR_DICT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is clearer to import this directly from to_r; removed from here. Regarding symmetric naming, do you mean we should rename it to DEFAULT_CLASS_MAP (same as in reading, user imports the correct name from to_r / to_python) or more explicit (but longer) like DEFAULT_PYTHON_TO_R_CLASS_MAP?
rdata/conversion/_conversion.py
Outdated
| value = None | ||
|
|
||
| elif obj.info.type == parser.RObjectType.ALTREP: | ||
| value = convert_altrep_to_range(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now altrep is created only for suitable pd.RangeIndex objects. Users could avoid altreps here altogether by overriding the default constructor with
def rangeindex_constructor(data: pd.RangeIndex, converter: Converter) -> RObject:
return build_r_object(RObjectType.INT, value=np.array(data))
vnmabus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much! I think this is ready. If there is something we missed, we can just change it later, I think.
|
@vnmabus Thank you for the thorough review once again! It would be useful to have a pypi release with the writing functionality. Do you think the code would be ready for that? The writing functionality would need documentation in rdata.readthedocs.io (I can make a PR for that), but would something else be needed too? |
I would say that the only thing missing is the documentation. |
References to issues or other PRs
Closes #20.
Describe the proposed changes
This PR will add support for writing pandas dataframes. This turned out to be a quite large change to the current Python-to-R conversion functionality. Overview of changes:
rdata.conversion.to_rreorganized, a classConverterFromPythonToRadded to simplify keeping track of references in RData files.rdata.missing.convert_altrep_to_range()(rdata.conversion._conversion) that enable converting compact intseq to range object (e.g. as a dataframe index).Additional information
The functions in
rdata.missingcould be useful for users to handle NA values in the desired way, e.g.,pd.arrays.FloatingArray(array)would set all NaNs (including R's NA value) as "missing", butpd.arrays.FloatingArray(array, is_na(array))can be used to set only R's NA values as "missing".Checklist before requesting a review