Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@trossi
Copy link
Contributor

@trossi trossi commented Sep 16, 2024

References to issues or other PRs

Closes #20.

Describe the proposed changes

This PR will add support for writing pandas dataframes. This turned out to be a quite large change to the current Python-to-R conversion functionality. Overview of changes:

  • Functions in rdata.conversion.to_r reorganized, a class ConverterFromPythonToR added to simplify keeping track of references in RData files.
  • Functionality for distinguishing R's NA float value from other NaN values added to rdata.missing.
  • Added more tests on reading and writing dataframes with various dtypes and a mix of NA and NaN values.
  • Added convert_altrep_to_range() (rdata.conversion._conversion) that enable converting compact intseq to range object (e.g. as a dataframe index).
  • Unparsing REF and ALTREP added.

Additional information

The functions in rdata.missing could be useful for users to handle NA values in the desired way, e.g., pd.arrays.FloatingArray(array) would set all NaNs (including R's NA value) as "missing", but pd.arrays.FloatingArray(array, is_na(array)) can be used to set only R's NA values as "missing".

Checklist before requesting a review

  • I have performed a self-review of my code
  • The code conforms to the style used in this package (checked with Ruff)
  • The code is fully documented and typed (type-checked with Mypy)
  • I have added thorough tests for the new/changed functionality

value = None

elif obj.info.type == parser.RObjectType.ALTREP:
value = convert_altrep_to_range(obj)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is ok?

In principle it is ok to write altreps when possible by default. I am not sure if we should add an option for the Converter not to use altreps, for having compatibility with tools that do not understand them.

rdata/_write.py Outdated
compression: Compression.
encoding: Encoding to be used for strings within data.
format_version: File format version.
constructor_dict: Dictionary mapping Python types to R classes.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not really true, right? It maps Python classes to functions to convert them to R classes (which is more powerful, as it can choose a different R class depending on the attributes of the object).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, changed to Dictionary mapping Python classes to functions converting them to R classes. here and in other locations.

Copy link
Contributor Author

@trossi trossi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vnmabus Sorry for the long delay with this PR - my past months were pretty busy. I have pushed fixes based on your suggestions.

rdata/_write.py Outdated
compression: Compression.
encoding: Encoding to be used for strings within data.
format_version: File format version.
constructor_dict: Dictionary mapping Python types to R classes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, changed to Dictionary mapping Python classes to functions converting them to R classes. here and in other locations.

build_r_data as build_r_data,
convert_to_r_object as convert_to_r_object,
convert_to_r_object_for_rda as convert_to_r_object_for_rda,
DEFAULT_CONSTRUCTOR_DICT as DEFAULT_CONSTRUCTOR_DICT,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is clearer to import this directly from to_r; removed from here. Regarding symmetric naming, do you mean we should rename it to DEFAULT_CLASS_MAP (same as in reading, user imports the correct name from to_r / to_python) or more explicit (but longer) like DEFAULT_PYTHON_TO_R_CLASS_MAP?

value = None

elif obj.info.type == parser.RObjectType.ALTREP:
value = convert_altrep_to_range(obj)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now altrep is created only for suitable pd.RangeIndex objects. Users could avoid altreps here altogether by overriding the default constructor with

def rangeindex_constructor(data: pd.RangeIndex, converter: Converter) -> RObject:
    return build_r_object(RObjectType.INT, value=np.array(data))

Copy link
Owner

@vnmabus vnmabus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much! I think this is ready. If there is something we missed, we can just change it later, I think.

@vnmabus vnmabus merged commit 9e3b147 into vnmabus:develop Mar 9, 2025
15 checks passed
@trossi
Copy link
Contributor Author

trossi commented Mar 21, 2025

@vnmabus Thank you for the thorough review once again!

It would be useful to have a pypi release with the writing functionality. Do you think the code would be ready for that? The writing functionality would need documentation in rdata.readthedocs.io (I can make a PR for that), but would something else be needed too?

@vnmabus
Copy link
Owner

vnmabus commented Mar 21, 2025

It would be useful to have a pypi release with the writing functionality. Do you think the code would be ready for that? The writing functionality would need documentation in rdata.readthedocs.io (I can make a PR for that), but would something else be needed too?

I would say that the only thing missing is the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write the DF as rds file

2 participants