Add writing functionality for dataframes #45

trossi · 2024-09-16T12:44:01Z

References to issues or other PRs

Closes #20.

Describe the proposed changes

This PR will add support for writing pandas dataframes. This turned out to be a quite large change to the current Python-to-R conversion functionality. Overview of changes:

Functions in rdata.conversion.to_r reorganized, a class ConverterFromPythonToR added to simplify keeping track of references in RData files.
Functionality for distinguishing R's NA float value from other NaN values added to rdata.missing.
Added more tests on reading and writing dataframes with various dtypes and a mix of NA and NaN values.
Added convert_altrep_to_range() (rdata.conversion._conversion) that enable converting compact intseq to range object (e.g. as a dataframe index).
Unparsing REF and ALTREP added.

Additional information

The functions in rdata.missing could be useful for users to handle NA values in the desired way, e.g., pd.arrays.FloatingArray(array) would set all NaNs (including R's NA value) as "missing", but pd.arrays.FloatingArray(array, is_na(array)) can be used to set only R's NA values as "missing".

Checklist before requesting a review

I have performed a self-review of my code
The code conforms to the style used in this package (checked with Ruff)
The code is fully documented and typed (type-checked with Mypy)
I have added thorough tests for the new/changed functionality

vnmabus · 2024-10-29T11:34:20Z

rdata/conversion/_conversion.py

            value = None

+        elif obj.info.type == parser.RObjectType.ALTREP:
+            value = convert_altrep_to_range(obj)


Do you think this is ok?

In principle it is ok to write altreps when possible by default. I am not sure if we should add an option for the Converter not to use altreps, for having compatibility with tools that do not understand them.

rdata/conversion/_conversion.py

rdata/conversion/to_r.py

vnmabus · 2024-11-02T11:47:00Z

rdata/_write.py

        compression: Compression.
        encoding: Encoding to be used for strings within data.
        format_version: File format version.
+        constructor_dict: Dictionary mapping Python types to R classes.


That is not really true, right? It maps Python classes to functions to convert them to R classes (which is more powerful, as it can choose a different R class depending on the attributes of the object).

True, changed to Dictionary mapping Python classes to functions converting them to R classes. here and in other locations.

rdata/conversion/__init__.py

rdata/conversion/to_r.py

rdata/missing.py

This reverts commit 8a269ae.

trossi

@vnmabus Sorry for the long delay with this PR - my past months were pretty busy. I have pushed fixes based on your suggestions.

rdata/conversion/__init__.py

rdata/conversion/to_r.py

trossi · 2025-01-22T13:18:11Z

rdata/_write.py

        compression: Compression.
        encoding: Encoding to be used for strings within data.
        format_version: File format version.
+        constructor_dict: Dictionary mapping Python types to R classes.


True, changed to Dictionary mapping Python classes to functions converting them to R classes. here and in other locations.

trossi · 2025-01-22T13:35:13Z

rdata/conversion/__init__.py

-    build_r_data as build_r_data,
-    convert_to_r_object as convert_to_r_object,
-    convert_to_r_object_for_rda as convert_to_r_object_for_rda,
+    DEFAULT_CONSTRUCTOR_DICT as DEFAULT_CONSTRUCTOR_DICT,


I agree it is clearer to import this directly from to_r; removed from here. Regarding symmetric naming, do you mean we should rename it to DEFAULT_CLASS_MAP (same as in reading, user imports the correct name from to_r / to_python) or more explicit (but longer) like DEFAULT_PYTHON_TO_R_CLASS_MAP?

rdata/conversion/to_r.py

rdata/conversion/_conversion.py

trossi · 2025-01-22T14:54:35Z

rdata/conversion/_conversion.py

            value = None

+        elif obj.info.type == parser.RObjectType.ALTREP:
+            value = convert_altrep_to_range(obj)


Now altrep is created only for suitable pd.RangeIndex objects. Users could avoid altreps here altogether by overriding the default constructor with

def rangeindex_constructor(data: pd.RangeIndex, converter: Converter) -> RObject: return build_r_object(RObjectType.INT, value=np.array(data))

rdata/conversion/to_r.py

vnmabus

Thank you so much! I think this is ready. If there is something we missed, we can just change it later, I think.

trossi · 2025-03-21T10:43:59Z

@vnmabus Thank you for the thorough review once again!

It would be useful to have a pypi release with the writing functionality. Do you think the code would be ready for that? The writing functionality would need documentation in rdata.readthedocs.io (I can make a PR for that), but would something else be needed too?

vnmabus · 2025-03-21T11:04:46Z

It would be useful to have a pypi release with the writing functionality. Do you think the code would be ready for that? The writing functionality would need documentation in rdata.readthedocs.io (I can make a PR for that), but would something else be needed too?

I would say that the only thing missing is the documentation.

trossi added 30 commits September 5, 2024 17:13

Add reference type to unparser

3a6664d

Add draft dataframe conversion

153f803

Add helper function for creating unicode arrays

4557559

Add more pd.Series types

6eeb992

Fix the order of symbol references

ffddf74

Add a converter class for Python-to-R conversion

eb82ff6

Fix masked values in masked array

1868d8a

Compare first string representations

8d9cb55

Fix conversion of dataframe columns

398d1e9

Add support for dataframe with string index

9cdd37c

Add assertions for strings

5084d2d

Add conversion for rangeindex and range

af0f6fe

Add conversion of integer index

1c71a86

Add unparsing altreps

8fa951e

Move build_r_data function under converter class

b205d8d

Convert range to array for old format

963a9bc

Fix ruff

61a2ea2

Set object flag explicitly

937908b

Fix mypy

8eda454

Add tests for different dataframe index types

efbb09d

Test converting expanded altrep

32a2cc6

Add only non-nil attributes to expanded altrep

1f4e8d8

Enable general rangeindex in dataframe

237bc22

Test conversion of altreps

6859b8c

Change attribute order to match test files

5ac49d0

Add comment about reordering attributes

6ad1408

Fix ruff and mypy

1c458ba

Add test for dataframe with different dtypes

92429ca

Add conversion of boolean pd arrays

5cf678d

Add test for pandas dtypes

f379fc9

trossi added 13 commits October 25, 2024 14:36

Merge branch 'develop' into dataframe-writer

aa7239d

Recreate test files in common attribute order

df8b391

Skip altreps with attributes in test

1a00c1d

Fix ruff

ff6b6a9

Filter expected warnings

daf1e3a

Pass converter object to constructor functions

943e697

Allow constructor functions without converter

8a269ae

Convert only pandas rangeindex to altrep

9718161

Use more robust indexing

a7c7066

Add tests for rangeindex

5a430aa

Remove conversion of altrep to range

87f4c65

Clarify skip message

25a14af

Fix ruff formatting

7984089

vnmabus requested changes Nov 2, 2024

View reviewed changes

trossi added 8 commits January 22, 2025 14:25

Fix docstring

9570204

Include converter always in constructor functions

b057ee1

This reverts commit 8a269ae.

Return R object from constructors

fb76598

Fix docstring

ad30ca3

Do not expose DEFAULT_CONSTRUCTOR_DICT

8a4758a

Do not expose DEFAULT_FORMAT_VERSION

760684f

Remove asserts encoded in type hints

be97c4c

Add comment on default row names

d755d08

trossi commented Jan 22, 2025

View reviewed changes

trossi added 2 commits January 22, 2025 17:08

Fix ruff

a086138

Rename default constructors to DEFAULT_CLASS_MAP

e50eb9c

vnmabus mentioned this pull request Mar 9, 2025

Consider masking float arrays with NaNs in Pandas to NumPy conversion #48

Open

vnmabus approved these changes Mar 9, 2025

View reviewed changes

vnmabus merged commit 9e3b147 into vnmabus:develop Mar 9, 2025
15 checks passed

Add writing functionality for dataframes #45

Add writing functionality for dataframes #45

Uh oh!

Conversation

trossi commented Sep 16, 2024

References to issues or other PRs

Describe the proposed changes

Additional information

Checklist before requesting a review

Uh oh!

vnmabus Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vnmabus Nov 2, 2024

Choose a reason for hiding this comment

Uh oh!

trossi Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trossi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trossi Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

trossi Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trossi Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vnmabus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

trossi commented Mar 21, 2025

Uh oh!

vnmabus commented Mar 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants