Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@vincentarelbundock
Copy link
Collaborator

@vincentarelbundock vincentarelbundock commented Aug 12, 2024

This is a refactor of the histogram, boxplot, area, and other helper functions to kickstart a larger refactoring strategy that I could implement.

Concepts:

  1. Store by, facet, x, y in a single data frame named datapoints so we can use split-apply-combine for facet and groups, instead of the complicated use of interactions().
  2. In a future PR, datapoint is converted directly to the current split_data.
  3. This lays the groundwork for user-supplied type_*(), which must accept a datapoints data.frame with known characteristics and return another data frame with the same characteristics.

Benefits:

  1. Code simplification: it's easier to keep track of by and facet indices when we split-apply-combine on a dataframe.
  2. Avoid overwriting x and friends, in case we need to check what the original user input was.
  3. Pathways to user-supplied type_*() functions.

@vincentarelbundock vincentarelbundock marked this pull request as draft August 12, 2024 00:28
@vincentarelbundock vincentarelbundock marked this pull request as ready for review August 12, 2024 01:59
@vincentarelbundock
Copy link
Collaborator Author

I think this is ready for review.

This PR lays the groundwork for further simplification, but I think it already stands alone as a nice useful chunk, so it makes sense to review and merge before doing anything else.

@vincentarelbundock vincentarelbundock changed the title Histogram refactor More refactor: data frame split apply combine Aug 12, 2024
This was referenced Aug 12, 2024
@grantmcdermott
Copy link
Owner

Super, thanks for this @vincentarelbundock. I'm currently on vacation without a laptop, but will aim to do a proper review once I'm back (and have had a chance to look at #197 too).

Qq: Does going through the data.frame intermediary affect performance (snappiness) at all?

@vincentarelbundock
Copy link
Collaborator Author

Super, thanks for this @vincentarelbundock. I'm currently on vacation without a laptop, but will aim to do a proper review once I'm back (and have had a chance to look at #197 too).

Cool cool. No rush at all.

Qq: Does going through the data.frame intermediary affect performance (snappiness) at all?

I have not benchmarked, but I wouldn't expect this to have any effect at all. We're just storing equal length vectors into a data.frame and calling split() only once, instead of holding them as separate variables, putting them in a list, and calling Map()+split(). If anything, this simplifies things and reduces the number of calls.

@grantmcdermott
Copy link
Owner

grantmcdermott commented Aug 20, 2024

@vincentarelbundock do you mind merging in the recent changes that we pushed to the main branch? I know there aren't any file conflicts, but I edited some tests and wanted to make sure that everything is passing against the up-to-date test suite. Cheers.

(I'm aiming to review this PR properly by the end of the week.)

@vincentarelbundock
Copy link
Collaborator Author

done

Copy link
Owner

@grantmcdermott grantmcdermott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @vincentarelbundock. I really appreciate the continued efforts at further modularization.

I must admit that some of this feels like over-engineering to me at the moment (e.g., all the datapoints <-> dp internal shuffling that happens within the individual types). But I'm willing to take it on faith that this will enable the functionality that you highlight in your concept outline. (And I'm really excited about this possibility.)

The requested here changes are mostly minor.

@vincentarelbundock
Copy link
Collaborator Author

I think that all issues are resolved.

I must admit that some of this feels like over-engineering to me at the moment (e.g., all the datapoints <-> dp internal shuffling

Fair enough.

I removed all assignments to intermediary dp.

I'm optimistic about the long term vision. Think it's going to be very cool. I'm still going to do one more run of simplification to fully exploit the datapoints.

But note that we are already benefiting. The old code to prep data for histograms was roughly 96 lines long; the new one is roughly 51 lines. Not a massive deal, but it illustrates some of the simplifications possible, and there are some more.

Copy link
Owner

@grantmcdermott grantmcdermott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merci, Vincent!

Thanks again for actioning these changes and for the continued improvements to the tinyplot codebase. Again, I'm very excited by some of the proposed features that this should unlock.

As a HU, I would like to submit a patch release to CRAN in the next day or two, since we've fixed quite few bugs in last month. (Hopefully, I'll be able to include a fix for #206 too, which is proving a little more finicky than I originally thought). But then we can work at a bigger release after that, which includes these new features.

@grantmcdermott grantmcdermott merged commit 3917b16 into grantmcdermott:main Aug 25, 2024
@vincentarelbundock vincentarelbundock deleted the histogram_refactor branch September 9, 2024 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants