Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ehinman
Copy link
Contributor

@ehinman ehinman commented Feb 9, 2023

This branch consists of:

  • Consistent date format YYYY-MM-DD in all functions and examples
  • Column name inputs to names that match the query input (project and organization) for TADAdataRetrieval
  • New special characters function that acts on a column input rather than hard coded ResultMeasureValue and DetectionQuantitationTypeLimit. MeasureValue
  • Spec char function also recognizes more special character situations (% and #,####) and flags them for the user
  • Spec char function does not flag scientific notation as text
  • Added lines to replace current special characters function to new function if desired/approved.

- all dates are YYYY-MM-DD in functions, examples
- changed TADAdataRetrieval inputs to "project" and "organization" to match WQP queries.
- could replace MeasureValueSpecialCharacters, more generalized, not column specific, handles % and flags numbers with commas as well
- Fixed the issue flagging scientific notation as text.
- added new functions to try
ResultIdentifiers are unique to samples. There may be multiple detection limit types related to a result identifier, but result identifiers connect to one observation and one activity. This change in code streamlines the process so that duplicates are only checked on result identifiers represented in two or more rows.
Copy link
Collaborator

@cristinamullin cristinamullin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall these changes look great! I added a few comments/suggestions for edits & will try to dig up some examples for you with exact duplicates and potential duplicates.

# Remove duplicate rows - turned into a test because duplicated() takes a long
# time acting on all columns in a large dataset.
if(!length(unique(.data$ResultIdentifier))==dim(.data)[1]){
print("Duplicate records may be present. Filtering to unique records. This may take a while on large datasets.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to "Duplicate records are present" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote "may be present" because I could potentially see someone using autoclean on a dataset where they joined detection limit data to result data and thus have unique rows with the same result identifier and different detection limits. In this case, the function will check to make sure these cases are truly unique.

- updated data harmonization vignette to include information on other TADAdataRetrieval inputs (project and organization), and information on the discrepancy between WQP and dataRetrieval regarding date format.
- Added more commentary on input specification origins and date format discrepancy in TADAdataRetrieval function
- Added date format check to help user ensure dates are in format YYYY-MM-DD.
- updated documentation for ConvertSpecialChars
@cristinamullin cristinamullin merged commit d3f13e7 into develop Feb 13, 2023
@cristinamullin cristinamullin deleted the inputs_spec_chars_eh branch February 13, 2023 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

date format TADAdataRetrieval function input name consistency with WQP UI, WQP Profiles, and dataRetrieval

3 participants