-
Notifications
You must be signed in to change notification settings - Fork 23
Inputs spec chars eh #205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inputs spec chars eh #205
Conversation
- all dates are YYYY-MM-DD in functions, examples - changed TADAdataRetrieval inputs to "project" and "organization" to match WQP queries.
- could replace MeasureValueSpecialCharacters, more generalized, not column specific, handles % and flags numbers with commas as well - Fixed the issue flagging scientific notation as text.
- added new functions to try
ResultIdentifiers are unique to samples. There may be multiple detection limit types related to a result identifier, but result identifiers connect to one observation and one activity. This change in code streamlines the process so that duplicates are only checked on result identifiers represented in two or more rows.
cristinamullin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall these changes look great! I added a few comments/suggestions for edits & will try to dig up some examples for you with exact duplicates and potential duplicates.
| # Remove duplicate rows - turned into a test because duplicated() takes a long | ||
| # time acting on all columns in a large dataset. | ||
| if(!length(unique(.data$ResultIdentifier))==dim(.data)[1]){ | ||
| print("Duplicate records may be present. Filtering to unique records. This may take a while on large datasets.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to "Duplicate records are present" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote "may be present" because I could potentially see someone using autoclean on a dataset where they joined detection limit data to result data and thus have unique rows with the same result identifier and different detection limits. In this case, the function will check to make sure these cases are truly unique.
- updated data harmonization vignette to include information on other TADAdataRetrieval inputs (project and organization), and information on the discrepancy between WQP and dataRetrieval regarding date format. - Added more commentary on input specification origins and date format discrepancy in TADAdataRetrieval function - Added date format check to help user ensure dates are in format YYYY-MM-DD. - updated documentation for ConvertSpecialChars
This branch consists of: