-
Notifications
You must be signed in to change notification settings - Fork 23
Censored data eh #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Censored data eh #217
Conversation
- created CensoredDataSuite file, started adding code to discern ND's - added a duplicates file that is under dev. - Edited ConvertResultUnits to accommodate new columns - Edited autoclean to use new special chars function - added possibility to special chars function that input columns are already numeric, creates new TADA columns.
Bug note
- Result Detection Condition Text - Detection Quantitation Limit Type Name - These functions fetch the domain table, update the self-contained csv, and test whether the csv is up to date in test- folder.
Started building censored data functions that identify censored data and have simple methods for filling censored data values.
- convert units and id censored data functions move detection limit values and units to TADA.ResultMeasureValue and TADA.ResultMeasure.MeasureUnitCode. - Change above removes the need to run the convert units function on both result value/unit and detection limit value/unit. Commented out these sections. When run with the commented out sections, function does not have intended effect, because the unit ref is only joining to the result value unit, not detection limit unit. If RV unit is NA, will not convert units for DL unit. - changed language in flag columns
Added TADA columns in autoclean, and started editing harmonization vignette to accommodate change.
- moved meters to m conversion to the convertdepthunits function. - remove relocation of TADA columns.
made summarizecharacteristics not column-specific. Will summarize any column.
Added a period between TADA and CharacteristicGroup to ensure it matches other column conventions.
-Any functions performing operations on WQP dataframe create new columns to support work/conversions, rather than changing content in the original columns. ***MOVED METERS TO M conversion to ConvertDepthUnits - Created new function to rearrange TADA and WQX columns to end of dataframe. Will implement in new commit. - Updated vignette with new convention and tested with AK data.
- convert depth units now creates TADA columns via ConvertSpecCharsfor depth measures that join to the unit ref table - Fixed warning in OrderTADACols that calling a vector of column names is ambiguous. - fixed issue with unit conversion when NA in RV field using "if else" - changed harmonization ref table join, because previous code was trying to join on columns not present in the harm.raw object.
- changed half detection limit method to multiplier to accommodate any value multiplied by the detection limit value.
maps required for vignette even if not required in package.
added water filter to vignette, added start date to DR test (runs faster now), reviewed autoclean (added comments within function for future dev), added stats to stats::runif within simpleCensoredMethods, added stats and maps back as imports in description file
ehinman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I just made one comment on the global variables added that are actually functions. I wasn't sure if that was the convention for functions called in the package.
Ah okay, I didn't realize those were functions. In that case, we need to find where these are used and specify the package. It looks like last_col is a function in both tidyselect and dplyr (e.g. update to "dplyr::last_col"?. Where is summ used? |
cristinamullin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Let's try to get rid of the remaining notes (truncated text and lat/long numeric issue in autoclean).
| print("This dataframe is empty because we did not find any invalid fraction/characteristic combinations in your dataframe") | ||
| empty.data <- dplyr::filter(check.data, WQX.SampleFractionValidity == "Invalid") | ||
| empty.data <- dplyr::select(empty.data, -WQX.SampleFractionValidity) | ||
| empty.data = OrderTADACols(empty.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss the logic for all these flag functions:
When flagging: If the input dataframe does not have an issues that require flagging, then I believe the function returns the input dataframe unchanged, with a note saying that. In that case, the output would not include the flag column since it would contain only NAs. For tracking purposes, do you think it would make sense to still include the flag column in the output, even if it would have all NA's (no flags)?
When flagging "errors only", it works a little differently. In that case the function will return an empty dataframe (also with no flag column).
| #' TADAProfile <- TADAdataRetrieval(statecode = "UT", | ||
| #' characteristicName = c("Ammonia", "Nitrate", "Nitrogen"), | ||
| #' startDate = "10-01-2020") | ||
| # TADAProfile <- TADAdataRetrieval(statecode = "UT", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update these in the Shiny data.R and data folder as well for consistency (in a new PR/branch on the Shiny repo)? https://github.com/USEPA/TADAShiny/tree/develop/data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do. But one thing: I have not yet updated WaterTemp_US because I haven't remembered to set it to run overnight. It seems like a huge data pull. I will set this to run tonight.
| @@ -0,0 +1,47 @@ | |||
| #' Identify Potential Duplicate Data Uploads | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Is this a replacement for the original potential duplicates function, or in addition to?
Here is some context/notes to consider for inclusion in the function description or vignette:
Notes: each unique result has a unique ResultIdentifier that should be included in all the result profiles. That should be all you need (for exact duplicates from the same submitting org), unless two different organizations submitted the same exact dataset to WQP (with only the org info being different). In addition to the org info, the site information (or really anything else org specific) could also be different if the two (or more) orgs are submitting this same dataset. In this scenario, the same datasets are being made available under different, but duplicate or nearby sites created separately by the submitting orgs. There are a lot of cases where the state and USGS (state) Water Science Center both are submitting the same datasets. This is something we may be able to try to address in the future through better communication between the states & USGS on who is submitting the data to WQP (but it might be more complicated than just a communication issue... it is possible some states feel they need (or maybe even have state requirements) to have the data linked to their orgs sites and their org names for use in their CWA assessments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was having trouble finding an instance where the only thing different in a duplicate dataset was the organization metadata, but we can certainly keep both or even combine. This one is definitely still in development and needs tweaking. It should create a distance matrix between all sites with suspected duplicates and only flag those that are within X meters of one another with identical sample results for the same characteristic at the same date and time.
@ehinman I kept summ but removed last_col here and updated this to dplyr::last_col where used |
This is a big pull request. In a nutshell, I created new censored data functions and converted the TADA package framework to rely upon TADA-generated columns for all operations to preserve original column contents.