Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ehinman
Copy link
Contributor

@ehinman ehinman commented Feb 24, 2023

This is a big pull request. In a nutshell, I created new censored data functions and converted the TADA package framework to rely upon TADA-generated columns for all operations to preserve original column contents.

- created CensoredDataSuite file, started adding code to discern ND's
- added a duplicates file that is under dev.
- Edited ConvertResultUnits to accommodate new columns
- Edited autoclean to use new special chars function
- added possibility to special chars function that input columns are already numeric, creates new TADA columns.
- Result Detection Condition Text
- Detection Quantitation Limit Type Name
- These functions fetch the domain table, update the self-contained csv, and test whether the csv is up to date in test- folder.
Started building censored data functions that identify censored data and have simple methods for filling censored data values.
- convert units and id censored data functions move detection limit values and units to TADA.ResultMeasureValue and TADA.ResultMeasure.MeasureUnitCode.
- Change above removes the need to run the convert units function on both result value/unit and detection limit value/unit. Commented out these sections. When run with the commented out sections, function does not have intended effect, because the unit ref is only joining to the result value unit, not detection limit unit. If RV unit is NA, will not convert units for DL unit.
- changed language in flag columns
Added TADA columns in autoclean, and started editing harmonization vignette to accommodate change.
- moved meters to m conversion to the convertdepthunits function.
- remove relocation of TADA columns.
 made summarizecharacteristics not column-specific. Will summarize any column.
Added a period between TADA and CharacteristicGroup to ensure it matches other column conventions.
-Any functions performing operations on WQP dataframe create new columns to support work/conversions, rather than changing content in the original columns.
***MOVED METERS TO M conversion to ConvertDepthUnits
- Created new function to rearrange TADA and WQX columns to end of dataframe. Will implement in new commit.
- Updated vignette with new convention and tested with AK data.
- convert depth units now creates TADA columns via ConvertSpecCharsfor depth measures that join to the unit ref table
- Fixed warning in OrderTADACols that calling a vector of column names is ambiguous.
- fixed issue with unit conversion when NA in RV field using "if else"
- changed harmonization ref table join, because previous code was trying to join on columns not present in the harm.raw object.
- changed half detection limit method to multiplier to accommodate any value multiplied by the detection limit value.
@ehinman ehinman marked this pull request as draft February 24, 2023 22:02
@ehinman ehinman linked an issue Feb 24, 2023 that may be closed by this pull request
ehinman and others added 3 commits February 27, 2023 11:13
maps required for vignette even if not required in package.
added water filter to vignette, added start date to DR test (runs faster now), reviewed autoclean (added comments within function for future dev), added stats to stats::runif within simpleCensoredMethods, added stats and maps back as imports in description file
Copy link
Contributor Author

@ehinman ehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I just made one comment on the global variables added that are actually functions. I wasn't sure if that was the convention for functions called in the package.

@cristinamullin
Copy link
Collaborator

Looks good to me! I just made one comment on the global variables added that are actually functions. I wasn't sure if that was the convention for functions called in the package.

Ah okay, I didn't realize those were functions. In that case, we need to find where these are used and specify the package. It looks like last_col is a function in both tidyselect and dplyr (e.g. update to "dplyr::last_col"?. Where is summ used?

Copy link
Collaborator

@cristinamullin cristinamullin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Let's try to get rid of the remaining notes (truncated text and lat/long numeric issue in autoclean).

print("This dataframe is empty because we did not find any invalid fraction/characteristic combinations in your dataframe")
empty.data <- dplyr::filter(check.data, WQX.SampleFractionValidity == "Invalid")
empty.data <- dplyr::select(empty.data, -WQX.SampleFractionValidity)
empty.data = OrderTADACols(empty.data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss the logic for all these flag functions:

When flagging: If the input dataframe does not have an issues that require flagging, then I believe the function returns the input dataframe unchanged, with a note saying that. In that case, the output would not include the flag column since it would contain only NAs. For tracking purposes, do you think it would make sense to still include the flag column in the output, even if it would have all NA's (no flags)?

When flagging "errors only", it works a little differently. In that case the function will return an empty dataframe (also with no flag column).

#' TADAProfile <- TADAdataRetrieval(statecode = "UT",
#' characteristicName = c("Ammonia", "Nitrate", "Nitrogen"),
#' startDate = "10-01-2020")
# TADAProfile <- TADAdataRetrieval(statecode = "UT",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update these in the Shiny data.R and data folder as well for consistency (in a new PR/branch on the Shiny repo)? https://github.com/USEPA/TADAShiny/tree/develop/data

Copy link
Contributor Author

@ehinman ehinman Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do. But one thing: I have not yet updated WaterTemp_US because I haven't remembered to set it to run overnight. It seems like a huge data pull. I will set this to run tonight.

@@ -0,0 +1,47 @@
#' Identify Potential Duplicate Data Uploads
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Is this a replacement for the original potential duplicates function, or in addition to?

Here is some context/notes to consider for inclusion in the function description or vignette:

Notes: each unique result has a unique ResultIdentifier that should be included in all the result profiles. That should be all you need (for exact duplicates from the same submitting org), unless two different organizations submitted the same exact dataset to WQP (with only the org info being different). In addition to the org info, the site information (or really anything else org specific) could also be different if the two (or more) orgs are submitting this same dataset. In this scenario, the same datasets are being made available under different, but duplicate or nearby sites created separately by the submitting orgs. There are a lot of cases where the state and USGS (state) Water Science Center both are submitting the same datasets. This is something we may be able to try to address in the future through better communication between the states & USGS on who is submitting the data to WQP (but it might be more complicated than just a communication issue... it is possible some states feel they need (or maybe even have state requirements) to have the data linked to their orgs sites and their org names for use in their CWA assessments).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was having trouble finding an instance where the only thing different in a duplicate dataset was the organization metadata, but we can certainly keep both or even combine. This one is definitely still in development and needs tweaking. It should create a distance matrix between all sites with suspected duplicates and only flag those that are within X meters of one another with identical sample results for the same characteristic at the same date and time.

@cristinamullin
Copy link
Collaborator

Looks good to me! I just made one comment on the global variables added that are actually functions. I wasn't sure if that was the convention for functions called in the package.

Ah okay, I didn't realize those were functions. In that case, we need to find where these are used and specify the package. It looks like last_col is a function in both tidyselect and dplyr (e.g. update to "dplyr::last_col"?. Where is summ used?

@ehinman I kept summ but removed last_col here and updated this to dplyr::last_col where used

@cristinamullin cristinamullin marked this pull request as ready for review February 28, 2023 17:05
@cristinamullin cristinamullin self-requested a review February 28, 2023 17:06
@ehinman ehinman merged commit 1277bf3 into develop Feb 28, 2023
@ehinman ehinman deleted the censored_data_eh branch February 28, 2023 17:09
@cristinamullin cristinamullin restored the censored_data_eh branch February 28, 2023 18:16
@cristinamullin cristinamullin deleted the censored_data_eh branch February 28, 2023 18:17
@ehinman ehinman linked an issue Feb 28, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ConvertDepthUnits Issue Ordering added columns

3 participants