Censored data eh #217

ehinman · 2023-02-24T19:59:42Z

This is a big pull request. In a nutshell, I created new censored data functions and converted the TADA package framework to rely upon TADA-generated columns for all operations to preserve original column contents.

- created CensoredDataSuite file, started adding code to discern ND's - added a duplicates file that is under dev. - Edited ConvertResultUnits to accommodate new columns - Edited autoclean to use new special chars function - added possibility to special chars function that input columns are already numeric, creates new TADA columns.

Bug note

- Result Detection Condition Text - Detection Quantitation Limit Type Name - These functions fetch the domain table, update the self-contained csv, and test whether the csv is up to date in test- folder.

Started building censored data functions that identify censored data and have simple methods for filling censored data values.

- convert units and id censored data functions move detection limit values and units to TADA.ResultMeasureValue and TADA.ResultMeasure.MeasureUnitCode. - Change above removes the need to run the convert units function on both result value/unit and detection limit value/unit. Commented out these sections. When run with the commented out sections, function does not have intended effect, because the unit ref is only joining to the result value unit, not detection limit unit. If RV unit is NA, will not convert units for DL unit. - changed language in flag columns

Added TADA columns in autoclean, and started editing harmonization vignette to accommodate change.

- moved meters to m conversion to the convertdepthunits function. - remove relocation of TADA columns.

made summarizecharacteristics not column-specific. Will summarize any column.

Added a period between TADA and CharacteristicGroup to ensure it matches other column conventions.

-Any functions performing operations on WQP dataframe create new columns to support work/conversions, rather than changing content in the original columns. ***MOVED METERS TO M conversion to ConvertDepthUnits - Created new function to rearrange TADA and WQX columns to end of dataframe. Will implement in new commit. - Updated vignette with new convention and tested with AK data.

- convert depth units now creates TADA columns via ConvertSpecCharsfor depth measures that join to the unit ref table - Fixed warning in OrderTADACols that calling a vector of column names is ambiguous. - fixed issue with unit conversion when NA in RV field using "if else" - changed harmonization ref table join, because previous code was trying to join on columns not present in the harm.raw object.

- changed half detection limit method to multiplier to accommodate any value multiplied by the detection limit value.

Fix error

maps required for vignette even if not required in package.

added water filter to vignette, added start date to DR test (runs faster now), reviewed autoclean (added comments within function for future dev), added stats to stats::runif within simpleCensoredMethods, added stats and maps back as imports in description file

R/Utilities.R

ehinman

Looks good to me! I just made one comment on the global variables added that are actually functions. I wasn't sure if that was the convention for functions called in the package.

cristinamullin · 2023-02-27T22:03:06Z

Looks good to me! I just made one comment on the global variables added that are actually functions. I wasn't sure if that was the convention for functions called in the package.

Ah okay, I didn't realize those were functions. In that case, we need to find where these are used and specify the package. It looks like last_col is a function in both tidyselect and dplyr (e.g. update to "dplyr::last_col"?. Where is summ used?

cristinamullin

Looks great. Let's try to get rid of the remaining notes (truncated text and lat/long numeric issue in autoclean).

R/Filtering.R

cristinamullin · 2023-02-27T22:29:19Z

R/ResultFlagsDependent.R

      print("This dataframe is empty because we did not find any invalid fraction/characteristic combinations in your dataframe")
      empty.data <- dplyr::filter(check.data, WQX.SampleFractionValidity == "Invalid")
      empty.data <- dplyr::select(empty.data, -WQX.SampleFractionValidity)
+      empty.data = OrderTADACols(empty.data)


Let's discuss the logic for all these flag functions:

When flagging: If the input dataframe does not have an issues that require flagging, then I believe the function returns the input dataframe unchanged, with a note saying that. In that case, the output would not include the flag column since it would contain only NAs. For tracking purposes, do you think it would make sense to still include the flag column in the output, even if it would have all NA's (no flags)?

When flagging "errors only", it works a little differently. In that case the function will return an empty dataframe (also with no flag column).

R/Visualizations.R

cristinamullin · 2023-02-27T22:36:39Z

R/data.R

-#' TADAProfile <- TADAdataRetrieval(statecode = "UT",
-#' characteristicName = c("Ammonia", "Nitrate", "Nitrogen"), 
-#' startDate = "10-01-2020")
+# TADAProfile <- TADAdataRetrieval(statecode = "UT",


Can you please update these in the Shiny data.R and data folder as well for consistency (in a new PR/branch on the Shiny repo)? https://github.com/USEPA/TADAShiny/tree/develop/data

Can do. But one thing: I have not yet updated WaterTemp_US because I haven't remembered to set it to run overnight. It seems like a huge data pull. I will set this to run tonight.

cristinamullin · 2023-02-27T22:42:06Z

R/identifyPotentialDuplicates.R

@@ -0,0 +1,47 @@
+#' Identify Potential Duplicate Data Uploads


Looks good. Is this a replacement for the original potential duplicates function, or in addition to?

Here is some context/notes to consider for inclusion in the function description or vignette:

Notes: each unique result has a unique ResultIdentifier that should be included in all the result profiles. That should be all you need (for exact duplicates from the same submitting org), unless two different organizations submitted the same exact dataset to WQP (with only the org info being different). In addition to the org info, the site information (or really anything else org specific) could also be different if the two (or more) orgs are submitting this same dataset. In this scenario, the same datasets are being made available under different, but duplicate or nearby sites created separately by the submitting orgs. There are a lot of cases where the state and USGS (state) Water Science Center both are submitting the same datasets. This is something we may be able to try to address in the future through better communication between the states & USGS on who is submitting the data to WQP (but it might be more complicated than just a communication issue... it is possible some states feel they need (or maybe even have state requirements) to have the data linked to their orgs sites and their org names for use in their CWA assessments).

I was having trouble finding an instance where the only thing different in a duplicate dataset was the organization metadata, but we can certainly keep both or even combine. This one is definitely still in development and needs tweaking. It should create a distance matrix between all sites with suspected duplicates and only flag those that are within X meters of one another with identical sample results for the same characteristic at the same date and time.

vignettes/WQPDataHarmonization.Rmd

cristinamullin · 2023-02-27T22:53:34Z

Looks good to me! I just made one comment on the global variables added that are actually functions. I wasn't sure if that was the convention for functions called in the package.

Ah okay, I didn't realize those were functions. In that case, we need to find where these are used and specify the package. It looks like last_col is a function in both tidyselect and dplyr (e.g. update to "dplyr::last_col"?. Where is summ used?

@ehinman I kept summ but removed last_col here and updated this to dplyr::last_col where used

ehinman added 18 commits February 15, 2023 18:12

Update Transformations.R

c8e4107

Bug note

created ref tables for censored data columns

ba6de94

- Result Detection Condition Text - Detection Quantitation Limit Type Name - These functions fetch the domain table, update the self-contained csv, and test whether the csv is up to date in test- folder.

censored data functions

f2240a1

Started building censored data functions that identify censored data and have simple methods for filling censored data values.

TADA columns in utilities

96595ee

Added TADA columns in autoclean, and started editing harmonization vignette to accommodate change.

Update Utilities.R

f0d3825

- moved meters to m conversion to the convertdepthunits function. - remove relocation of TADA columns.

visualizations file change and documentation

c0d4f4a

made summarizecharacteristics not column-specific. Will summarize any column.

updates to documentation

9178b38

Update HarmonizationTemplate.csv

95c44c9

Added a period between TADA and CharacteristicGroup to ensure it matches other column conventions.

update tests

c5c2bb7

Update test-Transformations.R

55cfd9b

update tests

2b2961a

added ordering function to other functions

5bbc5a3

comments on reordering function

56c49ce

Update CensoredDataSuite.R

6a882ba

- changed half detection limit method to multiplier to accommodate any value multiplied by the detection limit value.

ehinman requested a review from cristinamullin February 24, 2023 19:59

ehinman added 3 commits February 24, 2023 15:10

updated documentation

84bbe53

updates to build markdown

4311c99

Update Transformations.R

158ebd1

Fix error

ehinman marked this pull request as draft February 24, 2023 22:02

ehinman linked an issue Feb 24, 2023 that may be closed by this pull request

ConvertDepthUnits Issue #162

Closed

ehinman added 6 commits February 24, 2023 17:12

Merge branch 'develop' into censored_data_eh

bf07daa

Update ResultFlagsDependent.R

cf31182

fix warnings

be29e95

added global variables

b608d54

Update WQPDataHarmonization.Rmd

e7261ac

Update WQPDataHarmonization.Rmd

1bec6d0

ehinman and others added 3 commits February 27, 2023 11:13

Update WQPDataHarmonization.Rmd

d8f5006

maps required for vignette even if not required in package.

Small changes

447e097

added water filter to vignette, added start date to DR test (runs faster now), reviewed autoclean (added comments within function for future dev), added stats to stats::runif within simpleCensoredMethods, added stats and maps back as imports in description file

update docs

84d2880

ehinman commented Feb 27, 2023

View reviewed changes

R/Utilities.R Outdated Show resolved Hide resolved

ehinman commented Feb 27, 2023

View reviewed changes

cristinamullin reviewed Feb 27, 2023

View reviewed changes

Update Utilities.R

920e95a

cristinamullin marked this pull request as ready for review February 28, 2023 17:05

cristinamullin self-requested a review February 28, 2023 17:06

cristinamullin approved these changes Feb 28, 2023

View reviewed changes

ehinman merged commit 1277bf3 into develop Feb 28, 2023

ehinman deleted the censored_data_eh branch February 28, 2023 17:09

cristinamullin restored the censored_data_eh branch February 28, 2023 18:16

cristinamullin deleted the censored_data_eh branch February 28, 2023 18:17

ehinman linked an issue Feb 28, 2023 that may be closed by this pull request

Ordering added columns #147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Censored data eh #217

Censored data eh #217

Uh oh!

ehinman commented Feb 24, 2023

Uh oh!

Uh oh!

ehinman left a comment

Uh oh!

cristinamullin commented Feb 27, 2023

Uh oh!

cristinamullin left a comment

Uh oh!

Uh oh!

cristinamullin Feb 27, 2023

Uh oh!

Uh oh!

cristinamullin Feb 27, 2023

Uh oh!

ehinman Feb 28, 2023 •

edited

Loading

Uh oh!

cristinamullin Feb 27, 2023

Uh oh!

ehinman Feb 28, 2023

Uh oh!

Uh oh!

cristinamullin commented Feb 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,47 @@
		#' Identify Potential Duplicate Data Uploads

Censored data eh #217

Censored data eh #217

Uh oh!

Conversation

ehinman commented Feb 24, 2023

Uh oh!

Uh oh!

ehinman left a comment

Choose a reason for hiding this comment

Uh oh!

cristinamullin commented Feb 27, 2023

Uh oh!

cristinamullin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cristinamullin Feb 27, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cristinamullin Feb 27, 2023

Choose a reason for hiding this comment

Uh oh!

ehinman Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cristinamullin Feb 27, 2023

Choose a reason for hiding this comment

Uh oh!

ehinman Feb 28, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cristinamullin commented Feb 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ehinman Feb 28, 2023 •

edited

Loading