Updates to ATTAINS fxns + new USGS continuous data fxns #589

kathryn-willi · 2025-04-22T18:57:24Z

This PR combines two major updates to the TADA package:

1. ATTAINS Function Enhancements

fetch_ATTAINS() Iterative Clustering for Large Spatial Areas: Implemented a new {dbscan} clustering approach that partitions large spatial areas into smaller clusters, allowing ATTAINS features to be fetched more efficiently. This improves performance when loading features over large extents.
TADA_GetATTAINS() Enhancements: Added a distance column that reports the distance (in meters) between each WQP observation and intersecting ATTAINS features within its catchment.
Introduced a new argument, return_nearest. When set to TRUE, the function returns only the nearest ATTAINS feature to each WQP observation. When FALSE, it returns all features in the same catchment (ie, how it previously worked).
Replaced the index column with ResultIdentifier to more clearly and consistently handle cases where multiple ATTAINS features relate to a single WQP observation, per Cristina’s feedback (closes TADA_GetATTAINS index usability #539).
Testing & Tidying: Added new {testthat} functions to validate the updated geospatial behavior. General code tidying and improvements to documentation/examples across the affected functions.

2. New NWIS Continuous Data Functions (partial fulfilment of #222)

TADA_listNWIS(): Returns a spatial summary of available USGS NWIS daily data based on site numbers, state codes, or a specified spatial area.
TADA_getNWIS(): Retrieves and tidies USGS NWIS daily values for selected sites, states, or areas using parameter codes and date range.
Included examples and basic usage in the Module2 vignette.
Added new {testthat} tests to verify both functions.

These NWIS functions fulfill the continuous data access functionality TADA was aiming for and provide a foundation for future enhancements.

Let me know if anything is unclear or if you'd like changes to the implementation, documentation, or examples. Looking forward to your feedback. 🙂

Thanks!
Katie

Co-authored-by: Matt Brousil <[email protected]>

Functions for NWIS continuous data

Co-authored-by: Matt Brousil <[email protected]>

big updates

build getATTAINS

cristinamullin · 2025-04-28T21:43:52Z

vignettes/TADAModule2.Rmd

+  dplyr::group_by(site_no) %>%
+  dplyr::summarize(
+    parameters = paste(unique(parameter), collapse = ";  "),
+    parameter_codes = paste(unique(parameter_code), collapse = ";  ")


When we meet this week, let's chat more about TADA comparable data IDs, USGS p codes, and USGS observed properties in this context of bringing in the USGS continuous data & integrating it with the WQP data (compatibility).

I'd be happy to help work on WQP/USGS continuous data compatibility issues.

hillarymarler

Increase in efficiency for TADA_GetATTAINS is great, as are the options to include only the nearest true feature and use ResultIdentifier to track instances of duplicate rows. The continuous data functions worked well on for a variety of test queries I tried them with - looking forward to discussing how to further integrate them w/ EPATADA workflow as a future effort.

R/ContinuousDataFunctions.R

cristinamullin · 2025-05-01T18:28:52Z

There are some global variable notes. We can address these by removing the variable using rm() once the variable is no longer needed.

   TADA_GetATTAINS: no visible binding for global variable 'count'
   TADA_GetATTAINS : find_distances: no visible binding for global
     variable 'TADA.DistanceAway.Meters'
   TADA_GetATTAINS: no visible binding for global variable
     'TADA.DistanceAway.Meters'
   TADA_GetATTAINS: no visible global function definition for
     'st_drop_geometry'
   TADA_getNWIS: no visible binding for global variable 'dec_long_va'
   TADA_getNWIS: no visible binding for global variable 'dec_lat_va'
   TADA_getNWIS: no visible binding for global variable 'site_no'
   TADA_getNWIS: no visible binding for global variable 'agency_cd'
   TADA_getNWIS: no visible binding for global variable 'Date'
   TADA_getNWIS: no visible binding for global variable 'NWIS.parameter'
   TADA_getNWIS: no visible binding for global variable 'NWIS.value'
   TADA_getNWIS: no visible binding for global variable 'NWIS.status'
   TADA_listNWIS : pcodes: no visible binding for global variable
     'parameter_code'
   TADA_listNWIS: no visible binding for global variable 'dec_long_va'
   TADA_listNWIS: no visible binding for global variable 'dec_lat_va'
   TADA_listNWIS: no visible binding for global variable 'site_no'
   TADA_listNWIS: no visible binding for global variable 'station_nm'
   TADA_listNWIS: no visible binding for global variable 'site_type'
   TADA_listNWIS: no visible binding for global variable 'site_tp_cd'
   TADA_listNWIS: no visible binding for global variable 'data_type'
   TADA_listNWIS: no visible binding for global variable 'data_type_cd'
   TADA_listNWIS: no visible binding for global variable
     'parameter_name_description'
   TADA_listNWIS: no visible binding for global variable 'parm_cd'
   TADA_listNWIS: no visible binding for global variable 'count_nu'
   TADA_listNWIS: no visible binding for global variable 'begin_date'
   TADA_listNWIS: no visible binding for global variable 'end_date'
   fetchATTAINS : perform_iterative_clustering : bbox_area: no visible
     binding for global variable 'cluster'
   fetchATTAINS : perform_iterative_clustering : split_clusters_by_area:
     no visible binding for global variable 'cluster'
   fetchATTAINS: no visible binding for global variable 'cluster'
   Undefined global functions or variables:
     Date NWIS.parameter NWIS.status NWIS.value TADA.DistanceAway.Meters
     agency_cd begin_date cluster count count_nu data_type data_type_cd
     dec_lat_va dec_long_va end_date parameter_code
     parameter_name_description parm_cd site_no site_tp_cd site_type
     st_drop_geometry station_nm

cristinamullin · 2025-05-01T20:16:11Z

@kathryn-willi A few of these look the same except for the total N. Do you know if this is mostly duplicate data? Curious if this is a common issue we would want to handle before analysis (making sure we don't have dups)

# Example 2: Query by specific site numbers
site_nums <- c("11530500", "11532500")
sites_specific <- TADA_listNWIS(sites = site_nums)

cristinamullin · 2025-05-01T20:51:53Z

R/ContinuousDataFunctions.R

+#' features (or "rows" in the sf data frame) must be under 118,078 square miles
+#' (roughly the area of Nevada).
+#' @param statecode Character vector of two-letter state codes (e.g., c("CA", "OR")).
+#' @param sites Character vector of USGS site numbers.


suggest to change from sites to siteid for consistency with other TADA data retrieval functions. I am working on this now

sites to siteid

kathryn-willi · 2025-05-01T22:10:40Z

site_nums <- c("11530500", "11532500")
sites_specific <- TADA_listNWIS(sites = site_nums)

Ooof, this i think is exposing my ignorance of non-flow USGS data. I bet this has to do with different statistical data being collected (e.g., both the max and mean). I will make the necessary changes to the functions to ensure we are capturing the statistical information related to both listNWIS and getNWIS! Please stay tuned for another commit from me...

kathryn-willi · 2025-05-01T23:53:14Z

@cristinamullin - it was in fact that there were multiple statistics being published for the same site/parameter combos. (The "mean" value is the default and by far the most common statistic for flow data, hence why I didn't catch this earlier. My apologies!) I have incorporated code to 1) list all available statistics in TADA_listNWIS(), and 2) allow the user to select which statistic(s) to download in TADA_getNWIS().

On another topic - I notice that many of the checks fail when I submit PRs, and I'm not sure how to keep this from happening; sorry for the inconvenience! For future PRs I'd love to make sure that doesn't continue to happen. If discussing how to ensure the tests pass would be something easier to discuss over a call please let me know.

kathryn-willi · 2025-05-02T14:00:34Z

@kathryn-willi A few of these look the same except for the total N. Do you know if this is mostly duplicate data? Curious if this is a common issue we would want to handle before analysis (making sure we don't have dups)
# Example 2: Query by specific site numbers
site_nums <- c("11530500", "11532500")
sites_specific <- TADA_listNWIS(sites = site_nums)

cristinamullin · 2025-05-02T19:44:19Z

@cristinamullin - it was in fact that there were multiple statistics being published for the same site/parameter combos. (The "mean" value is the default and by far the most common statistic for flow data, hence why I didn't catch this earlier. My apologies!) I have incorporated code to 1) list all available statistics in TADA_listNWIS(), and 2) allow the user to select which statistic(s) to download in TADA_getNWIS().

On another topic - I notice that many of the checks fail when I submit PRs, and I'm not sure how to keep this from happening; sorry for the inconvenience! For future PRs I'd love to make sure that doesn't continue to happen. If discussing how to ensure the tests pass would be something easier to discuss over a call please let me know.

Did you consider using the USGS DV service vs. the statistics service? It sounds like the main difference is that the DV service provides provisional results (most recent values) while the statistics service only provides final results. We may want to chat with them to confirm which is most appropriate for TADA use cases.

"Please note that most recent data are marked provisional, so these data should be interpreted with caution as it is possible (although unlikely) to be incorrect. See the USGS Provisional Data Disclaimer page for more information."

https://waterservices.usgs.gov/docs/

Daily Values
Daily values are summarized data about our nation’s streams, spring, lakes and wells derived from regular time-series equipment at these sites. Daily daily available for USGS water sites include mean, median, maximum, minimum, and/or other derived values. Many sites have periods of record for a decade or more. This service allows you to find daily values for time-series sites, both current and historical, using a number of flexible filters.

Statistics
Retrieve daily, monthly or annual statistics for sites. Statistics are provided on approved data only for time-series sites. Statistics are available for any parameter on these sites with approved data. Statistics include mean, minimum, maximum, mean and various percentiles.

kathryn-willi · 2025-05-02T20:52:09Z

@cristinamullin - it was in fact that there were multiple statistics being published for the same site/parameter combos. (The "mean" value is the default and by far the most common statistic for flow data, hence why I didn't catch this earlier. My apologies!) I have incorporated code to 1) list all available statistics in TADA_listNWIS(), and 2) allow the user to select which statistic(s) to download in TADA_getNWIS().
On another topic - I notice that many of the checks fail when I submit PRs, and I'm not sure how to keep this from happening; sorry for the inconvenience! For future PRs I'd love to make sure that doesn't continue to happen. If discussing how to ensure the tests pass would be something easier to discuss over a call please let me know.

Did you consider using the USGS DV service vs. the statistics service? It sounds like the main difference is that the DV service provides provisional results (most recent values) while the statistics service only provides final results. We may want to chat with them to confirm which is most appropriate for TADA use cases.

"Please note that most recent data are marked provisional, so these data should be interpreted with caution as it is possible (although unlikely) to be incorrect. See the USGS Provisional Data Disclaimer page for more information."

https://waterservices.usgs.gov/docs/

Daily Values Daily values are summarized data about our nation’s streams, spring, lakes and wells derived from regular time-series equipment at these sites. Daily daily available for USGS water sites include mean, median, maximum, minimum, and/or other derived values. Many sites have periods of record for a decade or more. This service allows you to find daily values for time-series sites, both current and historical, using a number of flexible filters.

Statistics Retrieve daily, monthly or annual statistics for sites. Statistics are provided on approved data only for time-series sites. Statistics are available for any parameter on these sites with approved data. Statistics include mean, minimum, maximum, mean and various percentiles.

Yes! We could definitely swap to a statistics call... However, as-is, TADA_getNWIS()'s NWIS.status column contains the information about whether the data is provisional or approved. So, the user could filter to only approved data if they wanted. In a future PR, I could modify the codes to make them more clear (A=approved, P=provisional, etc.) and add some deeper clarifying information about what that column is?

cristinamullin · 2025-05-02T21:19:13Z

@cristinamullin - it was in fact that there were multiple statistics being published for the same site/parameter combos. (The "mean" value is the default and by far the most common statistic for flow data, hence why I didn't catch this earlier. My apologies!) I have incorporated code to 1) list all available statistics in TADA_listNWIS(), and 2) allow the user to select which statistic(s) to download in TADA_getNWIS().
On another topic - I notice that many of the checks fail when I submit PRs, and I'm not sure how to keep this from happening; sorry for the inconvenience! For future PRs I'd love to make sure that doesn't continue to happen. If discussing how to ensure the tests pass would be something easier to discuss over a call please let me know.

Did you consider using the USGS DV service vs. the statistics service? It sounds like the main difference is that the DV service provides provisional results (most recent values) while the statistics service only provides final results. We may want to chat with them to confirm which is most appropriate for TADA use cases.
"Please note that most recent data are marked provisional, so these data should be interpreted with caution as it is possible (although unlikely) to be incorrect. See the USGS Provisional Data Disclaimer page for more information."
https://waterservices.usgs.gov/docs/
Daily Values Daily values are summarized data about our nation’s streams, spring, lakes and wells derived from regular time-series equipment at these sites. Daily daily available for USGS water sites include mean, median, maximum, minimum, and/or other derived values. Many sites have periods of record for a decade or more. This service allows you to find daily values for time-series sites, both current and historical, using a number of flexible filters.
Statistics Retrieve daily, monthly or annual statistics for sites. Statistics are provided on approved data only for time-series sites. Statistics are available for any parameter on these sites with approved data. Statistics include mean, minimum, maximum, mean and various percentiles.

Yes! We could definitely swap to a statistics call... However, as-is, TADA_getNWIS()'s NWIS.status column contains the information about whether the data is provisional or approved. So, the user could filter to only approved data if they wanted. In a future PR, I could modify the codes to make them more clear (A=approved, P=provisional, etc.) and add some deeper clarifying information about what that column is?

Sounds good. I like the idea of keeping the most recent data

kathryn-willi and others added 27 commits March 26, 2025 17:00

NWIS functions beginning

ef10802

continuous nwis data tests

41da3c2

start to vignette

8f4df5b

tweaks

497994e

editorial changes

b0e735c

add PR

08bcce6

Update R/ContinuousDataFunctions.R

d705d2e

Co-authored-by: Matt Brousil <[email protected]>

Update R/ContinuousDataFunctions.R

b1549b1

Co-authored-by: Matt Brousil <[email protected]>

Update R/ContinuousDataFunctions.R

1ae158d

Co-authored-by: Matt Brousil <[email protected]>

Update R/ContinuousDataFunctions.R

9810c5b

Co-authored-by: Matt Brousil <[email protected]>

Update R/ContinuousDataFunctions.R

3565cf0

Co-authored-by: Matt Brousil <[email protected]>

Update vignettes/TADAModule2.Rmd

7df89b5

Co-authored-by: Matt Brousil <[email protected]>

overhaul

8eaaaaa

text changes

96e30a3

Update R/ContinuousDataFunctions.R

9d79b3e

Co-authored-by: Matt Brousil <[email protected]>

Merge pull request #9 from kathryn-willi/develop

7ce9a3c

Functions for NWIS continuous data

big updates

48a0c47

Update R/GeospatialFunctions.R

c98287c

Co-authored-by: Matt Brousil <[email protected]>

Update R/GeospatialFunctions.R

1ce90de

Co-authored-by: Matt Brousil <[email protected]>

Update R/GeospatialFunctions.R

ead6fcf

Co-authored-by: Matt Brousil <[email protected]>

logic desc

460a4b5

comments for find_distances

0fc7aa0

distance description

8640027

mod returh for get_attains

173ebb9

Merge pull request #10 from kathryn-willi/develop

5712792

big updates

more descriptions

8e1d3a4

Merge pull request #11 from kathryn-willi/develop

292e855

build getATTAINS

kathryn-willi requested review from cristinamullin, hillarymarler and wokenny13 April 22, 2025 18:59

cristinamullin added 2 commits April 24, 2025 18:00

spelling and style

d9acac6

fix test failures

942d7d3

cristinamullin reviewed Apr 28, 2025

View reviewed changes

hillarymarler approved these changes May 1, 2025

View reviewed changes

R/ContinuousDataFunctions.R Outdated Show resolved Hide resolved

R/ContinuousDataFunctions.R Outdated Show resolved Hide resolved

cristinamullin approved these changes May 1, 2025

View reviewed changes

states to statecode + added US territories

de0503a

fix notes and update TADA_GetATTAINS docs

60ca98e

cristinamullin reviewed May 1, 2025

View reviewed changes

TADA_getNWIS updates

c314f30

sites to siteid

kathryn-willi added 2 commits May 1, 2025 17:42

incorporate stat code

c4f550a

incorporate stat code

cf627f8

kathryn-willi closed this May 2, 2025

kathryn-willi reopened this May 2, 2025

cristinamullin added 5 commits May 2, 2025 10:35

Update test-ContinuousDataFunctions.R

3def98b

fix url issues

965ec10

remove url

cf66ac2

update refs and ex data

1e0aaff

Update TADA_listNWIS.Rd

ea16b7e

update docs

5613611

cristinamullin merged commit f14b001 into USEPA:develop May 2, 2025
3 of 7 checks passed

wokenny13 linked an issue May 5, 2025 that may be closed by this pull request

TADA_GetATTAINS returning "..no ATTAINS catchment associated with these WQP observations..." #583

Closed

wokenny13 mentioned this pull request Aug 21, 2025

troubleshoot 429 errors for functions relying on ATTAINS geospatial web services #636

Open

Updates to ATTAINS fxns + new USGS continuous data fxns #589

Updates to ATTAINS fxns + new USGS continuous data fxns #589

Uh oh!

Conversation

kathryn-willi commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cristinamullin Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

hillarymarler Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

hillarymarler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cristinamullin commented May 1, 2025

Uh oh!

cristinamullin commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cristinamullin May 1, 2025

Choose a reason for hiding this comment

Uh oh!

kathryn-willi commented May 1, 2025

Uh oh!

kathryn-willi commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kathryn-willi commented May 2, 2025

Uh oh!

cristinamullin commented May 2, 2025

Uh oh!

Uh oh!

kathryn-willi commented May 2, 2025

Uh oh!

cristinamullin commented May 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kathryn-willi commented Apr 22, 2025 •

edited

Loading

cristinamullin commented May 1, 2025 •

edited

Loading

kathryn-willi commented May 1, 2025 •

edited

Loading