Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@kathryn-willi
Copy link
Contributor

@kathryn-willi kathryn-willi commented Apr 22, 2025

This PR combines two major updates to the TADA package:

1. ATTAINS Function Enhancements

  • fetch_ATTAINS() Iterative Clustering for Large Spatial Areas: Implemented a new {dbscan} clustering approach that partitions large spatial areas into smaller clusters, allowing ATTAINS features to be fetched more efficiently. This improves performance when loading features over large extents.

  • TADA_GetATTAINS() Enhancements: Added a distance column that reports the distance (in meters) between each WQP observation and intersecting ATTAINS features within its catchment.

  • Introduced a new argument, return_nearest. When set to TRUE, the function returns only the nearest ATTAINS feature to each WQP observation. When FALSE, it returns all features in the same catchment (ie, how it previously worked).

  • Replaced the index column with ResultIdentifier to more clearly and consistently handle cases where multiple ATTAINS features relate to a single WQP observation, per Cristina’s feedback (closes TADA_GetATTAINS index usability #539).

  • Testing & Tidying: Added new {testthat} functions to validate the updated geospatial behavior. General code tidying and improvements to documentation/examples across the affected functions.

2. New NWIS Continuous Data Functions (partial fulfilment of #222)

  • TADA_listNWIS(): Returns a spatial summary of available USGS NWIS daily data based on site numbers, state codes, or a specified spatial area.
  • TADA_getNWIS(): Retrieves and tidies USGS NWIS daily values for selected sites, states, or areas using parameter codes and date range.
  • Included examples and basic usage in the Module2 vignette.
  • Added new {testthat} tests to verify both functions.

These NWIS functions fulfill the continuous data access functionality TADA was aiming for and provide a foundation for future enhancements.

Let me know if anything is unclear or if you'd like changes to the implementation, documentation, or examples. Looking forward to your feedback. 🙂

Thanks!
Katie

kathryn-willi and others added 27 commits March 26, 2025 17:00
Functions for NWIS continuous data
dplyr::group_by(site_no) %>%
dplyr::summarize(
parameters = paste(unique(parameter), collapse = "; "),
parameter_codes = paste(unique(parameter_code), collapse = "; ")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we meet this week, let's chat more about TADA comparable data IDs, USGS p codes, and USGS observed properties in this context of bringing in the USGS continuous data & integrating it with the WQP data (compatibility).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to help work on WQP/USGS continuous data compatibility issues.

Copy link
Collaborator

@hillarymarler hillarymarler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increase in efficiency for TADA_GetATTAINS is great, as are the options to include only the nearest true feature and use ResultIdentifier to track instances of duplicate rows. The continuous data functions worked well on for a variety of test queries I tried them with - looking forward to discussing how to further integrate them w/ EPATADA workflow as a future effort.

@cristinamullin
Copy link
Collaborator

There are some global variable notes. We can address these by removing the variable using rm() once the variable is no longer needed.

   TADA_GetATTAINS: no visible binding for global variable 'count'
   TADA_GetATTAINS : find_distances: no visible binding for global
     variable 'TADA.DistanceAway.Meters'
   TADA_GetATTAINS: no visible binding for global variable
     'TADA.DistanceAway.Meters'
   TADA_GetATTAINS: no visible global function definition for
     'st_drop_geometry'
   TADA_getNWIS: no visible binding for global variable 'dec_long_va'
   TADA_getNWIS: no visible binding for global variable 'dec_lat_va'
   TADA_getNWIS: no visible binding for global variable 'site_no'
   TADA_getNWIS: no visible binding for global variable 'agency_cd'
   TADA_getNWIS: no visible binding for global variable 'Date'
   TADA_getNWIS: no visible binding for global variable 'NWIS.parameter'
   TADA_getNWIS: no visible binding for global variable 'NWIS.value'
   TADA_getNWIS: no visible binding for global variable 'NWIS.status'
   TADA_listNWIS : pcodes: no visible binding for global variable
     'parameter_code'
   TADA_listNWIS: no visible binding for global variable 'dec_long_va'
   TADA_listNWIS: no visible binding for global variable 'dec_lat_va'
   TADA_listNWIS: no visible binding for global variable 'site_no'
   TADA_listNWIS: no visible binding for global variable 'station_nm'
   TADA_listNWIS: no visible binding for global variable 'site_type'
   TADA_listNWIS: no visible binding for global variable 'site_tp_cd'
   TADA_listNWIS: no visible binding for global variable 'data_type'
   TADA_listNWIS: no visible binding for global variable 'data_type_cd'
   TADA_listNWIS: no visible binding for global variable
     'parameter_name_description'
   TADA_listNWIS: no visible binding for global variable 'parm_cd'
   TADA_listNWIS: no visible binding for global variable 'count_nu'
   TADA_listNWIS: no visible binding for global variable 'begin_date'
   TADA_listNWIS: no visible binding for global variable 'end_date'
   fetchATTAINS : perform_iterative_clustering : bbox_area: no visible
     binding for global variable 'cluster'
   fetchATTAINS : perform_iterative_clustering : split_clusters_by_area:
     no visible binding for global variable 'cluster'
   fetchATTAINS: no visible binding for global variable 'cluster'
   Undefined global functions or variables:
     Date NWIS.parameter NWIS.status NWIS.value TADA.DistanceAway.Meters
     agency_cd begin_date cluster count count_nu data_type data_type_cd
     dec_lat_va dec_long_va end_date parameter_code
     parameter_name_description parm_cd site_no site_tp_cd site_type
     st_drop_geometry station_nm

@cristinamullin
Copy link
Collaborator

cristinamullin commented May 1, 2025

@kathryn-willi A few of these look the same except for the total N. Do you know if this is mostly duplicate data? Curious if this is a common issue we would want to handle before analysis (making sure we don't have dups)

# Example 2: Query by specific site numbers
site_nums <- c("11530500", "11532500")
sites_specific <- TADA_listNWIS(sites = site_nums)

image

#' features (or "rows" in the sf data frame) must be under 118,078 square miles
#' (roughly the area of Nevada).
#' @param statecode Character vector of two-letter state codes (e.g., c("CA", "OR")).
#' @param sites Character vector of USGS site numbers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to change from sites to siteid for consistency with other TADA data retrieval functions. I am working on this now

sites to siteid
@kathryn-willi
Copy link
Contributor Author

site_nums <- c("11530500", "11532500")
sites_specific <- TADA_listNWIS(sites = site_nums)

Ooof, this i think is exposing my ignorance of non-flow USGS data. I bet this has to do with different statistical data being collected (e.g., both the max and mean). I will make the necessary changes to the functions to ensure we are capturing the statistical information related to both listNWIS and getNWIS! Please stay tuned for another commit from me...

@kathryn-willi
Copy link
Contributor Author

kathryn-willi commented May 1, 2025

@cristinamullin - it was in fact that there were multiple statistics being published for the same site/parameter combos. (The "mean" value is the default and by far the most common statistic for flow data, hence why I didn't catch this earlier. My apologies!) I have incorporated code to 1) list all available statistics in TADA_listNWIS(), and 2) allow the user to select which statistic(s) to download in TADA_getNWIS().

On another topic - I notice that many of the checks fail when I submit PRs, and I'm not sure how to keep this from happening; sorry for the inconvenience! For future PRs I'd love to make sure that doesn't continue to happen. If discussing how to ensure the tests pass would be something easier to discuss over a call please let me know.

@kathryn-willi
Copy link
Contributor Author

@kathryn-willi A few of these look the same except for the total N. Do you know if this is mostly duplicate data? Curious if this is a common issue we would want to handle before analysis (making sure we don't have dups)

# Example 2: Query by specific site numbers
site_nums <- c("11530500", "11532500")
sites_specific <- TADA_listNWIS(sites = site_nums)

image

@cristinamullin
Copy link
Collaborator

@cristinamullin - it was in fact that there were multiple statistics being published for the same site/parameter combos. (The "mean" value is the default and by far the most common statistic for flow data, hence why I didn't catch this earlier. My apologies!) I have incorporated code to 1) list all available statistics in TADA_listNWIS(), and 2) allow the user to select which statistic(s) to download in TADA_getNWIS().

On another topic - I notice that many of the checks fail when I submit PRs, and I'm not sure how to keep this from happening; sorry for the inconvenience! For future PRs I'd love to make sure that doesn't continue to happen. If discussing how to ensure the tests pass would be something easier to discuss over a call please let me know.

Did you consider using the USGS DV service vs. the statistics service? It sounds like the main difference is that the DV service provides provisional results (most recent values) while the statistics service only provides final results. We may want to chat with them to confirm which is most appropriate for TADA use cases.

"Please note that most recent data are marked provisional, so these data should be interpreted with caution as it is possible (although unlikely) to be incorrect. See the USGS Provisional Data Disclaimer page for more information."

https://waterservices.usgs.gov/docs/

Daily Values
Daily values are summarized data about our nation’s streams, spring, lakes and wells derived from regular time-series equipment at these sites. Daily daily available for USGS water sites include mean, median, maximum, minimum, and/or other derived values. Many sites have periods of record for a decade or more. This service allows you to find daily values for time-series sites, both current and historical, using a number of flexible filters.

Statistics
Retrieve daily, monthly or annual statistics for sites. Statistics are provided on approved data only for time-series sites. Statistics are available for any parameter on these sites with approved data. Statistics include mean, minimum, maximum, mean and various percentiles.

@cristinamullin cristinamullin merged commit f14b001 into USEPA:develop May 2, 2025
3 of 7 checks passed
@kathryn-willi
Copy link
Contributor Author

@cristinamullin - it was in fact that there were multiple statistics being published for the same site/parameter combos. (The "mean" value is the default and by far the most common statistic for flow data, hence why I didn't catch this earlier. My apologies!) I have incorporated code to 1) list all available statistics in TADA_listNWIS(), and 2) allow the user to select which statistic(s) to download in TADA_getNWIS().
On another topic - I notice that many of the checks fail when I submit PRs, and I'm not sure how to keep this from happening; sorry for the inconvenience! For future PRs I'd love to make sure that doesn't continue to happen. If discussing how to ensure the tests pass would be something easier to discuss over a call please let me know.

Did you consider using the USGS DV service vs. the statistics service? It sounds like the main difference is that the DV service provides provisional results (most recent values) while the statistics service only provides final results. We may want to chat with them to confirm which is most appropriate for TADA use cases.

"Please note that most recent data are marked provisional, so these data should be interpreted with caution as it is possible (although unlikely) to be incorrect. See the USGS Provisional Data Disclaimer page for more information."

https://waterservices.usgs.gov/docs/

Daily Values Daily values are summarized data about our nation’s streams, spring, lakes and wells derived from regular time-series equipment at these sites. Daily daily available for USGS water sites include mean, median, maximum, minimum, and/or other derived values. Many sites have periods of record for a decade or more. This service allows you to find daily values for time-series sites, both current and historical, using a number of flexible filters.

Statistics Retrieve daily, monthly or annual statistics for sites. Statistics are provided on approved data only for time-series sites. Statistics are available for any parameter on these sites with approved data. Statistics include mean, minimum, maximum, mean and various percentiles.

Yes! We could definitely swap to a statistics call... However, as-is, TADA_getNWIS()'s NWIS.status column contains the information about whether the data is provisional or approved. So, the user could filter to only approved data if they wanted. In a future PR, I could modify the codes to make them more clear (A=approved, P=provisional, etc.) and add some deeper clarifying information about what that column is?

@cristinamullin
Copy link
Collaborator

@cristinamullin - it was in fact that there were multiple statistics being published for the same site/parameter combos. (The "mean" value is the default and by far the most common statistic for flow data, hence why I didn't catch this earlier. My apologies!) I have incorporated code to 1) list all available statistics in TADA_listNWIS(), and 2) allow the user to select which statistic(s) to download in TADA_getNWIS().
On another topic - I notice that many of the checks fail when I submit PRs, and I'm not sure how to keep this from happening; sorry for the inconvenience! For future PRs I'd love to make sure that doesn't continue to happen. If discussing how to ensure the tests pass would be something easier to discuss over a call please let me know.

Did you consider using the USGS DV service vs. the statistics service? It sounds like the main difference is that the DV service provides provisional results (most recent values) while the statistics service only provides final results. We may want to chat with them to confirm which is most appropriate for TADA use cases.
"Please note that most recent data are marked provisional, so these data should be interpreted with caution as it is possible (although unlikely) to be incorrect. See the USGS Provisional Data Disclaimer page for more information."
https://waterservices.usgs.gov/docs/
Daily Values Daily values are summarized data about our nation’s streams, spring, lakes and wells derived from regular time-series equipment at these sites. Daily daily available for USGS water sites include mean, median, maximum, minimum, and/or other derived values. Many sites have periods of record for a decade or more. This service allows you to find daily values for time-series sites, both current and historical, using a number of flexible filters.
Statistics Retrieve daily, monthly or annual statistics for sites. Statistics are provided on approved data only for time-series sites. Statistics are available for any parameter on these sites with approved data. Statistics include mean, minimum, maximum, mean and various percentiles.

Yes! We could definitely swap to a statistics call... However, as-is, TADA_getNWIS()'s NWIS.status column contains the information about whether the data is provisional or approved. So, the user could filter to only approved data if they wanted. In a future PR, I could modify the codes to make them more clear (A=approved, P=provisional, etc.) and add some deeper clarifying information about what that column is?

Sounds good. I like the idea of keeping the most recent data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TADA_GetATTAINS returning "..no ATTAINS catchment associated with these WQP observations..." TADA_GetATTAINS index usability

3 participants