Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ehinman
Copy link
Contributor

@ehinman ehinman commented Jan 30, 2023

  • Chunks by site list generated by readWQPsummary, rather than sites within each FIPS code.
  • changed statecode default to "null" to match other inputs
  • specified startDate and endDate if not populated in inputs (this is required if using readWQPdata()).
  • Made readWQPsummary a flexible query in case site type or statecode are not included...saves if statements
  • State code can be an input but not default.
  • removed objects not needed in future computation (df_summary, sites; save space)
  • removed temp RDS file creation/reading within package.
  • added testthat back in that compares TADAdataRetrieval with TADABigdataRetrieval...does not depend upon statecode

- changed statecode default to "no" to match other inputs
- specified startDate and endDate if not populated in input.
- Made readWQPsummary a flexible query in case site type or statecode are not included
- State code can be used but not default.
- removes objects not needed in future computation (save space)
- removed temp RDS file creation/reading within package.
@cristinamullin
Copy link
Collaborator

Example in documentation does not work (characteristic is required - can we update the error message?):

tada3 <- TADABigdataRetrieval(statecode = "CT")
Error in STUSAB %in% statecode : object 'STUSAB' not found
Called from: STUSAB %in% statecode

@cristinamullin
Copy link
Collaborator

cristinamullin commented Jan 30, 2023

Instead of requiring characteristicname , can we make it more flexible, so that statecode, sitetype, characteristicname, or HUC (if added) (at least 1) could be provided?

We can also add HUC to TADAdataRetrieval.

Add information to the function documentation and vignette about potential memory issues they may experience when working with big data in R (so users are aware). Users can bindrows() and save to RDS, but that doesn't fix the issue of reading in ALL those files for use in their analyses. Data may still need to be chunked into manageable queries.

Suggestion: add HUC as an input (for spatial queries), and add the information below to the vignette as well?
Users could use the WATERS geoviewer to find HUC codes and query spatially if desired. See hydrologic units layer, and zoom in and out on the map - HUC sizes will change with the zoom, any HUC number can be used to query the WQP: https://epa.maps.arcgis.com/apps/webappviewer/index.html?id=074cfede236341b6a1e03779c2bd0692

cristinamullin and others added 3 commits January 30, 2023 17:02
removed bad example
- added inputs huc, sampleMedia, and applyautoclean
- sampleMedia defaults to "Water" but will accept "null" as well.
- applyautoclean defaults to FALSE in bigdataretrieval, TRUE in dataretrieval
- added warning that R memory could be compromised by function
- added examples and more commenting
- made queries to summary and readWQPdata more flexible with user inputs (lists of lists) -- allows for multiple inputs of a single parameter/huc/media/ etc.
- added verbose printing that lets user know status of chunking downloads.
- added mutate_at function which converts all column names containing the string "MeasureValue" to character prior to binding to larger dataset. This was added in response to repeated failures of the function if the larger dataframe's column X was character() and the new chunk to add's column X was numeric().
- added test to ensure bigdataretrieval filters dates correctly.
- fixed syntax for calling multiple state summaries. Requires "US:##" in list.
- Converted statecode.csv to a .Rdata file to ensure STATE column is preserved as a character, since states 01 - 09 need the "0" in front of the number. This behavior is hard to control using Excel across machines.
@ehinman
Copy link
Contributor Author

ehinman commented Jan 31, 2023

@cristinamullin This function is ready for your review and testing. Appears robust on my machine but I am certain fresh eyes will identify weaknesses. Thank you!

@cristinamullin
Copy link
Collaborator

Do you think we should make the "ProjectIdentifier" input "project" for consistency with dataRetreival? Do any others deviate?

@cristinamullin cristinamullin merged commit fc6156b into develop Feb 1, 2023
@cristinamullin cristinamullin deleted the bigdataRetrieval_chunking branch February 1, 2023 02:26
@ehinman
Copy link
Contributor Author

ehinman commented Feb 1, 2023

@cristinamullin It's a great question. I wonder what data submitters see in WQX when they upload a project ID. As a data user, I see the column names in the returned data profiles and it makes sense to me to have the function input match the column name on which I want to query. For example, I would be much more likely to know what a "ProjectIdentifier" input is, rather than "project". Similarly with "ActivityMediaName" rather than "sampleMedia". However, it is interesting that even the WQP browser interface is different. Continuing with these two examples, those fields are termed "Project ID" and "Sample Media" on the page itself. I think whichever we pick, adding documentation of field synonyms across querying platforms will make functions relatively simple to use (e.g. "project input is synonymous with the ProjectIdentifier column returned from the WQP and Project ID field on the web interface").

@cristinamullin
Copy link
Collaborator

cristinamullin commented Feb 1, 2023

I agree the exact word used should be consistent across the WQP UI, dataRetreival, TADA, WQX, and the data profiles returned. I'll add a new issue for this since it is a cross-system topic

@cristinamullin
Copy link
Collaborator

@ehinman I have another question related to this thread. Do you think it would make sense for TADAdataRetreival and TADABigdataRetrieval to be a single function? Where the chunking is only activated if needed, that is, if the summary service returns over X unique MonitoringLocationIdentifier's?

@ehinman
Copy link
Contributor Author

ehinman commented Feb 1, 2023

@cristinamullin Interesting idea, and certainly can be done. The only downside I see is that running the summary will slow down the function, especially since we do not have the ability to specify the date range of interest in the summary service. Someone could specify they want one month of data for one characteristic name in one state and it'll pull data for that characteristic name and state over the entire time data are available. It's not likely to take too long, but it will add some time. Another consideration is the prevalence of sites with little to no data, such that they add to the site n count but may not require chunking to successfully return data. What do you think? Would you like me to try this? It would be so nice if there was a webservice that reported the size of a request so one (or a function!) could make a decision on which process to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants