TADAbigdataRet chunking by site #193

ehinman · 2023-01-30T19:16:27Z

Chunks by site list generated by readWQPsummary, rather than sites within each FIPS code.
changed statecode default to "null" to match other inputs
specified startDate and endDate if not populated in inputs (this is required if using readWQPdata()).
Made readWQPsummary a flexible query in case site type or statecode are not included...saves if statements
State code can be an input but not default.
removed objects not needed in future computation (df_summary, sites; save space)
removed temp RDS file creation/reading within package.
added testthat back in that compares TADAdataRetrieval with TADABigdataRetrieval...does not depend upon statecode

- changed statecode default to "no" to match other inputs - specified startDate and endDate if not populated in input. - Made readWQPsummary a flexible query in case site type or statecode are not included - State code can be used but not default. - removes objects not needed in future computation (save space) - removed temp RDS file creation/reading within package.

cristinamullin · 2023-01-30T21:36:55Z

Example in documentation does not work (characteristic is required - can we update the error message?):

tada3 <- TADABigdataRetrieval(statecode = "CT")
Error in STUSAB %in% statecode : object 'STUSAB' not found
Called from: STUSAB %in% statecode

cristinamullin · 2023-01-30T21:46:08Z

Instead of requiring characteristicname , can we make it more flexible, so that statecode, sitetype, characteristicname, or HUC (if added) (at least 1) could be provided?

We can also add HUC to TADAdataRetrieval.

Add information to the function documentation and vignette about potential memory issues they may experience when working with big data in R (so users are aware). Users can bindrows() and save to RDS, but that doesn't fix the issue of reading in ALL those files for use in their analyses. Data may still need to be chunked into manageable queries.

Suggestion: add HUC as an input (for spatial queries), and add the information below to the vignette as well?
Users could use the WATERS geoviewer to find HUC codes and query spatially if desired. See hydrologic units layer, and zoom in and out on the map - HUC sizes will change with the zoom, any HUC number can be used to query the WQP: https://epa.maps.arcgis.com/apps/webappviewer/index.html?id=074cfede236341b6a1e03779c2bd0692

removed bad example

- added inputs huc, sampleMedia, and applyautoclean - sampleMedia defaults to "Water" but will accept "null" as well. - applyautoclean defaults to FALSE in bigdataretrieval, TRUE in dataretrieval - added warning that R memory could be compromised by function - added examples and more commenting - made queries to summary and readWQPdata more flexible with user inputs (lists of lists) -- allows for multiple inputs of a single parameter/huc/media/ etc. - added verbose printing that lets user know status of chunking downloads. - added mutate_at function which converts all column names containing the string "MeasureValue" to character prior to binding to larger dataset. This was added in response to repeated failures of the function if the larger dataframe's column X was character() and the new chunk to add's column X was numeric(). - added test to ensure bigdataretrieval filters dates correctly.

- fixed syntax for calling multiple state summaries. Requires "US:##" in list. - Converted statecode.csv to a .Rdata file to ensure STATE column is preserved as a character, since states 01 - 09 need the "0" in front of the number. This behavior is hard to control using Excel across machines.

ehinman · 2023-01-31T20:46:37Z

@cristinamullin This function is ready for your review and testing. Appears robust on my machine but I am certain fresh eyes will identify weaknesses. Thank you!

cristinamullin · 2023-02-01T02:18:11Z

Do you think we should make the "ProjectIdentifier" input "project" for consistency with dataRetreival? Do any others deviate?

ehinman · 2023-02-01T13:16:57Z

@cristinamullin It's a great question. I wonder what data submitters see in WQX when they upload a project ID. As a data user, I see the column names in the returned data profiles and it makes sense to me to have the function input match the column name on which I want to query. For example, I would be much more likely to know what a "ProjectIdentifier" input is, rather than "project". Similarly with "ActivityMediaName" rather than "sampleMedia". However, it is interesting that even the WQP browser interface is different. Continuing with these two examples, those fields are termed "Project ID" and "Sample Media" on the page itself. I think whichever we pick, adding documentation of field synonyms across querying platforms will make functions relatively simple to use (e.g. "project input is synonymous with the ProjectIdentifier column returned from the WQP and Project ID field on the web interface").

cristinamullin · 2023-02-01T15:21:23Z

I agree the exact word used should be consistent across the WQP UI, dataRetreival, TADA, WQX, and the data profiles returned. I'll add a new issue for this since it is a cross-system topic

cristinamullin · 2023-02-01T15:49:55Z

@ehinman I have another question related to this thread. Do you think it would make sense for TADAdataRetreival and TADABigdataRetrieval to be a single function? Where the chunking is only activated if needed, that is, if the summary service returns over X unique MonitoringLocationIdentifier's?

ehinman · 2023-02-01T16:04:59Z

@cristinamullin Interesting idea, and certainly can be done. The only downside I see is that running the summary will slow down the function, especially since we do not have the ability to specify the date range of interest in the summary service. Someone could specify they want one month of data for one characteristic name in one state and it'll pull data for that characteristic name and state over the entire time data are available. It's not likely to take too long, but it will add some time. Another consideration is the prevalence of sites with little to no data, such that they add to the site n count but may not require chunking to successfully return data. What do you think? Would you like me to try this? It would be so nice if there was a webservice that reported the size of a request so one (or a function!) could make a decision on which process to use.

ehinman requested a review from cristinamullin January 30, 2023 19:16

cristinamullin approved these changes Jan 30, 2023

View reviewed changes

cristinamullin and others added 3 commits January 30, 2023 17:02

updated examples

649e3e7

removed bad example

documentation fix

3128258

ehinman requested a review from cristinamullin January 31, 2023 18:34

cristinamullin merged commit fc6156b into develop Feb 1, 2023

cristinamullin deleted the bigdataRetrieval_chunking branch February 1, 2023 02:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TADAbigdataRet chunking by site #193

TADAbigdataRet chunking by site #193

Uh oh!

ehinman commented Jan 30, 2023 •

edited

Loading

Uh oh!

cristinamullin commented Jan 30, 2023

Uh oh!

cristinamullin commented Jan 30, 2023 •

edited

Loading

Uh oh!

ehinman commented Jan 31, 2023

Uh oh!

cristinamullin commented Feb 1, 2023

Uh oh!

ehinman commented Feb 1, 2023 •

edited

Loading

Uh oh!

cristinamullin commented Feb 1, 2023 •

edited

Loading

Uh oh!

cristinamullin commented Feb 1, 2023

Uh oh!

ehinman commented Feb 1, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TADAbigdataRet chunking by site #193

TADAbigdataRet chunking by site #193

Uh oh!

Conversation

ehinman commented Jan 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cristinamullin commented Jan 30, 2023

Uh oh!

cristinamullin commented Jan 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehinman commented Jan 31, 2023

Uh oh!

cristinamullin commented Feb 1, 2023

Uh oh!

ehinman commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cristinamullin commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cristinamullin commented Feb 1, 2023

Uh oh!

ehinman commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ehinman commented Jan 30, 2023 •

edited

Loading

cristinamullin commented Jan 30, 2023 •

edited

Loading

ehinman commented Feb 1, 2023 •

edited

Loading

cristinamullin commented Feb 1, 2023 •

edited

Loading

ehinman commented Feb 1, 2023 •

edited

Loading