-
Notifications
You must be signed in to change notification settings - Fork 23
TADAbigdataRet chunking by site #193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- changed statecode default to "no" to match other inputs - specified startDate and endDate if not populated in input. - Made readWQPsummary a flexible query in case site type or statecode are not included - State code can be used but not default. - removes objects not needed in future computation (save space) - removed temp RDS file creation/reading within package.
|
Example in documentation does not work (characteristic is required - can we update the error message?): tada3 <- TADABigdataRetrieval(statecode = "CT") |
|
Instead of requiring characteristicname , can we make it more flexible, so that statecode, sitetype, characteristicname, or HUC (if added) (at least 1) could be provided? We can also add HUC to TADAdataRetrieval. Add information to the function documentation and vignette about potential memory issues they may experience when working with big data in R (so users are aware). Users can bindrows() and save to RDS, but that doesn't fix the issue of reading in ALL those files for use in their analyses. Data may still need to be chunked into manageable queries. Suggestion: add HUC as an input (for spatial queries), and add the information below to the vignette as well? |
removed bad example
- added inputs huc, sampleMedia, and applyautoclean - sampleMedia defaults to "Water" but will accept "null" as well. - applyautoclean defaults to FALSE in bigdataretrieval, TRUE in dataretrieval - added warning that R memory could be compromised by function - added examples and more commenting - made queries to summary and readWQPdata more flexible with user inputs (lists of lists) -- allows for multiple inputs of a single parameter/huc/media/ etc. - added verbose printing that lets user know status of chunking downloads. - added mutate_at function which converts all column names containing the string "MeasureValue" to character prior to binding to larger dataset. This was added in response to repeated failures of the function if the larger dataframe's column X was character() and the new chunk to add's column X was numeric(). - added test to ensure bigdataretrieval filters dates correctly.
- fixed syntax for calling multiple state summaries. Requires "US:##" in list. - Converted statecode.csv to a .Rdata file to ensure STATE column is preserved as a character, since states 01 - 09 need the "0" in front of the number. This behavior is hard to control using Excel across machines.
|
@cristinamullin This function is ready for your review and testing. Appears robust on my machine but I am certain fresh eyes will identify weaknesses. Thank you! |
|
Do you think we should make the "ProjectIdentifier" input "project" for consistency with dataRetreival? Do any others deviate? |
|
@cristinamullin It's a great question. I wonder what data submitters see in WQX when they upload a project ID. As a data user, I see the column names in the returned data profiles and it makes sense to me to have the function input match the column name on which I want to query. For example, I would be much more likely to know what a "ProjectIdentifier" input is, rather than "project". Similarly with "ActivityMediaName" rather than "sampleMedia". However, it is interesting that even the WQP browser interface is different. Continuing with these two examples, those fields are termed "Project ID" and "Sample Media" on the page itself. I think whichever we pick, adding documentation of field synonyms across querying platforms will make functions relatively simple to use (e.g. "project input is synonymous with the ProjectIdentifier column returned from the WQP and Project ID field on the web interface"). |
|
I agree the exact word used should be consistent across the WQP UI, dataRetreival, TADA, WQX, and the data profiles returned. I'll add a new issue for this since it is a cross-system topic |
|
@ehinman I have another question related to this thread. Do you think it would make sense for TADAdataRetreival and TADABigdataRetrieval to be a single function? Where the chunking is only activated if needed, that is, if the summary service returns over X unique MonitoringLocationIdentifier's? |
|
@cristinamullin Interesting idea, and certainly can be done. The only downside I see is that running the summary will slow down the function, especially since we do not have the ability to specify the date range of interest in the summary service. Someone could specify they want one month of data for one characteristic name in one state and it'll pull data for that characteristic name and state over the entire time data are available. It's not likely to take too long, but it will add some time. Another consideration is the prevalence of sites with little to no data, such that they add to the site n count but may not require chunking to successfully return data. What do you think? Would you like me to try this? It would be so nice if there was a webservice that reported the size of a request so one (or a function!) could make a decision on which process to use. |
Uh oh!
There was an error while loading. Please reload this page.