-
Notifications
You must be signed in to change notification settings - Fork 23
494 epatada continuous data flag error #501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
494 epatada continuous data flag error #501
Conversation
Updated function to use grouping, not a loop to find matches. This update is taking about 1.3 min on a ~100,000 row data set, while the previous version took ~45 min.
Corrected issue with counting rows to check how many cont results there are
Moved location of flag.data to address check issue
Styler updates, checked and corrected {} placement in function
|
Ran through the examples with Data_Nutrients_UT and looked into using different time_difference rather than the default 4. Also ran through an own example dataset with just Florida, and the flagging of continuous data seems to have worked as expected. Run time was quick, and I think the options to flag or clean the dataset for continuous data is nice. I tested fractional time values, and it does run and seem to work, but perhaps fractional values are not of much interest. |
wokenny13
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran through the examples with Data_Nutrients_UT and looked into using different time_difference rather than the default 4. Also ran through an own example dataset with just Florida, and the flagging of continuous data seems to have worked as expected.
Run time was quick, and I think the options to flag or clean the dataset for continuous data is nice. I tested fractional time values, and it does run and seem to work, but perhaps fractional values are not of much interest.
| #' in hours between measurements of the same TADA.ComparableDataIdentifier taken at the same | ||
| #' latitude, longitude, and depth. This is used to search for | ||
| #' continuous time series data (i.e., if there are multiple measurements within the selected | ||
| #' time_difference, then the row will be flagged as continuous). The default time window is 4 hours. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would decimal values for the time windows work? I've tried an example with 0.25 and even 0.20 and it seems to have worked, but didn't play around too much with fractional times
Update examples, remove extraneous column
renaemyers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filtering approach is much better than the loop that the function previously used. I had run times between a couple seconds using the Data_Nutrients_UT example data and 2.5 minutes using a large custom dataset of about 207,000 rows.
cefergus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the revised function looks good from what I could see. I ran the revised function on Fond du Lac data and a random TADA test data set. I looked for observations with same location, depth, comparable data identifier, and organization and it looked like the function is correctly flagging continuous vs discrete observations. Some observations labeled "Field Msr/Obs" and flagged as "Continuous" look like duplicated result values. But maybe a different function can flag those incidences.
|
Thanks for the reviews and comments, @renaemyers and @cefergus. I know we need to revisit the find QA/QC and paired replicates functions again at some point in the future, so I will copy @cefergus comment to the appropriate paired replicates issue or create a new one, so that we can consider it in the context of future updates to be applied to QC or continuous flagging functions. |
New approach to flagging continuous data. Instead of relying on a loop, this version groups data (TADA.ComparableDataIdentifier, MonitoringLocation, various depth fields), arranges the date consecutively by ActivityStartDateTime and then calculates the time difference between each result and the result before and after it within each group.
There are some slight differences in the results between this function and the original. Additional results are identified as "discrete" in this version. I think this is correct as when I reviewed the data set, in each of these cases, the result labeled as "discrete" was the only one in its "group". I am not sure why they were identified as continuous in the previous function.
I have an example data set with results from the old function. It was too large to attach here, but I can send it if you would like to compare its results to results from the updated function.