Centralization of 3W Dataset in BibMon Toolkit: Data Loading and Structuring Functions #132

nabelly19 · 2024-10-19T19:34:52Z

Description:

This Pull Request focuses on the centralization and integration of the 3W Dataset within the BibMon toolkit, providing structured data loading and preparation functions.

Data Loading and Unification: Combines multiple Parquet files into a unified dataset for streamlined analysis.
Automatic Folder and File Mapping: Organizes data by operational situations based on predefined folder structures.
Timestamp Extraction and Formatting: Extracts and formats timestamps from filenames for consistency.

By creating this pull request, I confirm that I have read and fully accept and agree with one of the Petrobras' Contributor License Agreements (CLAs):

ICLA: Individual Contributor License Agreement on behalf of only yourself;
CCLA: Corporate Contributor License Agreement on behalf of your employer.

Our CLAs are based on the Apache Software Foundation's CLAs:

ICLA: Individual Contributor License Agreement
CCLA: Corporate Contributor License Agreement

…ommits

ricardoevvargas · 2025-01-22T17:53:32Z

Hello, @nabelly19.

Thank you for submitting this PR.

Over the last few weeks, we have been concentrating on the necessary steps to add to the 3W Dataset some types of undesirable events that occur during the well drilling stage. We believe that this increase (planned for the coming months) will promote significant progress in the 3W Project.

We intend to evaluate this and the other open PRs over the next few days.

Once again, thank you for submitting this PR.

ricardoevvargas · 2025-02-26T15:26:47Z

Hello, @nabelly19.

This PR contains an interesting proposal. Using Dask can be a very good alternative to the 3W Toolkit. However, the implementation of this PR needs to be adjusted:

A suitable version of Dask needs to be included in environment.yml. Remember that the inclusion of any package in the specification of the appropriate virtual environment for the 3W Project may result in other packages needing to be updated. Also remember that before making any changes to environment.yml, you need to make sure that the 3W Toolkit features are working correctly. For example, Dask 2025.2.0 requires version 2.0.0 or newer of Pandas and version 1.5.3 is currently used;
The relationship between this PR and BibMon is unclear. Does this project use Dask? Can you explain it better?
The folder_mapping variable is hard coded. This type of configuration is in the dataset.ini and is loaded in constants from the toolkit package. Examples: LABELS_DESCRIPTIONS and EVENT_NAMES_LABELS;
About the load_and_combine_data() method:
- The documentation (docstrings) needs to be generated in Google format with autoDocstring - Python Docstring Generator, which follows PEP 257, and pdoc3. Further recommendations can be found in the 3W Project contributing guide;
- The number of directories (10) is hard coded. This method will not work after updates to the 3W Dataset structure. The load_instance() method should be used to avoid this problem;
- The label_and_file_generator() method could be used to filter out unwanted types of instances.
About the classify_events() method:
- The documentation (docstrings) needs to be generated in Google format with autoDocstring - Python Docstring Generator, which follows PEP 257, and pdoc3. Further recommendations can be found in the 3W Project contributing guide;
- The number of directories (10) is hard coded;
- There are differences in nomenclature:
  - Where event is used, observation should be used. A sample is a collection of contiguous observations and an instance is a collection of contiguous samples;
  - Where classifies is used, counts should be used. Classifying means estimating whether a sample contains observations associated with a certain type of event.
- I can't see any point in knowing the number of observations grouped by event type (directory) and sample type. Remember that other useful counts are performed by the create_table_of_instances() and calc_stats_instances() methods. Can you explain the benefits of this method?
About the visualize_data() method:
- I can't see any point in visualizing the number of observations grouped by event type (directory) and sample type. Remember that other useful counts are performed by the create_table_of_instances() and calc_stats_instances() methods. Can you explain the benefits of this method?
About the unify-data-tutorial.ipynb:
- There are comments in Portuguese, which is easy to resolve either by yourself or any other member of the 3W Community (including repository administrators);
- Ideally, the dataset_dir should be a relative path, not a specific absolute path on your computer.

Please think about it and let us know how you prefer to proceed.

Once again, thank you for submitting this PR.

tpsiqueira · 2025-03-26T14:28:41Z

Hello, @nabelly19.

Have you had a chance to evaluate my comment from February 26th?

Please think about it and let us know how you prefer to proceed.

Once again, thank you for submitting this PR.

tpsiqueira · 2025-05-08T12:53:25Z

Hello, @nabelly19.

Since our comment on February 26th we have not received a response on this PR.

We’ll go ahead and close this PR for now, but it can be reopened in the future if needed.

Let us know if you need anything and thank you for submitting this PR.

nabelly19 added 5 commits October 19, 2024 11:15

��add load_and_combine_data function

7520363

�add cassify_events function

36c55d0

�add visualize_data function

029087f

�add functions to initpy

6919286

add tutorial notebook and fix some codes in functions added in last c…

4097351

…ommits

tpsiqueira closed this May 8, 2025

tpsiqueira removed the waiting author Waiting for the author's input to proceed label May 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Centralization of 3W Dataset in BibMon Toolkit: Data Loading and Structuring Functions #132

Centralization of 3W Dataset in BibMon Toolkit: Data Loading and Structuring Functions #132

Uh oh!

nabelly19 commented Oct 19, 2024

Uh oh!

ricardoevvargas commented Jan 22, 2025

Uh oh!

ricardoevvargas commented Feb 26, 2025 •

edited

Loading

Uh oh!

tpsiqueira commented Mar 26, 2025

Uh oh!

tpsiqueira commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Centralization of 3W Dataset in BibMon Toolkit: Data Loading and Structuring Functions #132

Centralization of 3W Dataset in BibMon Toolkit: Data Loading and Structuring Functions #132

Uh oh!

Conversation

nabelly19 commented Oct 19, 2024

Description:

Uh oh!

ricardoevvargas commented Jan 22, 2025

Uh oh!

ricardoevvargas commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tpsiqueira commented Mar 26, 2025

Uh oh!

tpsiqueira commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ricardoevvargas commented Feb 26, 2025 •

edited

Loading