Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@nabelly19
Copy link

Description:

This Pull Request focuses on the centralization and integration of the 3W Dataset within the BibMon toolkit, providing structured data loading and preparation functions.

  • Data Loading and Unification: Combines multiple Parquet files into a unified dataset for streamlined analysis.
  • Automatic Folder and File Mapping: Organizes data by operational situations based on predefined folder structures.
  • Timestamp Extraction and Formatting: Extracts and formats timestamps from filenames for consistency.

By creating this pull request, I confirm that I have read and fully accept and agree with one of the Petrobras' Contributor License Agreements (CLAs):

Our CLAs are based on the Apache Software Foundation's CLAs:

@ricardoevvargas
Copy link
Collaborator

Hello, @nabelly19.

Thank you for submitting this PR.

Over the last few weeks, we have been concentrating on the necessary steps to add to the 3W Dataset some types of undesirable events that occur during the well drilling stage. We believe that this increase (planned for the coming months) will promote significant progress in the 3W Project.

We intend to evaluate this and the other open PRs over the next few days.

Once again, thank you for submitting this PR.

@ricardoevvargas
Copy link
Collaborator

ricardoevvargas commented Feb 26, 2025

Hello, @nabelly19.

This PR contains an interesting proposal. Using Dask can be a very good alternative to the 3W Toolkit. However, the implementation of this PR needs to be adjusted:

  • A suitable version of Dask needs to be included in environment.yml. Remember that the inclusion of any package in the specification of the appropriate virtual environment for the 3W Project may result in other packages needing to be updated. Also remember that before making any changes to environment.yml, you need to make sure that the 3W Toolkit features are working correctly. For example, Dask 2025.2.0 requires version 2.0.0 or newer of Pandas and version 1.5.3 is currently used;
  • The relationship between this PR and BibMon is unclear. Does this project use Dask? Can you explain it better?
  • The folder_mapping variable is hard coded. This type of configuration is in the dataset.ini and is loaded in constants from the toolkit package. Examples: LABELS_DESCRIPTIONS and EVENT_NAMES_LABELS;
  • About the load_and_combine_data() method:
    • The documentation (docstrings) needs to be generated in Google format with autoDocstring - Python Docstring Generator, which follows PEP 257, and pdoc3. Further recommendations can be found in the 3W Project contributing guide;
    • The number of directories (10) is hard coded. This method will not work after updates to the 3W Dataset structure. The load_instance() method should be used to avoid this problem;
    • The label_and_file_generator() method could be used to filter out unwanted types of instances.
  • About the classify_events() method:
    • The documentation (docstrings) needs to be generated in Google format with autoDocstring - Python Docstring Generator, which follows PEP 257, and pdoc3. Further recommendations can be found in the 3W Project contributing guide;
    • The number of directories (10) is hard coded;
    • There are differences in nomenclature:
      • Where event is used, observation should be used. A sample is a collection of contiguous observations and an instance is a collection of contiguous samples;
      • Where classifies is used, counts should be used. Classifying means estimating whether a sample contains observations associated with a certain type of event.
    • I can't see any point in knowing the number of observations grouped by event type (directory) and sample type. Remember that other useful counts are performed by the create_table_of_instances() and calc_stats_instances() methods. Can you explain the benefits of this method?
  • About the visualize_data() method:
    • I can't see any point in visualizing the number of observations grouped by event type (directory) and sample type. Remember that other useful counts are performed by the create_table_of_instances() and calc_stats_instances() methods. Can you explain the benefits of this method?
  • About the unify-data-tutorial.ipynb:
    • There are comments in Portuguese, which is easy to resolve either by yourself or any other member of the 3W Community (including repository administrators);
    • Ideally, the dataset_dir should be a relative path, not a specific absolute path on your computer.

Please think about it and let us know how you prefer to proceed.

Once again, thank you for submitting this PR.

@tpsiqueira tpsiqueira added waiting author Waiting for the author's input to proceed enhancement New feature or request advance Suitable for advanced developers documentation Improvements or additions to documentation and removed enhancement New feature or request advance Suitable for advanced developers labels Mar 25, 2025
@tpsiqueira
Copy link
Collaborator

Hello, @nabelly19.

Have you had a chance to evaluate my comment from February 26th?

Please think about it and let us know how you prefer to proceed.

Once again, thank you for submitting this PR.

@tpsiqueira
Copy link
Collaborator

Hello, @nabelly19.

Since our comment on February 26th we have not received a response on this PR.

We’ll go ahead and close this PR for now, but it can be reopened in the future if needed.

Let us know if you need anything and thank you for submitting this PR.

@tpsiqueira tpsiqueira closed this May 8, 2025
@tpsiqueira tpsiqueira removed the waiting author Waiting for the author's input to proceed label May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

advance Suitable for advanced developers documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants