diff --git a/overviews/_baseline/unify-data-tutorial.ipynb b/overviews/_baseline/unify-data-tutorial.ipynb new file mode 100644 index 00000000..e6196c86 --- /dev/null +++ b/overviews/_baseline/unify-data-tutorial.ipynb @@ -0,0 +1,162 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 3W Dataset's General Presentation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is a general presentation of the 3W Dataset, to the best of its authors' knowledge, the first realistic and public dataset with rare undesirable real events in oil wells that can be readily used as a benchmark dataset for development of machine learning techniques related to inherent difficulties of actual data.\n", + "\n", + "For more information about the theory behind this dataset, refer to the paper **A Realistic and Public Dataset with Rare Undesirable Real Events in Oil Wells** published in the **Journal of Petroleum Science and Engineering** (link [here](https://doi.org/10.1016/j.petrol.2019.106223))." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 1 Introduction" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This Jupyter Notebook presents the 3W Dataset 2.0.0 in a general way. For this, some functionalities for data unification and the benefits of this process are demonstrated.\n", + "\n", + "In complex datasets like 3W, data is often distributed across multiple folders and files, which may hinder quick insights and analysis. The data unification process involves loading, cleaning, and merging these scattered files into a single, well-structured data frame. This process offers several key benefits:\n", + "\n", + "Functionalities of Data Unification\n", + "Automated Loading of Distributed Data:\n", + "\n", + "The notebook loads all Parquet files from multiple folders efficiently.\n", + "It filters out irrelevant files (e.g., simulated data) and extracts important metadata like timestamps directly from file names.\n", + "Data Normalization:\n", + "\n", + "Additional columns (e.g., folder ID, date, and time) are added, ensuring consistency across different data points.\n", + "This enhances downstream analysis by making sure that different segments are harmonized.\n", + "Handling Large-Scale Data with Dask:\n", + "\n", + "The use of Dask allows seamless processing of large datasets that would otherwise not fit into memory.\n", + "This makes it easier to explore and manipulate the entire dataset efficiently.\n", + "Benefits of Data Unification\n", + "Improved Data Accessibility:\n", + "With all data combined into a single structure, researchers and engineers can access relevant information faster, minimizing the time spent searching across files.\n", + "\n", + "Enhanced Analytical Capabilities:\n", + "Unified data allows for richer analytics, such as visualizing trends and patterns across the entire dataset. Anomalies and transient events can be identified and classified more accurately.\n", + "\n", + "Simplified Visualization:\n", + "By consolidating data into a single DataFrame, it's easier to generate comprehensive visualizations that provide meaningful insights about operational states.\n", + "\n", + "Facilitates Collaboration:\n", + "When datasets are standardized and merged, it becomes easier for teams to share their findings and collaborate on data-driven projects. The unified dataset serves as a single source of truth.\n", + "\n", + "This notebook demonstrates these functionalities and benefits by loading the 3W Dataset, classifying events across multiple operational states, and generating visualizations that offer a deeper understanding of system behavior." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 2. Imports and Configurations" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import pandas as pd\n", + "import dask.dataframe as dd\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# from toolkit.misc import load_and_combine_data, classify_events, visualize_data\n", + "\n", + "plt.style.use('ggplot') # Estilo do matplotlib\n", + "pd.set_option('display.max_columns', None) # Exibe todas as colunas do DataFrame\n", + "\n", + "dataset_dir = \"C:/Users/anabe/OneDrive/Área de Trabalho/HACKATHON PETROBRÁS/dataset_modificado/dataset_modificado\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 3. Instances' Structure" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this section, we explain the organization of the folders and files within the dataset. The 3W Dataset contains subfolders numbered from 0 to 9, where each folder represents a specific operational situation, as described below:\n", + "\n", + "* 0 = Normal Operation\n", + "* 1 = Abrupt Increase of BSW\n", + "* 2 = Spurious Closure of DHSV\n", + "* 3 = Severe Slugging\n", + "* 4 = Flow Instability\n", + "* 5 = Rapid Productivity Loss\n", + "* 6 = Quick Restriction in PCK\n", + "* 7 = Scaling in PCK\n", + "* 8 = Hydrate in Production Line\n", + "* 9 = Hydrate in Service Line\n", + "\n", + "Each file follows the naming pattern:\n", + "* WELL-00008_20170818000222.parquet\n", + "\n", + "* WELL-00008: Identification of the well.\n", + "* 20170818000222: Timestamp in the format yyyyMMddhhmmss.\n", + "* .parquet: File extension indicating the data format." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from toolkit.misc import load_and_combine_data, classify_events, visualize_data\n", + "\n", + "datatype = 'SIMULATED'\n", + "df = load_and_combine_data(dataset_dir, datatype)\n", + "\n", + "if df is not None:\n", + " event_summary = classify_events(df)\n", + "\n", + " visualize_data(event_summary)\n", + "else:\n", + " print(\"Nenhum dado foi carregado.\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/toolkit/__init__.py b/toolkit/__init__.py index e3419e2e..cd95047b 100644 --- a/toolkit/__init__.py +++ b/toolkit/__init__.py @@ -107,4 +107,7 @@ load_instances, resample, plot_instance, + load_and_combine_data, + classify_events, + visualize_data ) diff --git a/toolkit/misc.py b/toolkit/misc.py index 6f2323e5..b2e7e9a4 100644 --- a/toolkit/misc.py +++ b/toolkit/misc.py @@ -9,6 +9,7 @@ import matplotlib.dates as mdates import matplotlib.colors as mcolors import os +import dask.dataframe as dd from matplotlib.patches import Patch from pathlib import Path @@ -35,9 +36,134 @@ PARQUET_ENGINE, ) +folder_mapping = { + 0: 'Normal Operation', 1: 'Abrupt Increase of BSW', 2: 'Spurious Closure of DHSV', + 3: 'Severe Slugging', 4: 'Flow Instability', 5: 'Rapid Productivity Loss', + 6: 'Quick Restriction in PCK', 7: 'Scaling in PCK', 8: 'Hydrate in Production Line', + 9: 'Hydrate in Service Line' +} + # Methods # + +def load_and_combine_data(dataset_dir, datatype): + """ + Loads and combines Parquet files from multiple folders, adding additional columns + for folder ID, date, and time extracted from the file names. + + Args: + ---------- + dataset_dir : str + Path to the root directory containing subfolders (0 to 9) with Parquet files. + + datatype : str + Type that user need to remove of the dataset for a specific analysis + + Returns: + -------- + dask.DataFrame or None + A combined Dask DataFrame with all the data from the Parquet files, or None + if no files were found. + + Functionality: + -------------- + - Iterates through folders (0-9) and loads all valid Parquet files (ignoring those + starting with 'SIMULATED' or other type defined by user). + - Extracts date and time from the filename and adds them as new columns ('data', 'hora'). + - Adds a 'folder_id' column to identify the folder each file originated from. + + Example: + -------- + df = load_and_combine_data('/path/to/dataset') + """ + dfs = [] + for folder in range(10): + folder_path = os.path.join(dataset_dir, str(folder)) + if os.path.exists(folder_path): + for file_name in os.listdir(folder_path): + if file_name.endswith('.parquet') and not file_name.startswith(datatype): #removal according to the user's wishes + df = dd.read_parquet(os.path.join(folder_path, file_name)) + file_name_without_ext = os.path.splitext(file_name)[0] + date_str = file_name_without_ext.split.split('_')[1] + formatted_date = f"{date_str[:4]}-{date_str[4:6]}-{date_str[6:8]}" + formatted_time = f"{date_str[8:10]}:{date_str[10:12]}:{date_str[12:]}" + df = df.assign(folder_id=folder, data=formatted_date, hora=formatted_time) + dfs.append(df) + return dd.concat(dfs) if dfs else None + +def classify_events(df): + """ + Classifies events in the dataset by folder and event type, and summarizes the + occurrences of different event types. + + Args: + ---------- + df : dask.DataFrame + The DataFrame containing the event data, including a 'folder_id' column and a 'class' column. + + Returns: + -------- + dict + A dictionary summarizing the count of events by event type ('Normal Operation', + 'Transient', 'Permanent Anomaly') for each folder. + + Functionality: + -------------- + - For each folder (0-9), counts the occurrences of events in three categories: + - 'Normal Operation': Events classified as 0. + - 'Transient': Events classified between 1 and 9. + - 'Permanent Anomaly': Events classified between 101 and 109. + + Example: + -------- + event_summary = classify_events(df) + """ + data = {folder_mapping[i]: {'Normal Operation': 0, 'Transient': 0, 'Permanent Anomaly': 0} for i in range(10)} + for folder in range(10): + folder_data = df[df['folder_id'] == folder] + if len(folder_data.index) > 0: + dtb = folder_data['class'].value_counts().compute() + data[folder_mapping[folder]]['Normal Operation'] = dtb.get(0, 0) + data[folder_mapping[folder]]['Transient'] = dtb[(dtb.index >= 1) & (dtb.index <= 9)].sum() + data[folder_mapping[folder]]['Permanent Anomaly'] = dtb[(dtb.index >= 101) & (dtb.index <= 109)].sum() + return data + +def visualize_data(data): + """ + Visualizes the event distribution by type using a stacked area chart. + + Parameters: + ---------- + data : dict + A dictionary where keys are folder names, and values are dictionaries with + counts of different event types ('Normal Operation', 'Transient', 'Permanent Anomaly'). + + Returns: + -------- + None + Displays a stacked area chart showing the distribution of event types for each folder. + + Functionality: + -------------- + - Converts the input dictionary into a DataFrame for plotting. + - Generates a stacked area chart with event types represented in different colors. + - Adds labels for the x and y axes, and a title. + + Example: + -------- + visualize_data(event_summary) + """ + df_plot = pd.DataFrame(data).T + df_plot.plot(kind='area', stacked=True, color=['blue', 'orange', 'purple'], figsize=(14, 8), alpha=0.6) + plt.title('Occurrences by Event Type', fontsize=16) + plt.xlabel('Situations', fontsize=14) + plt.ylabel('Amount', fontsize=14) + plt.xticks(rotation=45, ha='right', fontsize=12) + plt.legend(title='Event Type', loc='upper right') + plt.tight_layout() + plt.show() + def label_and_file_generator(real=True, simulated=False, drawn=False): """This is a generating function that returns tuples for all indicated instance sources (`real`, `simulated` and/or