diff --git a/.gitignore b/.gitignore index 92c504d28..5d0e761ce 100644 --- a/.gitignore +++ b/.gitignore @@ -136,4 +136,7 @@ spark-warehouse/ spark-checkpoints/ # Delta Sharing -config.share \ No newline at end of file +config.share + +# JetBrains +.idea/ diff --git a/README.md b/README.md index e6730513c..6e3883197 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,6 @@ [![PyPI version](https://img.shields.io/pypi/v/rtdip-sdk.svg?logo=pypi&logoColor=FFE873)](https://pypi.org/project/rtdip-sdk/) [![Supported Python versions](https://img.shields.io/pypi/pyversions/rtdip-sdk.svg?logo=python&logoColor=FFE873)](https://pypi.org/project/rtdip-sdk/) [![PyPI downloads](https://img.shields.io/pypi/dm/rtdip-sdk.svg)](https://pypistats.org/packages/rtdip-sdk) -![PyPI Downloads](https://static.pepy.tech/badge/rtdip-sdk) [![OpenSSF Best Practices](https://bestpractices.coreinfrastructure.org/projects/7557/badge)](https://bestpractices.coreinfrastructure.org/projects/7557) [![Code Style Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) @@ -116,4 +115,4 @@ Distributed under the Apache License Version 2.0. See [LICENSE.md](https://githu * Check previous questions and answers or ask new ones on our slack channel [**#rtdip**](https://lfenergy.slack.com/archives/C0484R9Q6A0) ### Community -* Chat with other community members by joining the **#rtdip** Slack channel. [Click here to join our slack community](https://lfenergy.slack.com/archives/C0484R9Q6A0) +* Chat with other community members by joining the **#rtdip** Slack channel. [Click here to join our slack community](https://lfenergy.slack.com/archives/C0484R9Q6A0) \ No newline at end of file diff --git a/docs/blog/.authors.yml b/docs/blog/.authors.yml index 966175639..ff16faf83 100644 --- a/docs/blog/.authors.yml +++ b/docs/blog/.authors.yml @@ -24,4 +24,8 @@ authors: GBARAS: name: Amber Rigg description: Contributor - avatar: https://github.com/Amber-Rigg.png \ No newline at end of file + avatar: https://github.com/Amber-Rigg.png + TUBCM: + name: Christian Munz + description: Contributor + avatar: https://github.com/chris-1187.png \ No newline at end of file diff --git a/docs/blog/images/agile.svg b/docs/blog/images/agile.svg new file mode 100644 index 000000000..8f206ff30 --- /dev/null +++ b/docs/blog/images/agile.svg @@ -0,0 +1,1827 @@ + + + diff --git a/docs/blog/images/amos_mvi.png b/docs/blog/images/amos_mvi.png new file mode 100644 index 000000000..93fd89a78 Binary files /dev/null and b/docs/blog/images/amos_mvi.png differ diff --git a/docs/blog/images/amos_mvi_raw.png b/docs/blog/images/amos_mvi_raw.png new file mode 100644 index 000000000..bcb1105b5 Binary files /dev/null and b/docs/blog/images/amos_mvi_raw.png differ diff --git a/docs/blog/posts/enhancing_data_quality_amos.md b/docs/blog/posts/enhancing_data_quality_amos.md new file mode 100644 index 000000000..af4e117a7 --- /dev/null +++ b/docs/blog/posts/enhancing_data_quality_amos.md @@ -0,0 +1,94 @@ +--- +date: 2025-02-05 +authors: + - TUBCM +--- + +# Enhancing Data Quality in Real-Time: Our Experience with RTDIP and the AMOS Project + + + +![blog](../images/agile.svg){width=60%} +1 + + +Real-time data integration and preparation are crucial in today's data-driven world, especially when dealing with time series data from often distributed heterogeneous data sources. As data scientists often spend no less than 80%² of their time finding, integrating, and cleaning datasets, the importance of automated ingestion pipelines rises inevitably. Building such ingestion and integration frameworks can be challenging and can entail all sorts of technical debt like glue code, pipeline jungles, or dead code paths, which calls for precise conception and development of such systems. Modern software development approaches try to mitigate technical debts and enhance quality results by introducing and utilizing agile and more iterative methodologies, which are designed to foster rapid feedback and continuous progress. + + + +As part of the Agile Methods and Open Source (AMOS) project, we had the unique opportunity to work in a SCRUM team consisting of students from TU Berlin and FAU Erlangen-Nürnberg, to build data quality measures for the RTDIP Ingestion Pipeline framework. With the goal of enhancing data quality, we got to work and built modular pipeline components that aim to help data scientists and engineers with data integration, data cleaning, and data preparation. + +But what does it mean to work in an agile framework? The Agile Manifesto is above all a set of guiding values, principles, ideals, and goals. The overarching goal is to gain performance and be most effective while adding business value. By prioritizing the right fundamentals like individuals and interactions, working software, customer collaboration, and responding to change, cross-functional teams can ship viable products easier and faster. + +How that worked out for us in building data quality measures? True to the motto "User stories drive everything," we got together with contributors from the RTDIP Team to hear about concepts, the end users' stake in the project, and the current state to get a grasp on the expectations we can set on ourselves. With that, we got to work and planned our first sprint, and soon, we got the idea of how agile implementation is here to point out deficiencies in our processes. Through regular team meetings, we fostered a culture of continuous feedback and testing, leveraging reviews and retrospectives to identify roadblocks and drive necessary changes that enhance the overall development process. + +## Enhancing Data Quality in RTDIP's Pipeline Framework + +Coming up with modular steps that enhance data quality was the initial and arguably most critical step to start off a successful development process. So the question was: what exactly do the terms data integration, data cleaning, and data preparation entail? To expand on the key parts of that, this is what we did to pour these aspects into RTDIP components. + +### Data Validation and Schema Alignment + +Data validation and schema alignment are critical for ensuring the reliability and usability of data, serving as a foundational step before implementing other quality measures. For the time series data at hand, we developed an InputValidator component to verify that incoming data adheres to predefined quality standards, including compliance with an expected schema, correct PySpark data types, and proper handling of null values, raising exceptions when inconsistencies are detected. Additionally, the component enforces schema integration, harmonizing data from multiple sources into a unified, predefined structure. To maintain a consistent and efficient workflow, we required all data quality components to inherit the validation functionality of the InputValidator. + +### Data Cleansing + +Data cleansing is a vital process in enhancing the quality of data within a data integration pipeline, ensuring consistency, reliability, and usability. We implemented functionalities such as duplicate detection, which identifies and removes redundant records to prevent skewed analysis, and flatline filters, which eliminate constant, non-informative data points. Interval and range filters are employed to validate the time series data against predefined temporal or value ranges, ensuring conformity with expected patterns. Additionally, a K-sigma anomaly detection component identifies outliers based on statistical deviations, enabling the isolation of erroneous or anomalous values. Together, these methods ensure the pipeline delivers high-quality, actionable data for downstream processes. + +### Missing Value Imputation + +With a dataset refined to exclude unwanted data points and accounting for potential sensor failures, the next step toward ensuring high-quality data is to address any missing values through imputation. The component we developed first identifies and flags missing values by leveraging PySpark’s capabilities in windowing and UDF operations. With these techniques, we are able to dynamically determine the expected interval for each sensor by analyzing historical data patterns within defined partitions. Spline interpolation allows us to estimate missing values in time series data, seamlessly filling gaps with plausible and mathematically derived substitutes. By doing so, data scientists can not only improve the consistency of integrated datasets but also prevent errors or biases in analytics and machine learning models. +To actually show how this is realized with this new RTDIP component, let me show you a short example on how a few lines of code can enhance an exemplary time series load profile: +```python +from rtdip_sdk.pipelines.data_quality import MissingValueImputation +from pyspark.sql import SparkSession +import pandas as pd + +spark_session = SparkSession.builder.master("local[2]").appName("test").getOrCreate() + +source_df = pd.read_csv('./solar_energy_production_germany_April02.csv') +incomplete_spark_df = spark_session.createDataFrame(vi_april_df, ['Value', 'EventTime', 'TagName', 'Status']) + +#Before Missing Value Imputation +spark_df.show() + +#Execute RTDIP Pipeline component +clean_df = MissingValueImputation(spark_session, df=incomplete_spark_df).filter_data() + +#After Missing Value Imputation +clean_df.show() +``` +To illustrate this visually, plotting the before-and-after DataFrames reveals that all gaps have been successfully filled with meaningful data. + + + +![blog](../images/amos_mvi_raw.png){width=70%} + +![blog](../images/amos_mvi.png){width=70%} + + + + +### Normalization + +Normalization is a critical step in ensuring data quality within data integration pipelines with various sources. Techniques like mean normalization, min-max scaling, and z-score standardization help transform raw time series data into a consistent scale, eliminating biases caused by differing units or magnitudes across features. It enables fair comparisons between variables, accelerates algorithm convergence, and ensures that data from diverse sources aligns seamlessly, supporting possible downstream processes such as entity resolution, data augmentation, and machine learning. To offer a variety of use cases within the RTDIP pipeline, we implemented normalization techniques like mean normalization, min-max scaling, and z-score standardization as well as their respective denormalization methods. + +### Data Monitoring + +Data monitoring is another aspect of enhancing data quality within the RTDIP pipeline, ensuring the reliability and consistency of incoming data streams. Techniques such as flatline detection identify periods of unchanging values, which may indicate sensor malfunctions or stale data. Missing data identification leverages predefined intervals or historical patterns to detect and flag gaps, enabling proactive resolution. By continuously monitoring for these anomalies, the pipeline maintains high data integrity, supporting accurate analysis for inconsistencies. + +### Data Prediction + +Forecasting based on historical data patterns is essential for making informed decisions on a business level. Linear Regression is a simple yet powerful approach for predicting continuous outcomes by establishing a relationship between input features and the target variable. However, for time series data, the ARIMA (Autoregressive Integrated Moving Average) model is often preferred due to its ability to model temporal dependencies and trends in the data. The ARIMA model combines autoregressive (AR) and moving average (MA) components, along with differencing to stabilize the variance and trends in the time series. ARIMA with autonomous parameter selection takes this a step further by automatically optimizing the model’s parameters (p, d, q) using techniques like grid search or other statistical criteria, ensuring that the model is well-suited to the data’s underlying structure for more accurate predictions. To address this, we incorporated both an ARIMA component and an AUTO-ARIMA component, enabling the prediction of future time series data points for each sensor. + +
+ +Working on the RTDIP Project within AMOS has been a fantastic journey, highlighting the importance of people and teamwork in agile development. By focusing on enhancing data quality, we’ve significantly boosted the reliability, consistency, and usability of the data going through the RTDIP pipeline. + +To look back, our regular team meetings were the key to our success. Through open communication and collaboration, we tackled challenges and kept improving our processes. This showed us the power of working together in an agile framework and growing as a dedicated SCRUM team. + +We’re excited about the future and how these advancements will help data scientists and engineers make better decisions. + +
+ +1 Designed by Freepik
+2 Michael Stonebraker, Ihab F. Ilyas: Data Integration: The Current Status and the Way Forward. IEEE Data Eng. Bull. 41(2) (2018) \ No newline at end of file diff --git a/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/dimensionality_reduction.md b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/dimensionality_reduction.md new file mode 100644 index 000000000..f3ef84937 --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/dimensionality_reduction.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.dimensionality_reduction diff --git a/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/duplicate_detection.md b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/duplicate_detection.md new file mode 100644 index 000000000..a76a79164 --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/duplicate_detection.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.duplicate_detection \ No newline at end of file diff --git a/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/flatline_filter.md b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/flatline_filter.md new file mode 100644 index 000000000..5c82a11d3 --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/flatline_filter.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.flatline_filter diff --git a/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/gaussian_smoothing.md b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/gaussian_smoothing.md new file mode 100644 index 000000000..3a4018f46 --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/gaussian_smoothing.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.gaussian_smoothing diff --git a/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/interval_filtering.md b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/interval_filtering.md new file mode 100644 index 000000000..fe5f3e968 --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/interval_filtering.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.interval_filtering diff --git a/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/k_sigma_anomaly_detection.md b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/k_sigma_anomaly_detection.md new file mode 100644 index 000000000..70e69b3ea --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/k_sigma_anomaly_detection.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.k_sigma_anomaly_detection diff --git a/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/missing_value_imputation.md b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/missing_value_imputation.md new file mode 100644 index 000000000..23e7fd491 --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/missing_value_imputation.md @@ -0,0 +1,2 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.missing_value_imputation + diff --git a/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/denormalization.md b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/denormalization.md new file mode 100644 index 000000000..c2d5a19cb --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/denormalization.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.denormalization diff --git a/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization.md b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization.md new file mode 100644 index 000000000..2483f8dc8 --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.normalization diff --git a/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization_mean.md b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization_mean.md new file mode 100644 index 000000000..84cb4c997 --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization_mean.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.normalization_mean diff --git a/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization_minmax.md b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization_minmax.md new file mode 100644 index 000000000..b0ca874ad --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization_minmax.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.normalization_minmax diff --git a/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization_zscore.md b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization_zscore.md new file mode 100644 index 000000000..509474b78 --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization_zscore.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.normalization_zscore diff --git a/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/out_of_range_value_filter.md b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/out_of_range_value_filter.md new file mode 100644 index 000000000..af684fb77 --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/data_manipulation/spark/out_of_range_value_filter.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.out_of_range_value_filter \ No newline at end of file diff --git a/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/check_value_ranges.md b/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/check_value_ranges.md new file mode 100644 index 000000000..c3cf7dd82 --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/check_value_ranges.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.check_value_ranges \ No newline at end of file diff --git a/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/flatline_detection.md b/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/flatline_detection.md new file mode 100644 index 000000000..0b1965ff1 --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/flatline_detection.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.flatline_detection \ No newline at end of file diff --git a/docs/sdk/code-reference/pipelines/monitoring/spark/data_quality/great_expectations.md b/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/great_expectations.md similarity index 71% rename from docs/sdk/code-reference/pipelines/monitoring/spark/data_quality/great_expectations.md rename to docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/great_expectations.md index 8f26a67bf..1f2dfd23c 100644 --- a/docs/sdk/code-reference/pipelines/monitoring/spark/data_quality/great_expectations.md +++ b/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/great_expectations.md @@ -2,4 +2,4 @@ Great Expectations is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. -::: src.sdk.python.rtdip_sdk.pipelines.monitoring.spark.data_quality.great_expectations_data_quality \ No newline at end of file +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.great_expectations_data_quality \ No newline at end of file diff --git a/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/identify_missing_data_interval.md b/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/identify_missing_data_interval.md new file mode 100644 index 000000000..91215567e --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/identify_missing_data_interval.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.identify_missing_data_interval \ No newline at end of file diff --git a/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/identify_missing_data_pattern.md b/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/identify_missing_data_pattern.md new file mode 100644 index 000000000..26d3b7fec --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/identify_missing_data_pattern.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.identify_missing_data_pattern \ No newline at end of file diff --git a/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/moving_average.md b/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/moving_average.md new file mode 100644 index 000000000..0b13b472d --- /dev/null +++ b/docs/sdk/code-reference/pipelines/data_quality/monitoring/spark/moving_average.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.moving_average \ No newline at end of file diff --git a/docs/sdk/code-reference/pipelines/forecasting/spark/arima.md b/docs/sdk/code-reference/pipelines/forecasting/spark/arima.md new file mode 100644 index 000000000..c0052fccd --- /dev/null +++ b/docs/sdk/code-reference/pipelines/forecasting/spark/arima.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.arima diff --git a/docs/sdk/code-reference/pipelines/forecasting/spark/auto_arima.md b/docs/sdk/code-reference/pipelines/forecasting/spark/auto_arima.md new file mode 100644 index 000000000..dd27e599a --- /dev/null +++ b/docs/sdk/code-reference/pipelines/forecasting/spark/auto_arima.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.auto_arima diff --git a/docs/sdk/code-reference/pipelines/forecasting/spark/data_binning.md b/docs/sdk/code-reference/pipelines/forecasting/spark/data_binning.md new file mode 100644 index 000000000..a64da6b3d --- /dev/null +++ b/docs/sdk/code-reference/pipelines/forecasting/spark/data_binning.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.data_binning diff --git a/docs/sdk/code-reference/pipelines/forecasting/spark/k_nearest_neighbors.md b/docs/sdk/code-reference/pipelines/forecasting/spark/k_nearest_neighbors.md new file mode 100644 index 000000000..215a2c4b0 --- /dev/null +++ b/docs/sdk/code-reference/pipelines/forecasting/spark/k_nearest_neighbors.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.k_nearest_neighbors \ No newline at end of file diff --git a/docs/sdk/code-reference/pipelines/forecasting/spark/linear_regression.md b/docs/sdk/code-reference/pipelines/forecasting/spark/linear_regression.md new file mode 100644 index 000000000..653fc5400 --- /dev/null +++ b/docs/sdk/code-reference/pipelines/forecasting/spark/linear_regression.md @@ -0,0 +1 @@ +::: src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.linear_regression diff --git a/environment.yml b/environment.yml index 288d35e9a..652104c27 100644 --- a/environment.yml +++ b/environment.yml @@ -68,6 +68,8 @@ dependencies: - black>=24.1.0 - joblib==1.3.2,<2.0.0 - great-expectations>=0.18.8,<1.0.0 + - statsmodels>=0.14.1,<0.15.0 + - pmdarima>=2.0.4 - pip: - databricks-sdk>=0.20.0,<1.0.0 - dependency-injector>=4.41.0,<5.0.0 @@ -85,4 +87,4 @@ dependencies: - eth-typing>=4.2.3,<5.0.0 - pandas<3.0.0 - moto[s3]>=5.0.16,<6.0.0 - - pyarrow>=14.0.1,<17.0.0 + - pyarrow>=14.0.1,<17.0.0 \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 6d9b13888..834d47916 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -235,10 +235,38 @@ nav: - Azure Key Vault: sdk/code-reference/pipelines/secrets/azure_key_vault.md - Deploy: - Databricks: sdk/code-reference/pipelines/deploy/databricks.md - - Monitoring: - - Data Quality: - - Great Expectations: - - Data Quality Monitoring: sdk/code-reference/pipelines/monitoring/spark/data_quality/great_expectations.md + - Data Quality: + - Monitoring: + - Check Value Ranges: sdk/code-reference/pipelines/data_quality/monitoring/spark/check_value_ranges.md + - Great Expectations: + - Data Quality Monitoring: sdk/code-reference/pipelines/data_quality/monitoring/spark/great_expectations.md + - Flatline Detection: sdk/code-reference/pipelines/data_quality/monitoring/spark/flatline_detection.md + - Identify Missing Data: + - Interval Based: sdk/code-reference/pipelines/data_quality/monitoring/spark/identify_missing_data_interval.md + - Pattern Based: sdk/code-reference/pipelines/data_quality/monitoring/spark/identify_missing_data_pattern.md + - Moving Average: sdk/code-reference/pipelines/data_quality/monitoring/spark/moving_average.md + - Data Manipulation: + - Duplicate Detetection: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/duplicate_detection.md + - Out of Range Value Filter: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/out_of_range_value_filter.md + - Flatline Filter: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/flatline_filter.md + - Gaussian Smoothing: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/gaussian_smoothing.md + - Dimensionality Reduction: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/dimensionality_reduction.md + - Interval Filtering: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/interval_filtering.md + - K-Sigma Anomaly Detection: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/k_sigma_anomaly_detection.md + - Missing Value Imputation: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/missing_value_imputation.md + - Normalization: + - Normalization: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization.md + - Normalization Mean: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization_mean.md + - Normalization MinMax: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization_minmax.md + - Normalization ZScore: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization_zscore.md + - Denormalization: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/denormalization.md + - Forecasting: + - Data Binning: sdk/code-reference/pipelines/forecasting/spark/data_binning.md + - Linear Regression: sdk/code-reference/pipelines/forecasting/spark/linear_regression.md + - Arima: sdk/code-reference/pipelines/forecasting/spark/arima.md + - Auto Arima: sdk/code-reference/pipelines/forecasting/spark/auto_arima.md + - K Nearest Neighbors: sdk/code-reference/pipelines/forecasting/spark/k_nearest_neighbors.md + - Jobs: sdk/pipelines/jobs.md - Deploy: - Databricks Workflows: sdk/pipelines/deploy/databricks.md @@ -330,4 +358,4 @@ nav: - blog/index.md - University: - University: university/overview.md - \ No newline at end of file + diff --git a/setup.py b/setup.py index a15c1c1a8..564b5ba67 100644 --- a/setup.py +++ b/setup.py @@ -1,4 +1,4 @@ -# Copyright 2022 RTDIP +# Copyright 2025 RTDIP # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -46,6 +46,9 @@ "langchain>=0.2.0,<0.3.0", "langchain-community>=0.2.0,<0.3.0", "openai>=1.13.3,<2.0.0", + "pydantic>=2.6.0,<3.0.0", + "statsmodels>=0.14.1,<0.15.0", + "pmdarima>=2.0.4", ] PYSPARK_PACKAGES = [ @@ -71,6 +74,7 @@ "joblib>=1.3.2,<2.0.0", "sqlparams>=5.1.0,<6.0.0", "entsoe-py>=0.5.10,<1.0.0", + "numpy>=1.23.4,<2.0.0", ] EXTRAS_DEPENDENCIES: dict[str, list[str]] = { diff --git a/src/sdk/python/rtdip_sdk/pipelines/_pipeline_utils/spark.py b/src/sdk/python/rtdip_sdk/pipelines/_pipeline_utils/spark.py index 198b91431..71b960d7b 100644 --- a/src/sdk/python/rtdip_sdk/pipelines/_pipeline_utils/spark.py +++ b/src/sdk/python/rtdip_sdk/pipelines/_pipeline_utils/spark.py @@ -1,4 +1,4 @@ -# Copyright 2022 RTDIP +# Copyright 2025 RTDIP # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -13,7 +13,7 @@ # limitations under the License. import logging -from pyspark.sql import SparkSession +from pyspark.sql import SparkSession, DataFrame from pyspark.sql.types import ( StructType, StructField, @@ -28,6 +28,7 @@ DoubleType, FloatType, ) +from pyspark.sql.functions import col from .models import Libraries from ..._sdk_utils.compare_versions import _package_version_meets_minimum @@ -117,6 +118,96 @@ def get_dbutils( # def onQueryTerminated(self, event): # logging.info("Query terminated: {} {}".format(event.id, event.name)) + +def is_dataframe_partially_conformed_in_schema( + dataframe: DataFrame, schema: StructType, throw_error: bool = True +) -> bool: + """ + Checks if all columns in the dataframe are contained in the schema with appropriate types. + + Parameters: + dataframe (DataFrame): The dataframe to check. + schema (StructType): The schema to conform to. + throw_error (bool): If True, raises an error on non-conformance. Defaults to True. + + Returns: + bool: True if the dataframe conforms to the schema, False otherwise. + """ + for column in dataframe.schema: + if column.name in schema.names: + schema_field = schema[column.name] + if not isinstance(column.dataType, type(schema_field.dataType)): + if throw_error: + raise ValueError( + "Column {0} is of Type {1}, expected Type {2}".format( + column, column.dataType, schema_field.dataType + ) + ) + return False + else: + # dataframe contains column not expected ins schema + if not throw_error: + return False + else: + raise ValueError( + "Column {0} is not expected in dataframe".format(column) + ) + return True + + +def conform_dataframe_to_schema( + dataframe: DataFrame, schema: StructType, throw_error: bool = True +) -> DataFrame: + """ + Tries to convert all columns to the given schema. + + Parameters: + dataframe (DataFrame): The dataframe to conform. + schema (StructType): The schema to conform to. + throw_error (bool): If True, raises an error on non-conformance. Defaults to True. + + Returns: + DataFrame: The conformed dataframe. + """ + for column in dataframe.schema: + c_name = column.name + if c_name in schema.names: + schema_field = schema[c_name] + if not isinstance(column.dataType, type(schema_field.dataType)): + dataframe = dataframe.withColumn( + c_name, dataframe[c_name].cast(schema_field.dataType) + ) + else: + if throw_error: + raise ValueError(f"Column '{c_name}' is not expected in the dataframe") + else: + dataframe = dataframe.drop(c_name) + return dataframe + + +def split_by_source(df: DataFrame, split_by_col: str, timestamp_col: str) -> dict: + """ + + Helper method to separate individual time series based on their source. + + Parameters: + df (DataFrame): The input DataFrame. + split_by_col (str): The column name to split the DataFrame by. + timestamp_col (str): The column name to order the DataFrame by. + + Returns: + dict: A dictionary where keys are distinct values from split_by_col and values are DataFrames filtered and ordered by timestamp_col. + """ + tag_names = df.select(split_by_col).distinct().collect() + tag_names = [row[split_by_col] for row in tag_names] + source_dict = { + tag: df.filter(col(split_by_col) == tag).orderBy(timestamp_col) + for tag in tag_names + } + + return source_dict + + EVENTHUB_SCHEMA = StructType( [ StructField("body", BinaryType(), True), @@ -469,6 +560,15 @@ def get_dbutils( ] ) +PROCESS_DATA_MODEL_EVENT_SCHEMA = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", StringType(), True), + ] +) + KAFKA_SCHEMA = StructType( [ StructField("key", BinaryType(), True), diff --git a/src/sdk/python/rtdip_sdk/pipelines/monitoring/__init__.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/__init__.py similarity index 86% rename from src/sdk/python/rtdip_sdk/pipelines/monitoring/__init__.py rename to src/sdk/python/rtdip_sdk/pipelines/data_quality/__init__.py index 17e525274..734152471 100644 --- a/src/sdk/python/rtdip_sdk/pipelines/monitoring/__init__.py +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/__init__.py @@ -1,4 +1,4 @@ -# Copyright 2022 RTDIP +# Copyright 2025 RTDIP # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -11,4 +11,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -from .spark.data_quality.great_expectations_data_quality import * + +from .data_manipulation import * +from .monitoring import * diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/__init__.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/__init__.py new file mode 100644 index 000000000..76bb6a388 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/__init__.py @@ -0,0 +1,15 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from .spark import * diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/interfaces.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/interfaces.py new file mode 100644 index 000000000..2e226f20d --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/interfaces.py @@ -0,0 +1,24 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from abc import abstractmethod + +from pyspark.sql import DataFrame +from ...interfaces import PipelineComponentBaseInterface + + +class DataManipulationBaseInterface(PipelineComponentBaseInterface): + @abstractmethod + def filter_data(self) -> DataFrame: + pass diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/__init__.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/__init__.py new file mode 100644 index 000000000..0d716ab8a --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/__init__.py @@ -0,0 +1,22 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from .normalization import * +from .dimensionality_reduction import DimensionalityReduction +from .duplicate_detection import DuplicateDetection +from .interval_filtering import IntervalFiltering +from .k_sigma_anomaly_detection import KSigmaAnomalyDetection +from .missing_value_imputation import MissingValueImputation +from .out_of_range_value_filter import OutOfRangeValueFilter +from .flatline_filter import FlatlineFilter diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/dimensionality_reduction.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/dimensionality_reduction.py new file mode 100644 index 000000000..2009e5145 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/dimensionality_reduction.py @@ -0,0 +1,157 @@ +# Copyright 2025 Project Team +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from pyspark.sql import DataFrame as PySparkDataFrame +from pyspark.ml.stat import Correlation +from pyspark.sql.functions import col +from pyspark.ml.feature import VectorAssembler + +from ..interfaces import DataManipulationBaseInterface +from ...._pipeline_utils.models import ( + Libraries, + SystemType, +) + + +class DimensionalityReduction(DataManipulationBaseInterface): + """ + Detects and combines columns based on correlation or exact duplicates. + + Example + -------- + ```python + from rtdip_sdk.pipelines.data_quality.data_manipulation.spark.dimensionality_reduction import DimensionalityReduction + + from pyspark.sql import SparkSession + + column_correlation_monitor = DimensionalityReduction( + df, + columns=['column1', 'column2'], + threshold=0.95, + combination_method='mean' + ) + + result = column_correlation_monitor.filter_data() + ``` + + Parameters: + df (DataFrame): PySpark DataFrame to be analyzed and transformed. + columns (list): List of column names to check for correlation. Only two columns are supported. + threshold (float, optional): Correlation threshold for column combination [0-1]. If the absolute value of the correlation is equal or bigger, than the columns are combined. Defaults to 0.9. + combination_method (str, optional): Method to combine correlated columns. + Supported methods: + - 'mean': Average the values of both columns and write the result to the first column + (New value = (column1 + column2) / 2) + - 'sum': Sum the values of both columns and write the result to the first column + (New value = column1 + column2) + - 'first': Keep the first column, drop the second column + - 'second': Keep the second column, drop the first column + - 'delete': Remove both columns entirely from the DataFrame + Defaults to 'mean'. + """ + + df: PySparkDataFrame + columns_to_check: list + threshold: float + combination_method: str + + def __init__( + self, + df: PySparkDataFrame, + columns: list, + threshold: float = 0.9, + combination_method: str = "mean", + ) -> None: + # Validate inputs + if not columns or not isinstance(columns, list): + raise ValueError("columns must be a non-empty list of column names.") + if len(columns) != 2: + raise ValueError( + "columns must contain exactly two columns for correlation." + ) + + if not 0 <= threshold <= 1: + raise ValueError("Threshold must be between 0 and 1.") + + valid_methods = ["mean", "sum", "first", "second", "delete"] + if combination_method not in valid_methods: + raise ValueError(f"combination_method must be one of {valid_methods}") + + self.df = df + self.columns_to_check = columns + self.threshold = threshold + self.combination_method = combination_method + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def _calculate_correlation(self) -> float: + """ + Calculate correlation between specified columns. + + Returns: + float: Correlation matrix between columns + """ + assembler = VectorAssembler( + inputCols=self.columns_to_check, outputCol="features" + ) + vector_df = assembler.transform(self.df) + + correlation_matrix = Correlation.corr( + vector_df, "features", method="pearson" + ).collect()[0][0] + + # Correlation between first and second column + return correlation_matrix.toArray()[0][1] + + def filter_data(self) -> PySparkDataFrame: + """ + Process DataFrame by detecting and combining correlated columns. + + Returns: + PySparkDataFrame: Transformed PySpark DataFrame + """ + correlation = self._calculate_correlation() + + # If correlation is below threshold, return original DataFrame + if correlation < self.threshold: + return self.df + + col1, col2 = self.columns_to_check + if self.combination_method == "mean": + return self.df.withColumn(col1, (col(col1) + col(col2)) / 2).drop(col2) + elif self.combination_method == "sum": + return self.df.withColumn(col1, col(col1) + col(col2)).drop(col2) + elif self.combination_method == "first": + return self.df.drop(col2) + elif self.combination_method == "second": + return self.df.drop(col2) + elif self.combination_method == "delete": + return self.df.drop(col1).drop(col2) + else: + return self.df diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/duplicate_detection.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/duplicate_detection.py new file mode 100644 index 000000000..20df3eded --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/duplicate_detection.py @@ -0,0 +1,81 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from pyspark.sql.functions import desc +from pyspark.sql import DataFrame as PySparkDataFrame + +from ..interfaces import DataManipulationBaseInterface +from ...input_validator import InputValidator +from ...._pipeline_utils.models import ( + Libraries, + SystemType, +) + + +class DuplicateDetection(DataManipulationBaseInterface, InputValidator): + """ + Cleanses a PySpark DataFrame from duplicates. + + Example + -------- + ```python + from rtdip_sdk.pipelines.data_quality.data_manipulation.spark.duplicate_detection import DuplicateDetection + + from pyspark.sql import SparkSession + from pyspark.sql.dataframe import DataFrame + + duplicate_detection_monitor = DuplicateDetection(df, primary_key_columns=["TagName", "EventTime"]) + + result = duplicate_detection_monitor.filter_data() + ``` + + Parameters: + df (DataFrame): PySpark DataFrame to be cleansed. + primary_key_columns (list): List of column names that serve as primary key for duplicate detection. + """ + + df: PySparkDataFrame + primary_key_columns: list + + def __init__(self, df: PySparkDataFrame, primary_key_columns: list) -> None: + if not primary_key_columns or not isinstance(primary_key_columns, list): + raise ValueError( + "primary_key_columns must be a non-empty list of column names." + ) + self.df = df + self.primary_key_columns = primary_key_columns + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def filter_data(self) -> PySparkDataFrame: + """ + Returns: + PySparkDataFrame: A cleansed PySpark DataFrame from all duplicates based on primary key columns. + """ + cleansed_df = self.df.dropDuplicates(self.primary_key_columns) + return cleansed_df diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/flatline_filter.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/flatline_filter.py new file mode 100644 index 000000000..4809dde0b --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/flatline_filter.py @@ -0,0 +1,92 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from pyspark.sql import DataFrame as PySparkDataFrame + +from ...monitoring.spark.flatline_detection import FlatlineDetection +from ..interfaces import DataManipulationBaseInterface +from ...._pipeline_utils.models import ( + Libraries, + SystemType, +) + + +class FlatlineFilter(DataManipulationBaseInterface): + """ + Removes and logs rows with flatlining detected in specified columns of a PySpark DataFrame. + + Args: + df (pyspark.sql.DataFrame): The input DataFrame to process. + watch_columns (list): List of column names to monitor for flatlining (null or zero values). + tolerance_timespan (int): Maximum allowed consecutive flatlining period. Rows exceeding this period are removed. + + Example: + ```python + from pyspark.sql import SparkSession + from rtdip_sdk.pipelines.data_quality.data_manipulation.spark.flatline_filter import FlatlineFilter + + + spark = SparkSession.builder.master("local[1]").appName("FlatlineFilterExample").getOrCreate() + + # Example DataFrame + data = [ + (1, "2024-01-02 03:49:45.000", 0.0), + (1, "2024-01-02 03:50:45.000", 0.0), + (1, "2024-01-02 03:51:45.000", 0.0), + (2, "2024-01-02 03:49:45.000", 5.0), + ] + columns = ["TagName", "EventTime", "Value"] + df = spark.createDataFrame(data, columns) + + filter_flatlining_rows = FlatlineFilter( + df=df, + watch_columns=["Value"], + tolerance_timespan=2, + ) + + result_df = filter_flatlining_rows.filter_data() + result_df.show() + ``` + """ + + def __init__( + self, df: PySparkDataFrame, watch_columns: list, tolerance_timespan: int + ) -> None: + self.df = df + self.flatline_detection = FlatlineDetection( + df=df, watch_columns=watch_columns, tolerance_timespan=tolerance_timespan + ) + + @staticmethod + def system_type(): + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def filter_data(self) -> PySparkDataFrame: + """ + Removes rows with flatlining detected. + + Returns: + pyspark.sql.DataFrame: A DataFrame without rows with flatlining detected. + """ + flatlined_rows = self.flatline_detection.check_for_flatlining() + flatlined_rows = flatlined_rows.select(*self.df.columns) + return self.df.subtract(flatlined_rows) diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/gaussian_smoothing.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/gaussian_smoothing.py new file mode 100644 index 000000000..49a0cd8f7 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/gaussian_smoothing.py @@ -0,0 +1,146 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np +from pyspark.sql.types import FloatType +from scipy.ndimage import gaussian_filter1d +from pyspark.sql import DataFrame as PySparkDataFrame, Window +from pyspark.sql import functions as F + +from ...._pipeline_utils.models import ( + Libraries, + SystemType, +) +from ..interfaces import DataManipulationBaseInterface + + +class GaussianSmoothing(DataManipulationBaseInterface): + """ + Applies Gaussian smoothing to a PySpark DataFrame. This method smooths the values in a specified column + using a Gaussian filter, which helps reduce noise and fluctuations in time-series or spatial data. + + The smoothing can be performed in two modes: + - **Temporal mode**: Applies smoothing along the time axis within each unique ID. + - **Spatial mode**: Applies smoothing across different IDs for the same timestamp. + + Example + -------- + ```python + from pyspark.sql import SparkSession + from rtdip_sdk.pipelines.data_quality.data_manipulation.spark.gaussian_smoothing import GaussianSmoothing + + + spark = SparkSession.builder.getOrCreate() + df = ... # Load your PySpark DataFrame + + smoothed_df = GaussianSmoothing( + df=df, + sigma=2.0, + mode="temporal", + id_col="sensor_id", + timestamp_col="timestamp", + value_col="measurement" + ).filter_data() + + smoothed_df.show() + ``` + + Parameters: + df (PySparkDataFrame): The input PySpark DataFrame. + sigma (float): The standard deviation for the Gaussian kernel, controlling the amount of smoothing. + mode (str, optional): The smoothing mode, either `"temporal"` (default) or `"spatial"`. + id_col (str, optional): The name of the column representing unique entity IDs (default: `"id"`). + timestamp_col (str, optional): The name of the column representing timestamps (default: `"timestamp"`). + value_col (str, optional): The name of the column containing the values to be smoothed (default: `"value"`). + + Raises: + TypeError: If `df` is not a PySpark DataFrame. + ValueError: If `sigma` is not a positive number. + ValueError: If `mode` is not `"temporal"` or `"spatial"`. + ValueError: If `id_col`, `timestamp_col`, or `value_col` are not found in the DataFrame. + """ + + def __init__( + self, + df: PySparkDataFrame, + sigma: float, + mode: str = "temporal", + id_col: str = "id", + timestamp_col: str = "timestamp", + value_col: str = "value", + ) -> None: + if not isinstance(df, PySparkDataFrame): + raise TypeError("df must be a PySpark DataFrame") + if not isinstance(sigma, (int, float)) or sigma <= 0: + raise ValueError("sigma must be a positive number") + if mode not in ["temporal", "spatial"]: + raise ValueError("mode must be either 'temporal' or 'spatial'") + + if id_col not in df.columns: + raise ValueError(f"Column {id_col} not found in DataFrame") + if timestamp_col not in df.columns: + raise ValueError(f"Column {timestamp_col} not found in DataFrame") + if value_col not in df.columns: + raise ValueError(f"Column {value_col} not found in DataFrame") + + self.df = df + self.sigma = sigma + self.mode = mode + self.id_col = id_col + self.timestamp_col = timestamp_col + self.value_col = value_col + + @staticmethod + def system_type(): + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + @staticmethod + def create_gaussian_smoother(sigma_value): + def apply_gaussian(values): + if not values: + return None + values_array = np.array([float(v) for v in values]) + smoothed = gaussian_filter1d(values_array, sigma=sigma_value) + return float(smoothed[-1]) + + return apply_gaussian + + def filter_data(self) -> PySparkDataFrame: + + smooth_udf = F.udf(self.create_gaussian_smoother(self.sigma), FloatType()) + + if self.mode == "temporal": + window = ( + Window.partitionBy(self.id_col) + .orderBy(self.timestamp_col) + .rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing) + ) + else: # spatial mode + window = ( + Window.partitionBy(self.timestamp_col) + .orderBy(self.id_col) + .rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing) + ) + + collect_list_expr = F.collect_list(F.col(self.value_col)).over(window) + + return self.df.withColumn(self.value_col, smooth_udf(collect_list_expr)) diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/interval_filtering.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/interval_filtering.py new file mode 100644 index 000000000..35cf723e0 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/interval_filtering.py @@ -0,0 +1,184 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from datetime import timedelta + +import pandas as pd +from pyspark.sql.types import StringType +from pyspark.sql import functions as F +from pyspark.sql import SparkSession +from pyspark.sql import DataFrame + +from ...._pipeline_utils.models import ( + Libraries, + SystemType, +) +from ..interfaces import DataManipulationBaseInterface +from ...input_validator import InputValidator + + +class IntervalFiltering(DataManipulationBaseInterface, InputValidator): + """ + Cleanses a DataFrame by removing rows outside a specified interval window. Supported time stamp columns are DateType and StringType. + + Parameters: + spark (SparkSession): A SparkSession object. + df (DataFrame): PySpark DataFrame to be converted + interval (int): The interval length for cleansing. + interval_unit (str): 'hours', 'minutes', 'seconds' or 'milliseconds' to specify the unit of the interval. + time_stamp_column_name (str): The name of the column containing the time stamps. Default is 'EventTime'. + tolerance (int): The tolerance for the interval. Default is None. + """ + + """ Default time stamp column name if not set in the constructor """ + DEFAULT_TIME_STAMP_COLUMN_NAME: str = "EventTime" + + def __init__( + self, + spark: SparkSession, + df: DataFrame, + interval: int, + interval_unit: str, + time_stamp_column_name: str = None, + tolerance: int = None, + ) -> None: + self.spark = spark + self.df = df + self.interval = interval + self.interval_unit = interval_unit + self.tolerance = tolerance + if time_stamp_column_name is None: + self.time_stamp_column_name = self.DEFAULT_TIME_STAMP_COLUMN_NAME + else: + self.time_stamp_column_name = time_stamp_column_name + + def filter_data(self) -> DataFrame: + """ + Filters the DataFrame based on the interval + """ + + if self.time_stamp_column_name not in self.df.columns: + raise ValueError( + f"Column {self.time_stamp_column_name} not found in the DataFrame." + ) + is_string_time_stamp = isinstance( + self.df.schema[self.time_stamp_column_name].dataType, StringType + ) + + original_schema = self.df.schema + self.df = self.convert_column_to_timestamp().orderBy( + self.time_stamp_column_name + ) + + tolerance_in_ms = None + if self.tolerance is not None: + tolerance_in_ms = self.get_time_delta(self.tolerance).total_seconds() * 1000 + + time_delta_in_ms = self.get_time_delta(self.interval).total_seconds() * 1000 + + rows = self.df.collect() + last_time_stamp = rows[0][self.time_stamp_column_name] + first_row = rows[0].asDict() + + first_row[self.time_stamp_column_name] = ( + self.format_date_time_to_string(first_row[self.time_stamp_column_name]) + if is_string_time_stamp + else first_row[self.time_stamp_column_name] + ) + + cleansed_df = [first_row] + + for i in range(1, len(rows)): + current_row = rows[i] + current_time_stamp = current_row[self.time_stamp_column_name] + + if self.check_outside_of_interval( + current_time_stamp, last_time_stamp, time_delta_in_ms, tolerance_in_ms + ): + current_row_dict = current_row.asDict() + current_row_dict[self.time_stamp_column_name] = ( + self.format_date_time_to_string( + current_row_dict[self.time_stamp_column_name] + ) + if is_string_time_stamp + else current_row_dict[self.time_stamp_column_name] + ) + + cleansed_df.append(current_row_dict) + last_time_stamp = current_time_stamp + + result_df = self.spark.createDataFrame(cleansed_df, schema=original_schema) + + return result_df + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def convert_column_to_timestamp(self) -> DataFrame: + try: + return self.df.withColumn( + self.time_stamp_column_name, F.to_timestamp(self.time_stamp_column_name) + ) + except Exception as e: + raise ValueError( + f"Error converting column {self.time_stamp_column_name} to timestamp: {e}" + f"{self.df.schema[self.time_stamp_column_name].dataType} might be unsupported!" + ) + + def get_time_delta(self, value: int) -> timedelta: + if self.interval_unit == "minutes": + return timedelta(minutes=value) + elif self.interval_unit == "days": + return timedelta(days=value) + elif self.interval_unit == "hours": + return timedelta(hours=value) + elif self.interval_unit == "seconds": + return timedelta(seconds=value) + elif self.interval_unit == "milliseconds": + return timedelta(milliseconds=value) + else: + raise ValueError( + "interval_unit must be either 'days', 'hours', 'minutes', 'seconds' or 'milliseconds'" + ) + + def check_outside_of_interval( + self, + current_time_stamp: pd.Timestamp, + last_time_stamp: pd.Timestamp, + time_delta_in_ms: float, + tolerance_in_ms: float, + ) -> bool: + time_difference = (current_time_stamp - last_time_stamp).total_seconds() * 1000 + if not tolerance_in_ms is None: + time_difference += tolerance_in_ms + return time_difference >= time_delta_in_ms + + def format_date_time_to_string(self, time_stamp: pd.Timestamp) -> str: + try: + return time_stamp.strftime("%Y-%m-%d %H:%M:%S.%f")[:-3] + except Exception as e: + raise ValueError(f"Error converting timestamp to string: {e}") diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/k_sigma_anomaly_detection.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/k_sigma_anomaly_detection.py new file mode 100644 index 000000000..090c149a5 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/k_sigma_anomaly_detection.py @@ -0,0 +1,142 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from pyspark.sql import DataFrame, SparkSession +from pyspark.sql.functions import mean, stddev, abs, col +from ..interfaces import DataManipulationBaseInterface +from ...input_validator import InputValidator +from ...._pipeline_utils.models import ( + Libraries, + SystemType, +) +from pyspark.sql.types import ( + DoubleType, + StructType, + StructField, +) + + +class KSigmaAnomalyDetection(DataManipulationBaseInterface, InputValidator): + """ + Anomaly detection with the k-sigma method. This method either computes the mean and standard deviation, or the median and the median absolute deviation (MAD) of the data. + The k-sigma method then filters out all data points that are k times the standard deviation away from the mean, or k times the MAD away from the median. + Assuming a normal distribution, this method keeps around 99.7% of the data points when k=3 and use_median=False. + + Example + -------- + ```python + from rtdip_sdk.pipelines.data_quality.data_manipulation.spark.k_sigma_anomaly_detection import KSigmaAnomalyDetection + + + spark = ... # SparkSession + df = ... # Get a PySpark DataFrame + + filtered_df = KSigmaAnomalyDetection( + spark, df, [""] + ).filter_data() + + filtered_df.show() + ``` + + Parameters: + spark (SparkSession): A SparkSession object. + df (DataFrame): Dataframe containing the raw data. + column_names (list[str]): The names of the columns to be filtered (currently only one column is supported). + k_value (float): The number of deviations to build the threshold. + use_median (book): If True the median and the median absolute deviation (MAD) are used, instead of the mean and standard deviation. + """ + + def __init__( + self, + spark: SparkSession, + df: DataFrame, + column_names: list[str], + k_value: float = 3.0, + use_median: bool = False, + ) -> None: + if len(column_names) == 0: + raise Exception("You must provide at least one column name") + if len(column_names) > 1: + raise NotImplementedError("Multiple columns are not supported yet") + + self.column_names = column_names + self.use_median = use_median + self.spark = spark + self.df = df + self.k_value = k_value + + self.validate( + StructType( + [StructField(column, DoubleType(), True) for column in column_names] + ) + ) + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def filter_data(self) -> DataFrame: + """ + Filter anomalies based on the k-sigma rule + """ + + column_name = self.column_names[0] + mean_value, deviation = 0, 0 + + if self.use_median: + mean_value = self.df.approxQuantile(column_name, [0.5], 0.0)[0] + if mean_value is None: + raise Exception("Failed to calculate the mean value") + + df_with_deviation = self.df.withColumn( + "absolute_deviation", abs(col(column_name) - mean_value) + ) + deviation = df_with_deviation.approxQuantile( + "absolute_deviation", [0.5], 0.0 + )[0] + if deviation is None: + raise Exception("Failed to calculate the deviation value") + else: + stats = self.df.select( + mean(column_name), stddev(self.column_names[0]) + ).first() + if stats is None: + raise Exception( + "Failed to calculate the mean value and the standard deviation value" + ) + + mean_value = stats[0] + deviation = stats[1] + + shift = self.k_value * deviation + lower_bound = mean_value - shift + upper_bound = mean_value + shift + + return self.df.filter( + (self.df[column_name] >= lower_bound) + & (self.df[column_name] <= upper_bound) + ) diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/missing_value_imputation.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/missing_value_imputation.py new file mode 100644 index 000000000..955d49ea2 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/missing_value_imputation.py @@ -0,0 +1,290 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from pyspark.sql import SparkSession, DataFrame as PySparkDataFrame, functions as F, Row +from pyspark.sql.functions import col, udf +from pyspark.sql.types import StringType, TimestampType, FloatType, ArrayType +from pyspark.sql.window import Window +from scipy.interpolate import UnivariateSpline +import numpy as np +from datetime import timedelta +from typing import List +from ..interfaces import DataManipulationBaseInterface +from ...input_validator import InputValidator +from ...._pipeline_utils.models import ( + Libraries, + SystemType, +) + + +class MissingValueImputation(DataManipulationBaseInterface, InputValidator): + """ + Imputes missing values in a univariate time series creating a continuous curve of data points. For that, the + time intervals of each individual source is calculated, to then insert empty records at the missing timestamps with + NaN values. Through spline interpolation the missing NaN values are calculated resulting in a consistent data set + and thus enhance your data quality. + + Example + -------- + ```python + from pyspark.sql import SparkSession + from pyspark.sql.dataframe import DataFrame + from pyspark.sql.types import StructType, StructField, StringType + from rtdip_sdk.pipelines.data_quality.data_manipulation.spark.missing_value_imputation import ( + MissingValueImputation, + ) + + spark = spark_session() + + schema = StructType([ + StructField("TagName", StringType(), True), + StructField("EventTime", StringType(), True), + StructField("Status", StringType(), True), + StructField("Value", StringType(), True) + ]) + + data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-01 03:29:21.000", "Good", "1.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 07:32:55.000", "Good", "2.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 11:36:29.000", "Good", "3.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 15:39:03.000", "Good", "4.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 19:42:37.000", "Good", "5.0"), + #("A2PS64V0J.:ZUX09R", "2024-01-01 23:46:11.000", "Good", "6.0"), # Test values + #("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "7.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "8.0"), + ] + df = spark.createDataFrame(data, schema=schema) + + missing_value_imputation = MissingValueImputation(spark, df) + result = missing_value_imputation.filter_data() + ``` + + Parameters: + df (DataFrame): Dataframe containing the raw data. + tolerance_percentage (int): Percentage value that indicates how much the time series data points may vary + in each interval + """ + + df: PySparkDataFrame + + def __init__( + self, + spark: SparkSession, + df: PySparkDataFrame, + tolerance_percentage: int = 5, + ) -> None: + self.spark = spark + self.df = df + self.tolerance_percentage = tolerance_percentage + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + @staticmethod + def _impute_missing_values_sp(df) -> PySparkDataFrame: + """ + Imputes missing values by Spline Interpolation + """ + data = np.array( + df.select("Value").rdd.flatMap(lambda x: x).collect(), dtype=float + ) + mask = np.isnan(data) + + x_data = np.arange(len(data)) + y_data = data[~mask] + + spline = UnivariateSpline(x_data[~mask], y_data, s=0) + + data_imputed = data.copy() + data_imputed[mask] = spline(x_data[mask]) + data_imputed_list = data_imputed.tolist() + + imputed_rdd = df.rdd.zipWithIndex().map( + lambda row: Row( + TagName=row[0][0], + EventTime=row[0][1], + Status=row[0][2], + Value=float(data_imputed_list[row[1]]), + ) + ) + imputed_df = imputed_rdd.toDF(df.schema) + + return imputed_df + + @staticmethod + def _flag_missing_values(df, tolerance_percentage) -> PySparkDataFrame: + """ + Determines intervals of each respective source time series and inserts empty records at missing timestamps + with NaN values + """ + window_spec = Window.partitionBy("TagName").orderBy("EventTime") + + df = df.withColumn("prev_event_time", F.lag("EventTime").over(window_spec)) + df = df.withColumn( + "time_diff_seconds", + (F.unix_timestamp("EventTime") - F.unix_timestamp("prev_event_time")), + ) + + df_diff = df.filter(F.col("time_diff_seconds").isNotNull()) + interval_counts = df_diff.groupBy("time_diff_seconds").count() + most_frequent_interval = interval_counts.orderBy(F.desc("count")).first() + expected_interval = ( + most_frequent_interval["time_diff_seconds"] + if most_frequent_interval + else None + ) + + tolerance = ( + (expected_interval * tolerance_percentage) / 100 if expected_interval else 0 + ) + + existing_timestamps = ( + df.select("TagName", "EventTime") + .rdd.map(lambda row: (row["TagName"], row["EventTime"])) + .groupByKey() + .collectAsMap() + ) + + def generate_missing_timestamps(prev_event_time, event_time, tag_name): + # Check for first row + if ( + prev_event_time is None + or event_time is None + or expected_interval is None + ): + return [] + + # Check against existing timestamps to avoid duplicates + tag_timestamps = set(existing_timestamps.get(tag_name, [])) + missing_timestamps = [] + current_time = prev_event_time + + while current_time < event_time: + next_expected_time = current_time + timedelta(seconds=expected_interval) + time_diff = abs((next_expected_time - event_time).total_seconds()) + if time_diff <= tolerance: + break + if next_expected_time not in tag_timestamps: + missing_timestamps.append(next_expected_time) + current_time = next_expected_time + + return missing_timestamps + + generate_missing_timestamps_udf = udf( + generate_missing_timestamps, ArrayType(TimestampType()) + ) + + df_with_missing = df.withColumn( + "missing_timestamps", + generate_missing_timestamps_udf("prev_event_time", "EventTime", "TagName"), + ) + + df_missing_entries = df_with_missing.select( + "TagName", + F.explode("missing_timestamps").alias("EventTime"), + F.lit("Good").alias("Status"), + F.lit(float("nan")).cast(FloatType()).alias("Value"), + ) + + df_combined = ( + df.select("TagName", "EventTime", "Status", "Value") + .union(df_missing_entries) + .orderBy("EventTime") + ) + + return df_combined + + @staticmethod + def _is_column_type(df, column_name, data_type): + """ + Helper method for data type checking + """ + type_ = df.schema[column_name] + + return isinstance(type_.dataType, data_type) + + def filter_data(self) -> PySparkDataFrame: + """ + Imputate missing values based on [Spline Interpolation, ] + """ + if not all( + col_ in self.df.columns + for col_ in ["TagName", "EventTime", "Value", "Status"] + ): + raise ValueError("Columns not as expected") + + if not self._is_column_type(self.df, "EventTime", TimestampType): + if self._is_column_type(self.df, "EventTime", StringType): + # Attempt to parse the first format, then fallback to the second + self.df = self.df.withColumn( + "EventTime", + F.coalesce( + F.to_timestamp("EventTime", "yyyy-MM-dd HH:mm:ss.SSS"), + F.to_timestamp("EventTime", "dd.MM.yyyy HH:mm:ss"), + ), + ) + if not self._is_column_type(self.df, "Value", FloatType): + self.df = self.df.withColumn("Value", self.df["Value"].cast(FloatType())) + + dfs_by_source = self._split_by_source() + + imputed_dfs: List[PySparkDataFrame] = [] + + for source, df in dfs_by_source.items(): + # Determine, insert and flag all the missing entries + flagged_df = self._flag_missing_values(df, self.tolerance_percentage) + + # Impute the missing values of flagged entries + try: + imputed_df_sp = self._impute_missing_values_sp(flagged_df) + except Exception as e: + if flagged_df.count() != 1: # Account for single entries + raise Exception( + "Something went wrong while imputing missing values" + ) + + imputed_dfs.append(imputed_df_sp) + + result_df = imputed_dfs[0] + for df in imputed_dfs[1:]: + result_df = result_df.unionByName(df) + + return result_df + + def _split_by_source(self) -> dict: + """ + Helper method to separate individual time series based on their source + """ + tag_names = self.df.select("TagName").distinct().collect() + tag_names = [row["TagName"] for row in tag_names] + source_dict = { + tag: self.df.filter(col("TagName") == tag).orderBy("EventTime") + for tag in tag_names + } + + return source_dict diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/__init__.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/__init__.py new file mode 100644 index 000000000..672fdd6d3 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/__init__.py @@ -0,0 +1,18 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from .denormalization import Denormalization +from .normalization_mean import NormalizationMean +from .normalization_minmax import NormalizationMinMax +from .normalization_zscore import NormalizationZScore diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/denormalization.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/denormalization.py new file mode 100644 index 000000000..3e7a7fc8b --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/denormalization.py @@ -0,0 +1,75 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from pyspark.sql import DataFrame as PySparkDataFrame +from ....input_validator import InputValidator +from ...interfaces import ( + DataManipulationBaseInterface, +) +from ....._pipeline_utils.models import ( + Libraries, + SystemType, +) +from .normalization import ( + NormalizationBaseClass, +) + + +class Denormalization(DataManipulationBaseInterface, InputValidator): + """ + Applies the appropriate denormalization method to revert values to their original scale. + + Example + -------- + ```python + from rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.denormalization import Denormalization + from pyspark.sql import SparkSession + from pyspark.sql.dataframe import DataFrame + + denormalization = Denormalization(normalized_df, normalization) + denormalized_df = denormalization.filter_data() + ``` + + Parameters: + df (DataFrame): PySpark DataFrame to be reverted to its original scale. + normalization_to_revert (NormalizationBaseClass): An instance of the specific normalization subclass (NormalizationZScore, NormalizationMinMax, NormalizationMean) that was originally used to normalize the data. + """ + + df: PySparkDataFrame + normalization_to_revert: NormalizationBaseClass + + def __init__( + self, df: PySparkDataFrame, normalization_to_revert: NormalizationBaseClass + ) -> None: + self.df = df + self.normalization_to_revert = normalization_to_revert + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def filter_data(self) -> PySparkDataFrame: + return self.normalization_to_revert.denormalize(self.df) diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/normalization.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/normalization.py new file mode 100644 index 000000000..dd4c3cad3 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/normalization.py @@ -0,0 +1,149 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from abc import abstractmethod +from pyspark.sql import DataFrame as PySparkDataFrame +from typing import List +from pyspark.sql.types import DoubleType, StructField, StructType +from ....input_validator import InputValidator +from ...interfaces import ( + DataManipulationBaseInterface, +) +from ....._pipeline_utils.models import ( + Libraries, + SystemType, +) + + +class NormalizationBaseClass(DataManipulationBaseInterface, InputValidator): + """ + A base class for applying normalization techniques to multiple columns in a PySpark DataFrame. + This class serves as a framework to support various normalization methods (e.g., Z-Score, Min-Max, and Mean), + with specific implementations in separate subclasses for each normalization type. + + Subclasses should implement specific normalization and denormalization methods by inheriting from this base class. + + + Example + -------- + ```python + from rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.normalization import NormalizationZScore + + from pyspark.sql import SparkSession + from pyspark.sql.dataframe import DataFrame + + normalization = NormalizationZScore(df, column_names=["value_column_1", "value_column_2"], in_place=False) + normalized_df = normalization.filter_data() + ``` + + Parameters: + df (DataFrame): PySpark DataFrame to be normalized. + column_names (List[str]): List of columns in the DataFrame to be normalized. + in_place (bool): If true, then result of normalization is stored in the same column. + + Attributes: + NORMALIZATION_NAME_POSTFIX : str + Suffix added to the column name if a new column is created for normalized values. + + """ + + df: PySparkDataFrame + column_names: List[str] + in_place: bool + + reversal_value: List[float] + + # Appended to column name if new column is added + NORMALIZATION_NAME_POSTFIX: str = "normalization" + + def __init__( + self, df: PySparkDataFrame, column_names: List[str], in_place: bool = False + ) -> None: + self.df = df + self.column_names = column_names + self.in_place = in_place + + EXPECTED_SCHEMA = StructType( + [StructField(column_name, DoubleType()) for column_name in column_names] + ) + self.validate(EXPECTED_SCHEMA) + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def filter_data(self): + return self.normalize() + + def normalize(self) -> PySparkDataFrame: + """ + Applies the specified normalization to each column in column_names. + + Returns: + DataFrame: A PySpark DataFrame with the normalized values. + """ + normalized_df = self.df + for column in self.column_names: + normalized_df = self._normalize_column(normalized_df, column) + return normalized_df + + def denormalize(self, input_df) -> PySparkDataFrame: + """ + Denormalizes the input DataFrame. Intended to be used by the denormalization component. + + Parameters: + input_df (DataFrame): Dataframe containing the current data. + """ + denormalized_df = input_df + if not self.in_place: + for column in self.column_names: + denormalized_df = denormalized_df.drop( + self._get_norm_column_name(column) + ) + else: + for column in self.column_names: + denormalized_df = self._denormalize_column(denormalized_df, column) + return denormalized_df + + @property + @abstractmethod + def NORMALIZED_COLUMN_NAME(self): ... + + @abstractmethod + def _normalize_column(self, df: PySparkDataFrame, column: str) -> PySparkDataFrame: + pass + + @abstractmethod + def _denormalize_column( + self, df: PySparkDataFrame, column: str + ) -> PySparkDataFrame: + pass + + def _get_norm_column_name(self, column_name: str) -> str: + if not self.in_place: + return f"{column_name}_{self.NORMALIZED_COLUMN_NAME}_{self.NORMALIZATION_NAME_POSTFIX}" + else: + return column_name diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/normalization_mean.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/normalization_mean.py new file mode 100644 index 000000000..55f29de37 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/normalization_mean.py @@ -0,0 +1,81 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import math + +from .normalization import NormalizationBaseClass +from pyspark.sql import DataFrame as PySparkDataFrame +from pyspark.sql import functions as F + + +class NormalizationMean(NormalizationBaseClass): + """ + Implements mean normalization for specified columns in a PySpark DataFrame. + + Example + -------- + ```python + from rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.normalization_mean import NormalizationMean + from pyspark.sql import SparkSession + from pyspark.sql.dataframe import DataFrame + + normalization = NormalizationMean(df, column_names=["value_column_1", "value_column_2"], in_place=False) + normalized_df = normalization.filter_data() + ``` + + Parameters: + df (DataFrame): PySpark DataFrame to be normalized. + column_names (List[str]): List of columns in the DataFrame to be normalized. + in_place (bool): If true, then result of normalization is stored in the same column. + """ + + NORMALIZED_COLUMN_NAME = "mean" + + def _normalize_column(self, df: PySparkDataFrame, column: str) -> PySparkDataFrame: + """ + Private method to apply Mean normalization to the specified column. + Mean normalization: (value - mean) / (max - min) + """ + mean_val = df.select(F.mean(F.col(column))).collect()[0][0] + min_val = df.select(F.min(F.col(column))).collect()[0][0] + max_val = df.select(F.max(F.col(column))).collect()[0][0] + + divisor = max_val - min_val + if math.isclose(divisor, 0.0, abs_tol=10e-8) or not math.isfinite(divisor): + raise ZeroDivisionError("Division by Zero in Mean") + + store_column = self._get_norm_column_name(column) + self.reversal_value = [mean_val, min_val, max_val] + + return df.withColumn( + store_column, + (F.col(column) - F.lit(mean_val)) / (F.lit(max_val) - F.lit(min_val)), + ) + + def _denormalize_column( + self, df: PySparkDataFrame, column: str + ) -> PySparkDataFrame: + """ + Private method to revert Mean normalization to the specified column. + Mean denormalization: normalized_value * (max - min) + mean = value + """ + mean_val = self.reversal_value[0] + min_val = self.reversal_value[1] + max_val = self.reversal_value[2] + + store_column = self._get_norm_column_name(column) + + return df.withColumn( + store_column, + F.col(column) * (F.lit(max_val) - F.lit(min_val)) + F.lit(mean_val), + ) diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/normalization_minmax.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/normalization_minmax.py new file mode 100644 index 000000000..0c2ad583a --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/normalization_minmax.py @@ -0,0 +1,79 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import math + +from .normalization import NormalizationBaseClass +from pyspark.sql import DataFrame as PySparkDataFrame +from pyspark.sql import functions as F + + +class NormalizationMinMax(NormalizationBaseClass): + """ + Implements Min-Max normalization for specified columns in a PySpark DataFrame. + + Example + -------- + ```python + from rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.normalization_minmax import NormalizationMinMax + from pyspark.sql import SparkSession + from pyspark.sql.dataframe import DataFrame + + normalization = NormalizationMinMax(df, column_names=["value_column_1", "value_column_2"], in_place=False) + normalized_df = normalization.filter_data() + ``` + + Parameters: + df (DataFrame): PySpark DataFrame to be normalized. + column_names (List[str]): List of columns in the DataFrame to be normalized. + in_place (bool): If true, then result of normalization is stored in the same column. + """ + + NORMALIZED_COLUMN_NAME = "minmax" + + def _normalize_column(self, df: PySparkDataFrame, column: str) -> PySparkDataFrame: + """ + Private method to revert Min-Max normalization to the specified column. + Min-Max denormalization: normalized_value * (max - min) + min = value + """ + min_val = df.select(F.min(F.col(column))).collect()[0][0] + max_val = df.select(F.max(F.col(column))).collect()[0][0] + + divisor = max_val - min_val + if math.isclose(divisor, 0.0, abs_tol=10e-8) or not math.isfinite(divisor): + raise ZeroDivisionError("Division by Zero in MinMax") + + store_column = self._get_norm_column_name(column) + self.reversal_value = [min_val, max_val] + + return df.withColumn( + store_column, + (F.col(column) - F.lit(min_val)) / (F.lit(max_val) - F.lit(min_val)), + ) + + def _denormalize_column( + self, df: PySparkDataFrame, column: str + ) -> PySparkDataFrame: + """ + Private method to revert Z-Score normalization to the specified column. + Z-Score denormalization: normalized_value * std_dev + mean = value + """ + min_val = self.reversal_value[0] + max_val = self.reversal_value[1] + + store_column = self._get_norm_column_name(column) + + return df.withColumn( + store_column, + (F.col(column) * (F.lit(max_val) - F.lit(min_val))) + F.lit(min_val), + ) diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/normalization_zscore.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/normalization_zscore.py new file mode 100644 index 000000000..da13aaac9 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/normalization/normalization_zscore.py @@ -0,0 +1,78 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import math + +from .normalization import NormalizationBaseClass +from pyspark.sql import DataFrame as PySparkDataFrame +from pyspark.sql import functions as F + + +class NormalizationZScore(NormalizationBaseClass): + """ + Implements Z-Score normalization for specified columns in a PySpark DataFrame. + + Example + -------- + ```python + from rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.normalization_zscore import NormalizationZScore + from pyspark.sql import SparkSession + from pyspark.sql.dataframe import DataFrame + + normalization = NormalizationZScore(df, column_names=["value_column_1", "value_column_2"], in_place=False) + normalized_df = normalization.filter_data() + ``` + + Parameters: + df (DataFrame): PySpark DataFrame to be normalized. + column_names (List[str]): List of columns in the DataFrame to be normalized. + in_place (bool): If true, then result of normalization is stored in the same column. + """ + + NORMALIZED_COLUMN_NAME = "zscore" + + def _normalize_column(self, df: PySparkDataFrame, column: str) -> PySparkDataFrame: + """ + Private method to apply Z-Score normalization to the specified column. + Z-Score normalization: (value - mean) / std_dev + """ + mean_val = df.select(F.mean(F.col(column))).collect()[0][0] + std_dev_val = df.select(F.stddev(F.col(column))).collect()[0][0] + + if math.isclose(std_dev_val, 0.0, abs_tol=10e-8) or not math.isfinite( + std_dev_val + ): + raise ZeroDivisionError("Division by Zero in ZScore") + + store_column = self._get_norm_column_name(column) + self.reversal_value = [mean_val, std_dev_val] + + return df.withColumn( + store_column, (F.col(column) - F.lit(mean_val)) / F.lit(std_dev_val) + ) + + def _denormalize_column( + self, df: PySparkDataFrame, column: str + ) -> PySparkDataFrame: + """ + Private method to revert Z-Score normalization to the specified column. + Z-Score denormalization: normalized_value * std_dev + mean = value + """ + mean_val = self.reversal_value[0] + std_dev_val = self.reversal_value[1] + + store_column = self._get_norm_column_name(column) + + return df.withColumn( + store_column, F.col(column) * F.lit(std_dev_val) + F.lit(mean_val) + ) diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/out_of_range_value_filter.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/out_of_range_value_filter.py new file mode 100644 index 000000000..8f9b80115 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/out_of_range_value_filter.py @@ -0,0 +1,127 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +from pyspark.sql import DataFrame as PySparkDataFrame +from ...monitoring.spark.check_value_ranges import CheckValueRanges +from ..interfaces import DataManipulationBaseInterface +from ...._pipeline_utils.models import ( + Libraries, + SystemType, +) + + +class OutOfRangeValueFilter(DataManipulationBaseInterface): + """ + Filters data in a DataFrame by checking the 'Value' column against expected ranges for specified TagNames. + Logs events when 'Value' exceeds the defined ranges for any TagName and deletes the rows. + + Args: + df (pyspark.sql.DataFrame): The DataFrame to monitor. + tag_ranges (dict): A dictionary where keys are TagNames and values are dictionaries specifying 'min' and/or + 'max', and optionally 'inclusive_bounds' values. + Example: + { + 'A2PS64V0J.:ZUX09R': {'min': 0, 'max': 100, 'inclusive_bounds': True}, + 'B3TS64V0K.:ZUX09R': {'min': 10, 'max': 200, 'inclusive_bounds': False}, + } + + Example: + ```python + from pyspark.sql import SparkSession + from rtdip_sdk.pipelines.data_quality.data_manipulation.spark.out_of_range_value_filter import OutOfRangeValueFilter + + + spark = SparkSession.builder.master("local[1]").appName("DeleteOutOfRangeValuesExample").getOrCreate() + + data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", 25.0), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", -5.0), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", 50.0), + ("B3TS64V0K.:ZUX09R", "2024-01-02 16:00:12.000", "Good", 80.0), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", 100.0), + ] + + columns = ["TagName", "EventTime", "Status", "Value"] + + df = spark.createDataFrame(data, columns) + + tag_ranges = { + "A2PS64V0J.:ZUX09R": {"min": 0, "max": 50, "inclusive_bounds": True}, + "B3TS64V0K.:ZUX09R": {"min": 50, "max": 100, "inclusive_bounds": False}, + } + + out_of_range_value_filter = OutOfRangeValueFilter( + df=df, + tag_ranges=tag_ranges, + ) + + result_df = out_of_range_value_filter.filter_data() + ``` + """ + + df: PySparkDataFrame + + def __init__( + self, + df: PySparkDataFrame, + tag_ranges: dict, + ) -> None: + self.df = df + self.check_value_ranges = CheckValueRanges(df=df, tag_ranges=tag_ranges) + + # Configure logging + self.logger = logging.getLogger(self.__class__.__name__) + if not self.logger.handlers: + handler = logging.StreamHandler() + formatter = logging.Formatter( + "%(asctime)s - %(name)s - %(levelname)s - %(message)s" + ) + handler.setFormatter(formatter) + self.logger.addHandler(handler) + self.logger.setLevel(logging.INFO) + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def filter_data(self) -> PySparkDataFrame: + """ + Executes the value range checking logic for the specified TagNames. Identifies, logs and deletes any rows + where 'Value' exceeds the defined ranges for each TagName. + + Returns: + pyspark.sql.DataFrame: + Returns a PySpark DataFrame without the rows that were out of range. + """ + out_of_range_df = self.check_value_ranges.check_for_out_of_range() + + if out_of_range_df.count() > 0: + self.check_value_ranges.log_out_of_range_values(out_of_range_df) + else: + self.logger.info(f"No out of range values found in 'Value' column.") + return self.df.subtract(out_of_range_df) diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/input_validator.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/input_validator.py new file mode 100644 index 000000000..434113cf0 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/input_validator.py @@ -0,0 +1,171 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from pyspark.sql.types import DataType, StructType +from pyspark.sql import functions as F +from pyspark.sql import DataFrame as SparkDataFrame +from ..interfaces import PipelineComponentBaseInterface +from .._pipeline_utils.models import ( + Libraries, + SystemType, +) + + +class InputValidator(PipelineComponentBaseInterface): + """ + Validates the PySpark DataFrame of the respective child class instance against a schema dictionary or pyspark + StructType. Checks for column availability and column data types. If data types differ, it tries to cast the + column into the expected data type. Casts "None", "none", "Null", "null" and "" to None. Raises Errors if some step fails. + + Example: + -------- + import pytest + from pyspark.sql import SparkSession + from pyspark.sql.types import StructType, StructField, StringType, TimestampType, FloatType + from src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.missing_value_imputation import ( + MissingValueImputation, + ) + + @pytest.fixture(scope="session") + def spark_session(): + return SparkSession.builder.master("local[2]").appName("test").getOrCreate() + + spark = spark_session() + + test_schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", StringType(), True), + StructField("Status", StringType(), True), + StructField("Value", StringType(), True), + ] + ) + expected_schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + test_data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-01 03:29:21.000", "Good", "1.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 07:32:55.000", "Good", "2.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 11:36:29.000", "Good", "3.0"), + ] + + test_df = spark_session.createDataFrame(test_data, schema=test_schema) + test_component = MissingValueImputation(spark_session, test_df) + + print(test_component.validate(expected_schema)) # True + + ``` + + Parameters: + schema_dict: dict or pyspark StructType + A dictionary where keys are column names, and values are expected PySpark data types. + Example: {"column1": StringType(), "column2": IntegerType()} + + Returns: + True: if data is valid + Raises Error else + + Raises: + ValueError: If a column is missing or has a mismatched pyspark data type. + TypeError: If a column does not hold or specify a pyspark data type. + """ + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def validate(self, schema_dict, df: SparkDataFrame = None): + """ + Used by child data quality utility classes to validate the input data. + """ + if df is None: + dataframe = getattr(self, "df", None) + + if isinstance(schema_dict, StructType): + schema_dict = {field.name: field.dataType for field in schema_dict.fields} + + dataframe_schema = { + field.name: field.dataType for field in dataframe.schema.fields + } + + for column, expected_type in schema_dict.items(): + if column in dataframe.columns: + dataframe = dataframe.withColumn( + column, + F.when( + F.col(column).isin("None", "none", "null", "Null", ""), None + ).otherwise(F.col(column)), + ) + + for column, expected_type in schema_dict.items(): + # Check if the column exists + if column not in dataframe_schema: + raise ValueError(f"Column '{column}' is missing in the DataFrame.") + + # Check if both types are of a pyspark data type + actual_type = dataframe_schema[column] + if not isinstance(actual_type, DataType) or not isinstance( + expected_type, DataType + ): + raise TypeError( + "Expected and actual types must be instances of pyspark.sql.types.DataType." + ) + + # Check if actual type is expected type, try to cast else + dataframe = self.cast_column_if_needed( + dataframe, column, expected_type, actual_type + ) + + self.df = dataframe + return True + + def cast_column_if_needed(self, dataframe, column, expected_type, actual_type): + if not isinstance(actual_type, type(expected_type)): + try: + original_null_count = dataframe.filter(F.col(column).isNull()).count() + casted_column = dataframe.withColumn( + column, F.col(column).cast(expected_type) + ) + new_null_count = casted_column.filter(F.col(column).isNull()).count() + + if new_null_count > original_null_count: + raise ValueError( + f"Column '{column}' cannot be cast to {expected_type}." + ) + dataframe = casted_column + except Exception as e: + raise ValueError( + f"Error during casting column '{column}' to {expected_type}: {str(e)}" + ) + + return dataframe diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/__init__.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/__init__.py new file mode 100644 index 000000000..76bb6a388 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/__init__.py @@ -0,0 +1,15 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from .spark import * diff --git a/src/sdk/python/rtdip_sdk/pipelines/monitoring/interfaces.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/interfaces.py similarity index 80% rename from src/sdk/python/rtdip_sdk/pipelines/monitoring/interfaces.py rename to src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/interfaces.py index 2c446c5bc..34176beeb 100644 --- a/src/sdk/python/rtdip_sdk/pipelines/monitoring/interfaces.py +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/interfaces.py @@ -13,8 +13,13 @@ # limitations under the License. from abc import abstractmethod -from ..interfaces import PipelineComponentBaseInterface + +from pyspark.sql import DataFrame + +from ...interfaces import PipelineComponentBaseInterface class MonitoringBaseInterface(PipelineComponentBaseInterface): - pass + @abstractmethod + def check(self) -> DataFrame: + pass diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/__init__.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/__init__.py new file mode 100644 index 000000000..50c574207 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/__init__.py @@ -0,0 +1,22 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import sys + +from .check_value_ranges import CheckValueRanges +from .flatline_detection import FlatlineDetection + +if "great_expectations" in sys.modules: + from .great_expectations_data_quality import GreatExpectationsDataQuality +from .identify_missing_data_interval import IdentifyMissingDataInterval +from .identify_missing_data_pattern import IdentifyMissingDataPattern diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/check_value_ranges.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/check_value_ranges.py new file mode 100644 index 000000000..f226f4561 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/check_value_ranges.py @@ -0,0 +1,260 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +from pyspark.sql import DataFrame as PySparkDataFrame +from pyspark.sql.functions import col +from pyspark.sql.types import ( + StructType, + StructField, + StringType, + TimestampType, + FloatType, +) +from functools import reduce +from operator import or_ +from ..interfaces import MonitoringBaseInterface +from ...._pipeline_utils.models import ( + Libraries, + SystemType, +) +from ...input_validator import InputValidator + + +class CheckValueRanges(MonitoringBaseInterface, InputValidator): + """ + Monitors data in a DataFrame by checking the 'Value' column against expected ranges for specified TagNames. + Logs events when 'Value' exceeds the defined ranges for any TagName. + + Args: + df (pyspark.sql.DataFrame): The DataFrame to monitor. + tag_ranges (dict): A dictionary where keys are TagNames and values are dictionaries specifying 'min' and/or + 'max', and optionally 'inclusive_bounds' values. + Example: + { + 'A2PS64V0J.:ZUX09R': {'min': 0, 'max': 100, 'inclusive_bounds': True}, + 'B3TS64V0K.:ZUX09R': {'min': 10, 'max': 200, 'inclusive_bounds': False}, + } + + Example: + ```python + from pyspark.sql import SparkSession + from rtdip_sdk.pipelines.data_quality.monitoring.spark.check_value_ranges import CheckValueRanges + + + spark = SparkSession.builder.master("local[1]").appName("CheckValueRangesExample").getOrCreate() + + data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", 25.0), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", -5.0), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", 50.0), + ("B3TS64V0K.:ZUX09R", "2024-01-02 16:00:12.000", "Good", 80.0), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", 100.0), + ] + + columns = ["TagName", "EventTime", "Status", "Value"] + + df = spark.createDataFrame(data, columns) + + tag_ranges = { + "A2PS64V0J.:ZUX09R": {"min": 0, "max": 50, "inclusive_bounds": True}, + "B3TS64V0K.:ZUX09R": {"min": 50, "max": 100, "inclusive_bounds": False}, + } + + check_value_ranges = CheckValueRanges( + df=df, + tag_ranges=tag_ranges, + ) + + result_df = check_value_ranges.check() + ``` + """ + + df: PySparkDataFrame + tag_ranges: dict + EXPECTED_SCHEMA = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + def __init__( + self, + df: PySparkDataFrame, + tag_ranges: dict, + ) -> None: + self.df = df + self.validate(self.EXPECTED_SCHEMA) + self.tag_ranges = tag_ranges + + # Configure logging + self.logger = logging.getLogger(self.__class__.__name__) + if not self.logger.handlers: + handler = logging.StreamHandler() + formatter = logging.Formatter( + "%(asctime)s - %(name)s - %(levelname)s - %(message)s" + ) + handler.setFormatter(formatter) + self.logger.addHandler(handler) + self.logger.setLevel(logging.INFO) + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def check(self) -> PySparkDataFrame: + """ + Executes the value range checking logic for the specified TagNames. Identifies and logs any rows + where 'Value' exceeds the defined ranges for each TagName. + + Returns: + pyspark.sql.DataFrame: + Returns the original PySpark DataFrame without changes. + """ + out_of_range_df = self.check_for_out_of_range() + + if out_of_range_df.count() > 0: + self.log_out_of_range_values(out_of_range_df) + else: + self.logger.info(f"No out of range values found in 'Value' column.") + + return self.df + + def check_for_out_of_range(self) -> PySparkDataFrame: + """ + Identifies rows where 'Value' exceeds defined ranges. + + Returns: + pyspark.sql.DataFrame: A DataFrame containing rows with out-of-range values. + """ + + self._validate_inputs() + + out_of_range_df = self.df.filter("1=0") + + for tag_name, range_dict in self.tag_ranges.items(): + df = self.df.filter(col("TagName") == tag_name) + + if df.count() == 0: + self.logger.warning(f"No data found for TagName '{tag_name}'.") + continue + + min_value = range_dict.get("min", None) + max_value = range_dict.get("max", None) + inclusive_bounds = range_dict.get("inclusive_bounds", True) + + conditions = [] + + # Build minimum value condition + self.add_min_value_condition(min_value, inclusive_bounds, conditions) + + # Build maximum value condition + self.add_max_value_condition(max_value, inclusive_bounds, conditions) + + if conditions: + condition = reduce(or_, conditions) + tag_out_of_range_df = df.filter(condition) + out_of_range_df = out_of_range_df.union(tag_out_of_range_df) + + return out_of_range_df + + def add_min_value_condition(self, min_value, inclusive_bounds, conditions): + if min_value is not None: + if inclusive_bounds: + min_condition = col("Value") < min_value + else: + min_condition = col("Value") <= min_value + conditions.append(min_condition) + + def add_max_value_condition(self, max_value, inclusive_bounds, conditions): + if max_value is not None: + if inclusive_bounds: + max_condition = col("Value") > max_value + else: + max_condition = col("Value") >= max_value + conditions.append(max_condition) + + def log_out_of_range_values(self, out_of_range_df: PySparkDataFrame): + """ + Logs out-of-range values for all TagNames. + """ + for tag_name in ( + out_of_range_df.select("TagName") + .distinct() + .rdd.map(lambda row: row[0]) + .collect() + ): + tag_out_of_range_df = out_of_range_df.filter(col("TagName") == tag_name) + count = tag_out_of_range_df.count() + self.logger.info( + f"Found {count} rows in 'Value' column for TagName '{tag_name}' out of range." + ) + for row in tag_out_of_range_df.collect(): + self.logger.info(f"Out of range row for TagName '{tag_name}': {row}") + + def _validate_inputs(self): + if not isinstance(self.tag_ranges, dict): + raise TypeError("tag_ranges must be a dictionary.") + + available_tags = ( + self.df.select("TagName").distinct().rdd.map(lambda row: row[0]).collect() + ) + + for tag_name, range_dict in self.tag_ranges.items(): + self.validate_tag_name(available_tags, tag_name, range_dict) + + inclusive_bounds = range_dict.get("inclusive_bounds", True) + if not isinstance(inclusive_bounds, bool): + raise ValueError( + f"Inclusive_bounds for TagName '{tag_name}' must be a boolean." + ) + + min_value = range_dict.get("min", None) + max_value = range_dict.get("max", None) + if min_value is not None and not isinstance(min_value, (int, float)): + raise ValueError( + f"Minimum value for TagName '{tag_name}' must be a number." + ) + if max_value is not None and not isinstance(max_value, (int, float)): + raise ValueError( + f"Maximum value for TagName '{tag_name}' must be a number." + ) + + def validate_tag_name(self, available_tags, tag_name, range_dict): + if not isinstance(tag_name, str): + raise ValueError(f"TagName '{tag_name}' must be a string.") + + if tag_name not in available_tags: + raise ValueError(f"TagName '{tag_name}' not found in DataFrame.") + + if "min" not in range_dict and "max" not in range_dict: + raise ValueError( + f"TagName '{tag_name}' must have at least 'min' or 'max' specified." + ) diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/flatline_detection.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/flatline_detection.py new file mode 100644 index 000000000..41e75c10c --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/flatline_detection.py @@ -0,0 +1,234 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import math +import logging +from pyspark.sql import DataFrame as PySparkDataFrame +from pyspark.sql.functions import col, when, lag, sum, lit, abs +from pyspark.sql.window import Window +from pyspark.sql.types import ( + StructType, + StructField, + StringType, + TimestampType, + FloatType, +) + +from ..interfaces import MonitoringBaseInterface +from ...._pipeline_utils.models import ( + Libraries, + SystemType, +) +from ...input_validator import InputValidator + + +class FlatlineDetection(MonitoringBaseInterface, InputValidator): + """ + Detects flatlining in specified columns of a PySpark DataFrame and logs warnings. + + Flatlining occurs when a column contains consecutive null or zero values exceeding a specified tolerance period. + This class identifies such occurrences and logs the rows where flatlining is detected. + + Args: + df (pyspark.sql.DataFrame): The input DataFrame to monitor for flatlining. + watch_columns (list): List of column names to monitor for flatlining (null or zero values). + tolerance_timespan (int): Maximum allowed consecutive flatlining period. If exceeded, a warning is logged. + + Example: + ```python + from rtdip_sdk.pipelines.data_quality.monitoring.spark.flatline_detection import FlatlineDetection + + from pyspark.sql import SparkSession + + spark = SparkSession.builder.master("local[1]").appName("FlatlineDetectionExample").getOrCreate() + + # Example DataFrame + data = [ + (1, 1), + (2, 0), + (3, 0), + (4, 0), + (5, 5), + ] + columns = ["ID", "Value"] + df = spark.createDataFrame(data, columns) + + # Initialize FlatlineDetection + flatline_detection = FlatlineDetection( + df, + watch_columns=["Value"], + tolerance_timespan=2 + ) + + # Detect flatlining + flatline_detection.check() + ``` + """ + + df: PySparkDataFrame + watch_columns: list + tolerance_timespan: int + EXPECTED_SCHEMA = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + def __init__( + self, df: PySparkDataFrame, watch_columns: list, tolerance_timespan: int + ) -> None: + if not watch_columns or not isinstance(watch_columns, list): + raise ValueError("watch_columns must be a non-empty list of column names.") + if not isinstance(tolerance_timespan, int) or tolerance_timespan <= 0: + raise ValueError("tolerance_timespan must be a positive integer.") + + self.df = df + self.validate(self.EXPECTED_SCHEMA) + self.watch_columns = watch_columns + self.tolerance_timespan = tolerance_timespan + + self.logger = logging.getLogger(self.__class__.__name__) + if not self.logger.handlers: + handler = logging.StreamHandler() + formatter = logging.Formatter( + "%(asctime)s - %(name)s - %(levelname)s - %(message)s" + ) + handler.setFormatter(formatter) + self.logger.addHandler(handler) + self.logger.setLevel(logging.INFO) + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def check(self) -> PySparkDataFrame: + """ + Detects flatlining and logs relevant rows. + + Returns: + pyspark.sql.DataFrame: The original DataFrame with additional flatline detection metadata. + """ + flatlined_rows = self.check_for_flatlining() + print("Flatlined Rows:") + flatlined_rows.show(truncate=False) + self.log_flatlining_rows(flatlined_rows) + return self.df + + def check_for_flatlining(self) -> PySparkDataFrame: + """ + Identifies rows with flatlining based on the specified columns and tolerance. + + Returns: + pyspark.sql.DataFrame: A DataFrame containing rows with flatlining detected. + """ + partition_column = "TagName" + sort_column = "EventTime" + window_spec = Window.partitionBy(partition_column).orderBy(sort_column) + + # Start with an empty DataFrame, ensure it has the required schema + flatlined_rows = ( + self.df.withColumn("Value_flatline_flag", lit(None).cast("int")) + .withColumn("Value_group", lit(None).cast("bigint")) + .filter("1=0") + ) + + for column in self.watch_columns: + flagged_column = f"{column}_flatline_flag" + group_column = f"{column}_group" + + # Add flag and group columns + df_with_flags = self.df.withColumn( + flagged_column, + when( + (col(column).isNull()) | (abs(col(column) - 0.0) <= 1e-09), + 1, + ).otherwise(0), + ).withColumn( + group_column, + sum( + when( + col(flagged_column) + != lag(col(flagged_column), 1, 0).over(window_spec), + 1, + ).otherwise(0) + ).over(window_spec), + ) + + # Identify flatlining groups + group_counts = ( + df_with_flags.filter(col(flagged_column) == 1) + .groupBy(group_column) + .count() + ) + large_groups = group_counts.filter(col("count") > self.tolerance_timespan) + large_group_ids = [row[group_column] for row in large_groups.collect()] + + if large_group_ids: + relevant_rows = df_with_flags.filter( + col(group_column).isin(large_group_ids) + ) + + # Ensure both DataFrames have the same columns + for col_name in flatlined_rows.columns: + if col_name not in relevant_rows.columns: + relevant_rows = relevant_rows.withColumn(col_name, lit(None)) + + flatlined_rows = flatlined_rows.union(relevant_rows) + + return flatlined_rows + + def log_flatlining_rows(self, flatlined_rows: PySparkDataFrame): + """ + Logs flatlining rows for all monitored columns. + + Args: + flatlined_rows (pyspark.sql.DataFrame): The DataFrame containing rows with flatlining detected. + """ + if flatlined_rows.count() == 0: + self.logger.info("No flatlining detected.") + return + + for column in self.watch_columns: + flagged_column = f"{column}_flatline_flag" + + if flagged_column not in flatlined_rows.columns: + self.logger.warning( + f"Expected column '{flagged_column}' not found in DataFrame." + ) + continue + + relevant_rows = flatlined_rows.filter(col(flagged_column) == 1).collect() + + if relevant_rows: + for row in relevant_rows: + self.logger.warning( + f"Flatlining detected in column '{column}' at row: {row}." + ) + else: + self.logger.info(f"No flatlining detected in column '{column}'.") diff --git a/src/sdk/python/rtdip_sdk/pipelines/monitoring/spark/data_quality/great_expectations_data_quality.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/great_expectations_data_quality.py similarity index 93% rename from src/sdk/python/rtdip_sdk/pipelines/monitoring/spark/data_quality/great_expectations_data_quality.py rename to src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/great_expectations_data_quality.py index f8022e41c..4aed6a90c 100644 --- a/src/sdk/python/rtdip_sdk/pipelines/monitoring/spark/data_quality/great_expectations_data_quality.py +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/great_expectations_data_quality.py @@ -14,25 +14,29 @@ import great_expectations as gx from pyspark.sql import DataFrame, SparkSession -from ...interfaces import MonitoringBaseInterface -from ...._pipeline_utils.models import Libraries, SystemType +from ..interfaces import MonitoringBaseInterface +from ...._pipeline_utils.models import ( + Libraries, + SystemType, +) from great_expectations.checkpoint import ( Checkpoint, ) from great_expectations.expectations.expectation import ( ExpectationConfiguration, ) +from ...input_validator import InputValidator # Create a new context -class GreatExpectationsDataQuality(MonitoringBaseInterface): +class GreatExpectationsDataQuality(MonitoringBaseInterface, InputValidator): """ Data Quality Monitoring using Great Expectations allowing you to create and check your data quality expectations. Example -------- ```python - from src.sdk.python.rtdip_sdk.monitoring.data_quality.great_expectations.python.great_expectations_data_quality import GreatExpectationsDataQuality + from src.sdk.python.rtdip_sdk.monitoring.data_manipulation.great_expectations.python.great_expectations_data_quality import GreatExpectationsDataQuality from rtdip_sdk.pipelines.utilities import SparkSessionUtility import json @@ -74,7 +78,7 @@ class GreatExpectationsDataQuality(MonitoringBaseInterface): GX.display_expectations(suite) - #Run the Data Quality Check by Validating your data against set expecations in the suite + #Run the Data Quality Check by Validating your data against set expectations in the suite checkpoint_name = "checkpoint_name" run_name_template = "run_name_template" @@ -215,7 +219,7 @@ def check( action_list: list, ): """ - Validate your data against set expecations in the suite + Validate your data against set expectations in the suite Args: checkpoint_name (str): The name of the checkpoint. run_name_template (str): The name of the run. diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/identify_missing_data_interval.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/identify_missing_data_interval.py new file mode 100644 index 000000000..f91ce5f17 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/identify_missing_data_interval.py @@ -0,0 +1,218 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from pyspark.sql import DataFrame as PySparkDataFrame +from pyspark.sql import functions as F +from pyspark.sql.window import Window +from pyspark.sql.types import ( + StructType, + StructField, + StringType, + TimestampType, + FloatType, +) + +from ..interfaces import MonitoringBaseInterface +from ...._pipeline_utils.models import ( + Libraries, + SystemType, +) +from ....utilities.spark.time_string_parsing import parse_time_string_to_ms +from ...input_validator import InputValidator +from ....logging.logger_manager import LoggerManager + + +class IdentifyMissingDataInterval(MonitoringBaseInterface, InputValidator): + """ + Detects missing data intervals in a DataFrame by identifying time differences between consecutive + measurements that exceed a specified tolerance or a multiple of the Median Absolute Deviation (MAD). + Logs the start and end times of missing intervals along with their durations. + + + Args: + df (pyspark.sql.Dataframe): DataFrame containing at least the 'EventTime' column. + interval (str, optional): Expected interval between data points (e.g., '10ms', '500ms'). If not specified, the median of time differences is used. + tolerance (str, optional): Tolerance time beyond which an interval is considered missing (e.g., '10ms'). If not specified, it defaults to 'mad_multiplier' times the Median Absolute Deviation (MAD) of time differences. + mad_multiplier (float, optional): Multiplier for MAD to calculate tolerance. Default is 3. + min_tolerance (str, optional): Minimum tolerance for pattern-based detection (e.g., '100ms'). Default is '10ms'. + + Returns: + df (pyspark.sql.Dataframe): Returns the original PySparkDataFrame without changes. + + Example + -------- + ```python + from rtdip_sdk.pipelines.data_quality.monitoring.spark.identify_missing_data_interval import IdentifyMissingDataInterval + + from pyspark.sql import SparkSession + + missing_data_monitor = IdentifyMissingDataInterval( + df=df, + interval='100ms', + tolerance='10ms', + ) + + df_result = missing_data_monitor.check() + ``` + + """ + + df: PySparkDataFrame + EXPECTED_SCHEMA = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + def __init__( + self, + df: PySparkDataFrame, + interval: str = None, + tolerance: str = None, + mad_multiplier: float = 3, + min_tolerance: str = "10ms", + ) -> None: + + self.df = df + self.interval = interval + self.tolerance = tolerance + self.mad_multiplier = mad_multiplier + self.min_tolerance = min_tolerance + self.validate(self.EXPECTED_SCHEMA) + + # Use global pipeline logger + self.logger_manager = LoggerManager() + self.logger = self.logger_manager.create_logger("IdentifyMissingDataInterval") + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def check(self) -> PySparkDataFrame: + """ + Executes the identify missing data logic. + + Returns: + pyspark.sql.DataFrame: + Returns the original PySpark DataFrame without changes. + """ + if "EventTime" not in self.df.columns: + self.logger.error("The DataFrame must contain an 'EventTime' column.") + raise ValueError("The DataFrame must contain an 'EventTime' column.") + + df = self.df.withColumn("EventTime", F.to_timestamp("EventTime")) + df_sorted = df.orderBy("EventTime") + # Calculate time difference in milliseconds between consecutive rows + df_with_diff = df_sorted.withColumn( + "TimeDeltaMs", + ( + F.col("EventTime").cast("double") + - F.lag("EventTime").over(Window.orderBy("EventTime")).cast("double") + ) + * 1000, + ).withColumn( + "StartMissing", F.lag("EventTime").over(Window.orderBy("EventTime")) + ) + # Parse interval to milliseconds if given + if self.interval is not None: + try: + interval_ms = parse_time_string_to_ms(self.interval) + self.logger.info(f"Using provided expected interval: {interval_ms} ms") + except ValueError as e: + self.logger.error(e) + raise + else: + # Calculate interval based on median of time differences + median_expr = F.expr("percentile_approx(TimeDeltaMs, 0.5)") + median_row = df_with_diff.select(median_expr.alias("median")).collect()[0] + interval_ms = median_row["median"] + self.logger.info( + f"Using median of time differences as expected interval: {interval_ms} ms" + ) + # Parse tolernace to milliseconds if given + if self.tolerance is not None: + try: + tolerance_ms = parse_time_string_to_ms(self.tolerance) + self.logger.info(f"Using provided tolerance: {tolerance_ms} ms") + except ValueError as e: + self.logger.error(e) + raise + else: + # Calculate tolerance based on MAD + mad_expr = F.expr( + f"percentile_approx(abs(TimeDeltaMs - {interval_ms}), 0.5)" + ) + mad_row = df_with_diff.select(mad_expr.alias("mad")).collect()[0] + mad = mad_row["mad"] + calculated_tolerance_ms = self.mad_multiplier * mad + min_tolerance_ms = parse_time_string_to_ms(self.min_tolerance) + tolerance_ms = max(calculated_tolerance_ms, min_tolerance_ms) + self.logger.info(f"Calculated tolerance: {tolerance_ms} ms (MAD-based)") + # Calculate the maximum acceptable interval with tolerance + max_interval_with_tolerance_ms = interval_ms + tolerance_ms + self.logger.info( + f"Maximum acceptable interval with tolerance: {max_interval_with_tolerance_ms} ms" + ) + + # Identify missing intervals + missing_intervals_df = df_with_diff.filter( + (F.col("TimeDeltaMs") > max_interval_with_tolerance_ms) + & (F.col("StartMissing").isNotNull()) + ).select( + "TagName", + "StartMissing", + F.col("EventTime").alias("EndMissing"), + "TimeDeltaMs", + ) + # Convert time delta to readable format + missing_intervals_df = missing_intervals_df.withColumn( + "DurationMissing", + F.concat( + F.floor(F.col("TimeDeltaMs") / 3600000).cast("string"), + F.lit("h "), + F.floor((F.col("TimeDeltaMs") % 3600000) / 60000).cast("string"), + F.lit("m "), + F.floor(((F.col("TimeDeltaMs") % 3600000) % 60000) / 1000).cast( + "string" + ), + F.lit("s"), + ), + ).select("TagName", "StartMissing", "EndMissing", "DurationMissing") + missing_intervals = missing_intervals_df.collect() + if missing_intervals: + self.logger.info("Detected Missing Intervals:") + for row in missing_intervals: + self.logger.info( + f"Tag: {row['TagName']} Missing Interval from {row['StartMissing']} to {row['EndMissing']} " + f"Duration: {row['DurationMissing']}" + ) + else: + self.logger.info("No missing intervals detected.") + return self.df diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/identify_missing_data_pattern.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/identify_missing_data_pattern.py new file mode 100644 index 000000000..debb59b1e --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/identify_missing_data_pattern.py @@ -0,0 +1,362 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging + +import pandas as pd +from pyspark.sql import DataFrame as PySparkDataFrame +from pyspark.sql import functions as F +from pyspark.sql.types import ( + StructType, + StructField, + StringType, + TimestampType, + FloatType, +) + + +from ....logging.logger_manager import LoggerManager +from ...input_validator import InputValidator +from ..interfaces import MonitoringBaseInterface +from ...._pipeline_utils.models import ( + Libraries, + SystemType, +) +from ....utilities.spark.time_string_parsing import parse_time_string_to_ms + + +class IdentifyMissingDataPattern(MonitoringBaseInterface, InputValidator): + """ + Identifies missing data in a DataFrame based on specified time patterns. + Logs the expected missing times. + + Args: + df (pyspark.sql.Dataframe): DataFrame containing at least the 'EventTime' column. + patterns (list of dict): List of dictionaries specifying the time patterns. + - For 'minutely' frequency: Specify 'second' and optionally 'millisecond'. + Example: [{'second': 0}, {'second': 13}, {'second': 49}] + - For 'hourly' frequency: Specify 'minute', 'second', and optionally 'millisecond'. + Example: [{'minute': 0, 'second': 0}, {'minute': 30, 'second': 30}] + frequency (str): Frequency of the patterns. Must be either 'minutely' or 'hourly'. + - 'minutely': Patterns are checked every minute at specified seconds. + - 'hourly': Patterns are checked every hour at specified minutes and seconds. + tolerance (str, optional): Maximum allowed deviation from the pattern (e.g., '1s', '500ms'). + Default is '10ms'. + + Example: + ```python + from pyspark.sql import SparkSession + + spark = SparkSession.builder.master("local[1]").appName("IdentifyMissingDataPatternExample").getOrCreate() + + patterns = [ + {"second": 0}, + {"second": 20}, + ] + + frequency = "minutely" + tolerance = "1s" + + identify_missing_data = IdentifyMissingDataPattern( + df=df, + patterns=patterns, + frequency=frequency, + tolerance=tolerance, + ) + + identify_missing_data.check() + ``` + + """ + + df: PySparkDataFrame + EXPECTED_SCHEMA = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + def __init__( + self, + df: PySparkDataFrame, + patterns: list, + frequency: str = "minutely", + tolerance: str = "10ms", + ) -> None: + + self.df = df + self.patterns = patterns + self.frequency = frequency.lower() + self.tolerance = tolerance + self.validate(self.EXPECTED_SCHEMA) + + # Configure logging + self.logger = LoggerManager().create_logger(self.__class__.__name__) + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def check(self) -> PySparkDataFrame: + """ + Executes the missing pattern detection logic. Identifies and logs any missing patterns + based on the provided patterns and frequency within the specified tolerance. + + Returns: + pyspark.sql.DataFrame: + Returns the original PySpark DataFrame without changes. + """ + self._validate_inputs() + df = self.df.withColumn("EventTime", F.to_timestamp("EventTime")) + df_sorted = df.orderBy("EventTime") + # Determine if the DataFrame is empty + count = df_sorted.count() + if count == 0: + self.logger.info("Generated 0 expected times based on patterns.") + self.logger.info("DataFrame is empty. No missing patterns to detect.") + return self.df + # Determine the time range of the data + min_time, max_time = df_sorted.agg( + F.min("EventTime"), F.max("EventTime") + ).first() + if not min_time or not max_time: + self.logger.info("Generated 0 expected times based on patterns.") + self.logger.info("DataFrame is empty. No missing patterns to detect.") + return self.df + # Generate all expected times based on patterns and frequency + expected_times_df = self._generate_expected_times(min_time, max_time) + # Identify missing patterns by left joining expected times with actual EventTimes within tolerance + missing_patterns_df = self._find_missing_patterns(expected_times_df, df_sorted) + self._log_missing_patterns(missing_patterns_df) + return self.df + + def _validate_inputs(self): + if self.frequency not in ["minutely", "hourly"]: + error_msg = "Frequency must be either 'minutely' or 'hourly'." + self.logger.error(error_msg) + raise ValueError(error_msg) + for pattern in self.patterns: + if self.frequency == "minutely": + self.validate_minutely_pattern(pattern) + elif self.frequency == "hourly": + self.validate_hourly_patterns(pattern) + try: + self.tolerance_ms = parse_time_string_to_ms(self.tolerance) + self.tolerance_seconds = self.tolerance_ms / 1000 + self.logger.info( + f"Using tolerance: {self.tolerance_ms} ms ({self.tolerance_seconds} seconds)" + ) + except ValueError as e: + error_msg = f"Invalid tolerance format: {self.tolerance}" + self.logger.error(error_msg) + raise ValueError(error_msg) from e + + def validate_hourly_patterns(self, pattern): + if "minute" not in pattern or "second" not in pattern: + raise ValueError( + "Each pattern must have 'minute' and 'second' keys for 'hourly' frequency." + ) + if pattern.get("minute", 0) >= 60: + raise ValueError("For 'hourly' frequency, 'minute' must be less than 60.") + if "hour" in pattern: + raise ValueError( + "For 'hourly' frequency, pattern should not contain 'hour'." + ) + + def validate_minutely_pattern(self, pattern): + if "second" not in pattern: + raise ValueError( + "Each pattern must have a 'second' key for 'minutely' frequency." + ) + if pattern.get("second", 0) >= 60: + raise ValueError("For 'minutely' frequency, 'second' must be less than 60.") + if "minute" in pattern or "hour" in pattern: + raise ValueError( + "For 'minutely' frequency, pattern should not contain 'minute' or 'hour'." + ) + + def _generate_expected_times(self, min_time, max_time) -> PySparkDataFrame: + floor_min_time = self._get_floor_min_time(min_time) + ceil_max_time = self._get_ceil_max_time(max_time) + base_times_df = self._create_base_times_df(floor_min_time, ceil_max_time) + expected_times_df = self._apply_patterns( + base_times_df, floor_min_time, max_time + ) + return expected_times_df + + def _get_floor_min_time(self, min_time): + if self.frequency == "minutely": + return min_time.replace(second=0, microsecond=0) + elif self.frequency == "hourly": + return min_time.replace(minute=0, second=0, microsecond=0) + + def _get_ceil_max_time(self, max_time): + if self.frequency == "minutely": + return (max_time + pd.Timedelta(minutes=1)).replace(second=0, microsecond=0) + elif self.frequency == "hourly": + return (max_time + pd.Timedelta(hours=1)).replace( + minute=0, second=0, microsecond=0 + ) + + def _create_base_times_df(self, floor_min_time, ceil_max_time): + step = F.expr(f"INTERVAL 1 {self.frequency.upper()[:-2]}") + return self.df.sparkSession.createDataFrame( + [(floor_min_time, ceil_max_time)], ["start", "end"] + ).select( + F.explode( + F.sequence( + F.col("start").cast("timestamp"), + F.col("end").cast("timestamp"), + step, + ) + ).alias("BaseTime") + ) + + def _apply_patterns(self, base_times_df, floor_min_time, max_time): + expected_times = [] + for pattern in self.patterns: + expected_time = self._calculate_expected_time(base_times_df, pattern) + expected_times.append(expected_time) + expected_times_df = ( + base_times_df.withColumn( + "ExpectedTime", F.explode(F.array(*expected_times)) + ) + .select("ExpectedTime") + .distinct() + .filter( + (F.col("ExpectedTime") >= F.lit(floor_min_time)) + & (F.col("ExpectedTime") <= F.lit(max_time)) + ) + ) + return expected_times_df + + def _calculate_expected_time(self, base_times_df, pattern): + if self.frequency == "minutely": + seconds = pattern.get("second", 0) + milliseconds = pattern.get("millisecond", 0) + return ( + F.col("BaseTime") + + F.expr(f"INTERVAL {seconds} SECOND") + + F.expr(f"INTERVAL {milliseconds} MILLISECOND") + ) + elif self.frequency == "hourly": + minutes = pattern.get("minute", 0) + seconds = pattern.get("second", 0) + milliseconds = pattern.get("millisecond", 0) + return ( + F.col("BaseTime") + + F.expr(f"INTERVAL {minutes} MINUTE") + + F.expr(f"INTERVAL {seconds} SECOND") + + F.expr(f"INTERVAL {milliseconds} MILLISECOND") + ) + + def _find_missing_patterns( + self, expected_times_df: PySparkDataFrame, actual_df: PySparkDataFrame + ) -> PySparkDataFrame: + """ + Finds missing patterns by comparing expected times with actual EventTimes within tolerance. + + Args: + expected_times_df (PySparkDataFrame): DataFrame with expected 'ExpectedTime'. + actual_df (PySparkDataFrame): Actual DataFrame with 'EventTime'. + + Returns: + PySparkDataFrame: DataFrame with missing 'ExpectedTime'. + """ + # Format tolerance for SQL INTERVAL + tolerance_str = self._format_timedelta_for_sql(self.tolerance_ms) + # Perform left join with tolerance window + actual_event_time = "at.EventTime" + missing_patterns_df = ( + expected_times_df.alias("et") + .join( + actual_df.alias("at"), + ( + F.col(actual_event_time) + >= F.expr(f"et.ExpectedTime - INTERVAL {tolerance_str}") + ) + & ( + F.col(actual_event_time) + <= F.expr(f"et.ExpectedTime + INTERVAL {tolerance_str}") + ), + how="left", + ) + .filter(F.col(actual_event_time).isNull()) + .select(F.col("et.ExpectedTime")) + ) + self.logger.info(f"Identified {missing_patterns_df.count()} missing patterns.") + return missing_patterns_df + + def _log_missing_patterns(self, missing_patterns_df: PySparkDataFrame): + """ + Logs the missing patterns. + + Args: + missing_patterns_df (PySparkDataFrame): DataFrame with missing 'ExpectedTime'. + """ + missing_patterns = missing_patterns_df.collect() + if missing_patterns: + self.logger.info("Detected Missing Patterns:") + # Sort missing patterns by ExpectedTime + sorted_missing_patterns = sorted( + missing_patterns, key=lambda row: row["ExpectedTime"] + ) + for row in sorted_missing_patterns: + # Format ExpectedTime to include milliseconds correctly + formatted_time = row["ExpectedTime"].strftime("%Y-%m-%d %H:%M:%S.%f")[ + :-3 + ] + self.logger.info(f"Missing Pattern at {formatted_time}") + else: + self.logger.info("No missing patterns detected.") + + @staticmethod + def _format_timedelta_for_sql(tolerance_ms: float) -> str: + """ + Formats a tolerance in milliseconds to a string suitable for SQL INTERVAL. + + Args: + tolerance_ms (float): Tolerance in milliseconds. + + Returns: + str: Formatted string (e.g., '1 SECOND', '500 MILLISECONDS'). + """ + if tolerance_ms >= 3600000: + hours = int(tolerance_ms // 3600000) + return f"{hours} HOURS" + elif tolerance_ms >= 60000: + minutes = int(tolerance_ms // 60000) + return f"{minutes} MINUTES" + elif tolerance_ms >= 1000: + seconds = int(tolerance_ms // 1000) + return f"{seconds} SECONDS" + else: + milliseconds = int(tolerance_ms) + return f"{milliseconds} MILLISECONDS" diff --git a/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/moving_average.py b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/moving_average.py new file mode 100644 index 000000000..ac9e096f6 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/moving_average.py @@ -0,0 +1,146 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +from pyspark.sql import DataFrame as PySparkDataFrame +from pyspark.sql.functions import col, avg +from pyspark.sql.window import Window +from pyspark.sql.types import ( + StructType, + StructField, + StringType, + TimestampType, + FloatType, +) + +from ..interfaces import MonitoringBaseInterface +from ...._pipeline_utils.models import ( + Libraries, + SystemType, +) +from ...input_validator import InputValidator + + +class MovingAverage(MonitoringBaseInterface, InputValidator): + """ + Computes and logs the moving average over a specified window size for a given PySpark DataFrame. + + Args: + df (pyspark.sql.DataFrame): The DataFrame to process. + window_size (int): The size of the moving window. + + Example: + ```python + from pyspark.sql import SparkSession + from rtdip_sdk.pipelines.data_quality.monitoring.spark.data_quality.moving_average import MovingAverage + + spark = SparkSession.builder.master("local[1]").appName("MovingAverageExample").getOrCreate() + + data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", 1.0), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", 2.0), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", 3.0), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", 4.0), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", 5.0), + ] + + columns = ["TagName", "EventTime", "Status", "Value"] + + df = spark.createDataFrame(data, columns) + + moving_avg = MovingAverage( + df=df, + window_size=3, + ) + + moving_avg.check() + ``` + """ + + df: PySparkDataFrame + window_size: int + EXPECTED_SCHEMA = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + def __init__( + self, + df: PySparkDataFrame, + window_size: int, + ) -> None: + if not isinstance(window_size, int) or window_size <= 0: + raise ValueError("window_size must be a positive integer.") + + self.df = df + self.validate(self.EXPECTED_SCHEMA) + self.window_size = window_size + + self.logger = logging.getLogger(self.__class__.__name__) + if not self.logger.handlers: + handler = logging.StreamHandler() + formatter = logging.Formatter( + "%(asctime)s - %(name)s - %(levelname)s - %(message)s" + ) + handler.setFormatter(formatter) + self.logger.addHandler(handler) + self.logger.setLevel(logging.INFO) + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def check(self) -> None: + """ + Computes and logs the moving average using a specified window size. + """ + + self._validate_inputs() + + window_spec = ( + Window.partitionBy("TagName") + .orderBy("EventTime") + .rowsBetween(-(self.window_size - 1), 0) + ) + + self.logger.info("Computing moving averages:") + + for row in ( + self.df.withColumn("MovingAverage", avg(col("Value")).over(window_spec)) + .select("TagName", "EventTime", "Value", "MovingAverage") + .collect() + ): + self.logger.info( + f"Tag: {row.TagName}, Time: {row.EventTime}, Value: {row.Value}, Moving Avg: {row.MovingAverage}" + ) + + def _validate_inputs(self): + if not isinstance(self.window_size, int) or self.window_size <= 0: + raise ValueError("window_size must be a positive integer.") diff --git a/src/sdk/python/rtdip_sdk/pipelines/forecasting/__init__.py b/src/sdk/python/rtdip_sdk/pipelines/forecasting/__init__.py new file mode 100644 index 000000000..76bb6a388 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/forecasting/__init__.py @@ -0,0 +1,15 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from .spark import * diff --git a/src/sdk/python/rtdip_sdk/pipelines/forecasting/interfaces.py b/src/sdk/python/rtdip_sdk/pipelines/forecasting/interfaces.py new file mode 100644 index 000000000..f79a36232 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/forecasting/interfaces.py @@ -0,0 +1,32 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from abc import abstractmethod + +from great_expectations.compatibility.pyspark import DataFrame + +from ..interfaces import PipelineComponentBaseInterface + + +class MachineLearningInterface(PipelineComponentBaseInterface): + @abstractmethod + def __init__(self): + pass + + @abstractmethod + def train(self, train_df: DataFrame): + return self + + @abstractmethod + def predict(self, predict_df: DataFrame, *args, **kwargs) -> DataFrame: + pass diff --git a/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/__init__.py b/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/__init__.py new file mode 100644 index 000000000..e2ca763d4 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/__init__.py @@ -0,0 +1,19 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from .data_binning import DataBinning +from .linear_regression import LinearRegression +from .arima import ArimaPrediction +from .auto_arima import ArimaAutoPrediction +from .k_nearest_neighbors import KNearestNeighbors diff --git a/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/arima.py b/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/arima.py new file mode 100644 index 000000000..f92f00135 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/arima.py @@ -0,0 +1,446 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import copy +import statistics +from enum import Enum +from typing import List, Tuple + +import pandas as pd +from pandas import DataFrame +from pyspark.sql import ( + DataFrame as PySparkDataFrame, + SparkSession, + functions as F, + DataFrame as SparkDataFrame, +) +from pyspark.sql.functions import col, lit +from pyspark.sql.types import StringType, StructField, StructType +from regex import regex +from statsmodels.tsa.arima.model import ARIMA +import numpy as np + +from ...data_quality.data_manipulation.interfaces import DataManipulationBaseInterface +from ...data_quality.input_validator import InputValidator +from ...._sdk_utils.pandas import _prepare_pandas_to_convert_to_spark +from ..._pipeline_utils.models import ( + Libraries, + SystemType, +) + + +class ArimaPrediction(DataManipulationBaseInterface, InputValidator): + """ + Extends the timeseries data in given DataFrame with forecasted values from an ARIMA model. + It forecasts a value column of the given time series dataframe based on the historical data points and constructs + full entries based on the preceding timestamps. It is advised to place this step after the missing value imputation + to prevent learning on dirty data. + + It supports dataframes in a source-based format (where each row is an event by a single sensor) and column-based format (where each row is a point in time). + + The similar component AutoArimaPrediction wraps around this component and needs less manual parameters set. + + ARIMA-Specific parameters can be viewed at the following statsmodels documentation page: + [ARIMA Documentation](https://www.statsmodels.org/dev/generated/statsmodels.tsa.arima.model.ARIMA.html) + + Example + ------- + ```python + import numpy as np + import matplotlib.pyplot as plt + import numpy.random + import pandas + from pyspark.sql import SparkSession + + from rtdip_sdk.pipelines.forecasting.spark.arima import ArimaPrediction + + import rtdip_sdk.pipelines._pipeline_utils.spark as spark_utils + + spark_session = SparkSession.builder.master("local[2]").appName("test").getOrCreate() + df = pandas.DataFrame() + + numpy.random.seed(0) + arr_len = 250 + h_a_l = int(arr_len / 2) + df['Value'] = np.random.rand(arr_len) + np.sin(np.linspace(0, arr_len / 10, num=arr_len)) + df['Value2'] = np.random.rand(arr_len) + np.cos(np.linspace(0, arr_len / 2, num=arr_len)) + 5 + df['index'] = np.asarray(pandas.date_range(start='1/1/2024', end='2/1/2024', periods=arr_len)) + df = df.set_index(pandas.DatetimeIndex(df['index'])) + + learn_df = df.head(h_a_l) + + # plt.plot(df['Value']) + # plt.show() + + input_df = spark_session.createDataFrame( + learn_df, + ['Value', 'Value2', 'index'], + ) + arima_comp = ArimaPrediction(input_df, to_extend_name='Value', number_of_data_points_to_analyze=h_a_l, number_of_data_points_to_predict=h_a_l, + order=(3,0,0), seasonal_order=(3,0,0,62)) + forecasted_df = arima_comp.filter_data().toPandas() + print('Done') + ``` + + Parameters: + past_data (PySparkDataFrame): PySpark DataFrame which contains training data + to_extend_name (str): Column or source to forecast on + past_data_style (InputStyle): In which format is past_data formatted + value_name (str): Name of column in source-based format, where values are stored + timestamp_name (str): Name of column, where event timestamps are stored + source_name (str): Name of column in source-based format, where source of events are stored + status_name (str): Name of column in source-based format, where status of events are stored + external_regressor_names (List[str]): Currently not working. Names of the columns with data to use for prediction, but not extend + number_of_data_points_to_predict (int): Amount of points to forecast + number_of_data_points_to_analyze (int): Amount of most recent points to train on + order (tuple): ARIMA-Specific setting + seasonal_order (tuple): ARIMA-Specific setting + trend (str): ARIMA-Specific setting + enforce_stationarity (bool): ARIMA-Specific setting + enforce_invertibility (bool): ARIMA-Specific setting + concentrate_scale (bool): ARIMA-Specific setting + trend_offset (int): ARIMA-Specific setting + missing (str): ARIMA-Specific setting + """ + + df: PySparkDataFrame = None + pd_df: DataFrame = None + spark_session: SparkSession + + column_to_predict: str + rows_to_predict: int + rows_to_analyze: int + + value_name: str + timestamp_name: str + source_name: str + external_regressor_names: List[str] + + class InputStyle(Enum): + """ + Used to describe style of a dataframe + """ + + COLUMN_BASED = 1 # Schema: [EventTime, FirstSource, SecondSource, ...] + SOURCE_BASED = 2 # Schema: [EventTime, NameSource, Value, OptionalStatus] + + def __init__( + self, + past_data: PySparkDataFrame, + to_extend_name: str, # either source or column + # Metadata about past_date + past_data_style: InputStyle = None, + value_name: str = None, + timestamp_name: str = None, + source_name: str = None, + status_name: str = None, + # Options for ARIMA + external_regressor_names: List[str] = None, + number_of_data_points_to_predict: int = 50, + number_of_data_points_to_analyze: int = None, + order: tuple = (0, 0, 0), + seasonal_order: tuple = (0, 0, 0, 0), + trend=None, + enforce_stationarity: bool = True, + enforce_invertibility: bool = True, + concentrate_scale: bool = False, + trend_offset: int = 1, + missing: str = "None", + ) -> None: + self.past_data = past_data + # Convert dataframe to general column-based format for internal processing + self._initialize_self_df( + past_data, + past_data_style, + source_name, + status_name, + timestamp_name, + to_extend_name, + value_name, + ) + + if number_of_data_points_to_analyze > self.df.count(): + raise ValueError( + "Number of data points to analyze exceeds the number of rows present" + ) + + self.spark_session = past_data.sparkSession + self.column_to_predict = to_extend_name + self.rows_to_predict = number_of_data_points_to_predict + self.rows_to_analyze = number_of_data_points_to_analyze or past_data.count() + self.order = order + self.seasonal_order = seasonal_order + self.trend = trend + self.enforce_stationarity = enforce_stationarity + self.enforce_invertibility = enforce_invertibility + self.concentrate_scale = concentrate_scale + self.trend_offset = trend_offset + self.missing = missing + self.external_regressor_names = external_regressor_names + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + @staticmethod + def _is_column_type(df, column_name, data_type): + """ + Helper method for data type checking + """ + type_ = df.schema[column_name] + + return isinstance(type_.dataType, data_type) + + def _initialize_self_df( + self, + past_data, + past_data_style, + source_name, + status_name, + timestamp_name, + to_extend_name, + value_name, + ): + # Initialize self.df with meta parameters if not already done by previous constructor + if self.df is None: + ( + self.past_data_style, + self.value_name, + self.timestamp_name, + self.source_name, + self.status_name, + ) = self._constructor_handle_input_metadata( + past_data, + past_data_style, + value_name, + timestamp_name, + source_name, + status_name, + ) + + if self.past_data_style == self.InputStyle.COLUMN_BASED: + self.df = past_data + elif self.past_data_style == self.InputStyle.SOURCE_BASED: + self.df = ( + past_data.groupby(self.timestamp_name) + .pivot(self.source_name) + .agg(F.first(self.value_name)) + ) + if not to_extend_name in self.df.columns: + raise ValueError("{} not found in the DataFrame.".format(to_extend_name)) + + def _constructor_handle_input_metadata( + self, + past_data: PySparkDataFrame, + past_data_style: InputStyle, + value_name: str, + timestamp_name: str, + source_name: str, + status_name: str, + ) -> Tuple[InputStyle, str, str, str, str]: + # Infer names of columns from past_data schema. If nothing is found, leave self parameters at None. + if past_data_style is not None: + return past_data_style, value_name, timestamp_name, source_name, status_name + # Automatic calculation part + schema_names = past_data.schema.names.copy() + + assumed_past_data_style = None + value_name = None + timestamp_name = None + source_name = None + status_name = None + + def pickout_column( + rem_columns: List[str], regex_string: str + ) -> (str, List[str]): + rgx = regex.compile(regex_string) + sus_columns = list(filter(rgx.search, rem_columns)) + found_column = sus_columns[0] if len(sus_columns) == 1 else None + return found_column + + # Is there a status column? + status_name = pickout_column(schema_names, r"(?i)status") + # Is there a source name / tag + source_name = pickout_column(schema_names, r"(?i)tag") + # Is there a timestamp column? + timestamp_name = pickout_column(schema_names, r"(?i)time|index") + # Is there a value column? + value_name = pickout_column(schema_names, r"(?i)value") + + if source_name is not None: + assumed_past_data_style = self.InputStyle.SOURCE_BASED + else: + assumed_past_data_style = self.InputStyle.COLUMN_BASED + + # if self.past_data_style is None: + # raise ValueError( + # "Automatic determination of past_data_style failed, must be specified in parameter instead.") + return ( + assumed_past_data_style, + value_name, + timestamp_name, + source_name, + status_name, + ) + + def filter_data(self) -> PySparkDataFrame: + """ + Forecasts a value column of a given time series dataframe based on the historical data points using ARIMA. + + Constructs full entries based on the preceding timestamps. It is advised to place this step after the missing + value imputation to prevent learning on dirty data. + + Returns: + DataFrame: A PySpark DataFrame with forecasted value entries depending on constructor parameters. + """ + # expected_scheme = StructType( + # [ + # StructField("TagName", StringType(), True), + # StructField("EventTime", TimestampType(), True), + # StructField("Status", StringType(), True), + # StructField("Value", NumericType(), True), + # ] + # ) + pd_df = self.df.toPandas() + pd_df.loc[:, self.timestamp_name] = pd.to_datetime( + pd_df[self.timestamp_name], format="mixed" + ).astype("datetime64[ns]") + pd_df.loc[:, self.column_to_predict] = pd_df.loc[ + :, self.column_to_predict + ].astype(float) + pd_df.sort_values(self.timestamp_name, inplace=True) + pd_df.reset_index(drop=True, inplace=True) + # self.validate(expected_scheme) + + # limit df to specific data points + pd_to_train_on = pd_df[pd_df[self.column_to_predict].notna()].tail( + self.rows_to_analyze + ) + pd_to_predict_on = pd_df[pd_df[self.column_to_predict].isna()].head( + self.rows_to_predict + ) + pd_df = pd.concat([pd_to_train_on, pd_to_predict_on]) + + main_signal_df = pd_df[pd_df[self.column_to_predict].notna()] + + input_data = main_signal_df[self.column_to_predict].astype(float) + exog_data = None + # if self.external_regressor_names is not None: + # exog_data = [] + # for column_name in self.external_regressor_names: + # signal_df = pd.concat([pd_to_train_on[column_name], pd_to_predict_on[column_name]]) + # exog_data.append(signal_df) + + source_model = ARIMA( + endog=input_data, + exog=exog_data, + order=self.order, + seasonal_order=self.seasonal_order, + trend=self.trend, + enforce_stationarity=self.enforce_stationarity, + enforce_invertibility=self.enforce_invertibility, + concentrate_scale=self.concentrate_scale, + trend_offset=self.trend_offset, + missing=self.missing, + ).fit() + + forecast = source_model.forecast(steps=self.rows_to_predict) + inferred_freq = pd.Timedelta( + value=statistics.mode(np.diff(main_signal_df[self.timestamp_name].values)) + ) + + pd_forecast_df = pd.DataFrame( + { + self.timestamp_name: pd.date_range( + start=main_signal_df[self.timestamp_name].max() + inferred_freq, + periods=self.rows_to_predict, + freq=inferred_freq, + ), + self.column_to_predict: forecast, + } + ) + + pd_df = pd.concat([pd_df, pd_forecast_df]) + + if self.past_data_style == self.InputStyle.COLUMN_BASED: + for obj in self.past_data.schema: + simple_string_type = obj.dataType.simpleString() + if simple_string_type == "timestamp": + continue + pd_df.loc[:, obj.name] = pd_df.loc[:, obj.name].astype( + simple_string_type + ) + # Workaround needed for PySpark versions <3.4 + pd_df = _prepare_pandas_to_convert_to_spark(pd_df) + predicted_source_pyspark_dataframe = self.spark_session.createDataFrame( + pd_df, schema=copy.deepcopy(self.past_data.schema) + ) + return predicted_source_pyspark_dataframe + elif self.past_data_style == self.InputStyle.SOURCE_BASED: + data_to_add = pd_forecast_df[[self.column_to_predict, self.timestamp_name]] + data_to_add = data_to_add.rename( + columns={ + self.timestamp_name: self.timestamp_name, + self.column_to_predict: self.value_name, + } + ) + data_to_add[self.source_name] = self.column_to_predict + data_to_add[self.timestamp_name] = data_to_add[ + self.timestamp_name + ].dt.strftime("%Y-%m-%dT%H:%M:%S.%f") + + pd_df_schema = StructType( + [ + StructField(self.source_name, StringType(), True), + StructField(self.timestamp_name, StringType(), True), + StructField(self.value_name, StringType(), True), + ] + ) + + # Workaround needed for PySpark versions <3.4 + data_to_add = _prepare_pandas_to_convert_to_spark(data_to_add) + + predicted_source_pyspark_dataframe = self.spark_session.createDataFrame( + _prepare_pandas_to_convert_to_spark( + data_to_add[ + [self.source_name, self.timestamp_name, self.value_name] + ] + ), + schema=pd_df_schema, + ) + + if self.status_name is not None: + predicted_source_pyspark_dataframe = ( + predicted_source_pyspark_dataframe.withColumn( + self.status_name, lit("Predicted") + ) + ) + + to_return = self.past_data.unionByName(predicted_source_pyspark_dataframe) + return to_return + + def validate(self, schema_dict, df: SparkDataFrame = None): + return super().validate(schema_dict, self.past_data) diff --git a/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/auto_arima.py b/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/auto_arima.py new file mode 100644 index 000000000..a47ff7a77 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/auto_arima.py @@ -0,0 +1,151 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import statistics +from typing import List, Tuple + +from pyspark.sql import DataFrame as PySparkDataFrame, SparkSession, functions as F +from pmdarima import auto_arima + +from .arima import ArimaPrediction + + +class ArimaAutoPrediction(ArimaPrediction): + """ + A wrapper for ArimaPrediction which uses pmdarima auto_arima for data prediction. + It selectively tries various sets of p and q (also P and Q for seasonal models) parameters and selects the model with the minimal AIC. + + Example + ------- + ```python + import numpy as np + import matplotlib.pyplot as plt + import numpy.random + import pandas + from pyspark.sql import SparkSession + + from rtdip_sdk.pipelines.data_quality.forecasting.spark.arima import ArimaPrediction + + import rtdip_sdk.pipelines._pipeline_utils.spark as spark_utils + from rtdip_sdk.pipelines.data_quality.forecasting.spark.auto_arima import ArimaAutoPrediction + + spark_session = SparkSession.builder.master("local[2]").appName("test").getOrCreate() + df = pandas.DataFrame() + + numpy.random.seed(0) + arr_len = 250 + h_a_l = int(arr_len / 2) + df['Value'] = np.random.rand(arr_len) + np.sin(np.linspace(0, arr_len / 10, num=arr_len)) + df['Value2'] = np.random.rand(arr_len) + np.cos(np.linspace(0, arr_len / 2, num=arr_len)) + 5 + df['index'] = np.asarray(pandas.date_range(start='1/1/2024', end='2/1/2024', periods=arr_len)) + df = df.set_index(pandas.DatetimeIndex(df['index'])) + + learn_df = df.head(h_a_l) + + # plt.plot(df['Value']) + # plt.show() + + input_df = spark_session.createDataFrame( + learn_df, + ['Value', 'Value2', 'index'], + ) + arima_comp = ArimaAutoPrediction(input_df, to_extend_name='Value', number_of_data_points_to_analyze=h_a_l, number_of_data_points_to_predict=h_a_l, + seasonal=True) + forecasted_df = arima_comp.filter_data().toPandas() + print('Done') + ``` + + Parameters: + past_data (PySparkDataFrame): PySpark DataFrame which contains training data + to_extend_name (str): Column or source to forecast on + past_data_style (InputStyle): In which format is past_data formatted + value_name (str): Name of column in source-based format, where values are stored + timestamp_name (str): Name of column, where event timestamps are stored + source_name (str): Name of column in source-based format, where source of events are stored + status_name (str): Name of column in source-based format, where status of events are stored + external_regressor_names (List[str]): Currently not working. Names of the columns with data to use for prediction, but not extend + number_of_data_points_to_predict (int): Amount of points to forecast + number_of_data_points_to_analyze (int): Amount of most recent points to train on + seasonal (bool): Setting for AutoArima, is past_data seasonal? + enforce_stationarity (bool): ARIMA-Specific setting + enforce_invertibility (bool): ARIMA-Specific setting + concentrate_scale (bool): ARIMA-Specific setting + trend_offset (int): ARIMA-Specific setting + missing (str): ARIMA-Specific setting + """ + + def __init__( + self, + past_data: PySparkDataFrame, + past_data_style: ArimaPrediction.InputStyle = None, + to_extend_name: str = None, + value_name: str = None, + timestamp_name: str = None, + source_name: str = None, + status_name: str = None, + external_regressor_names: List[str] = None, + number_of_data_points_to_predict: int = 50, + number_of_data_points_to_analyze: int = None, + seasonal: bool = False, + enforce_stationarity: bool = True, + enforce_invertibility: bool = True, + concentrate_scale: bool = False, + trend_offset: int = 1, + missing: str = "None", + ) -> None: + # Convert source-based dataframe to column-based if necessary + self._initialize_self_df( + past_data, + past_data_style, + source_name, + status_name, + timestamp_name, + to_extend_name, + value_name, + ) + # Prepare Input data + input_data = self.df.toPandas() + input_data = input_data[input_data[to_extend_name].notna()].tail( + number_of_data_points_to_analyze + )[to_extend_name] + + auto_model = auto_arima( + y=input_data, + seasonal=seasonal, + stepwise=True, + suppress_warnings=True, + trace=False, # Set to true if to debug + error_action="ignore", + max_order=None, + ) + + super().__init__( + past_data=past_data, + past_data_style=self.past_data_style, + to_extend_name=to_extend_name, + value_name=self.value_name, + timestamp_name=self.timestamp_name, + source_name=self.source_name, + status_name=self.status_name, + external_regressor_names=external_regressor_names, + number_of_data_points_to_predict=number_of_data_points_to_predict, + number_of_data_points_to_analyze=number_of_data_points_to_analyze, + order=auto_model.order, + seasonal_order=auto_model.seasonal_order, + trend="c" if auto_model.order[1] == 0 else "t", + enforce_stationarity=enforce_stationarity, + enforce_invertibility=enforce_invertibility, + concentrate_scale=concentrate_scale, + trend_offset=trend_offset, + missing=missing, + ) diff --git a/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/data_binning.py b/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/data_binning.py new file mode 100644 index 000000000..7138c547f --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/data_binning.py @@ -0,0 +1,91 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import pyspark.ml.clustering as clustering +from pyspark.sql import DataFrame +from ..interfaces import MachineLearningInterface +from ..._pipeline_utils.models import Libraries, SystemType + + +class DataBinning(MachineLearningInterface): + """ + Data binning using clustering methods. This method partitions the data points into a specified number of clusters (bins) + based on the specified column. Each data point is assigned to the nearest cluster center. + + Example + -------- + ```python + from src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.data_binning import DataBinning + + df = ... # Get a PySpark DataFrame with features column + + binning = DataBinning( + column_name="features", + bins=3, + output_column_name="bin", + method="kmeans" + ) + binned_df = binning.train(df).predict(df) + binned_df.show() + ``` + + Parameters: + column_name (str): The name of the input column to be binned (default: "features"). + bins (int): The number of bins/clusters to create (default: 2). + output_column_name (str): The name of the output column containing bin assignments (default: "bin"). + method (str): The binning method to use. Currently only supports "kmeans". + """ + + def __init__( + self, + column_name: str = "features", + bins: int = 2, + output_column_name: str = "bin", + method: str = "kmeans", + ) -> None: + self.column_name = column_name + + if method == "kmeans": + self.method = clustering.KMeans( + featuresCol=column_name, predictionCol=output_column_name, k=bins + ) + else: + raise ValueError("Unknown method: {}".format(method)) + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def train(self, train_df): + """ + Filter anomalies based on the k-sigma rule + """ + self.model = self.method.fit(train_df) + return self + + def predict(self, predict_df): + return self.model.transform(predict_df) diff --git a/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/k_nearest_neighbors.py b/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/k_nearest_neighbors.py new file mode 100644 index 000000000..da4a7cd86 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/k_nearest_neighbors.py @@ -0,0 +1,205 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from pyspark.sql import DataFrame +from pyspark.sql.functions import col, udf +from pyspark.sql.types import DoubleType +from ..interfaces import MachineLearningInterface +from ..._pipeline_utils.models import Libraries, SystemType +import numpy as np + + +class KNearestNeighbors(MachineLearningInterface): + """ + Implements the K-Nearest Neighbors (KNN) algorithm to predict missing values in a dataset. + This component is compatible with time series data and supports customizable weighted or unweighted averaging for predictions. + + Example: + ```python + from pyspark.ml.feature import StandardScaler, VectorAssembler + from pyspark.sql import SparkSession + from src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.k_nearest_neighbors import KNearestNeighbors + spark_session = SparkSession.builder.master("local[2]").appName("KNN").getOrCreate() + data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", 25.0), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", -5.0), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", 50.0), + ("B3TS64V0K.:ZUX09R", "2024-01-02 16:00:12.000", "Good", 80.0), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", 100.0), + ] + columns = ["TagName", "EventTime", "Status", "Value"] + raw_df = = spark.createDataFrame(data, columns) + assembler = VectorAssembler(inputCols=["feature1", "feature2"], outputCol="assembled_features") + df = assembler.transform(raw_df) + scaler = StandardScaler(inputCol="assembled_features", outputCol="features", withStd=True, withMean=True) + scaled_df = scaler.fit(df).transform(df) + knn = KNearestNeighbors( + features_col="features", + label_col="label", + timestamp_col="timestamp", + k=3, + weighted=True, + distance_metric="combined", # Options: "euclidean", "temporal", "combined" + temporal_weight=0.3 # Weight for temporal distance when using combined metric + ) + train_df, test_df = knn.randomSplit([0.8, 0.2], seed=42) + knn.train(train_df) + predictions = knn.predict(test_df) + ``` + + Parameters: + features_col (str): Name of the column containing the features (the input). Default is 'features' + label_col (str): Name of the column containing the label (the input). Default is 'label' + timestamp_col (str, optional): Name of the column containing timestamps + k (int): The number of neighbors to consider in the KNN algorithm. Default is 3 + weighted (bool): Whether to use weighted averaging based on distance. Default is False (unweighted averaging) + distance_metric (str): Type of distance calculation ("euclidean", "temporal", or "combined") + temporal_weight (float): Weight for temporal distance in combined metric (0 to 1) + """ + + def __init__( + self, + features_col, + label_col, + timestamp_col=None, + k=3, + weighted=False, + distance_metric="euclidean", + temporal_weight=0.5, + ): + self.features_col = features_col + self.label_col = label_col + self.timestamp_col = timestamp_col + self.k = k + self.weighted = weighted + self.distance_metric = distance_metric + self.temporal_weight = temporal_weight + self.train_features = None + self.train_labels = None + self.train_timestamps = None + + if distance_metric not in ["euclidean", "temporal", "combined"]: + raise ValueError( + "distance_metric must be 'euclidean', 'temporal', or 'combined'" + ) + + if distance_metric in ["temporal", "combined"] and timestamp_col is None: + raise ValueError( + "timestamp_col must be provided when using temporal or combined distance metrics" + ) + + @staticmethod + def system_type(): + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def train(self, train_df: DataFrame): + """ + Sets up the training DataFrame including temporal information if specified. + """ + if self.timestamp_col: + df = train_df.select( + self.features_col, self.label_col, self.timestamp_col + ).collect() + self.train_timestamps = np.array( + [row[self.timestamp_col].timestamp() for row in df] + ) + else: + df = train_df.select(self.features_col, self.label_col).collect() + + self.train_features = np.array([row[self.features_col] for row in df]) + self.train_labels = np.array([row[self.label_col] for row in df]) + return self + + def predict(self, test_df: DataFrame) -> DataFrame: + """ + Predicts labels using the specified distance metric. + """ + train_features = self.train_features + train_labels = self.train_labels + train_timestamps = self.train_timestamps + k = self.k + weighted = self.weighted + distance_metric = self.distance_metric + temporal_weight = self.temporal_weight + + def calculate_distances(features, timestamp=None): + test_point = np.array(features) + + if distance_metric == "euclidean": + return np.sqrt(np.sum((train_features - test_point) ** 2, axis=1)) + + elif distance_metric == "temporal": + return np.abs(train_timestamps - timestamp) + + else: # combined + feature_distances = np.sqrt( + np.sum((train_features - test_point) ** 2, axis=1) + ) + temporal_distances = np.abs(train_timestamps - timestamp) + + # Normalize distances to [0, 1] range + feature_distances = (feature_distances - feature_distances.min()) / ( + feature_distances.max() - feature_distances.min() + 1e-10 + ) + temporal_distances = (temporal_distances - temporal_distances.min()) / ( + temporal_distances.max() - temporal_distances.min() + 1e-10 + ) + + # Combine distances with weights + return ( + 1 - temporal_weight + ) * feature_distances + temporal_weight * temporal_distances + + def knn_predict(features, timestamp=None): + distances = calculate_distances(features, timestamp) + k_nearest_indices = np.argsort(distances)[:k] + k_nearest_labels = train_labels[k_nearest_indices] + + if weighted: + k_distances = distances[k_nearest_indices] + weights = 1 / (k_distances + 1e-10) + weights /= np.sum(weights) + unique_labels = np.unique(k_nearest_labels) + weighted_votes = { + label: np.sum(weights[k_nearest_labels == label]) + for label in unique_labels + } + return float(max(weighted_votes.items(), key=lambda x: x[1])[0]) + else: + return float( + max(set(k_nearest_labels), key=list(k_nearest_labels).count) + ) + + if self.distance_metric in ["temporal", "combined"]: + predict_udf = udf( + lambda features, timestamp: knn_predict( + features, timestamp.timestamp() + ), + DoubleType(), + ) + return test_df.withColumn( + "prediction", + predict_udf(col(self.features_col), col(self.timestamp_col)), + ) + else: + predict_udf = udf(lambda features: knn_predict(features), DoubleType()) + return test_df.withColumn("prediction", predict_udf(col(self.features_col))) diff --git a/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/linear_regression.py b/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/linear_regression.py new file mode 100644 index 000000000..b4195c37c --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/forecasting/spark/linear_regression.py @@ -0,0 +1,159 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from pyspark.sql import DataFrame +import pyspark.ml as ml +from pyspark.ml.evaluation import RegressionEvaluator +from ..interfaces import MachineLearningInterface +from ..._pipeline_utils.models import Libraries, SystemType +from typing import Optional + + +class LinearRegression(MachineLearningInterface): + """ + This class uses pyspark.ml.LinearRegression to train a linear regression model on time data + and then uses the model to predict next values in the time series. + + Args: + features_col (str): Name of the column containing the features (the input). Default is 'features'. + label_col (str): Name of the column containing the label (the input). Default is 'label'. + prediction_col (str): Name of the column to which the prediction will be written. Default is 'prediction'. + + Example: + -------- + ```python + from pyspark.sql import SparkSession + from pyspark.ml.feature import VectorAssembler + from rtdip_sdk.pipelines.forecasting.spark.linear_regression import LinearRegression + + spark = SparkSession.builder.master("local[2]").appName("LinearRegressionExample").getOrCreate() + + data = [ + (1, 2.0, 3.0), + (2, 3.0, 4.0), + (3, 4.0, 5.0), + (4, 5.0, 6.0), + (5, 6.0, 7.0), + ] + columns = ["id", "feature1", "label"] + df = spark.createDataFrame(data, columns) + + assembler = VectorAssembler(inputCols=["feature1"], outputCol="features") + df = assembler.transform(df) + + lr = LinearRegression(features_col="features", label_col="label", prediction_col="prediction") + train_df, test_df = lr.split_data(df, train_ratio=0.8) + lr.train(train_df) + predictions = lr.predict(test_df) + rmse, r2 = lr.evaluate(predictions) + print(f"RMSE: {rmse}, R²: {r2}") + ``` + + """ + + def __init__( + self, + features_col: str = "features", + label_col: str = "label", + prediction_col: str = "prediction", + ) -> None: + self.features_col = features_col + self.label_col = label_col + self.prediction_col = prediction_col + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def split_data( + self, df: DataFrame, train_ratio: float = 0.8 + ) -> tuple[DataFrame, DataFrame]: + """ + Splits the dataset into training and testing sets. + + Args: + train_ratio (float): The ratio of the data to be used for training. Default is 0.8 (80% for training). + + Returns: + tuple[DataFrame, DataFrame]: Returns the training and testing datasets. + """ + train_df, test_df = df.randomSplit([train_ratio, 1 - train_ratio], seed=42) + return train_df, test_df + + def train(self, train_df: DataFrame): + """ + Trains a linear regression model on the provided data. + """ + linear_regression = ml.regression.LinearRegression( + featuresCol=self.features_col, + labelCol=self.label_col, + predictionCol=self.prediction_col, + ) + + self.model = linear_regression.fit(train_df) + return self + + def predict(self, prediction_df: DataFrame): + """ + Predicts the next values in the time series. + """ + + return self.model.transform( + prediction_df, + ) + + def evaluate(self, test_df: DataFrame) -> Optional[float]: + """ + Evaluates the trained model using RMSE. + + Args: + test_df (DataFrame): The testing dataset to evaluate the model. + + Returns: + Optional[float]: The Root Mean Squared Error (RMSE) of the model or None if the prediction columnd doesn't exist. + """ + + if self.prediction_col not in test_df.columns: + print( + f"Error: '{self.prediction_col}' column is missing in the test DataFrame." + ) + return None + + # Evaluator for RMSE + evaluator_rmse = RegressionEvaluator( + labelCol=self.label_col, + predictionCol=self.prediction_col, + metricName="rmse", + ) + rmse = evaluator_rmse.evaluate(test_df) + + # Evaluator for R² + evaluator_r2 = RegressionEvaluator( + labelCol=self.label_col, predictionCol=self.prediction_col, metricName="r2" + ) + r2 = evaluator_r2.evaluate(test_df) + + return rmse, r2 diff --git a/tests/sdk/python/rtdip_sdk/pipelines/monitoring/spark/__init__.py b/src/sdk/python/rtdip_sdk/pipelines/logging/__init__.py similarity index 95% rename from tests/sdk/python/rtdip_sdk/pipelines/monitoring/spark/__init__.py rename to src/sdk/python/rtdip_sdk/pipelines/logging/__init__.py index 5305a429e..1832b01ae 100644 --- a/tests/sdk/python/rtdip_sdk/pipelines/monitoring/spark/__init__.py +++ b/src/sdk/python/rtdip_sdk/pipelines/logging/__init__.py @@ -1,4 +1,4 @@ -# Copyright 2022 RTDIP +# Copyright 2025 RTDIP # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. diff --git a/src/sdk/python/rtdip_sdk/pipelines/logging/interfaces.py b/src/sdk/python/rtdip_sdk/pipelines/logging/interfaces.py new file mode 100644 index 000000000..f72d565be --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/logging/interfaces.py @@ -0,0 +1,24 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from abc import abstractmethod + +from pyspark.sql import DataFrame +from ..interfaces import PipelineComponentBaseInterface + + +class LoggingBaseInterface(PipelineComponentBaseInterface): + @abstractmethod + def get_logs_as_df(self, logger_name: str) -> DataFrame: + pass diff --git a/src/sdk/python/rtdip_sdk/pipelines/logging/logger_manager.py b/src/sdk/python/rtdip_sdk/pipelines/logging/logger_manager.py new file mode 100644 index 000000000..1e68e181f --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/logging/logger_manager.py @@ -0,0 +1,82 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging + + +from pyspark.pandas.usage_logging.usage_logger import get_logger + + +class LoggerManager: + """ + Manages creation and storage of all loggers in the application. This is a singleton class. + Please create loggers with the LoggerManager if you want your logs to be handled and stored properly. + + + Example Usage + -------- + ```python + logger_manager = LoggerManager() + logger = logger_manager.create_logger("my_logger") + logger.info("This is a log message") + my_logger = logger_manager.get_logger("my_logger") + ``` + """ + + _instance = None + _initialized = False + + # dictionary to store all loggers + loggers = {} + + def __new__(cls): + if cls._instance is None: + cls._instance = super(LoggerManager, cls).__new__(cls) + return cls._instance + + def __init__(self): + if not LoggerManager._initialized: + logging.basicConfig( + level=logging.INFO, + format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", + ) + LoggerManager._initialized = True + + @classmethod + def create_logger(cls, name: str): + """ + Creates a logger with the specified name. + + Args: + name (str): The name of the logger. + + Returns: + logging.Logger: Configured logger instance. + """ + if name not in cls.loggers: + logger = logging.getLogger(name) + cls.loggers[name] = logger + return logger + + return cls.get_logger(name) + + @classmethod + def get_logger(cls, name: str): + if name not in cls.loggers: + return None + return cls.loggers[name] + + @classmethod + def get_all_loggers(cls) -> dict: + return cls.loggers diff --git a/tests/sdk/python/rtdip_sdk/pipelines/monitoring/spark/data_quality/__init__.py b/src/sdk/python/rtdip_sdk/pipelines/logging/spark/__init__.py similarity index 95% rename from tests/sdk/python/rtdip_sdk/pipelines/monitoring/spark/data_quality/__init__.py rename to src/sdk/python/rtdip_sdk/pipelines/logging/spark/__init__.py index 5305a429e..1832b01ae 100644 --- a/tests/sdk/python/rtdip_sdk/pipelines/monitoring/spark/data_quality/__init__.py +++ b/src/sdk/python/rtdip_sdk/pipelines/logging/spark/__init__.py @@ -1,4 +1,4 @@ -# Copyright 2022 RTDIP +# Copyright 2025 RTDIP # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. diff --git a/src/sdk/python/rtdip_sdk/pipelines/monitoring/spark/data_quality/__init__.py b/src/sdk/python/rtdip_sdk/pipelines/logging/spark/dataframe/__init__.py similarity index 95% rename from src/sdk/python/rtdip_sdk/pipelines/monitoring/spark/data_quality/__init__.py rename to src/sdk/python/rtdip_sdk/pipelines/logging/spark/dataframe/__init__.py index 5305a429e..1832b01ae 100644 --- a/src/sdk/python/rtdip_sdk/pipelines/monitoring/spark/data_quality/__init__.py +++ b/src/sdk/python/rtdip_sdk/pipelines/logging/spark/dataframe/__init__.py @@ -1,4 +1,4 @@ -# Copyright 2022 RTDIP +# Copyright 2025 RTDIP # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. diff --git a/src/sdk/python/rtdip_sdk/pipelines/logging/spark/dataframe/dataframe_log_handler.py b/src/sdk/python/rtdip_sdk/pipelines/logging/spark/dataframe/dataframe_log_handler.py new file mode 100644 index 000000000..f0d8ebdb6 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/logging/spark/dataframe/dataframe_log_handler.py @@ -0,0 +1,72 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging + +from pyspark.sql import DataFrame as PySparkDataFrame, SparkSession +from datetime import datetime + + +from pyspark.sql.types import StructField, TimestampType, StringType, StructType, Row + + +class DataFrameLogHandler(logging.Handler): + """ + Handles logs from attached logger and stores them in a DataFrame at runtime + Uses the following format: {Timestamp, Logger Name, Logging Level, Log Message} + + Args: + logging.Handler: Inherits from logging.Handler + + Returns: + returns a DataFrame with logs stored in it + + Example + -------- + ```python + import logging + + log_manager = logging.getLogger('log_manager') + + """ + + logs_df: PySparkDataFrame = None + spark: SparkSession + + def __init__(self, spark: SparkSession): + self.spark = spark + schema = StructType( + [ + StructField("timestamp", TimestampType(), True), + StructField("name", StringType(), True), + StructField("level", StringType(), True), + StructField("message", StringType(), True), + ] + ) + + self.logs_df = self.spark.createDataFrame([], schema) + super().__init__() + + def emit(self, record: logging.LogRecord) -> None: + """Process and store a log record""" + new_log_entry = Row( + timestamp=datetime.fromtimestamp(record.created), + name=record.name, + level=record.levelname, + message=record.msg, + ) + + self.logs_df = self.logs_df.union(self.spark.createDataFrame([new_log_entry])) + + def get_logs_as_df(self) -> PySparkDataFrame: + return self.logs_df diff --git a/src/sdk/python/rtdip_sdk/pipelines/logging/spark/log_file/__init__.py b/src/sdk/python/rtdip_sdk/pipelines/logging/spark/log_file/__init__.py new file mode 100644 index 000000000..1832b01ae --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/logging/spark/log_file/__init__.py @@ -0,0 +1,13 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/src/sdk/python/rtdip_sdk/pipelines/logging/spark/log_file/file_log_handler.py b/src/sdk/python/rtdip_sdk/pipelines/logging/spark/log_file/file_log_handler.py new file mode 100644 index 000000000..d820348a9 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/logging/spark/log_file/file_log_handler.py @@ -0,0 +1,61 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging + + +from pandas import DataFrame +from datetime import datetime + + +class FileLogHandler(logging.Handler): + """ + Handles logs from attached logger and stores them in a .log file + + Args: + logging.Handler: Inherits from logging.Handler + filename (str): Name of the log file to write to + mode (str): File opening mode ('a' for append, 'w' for write) + + Example + -------- + ```python + import logging + + log_manager = logging.getLogger('log_manager') + handler = FileLogHandler('my_logs.log') + log_manager.addHandler(handler) + ``` + """ + + logs_df: DataFrame = None + + def __init__(self, file_path: str, mode: str = "a"): + super().__init__() + self.mode = mode + self.file_path = file_path + + def emit(self, record: logging.LogRecord) -> None: + """Process and store a log record in the log file""" + try: + log_entry = { + f"{datetime.fromtimestamp(record.created).isoformat()} | " + f"{record.name} | " + f"{record.levelname} | " + f"{record.msg}\n" + } + with open(self.file_path, self.mode, encoding="utf-8") as log_file: + log_file.write(str(log_entry) + "\n") + + except Exception as e: + print(f"Error writing log entry to file: {e}") diff --git a/src/sdk/python/rtdip_sdk/pipelines/logging/spark/runtime_log_collector.py b/src/sdk/python/rtdip_sdk/pipelines/logging/spark/runtime_log_collector.py new file mode 100644 index 000000000..7b3ad84fb --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/logging/spark/runtime_log_collector.py @@ -0,0 +1,73 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os + +from pyspark.sql import SparkSession + +from src.sdk.python.rtdip_sdk.pipelines._pipeline_utils.models import ( + Libraries, + SystemType, +) + +from src.sdk.python.rtdip_sdk.pipelines.logging.logger_manager import LoggerManager +from src.sdk.python.rtdip_sdk.pipelines.logging.spark.dataframe.dataframe_log_handler import ( + DataFrameLogHandler, +) +from src.sdk.python.rtdip_sdk.pipelines.logging.spark.log_file.file_log_handler import ( + FileLogHandler, +) + + +class RuntimeLogCollector: + """Collects logs from all loggers in the LoggerManager at runtime.""" + + logger_manager: LoggerManager = LoggerManager() + + spark: SparkSession + + def __init__(self, spark: SparkSession): + self.spark = spark + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def _attach_dataframe_handler_to_logger( + self, logger_name: str + ) -> DataFrameLogHandler: + """Attaches the DataFrameLogHandler to the logger. Returns True if the handler was attached, False otherwise.""" + logger = self.logger_manager.get_logger(logger_name) + df_log_handler = DataFrameLogHandler(self.spark) + if logger is not None: + if df_log_handler not in logger.handlers: + logger.addHandler(df_log_handler) + return df_log_handler + + def _attach_file_handler_to_loggers( + self, filename: str, path: str = ".", mode: str = "a" + ) -> None: + """Attaches the FileLogHandler to the logger.""" + + loggers = self.logger_manager.get_all_loggers() + file_path = os.path.join(path, filename) + file_handler = FileLogHandler(file_path, mode) + for logger in loggers.values(): + # avoid duplicate handlers + if file_handler not in logger.handlers: + logger.addHandler(file_handler) diff --git a/src/sdk/python/rtdip_sdk/pipelines/transformers/spark/machine_learning/__init__.py b/src/sdk/python/rtdip_sdk/pipelines/transformers/spark/machine_learning/__init__.py new file mode 100644 index 000000000..9a4ecff83 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/transformers/spark/machine_learning/__init__.py @@ -0,0 +1,16 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from .columns_to_vector import * +from .polynomial_features import * diff --git a/src/sdk/python/rtdip_sdk/pipelines/transformers/spark/machine_learning/columns_to_vector.py b/src/sdk/python/rtdip_sdk/pipelines/transformers/spark/machine_learning/columns_to_vector.py new file mode 100644 index 000000000..df856bf57 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/transformers/spark/machine_learning/columns_to_vector.py @@ -0,0 +1,86 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from pyspark.ml.feature import VectorAssembler +from pyspark.sql import DataFrame +from ...._pipeline_utils.models import Libraries, SystemType +from ...interfaces import TransformerInterface + + +class ColumnsToVector(TransformerInterface): + """ + Converts columns containing numbers to a column containing a vector. + + Parameters: + df (DataFrame): PySpark DataFrame + input_cols (list[str]): List of columns to convert to a vector. + output_col (str): Name of the output column where the vector will be stored. + override_col (bool): If True, the output column can override an existing column. + """ + + def __init__( + self, + df: DataFrame, + input_cols: list[str], + output_col: str, + override_col: bool = False, + ) -> None: + self.input_cols = input_cols + self.output_col = output_col + self.override_col = override_col + self.df = df + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def pre_transform_validation(self): + if self.output_col in self.df.columns and not self.override_col: + return False + return True + + def post_transform_validation(self): + return True + + def transform(self): + if not self.pre_transform_validation(): + raise ValueError( + f"Output column {self.output_col} already exists and override_col is set to False." + ) + + temp_col = ( + f"{self.output_col}_temp" if self.output_col in self.df.columns else None + ) + transformed_df = VectorAssembler( + inputCols=self.input_cols, outputCol=(temp_col or self.output_col) + ).transform(self.df) + + if temp_col: + return transformed_df.drop(self.output_col).withColumnRenamed( + temp_col, self.output_col + ) + return transformed_df diff --git a/src/sdk/python/rtdip_sdk/pipelines/transformers/spark/machine_learning/one_hot_encoding.py b/src/sdk/python/rtdip_sdk/pipelines/transformers/spark/machine_learning/one_hot_encoding.py new file mode 100644 index 000000000..37a0d2ae1 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/transformers/spark/machine_learning/one_hot_encoding.py @@ -0,0 +1,135 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from pyspark.sql import DataFrame as PySparkDataFrame +from pyspark.sql import functions as F +from ...interfaces import TransformerInterface +from ...._pipeline_utils.models import Libraries, SystemType + + +class OneHotEncoding(TransformerInterface): + """ + Performs One-Hot Encoding on a specified column of a PySpark DataFrame. + + Example + -------- + ```python + from src.sdk.python.rtdip_sdk.pipelines.transformers.spark.machine_learning.one_hot_encoding import OneHotEncoding + from pyspark.sql import SparkSession + + + spark = ... # SparkSession + df = ... # Get a PySpark DataFrame + + one_hot_encoder = OneHotEncoding(df, "column_name", ["list_of_distinct_values"]) + result_df = one_hot_encoder.encode() + result_df.show() + ``` + + Parameters: + df (DataFrame): The PySpark DataFrame to apply encoding on. + column (str): The name of the column to apply the encoding to. + values (list, optional): A list of distinct values to encode. If not provided, + the distinct values from the data will be used. + """ + + df: PySparkDataFrame + column: str + values: list + + def __init__(self, df: PySparkDataFrame, column: str, values: list = None) -> None: + self.df = df + self.column = column + self.values = values + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def pre_transform_validation(self): + """ + Validate the input data before transformation. + - Check if the specified column exists in the DataFrame. + - If no values are provided, check if the distinct values can be computed. + - Ensure the DataFrame is not empty. + """ + if self.df is None or self.df.count() == 0: + raise ValueError("The DataFrame is empty.") + + if self.column not in self.df.columns: + raise ValueError(f"Column '{self.column}' does not exist in the DataFrame.") + + if not self.values: + distinct_values = [ + row[self.column] + for row in self.df.select(self.column).distinct().collect() + ] + if not distinct_values: + raise ValueError(f"No distinct values found in column '{self.column}'.") + self.values = distinct_values + + def post_transform_validation(self): + """ + Validate the result after transformation. + - Ensure that new columns have been added based on the distinct values. + - Verify the transformed DataFrame contains the expected number of columns. + """ + expected_columns = [ + f"{self.column}_{value if value is not None else 'None'}" + for value in self.values + ] + missing_columns = [ + col for col in expected_columns if col not in self.df.columns + ] + + if missing_columns: + raise ValueError( + f"Missing columns in the transformed DataFrame: {missing_columns}" + ) + + if self.df.count() == 0: + raise ValueError("The transformed DataFrame is empty.") + + def transform(self) -> PySparkDataFrame: + + self.pre_transform_validation() + + if not self.values: + self.values = [ + row[self.column] + for row in self.df.select(self.column).distinct().collect() + ] + + for value in self.values: + self.df = self.df.withColumn( + f"{self.column}_{value if value is not None else 'None'}", + F.when(F.col(self.column) == value, 1).otherwise(0), + ) + + self.post_transform_validation() + + return self.df diff --git a/src/sdk/python/rtdip_sdk/pipelines/transformers/spark/machine_learning/polynomial_features.py b/src/sdk/python/rtdip_sdk/pipelines/transformers/spark/machine_learning/polynomial_features.py new file mode 100644 index 000000000..b3456fe65 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/transformers/spark/machine_learning/polynomial_features.py @@ -0,0 +1,110 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import pyspark.ml as ml +from pyspark.sql import DataFrame + +from ...._pipeline_utils.models import Libraries, SystemType +from ...interfaces import TransformerInterface + + +class PolynomialFeatures(TransformerInterface): + """ + This transformer takes a vector column and generates polynomial combinations of the input features + up to the specified degree. For example, if the input vector is [a, b] and degree=2, + the output features will be [a, b, a^2, ab, b^2]. + + Parameters: + df (DataFrame): PySpark DataFrame + input_col (str): Name of the input column in the DataFrame that contains the feature vectors + output_col (str): + poly_degree (int): The degree of the polynomial features to generate + override_col (bool): If True, the output column can override an existing column. + """ + + def __init__( + self, + df: DataFrame, + input_col: str, + output_col: str, + poly_degree: int, + override_col: bool = False, + ): + self.df = df + self.input_col = input_col + self.output_col = output_col + self.poly_degree = poly_degree + self.override_col = override_col + + @staticmethod + def system_type(): + """ + Attributes: + SystemType (Environment): Requires PYSPARK + """ + return SystemType.PYSPARK + + @staticmethod + def libraries(): + libraries = Libraries() + return libraries + + @staticmethod + def settings() -> dict: + return {} + + def pre_transform_validation(self): + if not (self.input_col in self.df.columns): + raise ValueError( + f"Input column '{self.input_col}' does not exist in the DataFrame." + ) + if self.output_col in self.df.columns and not self.override_col: + raise ValueError( + f"Output column '{self.output_col}' already exists in the DataFrame and override_col is set to False." + ) + if not isinstance(self.df.schema[self.input_col].dataType, ml.linalg.VectorUDT): + raise ValueError( + f"Input column '{self.input_col}' is not of type VectorUDT." + ) + return True + + def post_transform_validation(self): + if self.output_col not in self.df.columns: + raise ValueError( + f"Output column '{self.output_col}' does not exist in the transformed DataFrame." + ) + return True + + def transform(self): + + self.pre_transform_validation() + + temp_col = ( + f"{self.output_col}_temp" if self.output_col in self.df.columns else None + ) + transformed_df = ml.feature.PolynomialExpansion( + degree=self.poly_degree, + inputCol=self.input_col, + outputCol=(temp_col or self.output_col), + ).transform(self.df) + + if temp_col: + return transformed_df.drop(self.output_col).withColumnRenamed( + temp_col, self.output_col + ) + + self.df = transformed_df + self.post_transform_validation() + + return transformed_df diff --git a/src/sdk/python/rtdip_sdk/pipelines/utilities/spark/time_string_parsing.py b/src/sdk/python/rtdip_sdk/pipelines/utilities/spark/time_string_parsing.py new file mode 100644 index 000000000..0bad557a7 --- /dev/null +++ b/src/sdk/python/rtdip_sdk/pipelines/utilities/spark/time_string_parsing.py @@ -0,0 +1,46 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import re + + +def parse_time_string_to_ms(time_str: str) -> float: + """ + Parses a time string and returns the total time in milliseconds. + + Args: + time_str (str): Time string (e.g., '10ms', '1s', '2m', '1h'). + + Returns: + float: Total time in milliseconds. + + Raises: + ValueError: If the format is invalid. + """ + pattern = re.compile(r"^(\d+(?:\.\d+)?)(ms|s|m|h)$") + match = pattern.match(time_str) + if not match: + raise ValueError(f"Invalid time format: {time_str}") + value, unit = match.groups() + value = float(value) + if unit == "ms": + return value + elif unit == "s": + return value * 1000 + elif unit == "m": + return value * 60 * 1000 + elif unit == "h": + return value * 3600 * 1000 + else: + raise ValueError(f"Unsupported time unit in time: {unit}") diff --git a/src/sdk/python/rtdip_sdk/queries/time_series/time_series_query_builder.py b/src/sdk/python/rtdip_sdk/queries/time_series/time_series_query_builder.py index a19b740be..383ab7fca 100644 --- a/src/sdk/python/rtdip_sdk/queries/time_series/time_series_query_builder.py +++ b/src/sdk/python/rtdip_sdk/queries/time_series/time_series_query_builder.py @@ -227,7 +227,7 @@ def raw( "end_date": end_date, "include_bad_data": include_bad_data, "display_uom": display_uom, - "sirt": sort, + "sort": sort, "limit": limit, "offset": offset, "tagname_column": self.tagname_column, @@ -323,7 +323,7 @@ def resample( "time_interval_rate": time_interval_rate, "time_interval_unit": time_interval_unit, "agg_method": agg_method, - ":fill": fill, + "fill": fill, "pivot": pivot, "display_uom": display_uom, "sort": sort, diff --git a/src/sdk/python/rtdip_sdk/pipelines/monitoring/spark/__init__.py b/tests/sdk/python/rtdip_sdk/pipelines/__init__.py similarity index 100% rename from src/sdk/python/rtdip_sdk/pipelines/monitoring/spark/__init__.py rename to tests/sdk/python/rtdip_sdk/pipelines/__init__.py diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/__init__.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/__init__.py new file mode 100644 index 000000000..1832b01ae --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/__init__.py @@ -0,0 +1,13 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/__init__.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/__init__.py new file mode 100644 index 000000000..1832b01ae --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/__init__.py @@ -0,0 +1,13 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/__init__.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/__init__.py new file mode 100644 index 000000000..1832b01ae --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/__init__.py @@ -0,0 +1,13 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_dimensionality_reduction.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_dimensionality_reduction.py new file mode 100644 index 000000000..0e19edd6e --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_dimensionality_reduction.py @@ -0,0 +1,119 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import pytest +from pyspark.sql import SparkSession + +from src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.dimensionality_reduction import ( + DimensionalityReduction, +) + + +@pytest.fixture(scope="session") +def spark_session(): + return SparkSession.builder.master("local[2]").appName("test").getOrCreate() + + +@pytest.fixture +def test_data(spark_session): + normal_distribution = [ + 0.30832997, + 0.22166579, + -1.68713693, + 1.41243689, + 1.25282623, + -0.70494665, + 0.52186887, + -0.34352648, + -1.38233527, + -0.76870644, + 1.72735928, + -0.14838714, + -0.76086769, + 1.81330706, + -1.84541331, + -1.05816002, + 0.86864253, + -2.47756826, + 0.19112086, + -0.72390124, + ] + + noise = [ + 2.39757601, + 0.40913959, + 0.40281196, + 0.43624341, + 0.57281305, + 0.15978893, + 0.09098515, + 0.18199072, + 2.9758837, + 1.38059478, + 1.55032586, + 0.88507288, + 2.13327, + 2.21896827, + 0.61288938, + 0.17535961, + 1.83386377, + 1.08476656, + 1.86311249, + 0.44964528, + ] + + data_with_noise = [ + (normal_distribution[i], normal_distribution[i] + noise[i]) + for i in range(len(normal_distribution)) + ] + + identical_data = [ + (normal_distribution[i], normal_distribution[i]) + for i in range(len(normal_distribution)) + ] + + return [ + spark_session.createDataFrame(data_with_noise, ["Value1", "Value2"]), + spark_session.createDataFrame(identical_data, ["Value1", "Value2"]), + ] + + +def test_with_correlated_data(spark_session, test_data): + identical_data = test_data[1] + + dimensionality_reduction = DimensionalityReduction( + identical_data, ["Value1", "Value2"] + ) + result_df = dimensionality_reduction.filter_data() + + assert ( + result_df.count() == identical_data.count() + ), "Row count does not match expected result" + assert "Value1" in result_df.columns, "Value1 should be in the DataFrame" + assert "Value2" not in result_df.columns, "Value2 should have been removed" + + +def test_with_uncorrelated_data(spark_session, test_data): + uncorrelated_data = test_data[0] + + dimensionality_reduction = DimensionalityReduction( + uncorrelated_data, ["Value1", "Value2"] + ) + result_df = dimensionality_reduction.filter_data() + + assert ( + result_df.count() == uncorrelated_data.count() + ), "Row count does not match expected result" + assert "Value1" in result_df.columns, "Value1 should be in the DataFrame" + assert "Value2" in result_df.columns, "Value2 should be in the DataFrame" diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_duplicate_detection.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_duplicate_detection.py new file mode 100644 index 000000000..270f2c36e --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_duplicate_detection.py @@ -0,0 +1,163 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import pytest +import os +from pyspark.sql import SparkSession +from pyspark.sql.dataframe import DataFrame +from pyspark.sql.types import ( + StructType, + StructField, + StringType, + TimestampType, + FloatType, +) + +from src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.duplicate_detection import ( + DuplicateDetection, +) + + +@pytest.fixture(scope="session") +def spark_session(): + return SparkSession.builder.master("local[2]").appName("test").getOrCreate() + + +@pytest.fixture +def test_data(spark_session): + data = [ + ("key1", "time1", "value1"), + ("key2", "time2", "value2"), + ("key2", "time3", "value2"), + ("key1", "time1", "value3"), + ("key4", "time4", "value4"), + ("key5", "time4", "value5"), + ] + columns = ["TagName", "EventTime", "Value"] + return spark_session.createDataFrame(data, columns) + + +def test_duplicate_detection_two_columns(spark_session, test_data): + expected_data = [ + ("key1", "time1", "value1"), + ("key2", "time2", "value2"), + ("key2", "time3", "value2"), + ("key4", "time4", "value4"), + ("key5", "time4", "value5"), + ] + columns = ["TagName", "EventTime", "Value"] + expected_df = spark_session.createDataFrame(expected_data, columns) + + duplicate_detection = DuplicateDetection( + test_data, primary_key_columns=["TagName", "EventTime"] + ) + result_df = duplicate_detection.filter_data() + result_df.show() + + assert ( + result_df.count() == expected_df.count() + ), "Row count does not match expected result" + assert sorted(result_df.collect()) == sorted( + expected_df.collect() + ), "Data does not match expected result" + + +def test_duplicate_detection_one_column(spark_session, test_data): + expected_data = [ + ("key1", "time1", "value1"), + ("key2", "time2", "value2"), + ("key4", "time4", "value4"), + ("key5", "time4", "value5"), + ] + columns = ["TagName", "EventTime", "Value"] + expected_df = spark_session.createDataFrame(expected_data, columns) + + duplicate_detection = DuplicateDetection(test_data, primary_key_columns=["TagName"]) + result_df = duplicate_detection.filter_data() + result_df.show() + + assert ( + result_df.count() == expected_df.count() + ), "Row count does not match expected result" + assert sorted(result_df.collect()) == sorted( + expected_df.collect() + ), "Data does not match expected result" + + +def test_duplicate_detection_large_data_set(spark_session: SparkSession): + test_path = os.path.dirname(__file__) + data_path = os.path.join(test_path, "../../test_data.csv") + + actual_df = spark_session.read.option("header", "true").csv(data_path) + + expected_schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + duplicate_detection_component = DuplicateDetection( + actual_df, primary_key_columns=["TagName", "EventTime"] + ) + result_df = DataFrame + + try: + if duplicate_detection_component.validate(expected_schema): + result_df = duplicate_detection_component.filter_data() + except Exception as e: + print(repr(e)) + + assert isinstance(actual_df, DataFrame) + + assert result_df.schema == expected_schema + assert result_df.count() < actual_df.count() + assert result_df.count() == (actual_df.count() - 4) + + +def test_duplicate_detection_wrong_datatype(spark_session: SparkSession): + + expected_schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + test_df = spark_session.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "invalid_data_type", "Good", "1.0"), + ("A2PS64V0J.:ZUX09R", "invalid_data_type", "Good", "2.0"), + ("A2PS64V0J.:ZUX09R", "invalid_data_type", "Good", "3.0"), + ("A2PS64V0J.:ZUX09R", "invalid_data_type", "Good", "4.0"), + ("A2PS64V0J.:ZUX09R", "invalid_data_type", "Good", "5.0"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + duplicate_detection_component = DuplicateDetection( + test_df, primary_key_columns=["TagName", "EventTime"] + ) + + with pytest.raises(ValueError) as exc_info: + duplicate_detection_component.validate(expected_schema) + + assert ( + "Error during casting column 'EventTime' to TimestampType(): Column 'EventTime' cannot be cast to TimestampType()." + in str(exc_info.value) + ) diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_flatline_filter.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_flatline_filter.py new file mode 100644 index 000000000..6e5086b9a --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_flatline_filter.py @@ -0,0 +1,131 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import pytest +import os +from pyspark.sql import SparkSession +from src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.flatline_filter import ( + FlatlineFilter, +) + + +@pytest.fixture(scope="session") +def spark(): + spark = ( + SparkSession.builder.master("local[2]") + .appName("FlatlineDetectionTest") + .getOrCreate() + ) + yield spark + spark.stop() + + +def test_flatline_filter_no_flatlining(spark): + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "0.150000006"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "0.340000004"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + detector = FlatlineFilter(df, watch_columns=["Value"], tolerance_timespan=2) + result = detector.filter_data() + + assert sorted(result.collect()) == sorted(df.collect()) + + +def test_flatline_detection_with_flatlining(spark): + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "0.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", "0.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "Null"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "0.340000004"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + detector = FlatlineFilter(df, watch_columns=["Value"], tolerance_timespan=2) + result = detector.filter_data() + + rows_to_remove = [ + { + "TagName": "A2PS64V0J.:ZUX09R", + "EventTime": "2024-01-02 07:53:11.000", + "Status": "Good", + "Value": "0.0", + }, + { + "TagName": "A2PS64V0J.:ZUX09R", + "EventTime": "2024-01-02 07:53:11.000", + "Status": "Good", + "Value": "0.0", + }, + { + "TagName": "A2PS64V0J.:ZUX09R", + "EventTime": "2024-01-02 11:56:42.000", + "Status": "Good", + "Value": "0.0", + }, + { + "TagName": "A2PS64V0J.:ZUX09R", + "EventTime": "2024-01-02 16:00:12.000", + "Status": "Good", + "Value": "None", + }, + ] + rows_to_remove_df = spark.createDataFrame(rows_to_remove) + expected_df = df.subtract(rows_to_remove_df) + assert sorted(result.collect()) == sorted(expected_df.collect()) + + +def test_large_dataset(spark): + base_path = os.path.dirname(__file__) + file_path = os.path.join(base_path, "../../test_data.csv") + df = spark.read.option("header", "true").csv(file_path) + + assert df.count() > 0, "Dataframe was not loaded correctly" + + detector = FlatlineFilter(df, watch_columns=["Value"], tolerance_timespan=2) + result = detector.filter_data() + + rows_to_remove = [ + { + "TagName": "FLATLINE_TEST", + "EventTime": "2024-01-02 02:35:10.511000", + "Status": "Good", + "Value": "0.0", + }, + { + "TagName": "FLATLINE_TEST", + "EventTime": "2024-01-02 02:49:10.408000", + "Status": "Good", + "Value": "0.0", + }, + { + "TagName": "FLATLINE_TEST", + "EventTime": "2024-01-02 14:57:10.372000", + "Status": "Good", + "Value": "0.0", + }, + ] + rows_to_remove_df = spark.createDataFrame(rows_to_remove) + + expected_df = df.subtract(rows_to_remove_df) + + assert sorted(result.collect()) == sorted(expected_df.collect()) diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_gaussian_smoothing.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_gaussian_smoothing.py new file mode 100644 index 000000000..1c8131903 --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_gaussian_smoothing.py @@ -0,0 +1,142 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import pytest +from pyspark.sql import SparkSession + +from src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.gaussian_smoothing import ( + GaussianSmoothing, +) + + +@pytest.fixture(scope="session") +def spark_session(): + spark = ( + SparkSession.builder.master("local[2]") + .appName("GaussianSmoothingTest") + .getOrCreate() + ) + yield spark + spark.stop() + + +def test_gaussian_smoothing_temporal(spark_session: SparkSession): + df = spark_session.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "0.150000006"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "0.340000004"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + smoother = GaussianSmoothing( + df=df, + sigma=2.0, + id_col="TagName", + mode="temporal", + timestamp_col="EventTime", + value_col="Value", + ) + result_df = smoother.filter_data() + + original_values = df.select("Value").collect() + smoothed_values = result_df.select("Value").collect() + + assert ( + original_values != smoothed_values + ), "Values should be smoothed and not identical" + + assert result_df.count() == df.count(), "Result should have same number of rows" + + +def test_gaussian_smoothing_spatial(spark_session: SparkSession): + df = spark_session.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "0.150000006"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "0.340000004"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + # Apply smoothing + smoother = GaussianSmoothing( + df=df, + sigma=3.0, + id_col="TagName", + mode="spatial", + timestamp_col="EventTime", + value_col="Value", + ) + result_df = smoother.filter_data() + + original_values = df.select("Value").collect() + smoothed_values = result_df.select("Value").collect() + + assert ( + original_values != smoothed_values + ), "Values should be smoothed and not identical" + assert result_df.count() == df.count(), "Result should have same number of rows" + + +def test_interval_detection_large_data_set(spark_session: SparkSession): + # Should not timeout + base_path = os.path.dirname(__file__) + file_path = os.path.join(base_path, "../../test_data.csv") + + df = spark_session.read.option("header", "true").csv(file_path) + + smoother = GaussianSmoothing( + df=df, + sigma=1, + id_col="TagName", + mode="temporal", + timestamp_col="EventTime", + value_col="Value", + ) + + actual_df = smoother.filter_data() + assert ( + actual_df.count() == df.count() + ), "Output should have same number of rows as input" + + +def test_gaussian_smoothing_invalid_mode(spark_session: SparkSession): + # Create test data + df = spark_session.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "0.150000006"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "0.340000004"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + # Attempt to initialize with an invalid mode + with pytest.raises(ValueError, match="mode must be either 'temporal' or 'spatial'"): + GaussianSmoothing( + df=df, + sigma=2.0, + id_col="TagName", + mode="invalid_mode", # Invalid mode + timestamp_col="EventTime", + value_col="Value", + ) diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_interval_filtering.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_interval_filtering.py new file mode 100644 index 000000000..a8fa04f32 --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_interval_filtering.py @@ -0,0 +1,377 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +from datetime import datetime + +import pytest + + +from pyspark.sql import SparkSession +from src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.interval_filtering import ( + IntervalFiltering, +) +from tests.sdk.python.rtdip_sdk.pipelines.logging.test_log_collection import spark + + +@pytest.fixture(scope="session") +def spark_session(): + spark = ( + SparkSession.builder.master("local[2]") + .appName("CheckValueRangesTest") + .getOrCreate() + ) + yield spark + spark.stop() + + +def convert_to_datetime(date_time: str): + return datetime.strptime(date_time, "%Y-%m-%d %H:%M:%S.%f") + + +def test_interval_detection_easy(spark_session: SparkSession): + expected_df = spark_session.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "0.150000006"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "0.340000004"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + df = spark_session.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "0.150000006"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "0.340000004"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + interval_filtering_wrangler = IntervalFiltering( + spark_session, df, 1, "seconds", "EventTime" + ) + actual_df = interval_filtering_wrangler.filter_data() + + assert expected_df.columns == actual_df.columns + assert expected_df.schema == actual_df.schema + assert expected_df.collect() == actual_df.collect() + + +def test_interval_detection_easy_unordered(spark_session: SparkSession): + expected_df = spark_session.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "0.150000006"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "0.340000004"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + df = spark_session.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "0.150000006"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "0.340000004"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + interval_filtering_wrangler = IntervalFiltering( + spark_session, df, 1, "seconds", "EventTime" + ) + actual_df = interval_filtering_wrangler.filter_data() + + assert expected_df.columns == actual_df.columns + assert expected_df.schema == actual_df.schema + assert expected_df.collect() == actual_df.collect() + + +def test_interval_detection_milliseconds(spark_session: SparkSession): + expected_df = spark_session.createDataFrame( + [ + ("A2PS64V0JR", "2024-01-02 20:03:46.000"), + ("A2PS64asd.:ZUX09R", "2024-01-02 20:03:46.020"), + ("A2PS64asd.:ZUX09R", "2024-01-02 20:03:46.030"), + ], + ["TagName", "Time"], + ) + + df = spark_session.createDataFrame( + [ + ("A2PS64V0JR", "2024-01-02 20:03:46.000"), + ("A2PS64asd.:ZUX09R", "2024-01-02 20:03:46.020"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.025"), + ("A2PS64asd.:ZUX09R", "2024-01-02 20:03:46.030"), + ("A2PS64V0J.:ZUasdX09R", "2024-01-02 20:03:46.035"), + ], + ["TagName", "Time"], + ) + + interval_filtering_wrangler = IntervalFiltering( + spark_session, df, 10, "milliseconds", "Time" + ) + actual_df = interval_filtering_wrangler.filter_data() + + assert expected_df.columns == actual_df.columns + assert expected_df.schema == actual_df.schema + assert expected_df.collect() == actual_df.collect() + + +def test_interval_detection_minutes(spark_session: SparkSession): + expected_df = spark_session.createDataFrame( + [ + ("A2PS64V0JR", "2024-01-02 20:03:46.000"), + ("A2PS64asd.:ZUX09R", "2024-01-02 20:06:46.000"), + ("A2PS64asd.:ZUX09R", "2024-01-02 20:12:46.030"), + ], + ["TagName", "Time"], + ) + + df = spark_session.createDataFrame( + [ + ("A2PS64V0JR", "2024-01-02 20:03:46.000"), + ("A2PS64asd.:ZUX09R", "2024-01-02 20:06:46.000"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:09:45.999"), + ("A2PS64asd.:ZUX09R", "2024-01-02 20:12:46.030"), + ("A2PS64V0J.:ZUasdX09R", "2024-01-02 20:03:46.035"), + ], + ["TagName", "Time"], + ) + + interval_filtering_wrangler = IntervalFiltering( + spark_session, df, 3, "minutes", "Time" + ) + actual_df = interval_filtering_wrangler.filter_data() + + assert expected_df.columns == actual_df.columns + assert expected_df.schema == actual_df.schema + assert expected_df.collect() == actual_df.collect() + + +def test_interval_detection_hours(spark_session: SparkSession): + expected_df = spark_session.createDataFrame( + [ + ("A2PS64V0JR", "2024-01-02 20:03:46.000"), + ("A2PS64asd.:ZUX09R", "2024-01-02 21:06:46.000"), + ("A2PS64V0J.:ZUasdX09R", "2024-01-02 23:03:46.035"), + ], + ["TagName", "EventTime"], + ) + + df = spark_session.createDataFrame( + [ + ("A2PS64V0JR", "2024-01-02 20:03:46.000"), + ("A2PS64asd.:ZUX09R", "2024-01-02 21:06:46.000"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 21:09:45.999"), + ("A2PS64asd.:ZUX09R", "2024-01-02 21:12:46.030"), + ("A2PS64V0J.:ZUasdX09R", "2024-01-02 23:03:46.035"), + ], + ["TagName", "EventTime"], + ) + + interval_filtering_wrangler = IntervalFiltering(spark_session, df, 1, "hours") + actual_df = interval_filtering_wrangler.filter_data() + + assert expected_df.columns == actual_df.columns + assert expected_df.schema == actual_df.schema + assert expected_df.collect() == actual_df.collect() + + +def test_interval_detection_days(spark_session: SparkSession): + expected_df = spark_session.createDataFrame( + [ + ("A2PS64V0JR", "2024-01-02 20:03:46.000"), + ("A2PS64asd.:ZUX09R", "2024-01-03 21:03:46.000"), + ("A2PS64asd.:ZUX09R", "2024-01-04 21:12:46.030"), + ("A2PS64V0J.:ZUasdX09R", "2028-01-01 23:03:46.035"), + ], + ["TagName", "EventTime"], + ) + + df = spark_session.createDataFrame( + [ + ("A2PS64V0JR", "2024-01-02 20:03:46.000"), + ("A2PS64asd.:ZUX09R", "2024-01-03 21:03:46.000"), + ("A2PS64V0J.:ZUX09R", "2024-01-04 21:03:45.999"), + ("A2PS64asd.:ZUX09R", "2024-01-04 21:12:46.030"), + ("A2PS64V0J.:ZUasdX09R", "2028-01-01 23:03:46.035"), + ], + ["TagName", "EventTime"], + ) + + interval_filtering_wrangler = IntervalFiltering(spark_session, df, 1, "days") + actual_df = interval_filtering_wrangler.filter_data() + + assert expected_df.columns == actual_df.columns + assert expected_df.schema == actual_df.schema + assert expected_df.collect() == actual_df.collect() + + +def test_interval_detection_wrong_time_stamp_column_name(spark_session: SparkSession): + df = spark_session.createDataFrame( + [ + ("A2PS64V0JR", "2024-01-02 20:03:46.000"), + ("A2PS64asd.:ZUX09R", "2024-01-02 21:06:46.000"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 21:09:45.999"), + ("A2PS64asd.:ZUX09R", "2024-01-02 21:12:46.030"), + ("A2PS64V0J.:ZUasdX09R", "2024-01-02 23:03:46.035"), + ], + ["TagName", "EventTime"], + ) + + interval_filtering_wrangler = IntervalFiltering( + spark_session, df, 1, "hours", "Time" + ) + + with pytest.raises(ValueError): + interval_filtering_wrangler.filter_data() + + +def test_interval_detection_wrong_interval_unit_pass(spark_session: SparkSession): + df = spark_session.createDataFrame( + [ + ("A2PS64V0JR", "2024-01-02 20:03:46.000"), + ("A2PS64asd.:ZUX09R", "2024-01-02 21:06:46.000"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 21:09:45.999"), + ("A2PS64asd.:ZUX09R", "2024-01-02 21:12:46.030"), + ("A2PS64V0J.:ZUasdX09R", "2024-01-02 23:03:46.035"), + ], + ["TagName", "EventTime"], + ) + + interval_filtering_wrangler = IntervalFiltering( + spark_session, df, 1, "years", "EventTime" + ) + + with pytest.raises(ValueError): + interval_filtering_wrangler.filter_data() + + +def test_interval_detection_faulty_time_stamp(spark_session: SparkSession): + df = spark_session.createDataFrame( + [ + ("A2PS64V0JR", "2024-01-09-02 20:03:46.000"), + ("A2PS64asd.:ZUX09R", "2024-01-02 21:06:46.000"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 21:09:45.999"), + ("A2PS64asd.:ZUX09R", "2024-01-02 21:12:46.030"), + ("A2PS64V0J.:ZUasdX09R", "2024-01-02 23:03:46.035"), + ], + ["TagName", "EventTime"], + ) + + interval_filtering_wrangler = IntervalFiltering( + spark_session, df, 1, "minutes", "EventTime" + ) + + with pytest.raises(ValueError): + interval_filtering_wrangler.filter_data() + + +def test_interval_tolerance(spark_session: SparkSession): + expected_df = spark_session.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:47.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:50.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:52.000", "Good", "0.129999995"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + df = spark_session.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:46.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:47.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:50.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:51.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:52.000", "Good", "0.129999995"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + interval_filtering_wrangler = IntervalFiltering( + spark_session, df, 3, "seconds", "EventTime", 1 + ) + actual_df = interval_filtering_wrangler.filter_data() + + assert expected_df.columns == actual_df.columns + assert expected_df.schema == actual_df.schema + assert expected_df.collect() == actual_df.collect() + + +def test_interval_detection_date_time_columns(spark_session: SparkSession): + expected_df = spark_session.createDataFrame( + [ + ("A2PS64V0JR", convert_to_datetime("2024-01-02 20:03:46.000")), + ("A2PS64asd.:ZUX09R", convert_to_datetime("2024-01-02 21:06:46.000")), + ("A2PS64V0J.:ZUasdX09R", convert_to_datetime("2024-01-02 23:03:46.035")), + ], + ["TagName", "EventTime"], + ) + df = spark_session.createDataFrame( + [ + ("A2PS64V0JR", convert_to_datetime("2024-01-02 20:03:46.000")), + ("A2PS64asd.:ZUX09R", convert_to_datetime("2024-01-02 21:06:46.000")), + ("A2PS64V0J.:ZUX09R", convert_to_datetime("2024-01-02 21:09:45.999")), + ("A2PS64asd.:ZUX09R", convert_to_datetime("2024-01-02 21:12:46.030")), + ("A2PS64V0J.:ZUasdX09R", convert_to_datetime("2024-01-02 23:03:46.035")), + ], + ["TagName", "EventTime"], + ) + + interval_filtering_wrangler = IntervalFiltering(spark_session, df, 1, "hours") + actual_df = interval_filtering_wrangler.filter_data() + + assert expected_df.columns == actual_df.columns + assert expected_df.schema == actual_df.schema + assert expected_df.collect() == actual_df.collect() + + +def test_interval_detection_large_data_set(spark_session: SparkSession): + base_path = os.path.dirname(__file__) + file_path = os.path.join(base_path, "../../test_data.csv") + + df = spark_session.read.option("header", "true").csv(file_path) + + interval_filtering_wrangler = IntervalFiltering(spark_session, df, 1, "hours") + + actual_df = interval_filtering_wrangler.filter_data() + assert actual_df.count() == 25 + + +def test_interval_detection_wrong_datatype(spark_session: SparkSession): + df = spark_session.createDataFrame( + [ + ("A2PS64V0JR", "invalid_data_type"), + ("A2PS64asd.:ZUX09R", "invalid_data_type"), + ("A2PS64V0J.:ZUX09R", "invalid_data_type"), + ("A2PS64asd.:ZUX09R", "invalid_data_type"), + ("A2PS64V0J.:ZUasdX09R", "invalid_data_type"), + ], + ["TagName", "EventTime"], + ) + + interval_filtering_wrangler = IntervalFiltering(spark_session, df, 1, "hours") + + with pytest.raises(ValueError): + interval_filtering_wrangler.filter_data() diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_k_sigma_anomaly_detection.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_k_sigma_anomaly_detection.py new file mode 100644 index 000000000..bee9b0678 --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_k_sigma_anomaly_detection.py @@ -0,0 +1,140 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from pyspark.sql import SparkSession +import pytest +from src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.k_sigma_anomaly_detection import ( + KSigmaAnomalyDetection, +) +import os + +# Normal data mean=10 stddev=5 + 3 anomalies +# fmt: off +normal_input_values = [ 5.19811497, 8.34437927, 3.62104032, 10.02819525, 6.1183447 , + 20.10067378, 10.32313075, 14.090119 , 21.43078927, 2.76624332, + 10.84089416, 1.90722629, 11.19750641, 13.70925639, 5.61011921, + 4.50072694, 13.79440311, 13.30173747, 7.07183589, 12.79853139, 100] + +normal_expected_values = [ 5.19811497, 8.34437927, 3.62104032, 10.02819525, 6.1183447 , + 20.10067378, 10.32313075, 14.090119 , 21.43078927, 2.76624332, + 10.84089416, 1.90722629, 11.19750641, 13.70925639, 5.61011921, + 4.50072694, 13.79440311, 13.30173747, 7.07183589, 12.79853139] +# fmt: on + +# These values are tricky for the mean method, as the anomaly has a big effect on the mean +input_values = [1, 2, 3, 4, 20] +expected_values = [1, 2, 3, 4] + + +def test_filter_with_mean(spark_session: SparkSession): + # Test with normal data + normal_input_df = spark_session.createDataFrame( + [(float(num),) for num in normal_input_values], schema=["value"] + ) + normal_expected_df = spark_session.createDataFrame( + [(float(num),) for num in normal_expected_values], schema=["value"] + ) + + normal_filtered_df = KSigmaAnomalyDetection( + spark_session, + normal_input_df, + column_names=["value"], + k_value=3, + use_median=False, + ).filter_data() + + assert normal_expected_df.collect() == normal_filtered_df.collect() + + # Test with data that has an anomaly that shifts the mean significantly + input_df = spark_session.createDataFrame( + [(float(num),) for num in input_values], schema=["value"] + ) + expected_df = spark_session.createDataFrame( + [(float(num),) for num in expected_values], schema=["value"] + ) + + filtered_df = KSigmaAnomalyDetection( + spark_session, input_df, column_names=["value"], k_value=3, use_median=False + ).filter_data() + + assert expected_df.collect() != filtered_df.collect() + + +def test_filter_with_median(spark_session: SparkSession): + # Test with normal data + normal_input_df = spark_session.createDataFrame( + [(float(num),) for num in normal_input_values], schema=["value"] + ) + normal_expected_df = spark_session.createDataFrame( + [(float(num),) for num in normal_expected_values], schema=["value"] + ) + + normal_filtered_df = KSigmaAnomalyDetection( + spark_session, + normal_input_df, + column_names=["value"], + k_value=3, + use_median=True, + ).filter_data() + + assert normal_expected_df.collect() == normal_filtered_df.collect() + + # Test with data that has an anomaly that shifts the mean significantly + input_df = spark_session.createDataFrame( + [(float(num),) for num in input_values], schema=["value"] + ) + expected_df = spark_session.createDataFrame( + [(float(num),) for num in expected_values], schema=["value"] + ) + + filtered_df = KSigmaAnomalyDetection( + spark_session, input_df, column_names=["value"], k_value=3, use_median=True + ).filter_data() + + assert expected_df.collect() == filtered_df.collect() + + +def test_filter_with_wrong_types(spark_session: SparkSession): + wrong_column_type_df = spark_session.createDataFrame( + [(f"string {i}",) for i in range(10)], schema=["value"] + ) + + # wrong value type + with pytest.raises(ValueError): + KSigmaAnomalyDetection( + spark_session, + wrong_column_type_df, + column_names=["value"], + k_value=3, + use_median=True, + ).filter_data() + + # missing column + with pytest.raises(ValueError): + KSigmaAnomalyDetection( + spark_session, + wrong_column_type_df, + column_names=["$value"], + k_value=3, + use_median=True, + ).filter_data() + + +def test_large_dataset(spark_session): + base_path = os.path.dirname(__file__) + file_path = os.path.join(base_path, "../../test_data.csv") + df = spark_session.read.option("header", "true").csv(file_path) + + assert df.count() > 0, "Dataframe was not loaded correct" + + KSigmaAnomalyDetection(spark_session, df, column_names=["Value"]).filter_data() diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_missing_value_imputation.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_missing_value_imputation.py new file mode 100644 index 000000000..242581571 --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_missing_value_imputation.py @@ -0,0 +1,403 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import pytest +import os + +from pyspark.sql import SparkSession +from pyspark.sql.dataframe import DataFrame +from pyspark.sql.functions import col, unix_timestamp, abs as A +from pyspark.sql.types import ( + StructType, + StructField, + StringType, + TimestampType, + FloatType, +) + +from src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.missing_value_imputation import ( + MissingValueImputation, +) + + +@pytest.fixture(scope="session") +def spark_session(): + return SparkSession.builder.master("local[2]").appName("test").getOrCreate() + + +def test_missing_value_imputation(spark_session: SparkSession): + + schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", StringType(), True), + StructField("Status", StringType(), True), + StructField("Value", StringType(), True), + ] + ) + + expected_schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + test_data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-01 03:29:21.000", "Good", "1.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 07:32:55.000", "Good", "2.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 11:36:29.000", "Good", "3.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 15:39:03.000", "Good", "4.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 19:42:37.000", "Good", "5.0"), + # ("A2PS64V0J.:ZUX09R", "2024-01-01 23:46:11.000", "Good", "6.0"), # Test values + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "7.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "8.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", "9.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "10.0"), + ( + "A2PS64V0J.:ZUX09R", + "2024-01-02 20:13:46.000", + "Good", + "11.0", + ), # Tolerance Test + ("A2PS64V0J.:ZUX09R", "2024-01-03 00:07:20.000", "Good", "12.0"), + # ("A2PS64V0J.:ZUX09R", "2024-01-03 04:10:54.000", "Good", "13.0"), + # ("A2PS64V0J.:ZUX09R", "2024-01-03 08:14:28.000", "Good", "14.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-03 12:18:02.000", "Good", "15.0"), + # ("A2PS64V0J.:ZUX09R", "2024-01-03 16:21:36.000", "Good", "16.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-03 20:25:10.000", "Good", "17.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-04 00:28:44.000", "Good", "18.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-04 04:32:18.000", "Good", "19.0"), + # Real missing values + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:01:43", "Good", "4686.259766"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:02:44", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:04:44", "Good", "4686.259766"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:05:44", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:11:46", "Good", "4686.259766"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:13:46", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:16:47", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:19:48", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:20:48", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:25:50", "Good", "4681.35791"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:26:50", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:27:50", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:28:50", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:31:51", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:32:52", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:42:52", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:42:54", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:43:54", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:44:54", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:45:54", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:46:55", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:47:55", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:51:56", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:52:56", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:55:57", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:56:58", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:57:58", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:59:59", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:00:59", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:05:01", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:10:02", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:11:03", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:13:06", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:17:07", "Good", "4691.161621"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:18:07", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:20:07", "Good", "4686.259766"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:21:07", "Good", "4700.96582"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:25:09", "Good", "4676.456055"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:26:09", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:30:09", "Good", "4700.96582"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:35:10", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:36:10", "Good", "4700.96582"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:40:11", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:42:11", "Good", "4700.96582"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:43:11", "Good", "4705.867676"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:44:11", "Good", "4700.96582"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:46:11", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:47:11", "Good", "4700.96582"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:53:13", "Good", "4696.063477"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:54:13", "Good", "4700.96582"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:55:13", "Good", "4686.259766"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:56:13", "Good", "4700.96582"), + ] + + expected_data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-01 03:29:21", "Good", "1.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 07:32:55", "Good", "2.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 11:36:29", "Good", "3.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 15:39:03", "Good", "4.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 19:42:37", "Good", "5.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 23:46:10", "Good", "6.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45", "Good", "7.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11", "Good", "8.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42", "Good", "9.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12", "Good", "10.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:13:46", "Good", "11.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-03 00:07:20", "Good", "12.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-03 04:10:50", "Good", "13.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-03 08:14:20", "Good", "14.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-03 12:18:02", "Good", "15.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-03 16:21:30", "Good", "16.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-03 20:25:10", "Good", "17.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-04 00:28:44", "Good", "18.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-04 04:32:18", "Good", "19.0"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:01:43", "Good", "4686.26"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:02:44", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:03:44", "Good", "4688.019"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:04:44", "Good", "4686.26"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:05:44", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:06:44", "Good", "4694.203"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:07:44", "Good", "4693.92"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:08:44", "Good", "4691.6475"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:09:44", "Good", "4688.722"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:10:44", "Good", "4686.481"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:11:46", "Good", "4686.26"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:12:46", "Good", "4688.637"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:13:46", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:14:46", "Good", "4691.4985"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:15:46", "Good", "4690.817"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:16:47", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:17:47", "Good", "4693.7354"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:18:47", "Good", "4696.372"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:19:48", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:20:48", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:21:48", "Good", "4684.8516"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:22:48", "Good", "4679.2305"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:23:48", "Good", "4675.784"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:24:48", "Good", "4675.998"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:25:50", "Good", "4681.358"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:26:50", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:27:50", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:28:50", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:29:50", "Good", "4691.056"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:30:50", "Good", "4694.813"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:31:51", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:32:52", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:33:52", "Good", "4685.6963"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:34:52", "Good", "4681.356"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:35:52", "Good", "4678.175"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:36:52", "Good", "4676.186"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:37:52", "Good", "4675.423"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:38:52", "Good", "4675.9185"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:39:52", "Good", "4677.707"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:40:52", "Good", "4680.8213"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:41:52", "Good", "4685.295"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:42:52", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:42:54", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:43:52", "Good", "4692.863"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:43:54", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:44:54", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:45:54", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:46:55", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:47:55", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:48:55", "Good", "4689.178"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:49:55", "Good", "4692.111"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:50:55", "Good", "4695.794"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:51:56", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:52:56", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:53:56", "Good", "4687.381"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:54:56", "Good", "4687.1104"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:55:57", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:56:58", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:57:58", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:58:58", "Good", "4693.161"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:59:59", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:00:59", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:01:59", "Good", "4688.2207"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:02:59", "Good", "4689.07"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:03:59", "Good", "4692.1904"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:05:01", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:06:01", "Good", "4699.3506"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:07:01", "Good", "4701.433"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:08:01", "Good", "4701.872"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:09:01", "Good", "4700.228"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:10:02", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:11:03", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:12:03", "Good", "4692.6973"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:13:06", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:14:06", "Good", "4695.113"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:15:06", "Good", "4691.5415"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:16:06", "Good", "4689.0054"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:17:07", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:18:07", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:19:07", "Good", "4688.7515"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:20:07", "Good", "4686.26"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:21:07", "Good", "4700.966"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:22:07", "Good", "4700.935"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:23:07", "Good", "4687.808"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:24:07", "Good", "4675.1323"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:25:09", "Good", "4676.456"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:26:09", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:27:09", "Good", "4708.868"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:28:09", "Good", "4711.2476"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:29:09", "Good", "4707.2603"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:30:09", "Good", "4700.966"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:31:09", "Good", "4695.7764"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:32:09", "Good", "4692.5146"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:33:09", "Good", "4691.358"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:34:09", "Good", "4692.482"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:35:10", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:36:10", "Good", "4700.966"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:37:10", "Good", "4702.4126"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:38:10", "Good", "4700.763"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:39:10", "Good", "4697.9897"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:40:11", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:41:11", "Good", "4696.747"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:42:11", "Good", "4700.966"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:43:11", "Good", "4705.8677"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:44:11", "Good", "4700.966"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:45:11", "Good", "4695.9624"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:46:11", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:47:11", "Good", "4700.966"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:48:11", "Good", "4702.187"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:49:11", "Good", "4699.401"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:50:11", "Good", "4695.0015"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:51:11", "Good", "4691.3823"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:52:11", "Good", "4690.9385"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:53:13", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:54:13", "Good", "4700.966"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:55:13", "Good", "4686.26"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:56:13", "Good", "4700.966"), + ] + + test_df = spark_session.createDataFrame(test_data, schema=schema) + expected_df = spark_session.createDataFrame(expected_data, schema=schema) + + missing_value_imputation = MissingValueImputation(spark_session, test_df) + actual_df = DataFrame + + try: + if missing_value_imputation.validate(expected_schema): + actual_df = missing_value_imputation.filter_data() + except Exception as e: + print(repr(e)) + + assert isinstance(actual_df, DataFrame) + + assert expected_df.columns == actual_df.columns + assert expected_schema == actual_df.schema + + def assert_dataframe_similar( + expected_df, actual_df, tolerance=1e-4, time_tolerance_seconds=5 + ): + + expected_df = expected_df.orderBy(["TagName", "EventTime"]) + actual_df = actual_df.orderBy(["TagName", "EventTime"]) + + expected_df = expected_df.withColumn("Value", col("Value").cast("float")) + actual_df = actual_df.withColumn("Value", col("Value").cast("float")) + + for expected_row, actual_row in zip(expected_df.collect(), actual_df.collect()): + for expected_val, actual_val, column_name in zip( + expected_row, actual_row, expected_df.columns + ): + if column_name == "Value": + assert ( + abs(expected_val - actual_val) < tolerance + ), f"Value mismatch: {expected_val} != {actual_val}" + elif column_name == "EventTime": + expected_event_time = unix_timestamp(col("EventTime")).cast( + "timestamp" + ) + actual_event_time = unix_timestamp(col("EventTime")).cast( + "timestamp" + ) + + time_diff = A( + expected_event_time.cast("long") + - actual_event_time.cast("long") + ) + condition = time_diff <= time_tolerance_seconds + + mismatched_rows = expected_df.join( + actual_df, on=["TagName", "EventTime"], how="inner" + ).filter(~condition) + + assert ( + mismatched_rows.count() == 0 + ), f"EventTime mismatch: {expected_val} != {actual_val} (tolerance: {time_tolerance_seconds}s)" + else: + assert ( + expected_val == actual_val + ), f"Mismatch in column '{column_name}': {expected_val} != {actual_val}" + + assert_dataframe_similar(expected_df, actual_df, tolerance=1e-4) + + +def test_missing_value_imputation_large_data_set(spark_session: SparkSession): + test_path = os.path.dirname(__file__) + data_path = os.path.join(test_path, "../../test_data.csv") + + actual_df = spark_session.read.option("header", "true").csv(data_path) + + expected_schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + missing_value_imputation_component = MissingValueImputation( + spark_session, actual_df + ) + result_df = DataFrame + + try: + if missing_value_imputation_component.validate(expected_schema): + result_df = missing_value_imputation_component.filter_data() + except Exception as e: + print(repr(e)) + + assert isinstance(actual_df, DataFrame) + + assert result_df.schema == expected_schema + assert result_df.count() > actual_df.count() + + +def test_missing_value_imputation_wrong_datatype(spark_session: SparkSession): + + expected_schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + test_df = spark_session.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "invalid_data_type", "Good", "1.0"), + ("A2PS64V0J.:ZUX09R", "invalid_data_type", "Good", "2.0"), + ("A2PS64V0J.:ZUX09R", "invalid_data_type", "Good", "3.0"), + ("A2PS64V0J.:ZUX09R", "invalid_data_type", "Good", "4.0"), + ("A2PS64V0J.:ZUX09R", "invalid_data_type", "Good", "5.0"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + missing_value_imputation_component = MissingValueImputation(spark_session, test_df) + + with pytest.raises(ValueError) as exc_info: + missing_value_imputation_component.validate(expected_schema) + + assert ( + "Error during casting column 'EventTime' to TimestampType(): Column 'EventTime' cannot be cast to TimestampType()." + in str(exc_info.value) + ) diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_normalization.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_normalization.py new file mode 100644 index 000000000..128ee14c5 --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_normalization.py @@ -0,0 +1,184 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from pandas.io.formats.format import math +import pytest +import os + +from pyspark.sql import SparkSession +from pyspark.sql.dataframe import DataFrame + +from src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.denormalization import ( + Denormalization, +) +from src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.normalization import ( + NormalizationBaseClass, +) +from src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.normalization_mean import ( + NormalizationMean, +) +from src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.normalization_minmax import ( + NormalizationMinMax, +) + + +@pytest.fixture(scope="session") +def spark_session(): + return SparkSession.builder.master("local[2]").appName("test").getOrCreate() + + +def test_nonexistent_column_normalization(spark_session: SparkSession): + input_df = spark_session.createDataFrame( + [ + (1.0,), + (2.0,), + ], + ["Value"], + ) + + with pytest.raises(ValueError): + NormalizationMean(input_df, column_names=["NonexistingColumn"], in_place=True) + + +def test_wrong_column_type_normalization(spark_session: SparkSession): + input_df = spark_session.createDataFrame( + [ + ("a",), + ("b",), + ], + ["Value"], + ) + + with pytest.raises(ValueError): + NormalizationMean(input_df, column_names=["Value"]) + + +def test_non_inplace_normalization(spark_session: SparkSession): + input_df = spark_session.createDataFrame( + [ + (1.0,), + (2.0,), + ], + ["Value"], + ) + + expected_normalised_df = spark_session.createDataFrame( + [ + (1.0, 0.0), + (2.0, 1.0), + ], + ["Value", "Value_minmax_normalization"], + ) + + normalization_component = NormalizationMinMax( + input_df, column_names=["Value"], in_place=False + ) + normalised_df = normalization_component.filter_data() + + assert isinstance(normalised_df, DataFrame) + + assert expected_normalised_df.columns == normalised_df.columns + assert expected_normalised_df.schema == normalised_df.schema + assert expected_normalised_df.collect() == normalised_df.collect() + + denormalization_component = Denormalization(normalised_df, normalization_component) + reverted_df = denormalization_component.filter_data() + + assert isinstance(reverted_df, DataFrame) + + assert input_df.columns == reverted_df.columns + assert input_df.schema == reverted_df.schema + assert input_df.collect() == reverted_df.collect() + + +@pytest.mark.parametrize("class_to_test", NormalizationBaseClass.__subclasses__()) +def test_idempotence_with_positive_values( + spark_session: SparkSession, class_to_test: NormalizationBaseClass +): + input_df = spark_session.createDataFrame( + [ + (1.0,), + (2.0,), + (3.0,), + (4.0,), + (5.0,), + ], + ["Value"], + ) + + expected_df = input_df.alias("input_df") + helper_assert_idempotence(class_to_test, input_df, expected_df) + + +@pytest.mark.parametrize("class_to_test", NormalizationBaseClass.__subclasses__()) +def test_idempotence_with_zero_values( + spark_session: SparkSession, class_to_test: NormalizationBaseClass +): + input_df = spark_session.createDataFrame( + [ + (0.0,), + (0.0,), + (0.0,), + (0.0,), + (0.0,), + ], + ["Value"], + ) + + expected_df = input_df.alias("input_df") + helper_assert_idempotence(class_to_test, input_df, expected_df) + + +@pytest.mark.parametrize("class_to_test", NormalizationBaseClass.__subclasses__()) +def test_idempotence_with_large_data_set( + spark_session: SparkSession, class_to_test: NormalizationBaseClass +): + base_path = os.path.dirname(__file__) + file_path = os.path.join(base_path, "../../test_data.csv") + input_df = spark_session.read.option("header", "true").csv(file_path) + input_df = input_df.withColumn("Value", input_df["Value"].cast("double")) + assert input_df.count() > 0, "Dataframe was not loaded correct" + input_df.show() + + expected_df = input_df.alias("input_df") + helper_assert_idempotence(class_to_test, input_df, expected_df) + + +def helper_assert_idempotence( + class_to_test: NormalizationBaseClass, + input_df: DataFrame, + expected_df: DataFrame, +): + try: + normalization_component = class_to_test( + input_df, column_names=["Value"], in_place=True + ) + actual_df = normalization_component.filter_data() + + denormalization_component = Denormalization(actual_df, normalization_component) + actual_df = denormalization_component.filter_data() + + assert isinstance(actual_df, DataFrame) + + assert expected_df.columns == actual_df.columns + assert expected_df.schema == actual_df.schema + + for row1, row2 in zip(expected_df.collect(), actual_df.collect()): + for col1, col2 in zip(row1, row2): + if isinstance(col1, float) and isinstance(col2, float): + assert math.isclose(col1, col2, rel_tol=1e-9) + else: + assert col1 == col2 + except ZeroDivisionError: + pass diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_one_hot_encoding.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_one_hot_encoding.py new file mode 100644 index 000000000..c3d54dda5 --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_one_hot_encoding.py @@ -0,0 +1,194 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import pytest +import math + +from pyspark.sql import SparkSession +from pyspark.sql.types import StructType, StructField, StringType, FloatType +from src.sdk.python.rtdip_sdk.pipelines.transformers.spark.machine_learning.one_hot_encoding import ( + OneHotEncoding, +) + +# Define the schema outside the test functions +SCHEMA = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", StringType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] +) + + +@pytest.fixture(scope="session") +def spark_session(): + return SparkSession.builder.master("local[2]").appName("test").getOrCreate() + + +def test_empty_df(spark_session): + """Empty DataFrame""" + empty_df = spark_session.createDataFrame([], SCHEMA) + encoder = OneHotEncoding(empty_df, "TagName") + + with pytest.raises(ValueError, match="The DataFrame is empty."): + encoder = OneHotEncoding(empty_df, "TagName") + encoder.transform() + + +def test_single_unique_value(spark_session): + """Single Unique Value""" + data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46", "Good", 0.34), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12", "Good", 0.15), + ] + df = spark_session.createDataFrame(data, SCHEMA) + encoder = OneHotEncoding(df, "TagName") + result_df = encoder.transform() + + expected_columns = [ + "TagName", + "EventTime", + "Status", + "Value", + "TagName_A2PS64V0J.:ZUX09R", + ] + assert ( + result_df.columns == expected_columns + ), "Columns do not match for single unique value." + for row in result_df.collect(): + assert ( + row["TagName_A2PS64V0J.:ZUX09R"] == 1 + ), "Expected 1 for the one-hot encoded column." + + +def test_null_values(spark_session): + """Column with Null Values""" + data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46", "Good", 0.34), + (None, "2024-01-02 16:00:12", "Good", 0.15), + ] + df = spark_session.createDataFrame(data, SCHEMA) + encoder = OneHotEncoding(df, "TagName") + result_df = encoder.transform() + + expected_columns = [ + "TagName", + "EventTime", + "Status", + "Value", + "TagName_A2PS64V0J.:ZUX09R", + "TagName_None", + ] + assert ( + result_df.columns == expected_columns + ), f"Columns do not match for null value case. Expected {expected_columns}, but got {result_df.columns}" + for row in result_df.collect(): + if row["TagName"] == "A2PS64V0J.:ZUX09R": + assert ( + row["TagName_A2PS64V0J.:ZUX09R"] == 1 + ), "Expected 1 for valid TagName." + assert ( + row["TagName_None"] == 0 + ), "Expected 0 for TagName_None for valid TagName." + elif row["TagName"] is None: + assert ( + row["TagName_A2PS64V0J.:ZUX09R"] == 0 + ), "Expected 0 for TagName_A2PS64V0J.:ZUX09R for None TagName." + assert ( + row["TagName_None"] == 0 + ), "Expected 0 for TagName_None for None TagName." + + +def test_large_unique_values(spark_session): + """Large Number of Unique Values""" + data = [ + (f"Tag_{i}", f"2024-01-02 20:03:{i:02d}", "Good", i * 1.0) for i in range(1000) + ] + df = spark_session.createDataFrame(data, SCHEMA) + encoder = OneHotEncoding(df, "TagName") + result_df = encoder.transform() + + assert ( + len(result_df.columns) == len(SCHEMA.fields) + 1000 + ), "Expected 1000 additional columns for one-hot encoding." + + +def test_special_characters(spark_session): + """Special Characters in Column Values""" + data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46", "Good", 0.34), + ("@Special#Tag!", "2024-01-02 16:00:12", "Good", 0.15), + ] + df = spark_session.createDataFrame(data, SCHEMA) + encoder = OneHotEncoding(df, "TagName") + result_df = encoder.transform() + + expected_columns = [ + "TagName", + "EventTime", + "Status", + "Value", + "TagName_A2PS64V0J.:ZUX09R", + "TagName_@Special#Tag!", + ] + assert ( + result_df.columns == expected_columns + ), "Columns do not match for special characters." + for row in result_df.collect(): + for tag in ["A2PS64V0J.:ZUX09R", "@Special#Tag!"]: + expected_value = 1 if row["TagName"] == tag else 0 + column_name = f"TagName_{tag}" + assert ( + row[column_name] == expected_value + ), f"Expected {expected_value} for {column_name}." + + +def test_distinct_value(spark_session): + """Dataset with Multiple TagName Values""" + + data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46", "Good", 0.3400000035762787), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12", "Good", 0.15000000596046448), + ( + "-4O7LSSAM_3EA02:2GT7E02I_R_MP", + "2024-01-02 20:09:58", + "Good", + 7107.82080078125, + ), + ("_LT2EPL-9PM0.OROTENV3:", "2024-01-02 12:27:10", "Good", 19407.0), + ("1N325T3MTOR-P0L29:9.T0", "2024-01-02 23:41:10", "Good", 19376.0), + ] + + df = spark_session.createDataFrame(data, SCHEMA) + + encoder = OneHotEncoding(df, "TagName") + result_df = encoder.transform() + + result = result_df.collect() + + expected_columns = df.columns + [ + f"TagName_{row['TagName']}" for row in df.select("TagName").distinct().collect() + ] + + assert set(result_df.columns) == set(expected_columns) + + tag_names = df.select("TagName").distinct().collect() + for row in result: + tag_name = row["TagName"] + for tag in tag_names: + column_name = f"TagName_{tag['TagName']}" + if tag["TagName"] == tag_name: + assert math.isclose(row[column_name], 1.0, rel_tol=1e-09, abs_tol=1e-09) + else: + assert math.isclose(row[column_name], 0.0, rel_tol=1e-09, abs_tol=1e-09) diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_out_of_range_value_filter.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_out_of_range_value_filter.py new file mode 100644 index 000000000..913ae9ffa --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/test_out_of_range_value_filter.py @@ -0,0 +1,111 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import pytest +from pyspark.sql import SparkSession +import os + + +from src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.out_of_range_value_filter import ( + OutOfRangeValueFilter, +) + + +@pytest.fixture(scope="session") +def spark(): + spark = ( + SparkSession.builder.master("local[2]") + .appName("DeleteOutOfRangeValuesTest") + .getOrCreate() + ) + yield spark + spark.stop() + + +@pytest.fixture +def test_data(spark): + data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "1"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "2"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", "3"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "4"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "5"), + ("Tag2", "2024-01-02 03:49:45.000", "Good", "1"), + ("Tag2", "2024-01-02 07:53:11.000", "Good", "2"), + ("Tag2", "2024-01-02 11:56:42.000", "Good", "3"), + ("Tag2", "2024-01-02 16:00:12.000", "Good", "4"), + ("Tag2", "2024-01-02 20:03:46.000", "Good", "5"), + ] + return spark.createDataFrame(data, ["TagName", "EventTime", "Status", "Value"]) + + +def test_basic(spark, test_data): + tag_ranges = { + "A2PS64V0J.:ZUX09R": {"min": 2, "max": 4, "inclusive_bounds": True}, + "Tag2": {"min": 1, "max": 5, "inclusive_bounds": False}, + } + manipulator = OutOfRangeValueFilter(test_data, tag_ranges) + + rows_to_remove = [ + { + "TagName": "A2PS64V0J.:ZUX09R", + "EventTime": "2024-01-02 07:53:11.000", + "Status": "Good", + "Value": "2", + }, + { + "TagName": "Tag2", + "EventTime": "2024-01-02 11:56:42.000", + "Status": "Good", + "Value": "3", + }, + ] + rows_to_remove_df = spark.createDataFrame(rows_to_remove) + expected = test_data.subtract(rows_to_remove_df) + + result = manipulator.filter_data() + + assert sorted(result.collect()) == sorted(expected.collect()) + + +def test_large_dataset(spark): + base_path = os.path.dirname(__file__) + file_path = os.path.join(base_path, "../../test_data.csv") + df = spark.read.option("header", "true").csv(file_path) + assert df.count() > 0, "Dataframe was not loaded correct" + + tag_ranges = { + "value_range": {"min": 2, "max": 4, "inclusive_bounds": True}, + } + manipulator = OutOfRangeValueFilter(df, tag_ranges) + + rows_to_remove = [ + { + "TagName": "value_range", + "EventTime": "2024-01-02 03:49:45", + "Status": "Good", + "Value": "1.0", + }, + { + "TagName": "value_range", + "EventTime": "2024-01-02 20:03:46", + "Status": "Good", + "Value": "5.0", + }, + ] + rows_to_remove_df = spark.createDataFrame(rows_to_remove) + expected = df.subtract(rows_to_remove_df) + + result = manipulator.filter_data() + + assert sorted(result.collect()) == sorted(expected.collect()) diff --git a/tests/sdk/python/rtdip_sdk/pipelines/monitoring/__init__ .py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/__init__ .py similarity index 95% rename from tests/sdk/python/rtdip_sdk/pipelines/monitoring/__init__ .py rename to tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/__init__ .py index 5305a429e..1832b01ae 100644 --- a/tests/sdk/python/rtdip_sdk/pipelines/monitoring/__init__ .py +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/__init__ .py @@ -1,4 +1,4 @@ -# Copyright 2022 RTDIP +# Copyright 2025 RTDIP # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/__init__.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/__init__.py new file mode 100644 index 000000000..1832b01ae --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/__init__.py @@ -0,0 +1,13 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_check_value_ranges.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_check_value_ranges.py new file mode 100644 index 000000000..9e036666b --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_check_value_ranges.py @@ -0,0 +1,140 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import pytest +from pyspark.sql import SparkSession +from io import StringIO +import logging +import os + + +from src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.check_value_ranges import ( + CheckValueRanges, +) + + +@pytest.fixture(scope="session") +def spark(): + spark = ( + SparkSession.builder.master("local[2]") + .appName("CheckValueRangesTest") + .getOrCreate() + ) + yield spark + spark.stop() + + +@pytest.fixture +def log_capture(): + log_stream = StringIO() + logger = logging.getLogger("CheckValueRanges") + logger.setLevel(logging.INFO) + handler = logging.StreamHandler(log_stream) + formatter = logging.Formatter("%(message)s") + handler.setFormatter(formatter) + logger.addHandler(handler) + yield log_stream + logger.removeHandler(handler) + handler.close() + + +@pytest.fixture +def test_data(spark): + data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "1"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "2"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", "3"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "4"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "5"), + ("Tag2", "2024-01-02 03:49:45.000", "Good", "1"), + ("Tag2", "2024-01-02 07:53:11.000", "Good", "2"), + ("Tag2", "2024-01-02 11:56:42.000", "Good", "3"), + ("Tag2", "2024-01-02 16:00:12.000", "Good", "4"), + ("Tag2", "2024-01-02 20:03:46.000", "Good", "5"), + ] + return spark.createDataFrame(data, ["TagName", "EventTime", "Status", "Value"]) + + +def test_basic(test_data, log_capture): + tag_ranges = { + "A2PS64V0J.:ZUX09R": {"min": 2, "max": 4, "inclusive_bounds": True}, + "Tag2": {"min": 1, "max": 5, "inclusive_bounds": False}, + } + monitor = CheckValueRanges(test_data, tag_ranges) + monitor.check() + expected_logs = [ + # For temperature with inclusive_bounds='both' + "Found 2 rows in 'Value' column for TagName 'A2PS64V0J.:ZUX09R' out of range.", + f"Out of range row for TagName 'A2PS64V0J.:ZUX09R': Row(TagName='A2PS64V0J.:ZUX09R', EventTime=datetime.datetime(2024, 1, 2, 3, 49, 45), Status='Good', Value=1.0)", + f"Out of range row for TagName 'A2PS64V0J.:ZUX09R': Row(TagName='A2PS64V0J.:ZUX09R', EventTime=datetime.datetime(2024, 1, 2, 20, 3, 46), Status='Good', Value=5.0)", + f"Found 2 rows in 'Value' column for TagName 'Tag2' out of range.", + f"Out of range row for TagName 'Tag2': Row(TagName='Tag2', EventTime=datetime.datetime(2024, 1, 2, 3, 49, 45), Status='Good', Value=1.0)", + f"Out of range row for TagName 'Tag2': Row(TagName='Tag2', EventTime=datetime.datetime(2024, 1, 2, 20, 3, 46), Status='Good', Value=5.0)", + ] + log_contents = log_capture.getvalue() + actual_logs = log_contents.strip().split("\n") + assert len(expected_logs) == len( + actual_logs + ), f"Expected {len(expected_logs)} logs, got {len(actual_logs)}" + for expected, actual in zip(expected_logs, actual_logs): + assert expected == actual, f"Expected: '{expected}', got: '{actual}'" + + +def test_invalid_tag_name(test_data): + tag_ranges = { + "InvalidTagName": {"min": 0, "max": 100}, + } + with pytest.raises(ValueError) as excinfo: + monitor = CheckValueRanges(df=test_data, tag_ranges=tag_ranges) + monitor.check() + + assert "TagName 'InvalidTagName' not found in DataFrame." in str(excinfo.value) + + +def test_no_min_or_max(test_data): + tag_ranges = { + "A2PS64V0J.:ZUX09R": {}, # Weder 'min' noch 'max' angegeben + } + with pytest.raises(ValueError) as excinfo: + monitor = CheckValueRanges(df=test_data, tag_ranges=tag_ranges) + monitor.check() + assert ( + "TagName 'A2PS64V0J.:ZUX09R' must have at least 'min' or 'max' specified." + in str(excinfo.value) + ) + + +def test_large_dataset(spark, log_capture): + base_path = os.path.dirname(__file__) + file_path = os.path.join(base_path, "../../test_data.csv") + df = spark.read.option("header", "true").csv(file_path) + assert df.count() > 0, "Dataframe was not loaded correct" + + tag_ranges = { + "value_range": {"min": 2, "max": 4, "inclusive_bounds": True}, + } + monitor = CheckValueRanges(df, tag_ranges) + monitor.check() + + expected_logs = [ + "Found 2 rows in 'Value' column for TagName 'value_range' out of range.", + f"Out of range row for TagName 'value_range': Row(TagName='value_range', EventTime=datetime.datetime(2024, 1, 2, 3, 49, 45), Status=' Good', Value=1.0)", + f"Out of range row for TagName 'value_range': Row(TagName='value_range', EventTime=datetime.datetime(2024, 1, 2, 20, 3, 46), Status=' Good', Value=5.0)", + ] + actual_logs = log_capture.getvalue().strip().split("\n") + + assert len(expected_logs) == len( + actual_logs + ), f"Expected {len(expected_logs)} logs, got {len(actual_logs)}" + for expected, actual in zip(expected_logs, actual_logs): + assert expected in actual, f"Expected: '{expected}', got: '{actual}'" diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_flatline_detection.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_flatline_detection.py new file mode 100644 index 000000000..64aac49b2 --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_flatline_detection.py @@ -0,0 +1,155 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import pytest +import os +from pyspark.sql import SparkSession +from src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.flatline_detection import ( + FlatlineDetection, +) + +import logging +from io import StringIO + + +@pytest.fixture(scope="session") +def spark(): + spark = ( + SparkSession.builder.master("local[2]") + .appName("FlatlineDetectionTest") + .getOrCreate() + ) + yield spark + spark.stop() + + +@pytest.fixture +def log_capture(): + log_stream = StringIO() + logger = logging.getLogger("FlatlineDetection") + logger.setLevel(logging.INFO) + handler = logging.StreamHandler(log_stream) + formatter = logging.Formatter("%(message)s") + handler.setFormatter(formatter) + logger.addHandler(handler) + yield log_stream + logger.removeHandler(handler) + handler.close() + + +def test_flatline_detection_no_flatlining(spark, log_capture): + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "0.150000006"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "0.340000004"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + detector = FlatlineDetection(df, watch_columns=["Value"], tolerance_timespan=2) + detector.check() + + expected_logs = [ + "No flatlining detected.", + ] + actual_logs = log_capture.getvalue().strip().split("\n") + + assert len(expected_logs) == len( + actual_logs + ), f"Expected {len(expected_logs)} logs, got {len(actual_logs)}" + for expected, actual in zip(expected_logs, actual_logs): + assert expected == actual, f"Expected: '{expected}', got: '{actual}'" + + +def test_flatline_detection_with_flatlining(spark, log_capture): + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "0.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", "0.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "Null"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "0.340000004"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + detector = FlatlineDetection(df, watch_columns=["Value"], tolerance_timespan=2) + detector.check() + + expected_logs = [ + "Flatlining detected in column 'Value' at row: Row(TagName='A2PS64V0J.:ZUX09R', EventTime=datetime.datetime(2024, 1, 2, 7, 53, 11), Status='Good', Value=0.0, Value_flatline_flag=1, Value_group=1).", + "Flatlining detected in column 'Value' at row: Row(TagName='A2PS64V0J.:ZUX09R', EventTime=datetime.datetime(2024, 1, 2, 11, 56, 42), Status='Good', Value=0.0, Value_flatline_flag=1, Value_group=1).", + "Flatlining detected in column 'Value' at row: Row(TagName='A2PS64V0J.:ZUX09R', EventTime=datetime.datetime(2024, 1, 2, 16, 0, 12), Status='Good', Value=None, Value_flatline_flag=1, Value_group=1).", + ] + actual_logs = log_capture.getvalue().strip().split("\n") + + assert len(expected_logs) == len( + actual_logs + ), f"Expected {len(expected_logs)} logs, got {len(actual_logs)}" + for expected, actual in zip(expected_logs, actual_logs): + assert expected in actual, f"Expected: '{expected}', got: '{actual}'" + + +def test_flatline_detection_with_tolerance(spark, log_capture): + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11.000", "Good", "0.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42.000", "Good", "0.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "Null"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "0.340000004"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + detector = FlatlineDetection(df, watch_columns=["Value"], tolerance_timespan=3) + detector.check() + + expected_logs = [ + "No flatlining detected.", + ] + actual_logs = log_capture.getvalue().strip().split("\n") + + assert len(expected_logs) == len( + actual_logs + ), f"Expected {len(expected_logs)} logs, got {len(actual_logs)}" + for expected, actual in zip(expected_logs, actual_logs): + assert expected in actual, f"Expected: '{expected}', got: '{actual}'" + + +def test_large_dataset(spark, log_capture): + base_path = os.path.dirname(__file__) + file_path = os.path.join(base_path, "../../test_data.csv") + df = spark.read.option("header", "true").csv(file_path) + + print(df.count) + assert df.count() > 0, "Dataframe was not loaded correct" + + detector = FlatlineDetection(df, watch_columns=["Value"], tolerance_timespan=2) + detector.check() + + expected_logs = [ + "Flatlining detected in column 'Value' at row: Row(TagName='FLATLINE_TEST', EventTime=datetime.datetime(2024, 1, 2, 2, 35, 10, 511000), Status='Good', Value=0.0, Value_flatline_flag=1, Value_group=1).", + "Flatlining detected in column 'Value' at row: Row(TagName='FLATLINE_TEST', EventTime=datetime.datetime(2024, 1, 2, 2, 49, 10, 408000), Status='Good', Value=0.0, Value_flatline_flag=1, Value_group=1).", + "Flatlining detected in column 'Value' at row: Row(TagName='FLATLINE_TEST', EventTime=datetime.datetime(2024, 1, 2, 14, 57, 10, 372000), Status='Good', Value=0.0, Value_flatline_flag=1, Value_group=1).", + ] + actual_logs = log_capture.getvalue().strip().split("\n") + + assert len(expected_logs) == len( + actual_logs + ), f"Expected {len(expected_logs)} logs, got {len(actual_logs)}" + for expected, actual in zip(expected_logs, actual_logs): + assert expected in actual, f"Expected: '{expected}', got: '{actual}'" diff --git a/tests/sdk/python/rtdip_sdk/pipelines/monitoring/spark/data_quality/test_great_expectations_data_quality.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_great_expectations_data_quality.py similarity index 85% rename from tests/sdk/python/rtdip_sdk/pipelines/monitoring/spark/data_quality/test_great_expectations_data_quality.py rename to tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_great_expectations_data_quality.py index 00bb57902..23ee3f970 100644 --- a/tests/sdk/python/rtdip_sdk/pipelines/monitoring/spark/data_quality/test_great_expectations_data_quality.py +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_great_expectations_data_quality.py @@ -1,8 +1,20 @@ -import pytest +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. from pytest_mock import MockerFixture -from pyspark.sql import SparkSession, DataFrame +from pyspark.sql import SparkSession -from src.sdk.python.rtdip_sdk.pipelines.monitoring.spark.data_quality.great_expectations_data_quality import ( +from src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.great_expectations_data_quality import ( GreatExpectationsDataQuality, ) diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_identify_missing_data_interval.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_identify_missing_data_interval.py new file mode 100644 index 000000000..2f3fc9482 --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_identify_missing_data_interval.py @@ -0,0 +1,247 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import pytest +import os +from pyspark.sql import SparkSession + +from src.sdk.python.rtdip_sdk.pipelines.logging.logger_manager import LoggerManager +from src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.identify_missing_data_interval import ( + IdentifyMissingDataInterval, +) + +import logging +from io import StringIO + + +@pytest.fixture(scope="session") +def spark(): + spark = ( + SparkSession.builder.master("local[2]") + .appName("IdentifyMissingDataIntervalTest") + .getOrCreate() + ) + yield spark + spark.stop() + + +@pytest.fixture +def log_capture(): + log_stream = StringIO() + logger_manager = LoggerManager() + logger = logger_manager.create_logger("IdentifyMissingDataInterval") + + handler = logging.StreamHandler(log_stream) + formatter = logging.Formatter("%(message)s") + handler.setFormatter(formatter) + logger.addHandler(handler) + yield log_stream + logger.removeHandler(handler) + handler.close() + + +def test_missing_intervals_with_given_interval_multiple_tags(spark, caplog): + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:00.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:10.000", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:20.000", "Good", "0.129999995"), + ( + "A2PS64V0J.:ZUX09R", + "2024-01-02 00:00:36.000", + "Good", + "0.150000006", + ), # Missing interval (20s to 36s) + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:45.000", "Good", "0.340000004"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:55.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:01:05.000", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:01:15.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:01:25.000", "Good", "0.150000006"), + ( + "A2PS64V0J.:ZUX09R", + "2024-01-02 00:01:41.000", + "Good", + "0.340000004", + ), # Missing interval (25s to 41s) + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + monitor = IdentifyMissingDataInterval( + df=df, + interval="10s", + tolerance="500ms", + ) + + with caplog.at_level(logging.INFO, logger="IdentifyMissingDataInterval"): + monitor.check() + expected_logs = [ + "Using provided expected interval: 10000.0 ms", + "Using provided tolerance: 500.0 ms", + "Maximum acceptable interval with tolerance: 10500.0 ms", + "Detected Missing Intervals:", + "Tag: A2PS64V0J.:ZUX09R Missing Interval from 2024-01-02 00:00:20 to 2024-01-02 00:00:36 Duration: 0h 0m 16s", + "Tag: A2PS64V0J.:ZUX09R Missing Interval from 2024-01-02 00:01:25 to 2024-01-02 00:01:41 Duration: 0h 0m 16s", + ] + actual_logs = [ + record.message + for record in caplog.records + if record.levelname == "INFO" and record.name == "IdentifyMissingDataInterval" + ] + + assert len(expected_logs) == len( + actual_logs + ), f"Expected {len(expected_logs)} logs, got {len(actual_logs)} " + for expected, actual in zip(expected_logs, actual_logs): + assert expected == actual, f"Expected: '{expected}', got: '{actual}'" + + +def test_missing_intervals_with_calculated_interval(spark, caplog): + + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:00.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:10.000", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:20.000", "Good", "0.129999995"), + ( + "A2PS64V0J.:ZUX09R", + "2024-01-02 00:00:36.000", + "Good", + "0.150000006", + ), # Missing interval (20s to 36s) + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:45.000", "Good", "0.340000004"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:55.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:01:05.000", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:01:15.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:01:25.000", "Good", "0.150000006"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:01:30.000", "Good", "0.340000004"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + monitor = IdentifyMissingDataInterval( + df=df, + ) + + with caplog.at_level(logging.INFO, logger="IdentifyMissingDataInterval"): + monitor.check() + expected_logs = [ + "Using median of time differences as expected interval: 10000.0 ms", + "Calculated tolerance: 10.0 ms (MAD-based)", + "Maximum acceptable interval with tolerance: 10010.0 ms", + "Detected Missing Intervals:", + "Tag: A2PS64V0J.:ZUX09R Missing Interval from 2024-01-02 00:00:20 to 2024-01-02 00:00:36 Duration: 0h 0m 16s", + ] + actual_logs = [ + record.message + for record in caplog.records + if record.levelname == "INFO" and record.name == "IdentifyMissingDataInterval" + ] + + assert len(expected_logs) == len( + actual_logs + ), f"Expected {len(expected_logs)} logs, got {len(actual_logs)} " + for expected, actual in zip(expected_logs, actual_logs): + assert expected == actual, f"Expected: '{expected}', got: '{actual}'" + + +def test_no_missing_intervals(spark, caplog): + + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:00.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:10.000", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:20.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:30.000", "Good", "0.150000006"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:40.000", "Good", "0.340000004"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:50.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:01:00.000", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:01:10.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:01:20.000", "Good", "0.150000006"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:01:30.000", "Good", "0.340000004"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + monitor = IdentifyMissingDataInterval( + df=df, + interval="10s", + tolerance="5s", + ) + + with caplog.at_level(logging.INFO, logger="IdentifyMissingDataInterval"): + monitor.check() + expected_logs = [ + "Using provided expected interval: 10000.0 ms", + "Using provided tolerance: 5000.0 ms", + "Maximum acceptable interval with tolerance: 15000.0 ms", + "No missing intervals detected.", + ] + actual_logs = [ + record.message + for record in caplog.records + if record.levelname == "INFO" and record.name == "IdentifyMissingDataInterval" + ] + + assert len(expected_logs) == len( + actual_logs + ), f"Expected {len(expected_logs)} logs, got {len(actual_logs)} " + for expected, actual in zip(expected_logs, actual_logs): + assert expected == actual, f"Expected: '{expected}', got: '{actual}'" + + +def test_invalid_timedelta_format(spark, caplog): + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12.000", "Good", "0.150000006"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "0.340000004"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + monitor = IdentifyMissingDataInterval( + df=df, + interval="10seconds", # should be '10s' + ) + + with pytest.raises(ValueError) as exc_info: + with caplog.at_level(logging.ERROR, logger="IdentifyMissingDataInterval"): + monitor.check() + + assert "Invalid time format: 10seconds" in str(exc_info.value) + assert "Invalid time format: 10seconds" in caplog.text + + +def test_large_data_set(spark, caplog): + base_path = os.path.dirname(__file__) + file_path = os.path.join(base_path, "../../test_data.csv") + df = spark.read.option("header", "true").csv(file_path) + assert df.count() > 0, "Dataframe was not loaded correct" + monitor = IdentifyMissingDataInterval( + df=df, + interval="1s", + tolerance="10ms", + ) + with caplog.at_level(logging.INFO, logger="IdentifyMissingDataInterval"): + monitor.check() + expected_logs = [ + "Tag: MISSING_DATA Missing Interval from 2024-01-02 00:08:11 to 2024-01-02 00:08:13 Duration: 0h 0m 2s" + ] + actual_logs = [ + record.message + for record in caplog.records + if record.levelname == "INFO" + and record.name == "IdentifyMissingDataInterval" + and "MISSING_DATA" in record.message + ] + + assert any( + expected in actual for expected in expected_logs for actual in actual_logs + ), "Expected logs not found in actual logs" diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_identify_missing_data_pattern.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_identify_missing_data_pattern.py new file mode 100644 index 000000000..52fb27799 --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_identify_missing_data_pattern.py @@ -0,0 +1,244 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# tests/test_identify_missing_data_pattern.py + +import pytest +import logging +import os + +from pyspark.sql import SparkSession + +from src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.identify_missing_data_pattern import ( + IdentifyMissingDataPattern, +) + + +@pytest.fixture(scope="session") +def spark(): + spark = ( + SparkSession.builder.master("local[2]") + .appName("IdentifyMissingDataPatternTest") + .getOrCreate() + ) + spark.sparkContext.setLogLevel("ERROR") # Unterdrücke WARN-Messages + yield spark + spark.stop() + + +def test_no_missing_patterns(spark, caplog): + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:00:00", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:00:13", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:00:49", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:01:00", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:01:13", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:01:49", "Good", "0.129999995"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + patterns = [{"second": 0}, {"second": 13}, {"second": 49}] + monitor = IdentifyMissingDataPattern( + df=df, patterns=patterns, frequency="minutely", tolerance="1s" + ) + + with caplog.at_level(logging.INFO, logger="IdentifyMissingDataPattern"): + monitor.check() + + actual_logs = [ + record.message + for record in caplog.records + if record.levelname == "INFO" and record.name == "IdentifyMissingDataPattern" + ] + assert "Using tolerance: 1000.0 ms (1.0 seconds)" in actual_logs + assert "Identified 0 missing patterns." in actual_logs + assert "No missing patterns detected." in actual_logs + + +def test_some_missing_patterns(spark, caplog): + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:00:00", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:00:13", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:00:49", "Good", "0.129999995"), + ( + "A2PS64V0J.:ZUX09R", + "2024-02-11 00:01:05", + "Good", + "0.129999995", + ), # Nothing matches in minute 1 + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:01:17", "Good", "0.119999997"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + patterns = [{"second": 0}, {"second": 13}, {"second": 49}] + monitor = IdentifyMissingDataPattern( + df=df, patterns=patterns, frequency="minutely", tolerance="1s" + ) + + with caplog.at_level(logging.INFO, logger="IdentifyMissingDataPattern"): + monitor.check() + + actual_logs = [ + record.message + for record in caplog.records + if record.levelname == "INFO" and record.name == "IdentifyMissingDataPattern" + ] + assert "Using tolerance: 1000.0 ms (1.0 seconds)" in actual_logs + assert "Identified 2 missing patterns." in actual_logs + assert "Detected Missing Patterns:" in actual_logs + assert "Missing Pattern at 2024-02-11 00:01:00.000" in actual_logs + assert "Missing Pattern at 2024-02-11 00:01:13.000" in actual_logs + + +def test_all_missing_patterns(spark, caplog): + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:00:05", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:00:17", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:00:29", "Good", "0.129999995"), + ( + "A2PS64V0J.:ZUX09R", + "2024-02-11 00:01:05", + "Good", + "0.129999995", + ), + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:01:17", "Good", "0.119999997"), + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:01:29", "Good", "0.129999995"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + patterns = [{"second": 0}, {"second": 13}, {"second": 49}] + monitor = IdentifyMissingDataPattern( + df=df, patterns=patterns, frequency="minutely", tolerance="1s" + ) + + with caplog.at_level(logging.INFO, logger="IdentifyMissingDataPattern"): + monitor.check() + + actual_logs = [ + record.message + for record in caplog.records + if record.levelname == "INFO" and record.name == "IdentifyMissingDataPattern" + ] + assert "Using tolerance: 1000.0 ms (1.0 seconds)" in actual_logs + assert "Identified 5 missing patterns." in actual_logs + assert "Detected Missing Patterns:" in actual_logs + missing_patterns = [ + "Missing Pattern at 2024-02-11 00:00:00.000", + "Missing Pattern at 2024-02-11 00:00:13.000", + "Missing Pattern at 2024-02-11 00:00:49.000", + "Missing Pattern at 2024-02-11 00:01:00.000", + "Missing Pattern at 2024-02-11 00:01:13.000", + ] + for pattern in missing_patterns: + assert pattern in actual_logs + + +def test_invalid_patterns(spark, caplog): + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:01:49", "Good", "0.129999995"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + patterns = [ + {"minute": 0}, # Invalid for 'minutely' frequency + {"second": 13}, + {"second": 49}, + ] + monitor = IdentifyMissingDataPattern( + df=df, patterns=patterns, frequency="minutely", tolerance="1s" + ) + + with pytest.raises(ValueError) as exc_info, caplog.at_level( + logging.ERROR, logger="IdentifyMissingDataPattern" + ): + monitor.check() + + assert "Each pattern must have a 'second' key for 'minutely' frequency." in str( + exc_info.value + ) + + +def test_invalid_tolerance_format(spark, caplog): + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:01:49", "Good", "0.129999995"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + patterns = [{"second": 0}, {"second": 13}, {"second": 49}] + monitor = IdentifyMissingDataPattern( + df=df, patterns=patterns, frequency="minutely", tolerance="1minute" + ) + + with pytest.raises(ValueError) as exc_info, caplog.at_level( + logging.ERROR, logger="IdentifyMissingDataPattern" + ): + monitor.check() + + assert "Invalid tolerance format: 1minute" in str(exc_info.value) + actual_logs = [ + record.message + for record in caplog.records + if record.levelname == "ERROR" and record.name == "IdentifyMissingDataPattern" + ] + assert "Invalid tolerance format: 1minute" in actual_logs + + +def test_hourly_patterns_with_microseconds(spark, caplog): + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:00:00.200", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-02-11 00:59:59.800", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-02-11 01:00:30.500", "Good", "0.129999995"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + patterns = [ + {"minute": 0, "second": 0, "millisecond": 0}, + {"minute": 30, "second": 30, "millisecond": 500}, + ] + monitor = IdentifyMissingDataPattern( + df=df, patterns=patterns, frequency="hourly", tolerance="500ms" + ) + + with caplog.at_level(logging.INFO, logger="IdentifyMissingDataPattern"): + monitor.check() + + actual_logs = [ + record.message + for record in caplog.records + if record.levelname == "INFO" and record.name == "IdentifyMissingDataPattern" + ] + assert "Using tolerance: 500.0 ms (0.5 seconds)" in actual_logs + assert "Identified 1 missing patterns." in actual_logs + assert "Detected Missing Patterns:" in actual_logs + assert "Missing Pattern at 2024-02-11 00:30:30.500" in actual_logs + + +def test_large_data_set(spark): + base_path = os.path.dirname(__file__) + file_path = os.path.join(base_path, "../../test_data.csv") + df = spark.read.option("header", "true").csv(file_path) + assert df.count() > 0, "Dataframe was not loaded correct" + patterns = [{"second": 0}, {"second": 13}, {"second": 49}] + monitor = IdentifyMissingDataPattern( + df=df, patterns=patterns, frequency="minutely", tolerance="1s" + ) + monitor.check() diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_moving_average.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_moving_average.py new file mode 100644 index 000000000..46b7396f9 --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/monitoring/spark/test_moving_average.py @@ -0,0 +1,104 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import pytest +import os +from pyspark.sql import SparkSession +from src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.moving_average import ( + MovingAverage, +) +import logging +from io import StringIO + + +@pytest.fixture(scope="session") +def spark(): + spark = ( + SparkSession.builder.master("local[2]") + .appName("MovingAverageTest") + .getOrCreate() + ) + yield spark + spark.stop() + + +@pytest.fixture +def log_capture(): + log_stream = StringIO() + logger = logging.getLogger("MovingAverage") + logger.setLevel(logging.INFO) + handler = logging.StreamHandler(log_stream) + formatter = logging.Formatter("%(message)s") + handler.setFormatter(formatter) + logger.addHandler(handler) + yield log_stream + logger.removeHandler(handler) + handler.close() + + +def test_moving_average_basic(spark, log_capture): + df = spark.createDataFrame( + [ + ("Tag1", "2024-01-02 03:49:45.000", "Good", 1.0), + ("Tag1", "2024-01-02 07:53:11.000", "Good", 2.0), + ("Tag1", "2024-01-02 11:56:42.000", "Good", 3.0), + ("Tag1", "2024-01-02 16:00:12.000", "Good", 4.0), + ("Tag1", "2024-01-02 20:03:46.000", "Good", 5.0), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + detector = MovingAverage(df, window_size=3) + detector.check() + + expected_logs = [ + "Computing moving averages:", + "Tag: Tag1, Time: 2024-01-02 03:49:45, Value: 1.0, Moving Avg: 1.0", + "Tag: Tag1, Time: 2024-01-02 07:53:11, Value: 2.0, Moving Avg: 1.5", + "Tag: Tag1, Time: 2024-01-02 11:56:42, Value: 3.0, Moving Avg: 2.0", + "Tag: Tag1, Time: 2024-01-02 16:00:12, Value: 4.0, Moving Avg: 3.0", + "Tag: Tag1, Time: 2024-01-02 20:03:46, Value: 5.0, Moving Avg: 4.0", + ] + + actual_logs = log_capture.getvalue().strip().split("\n") + + assert len(expected_logs) == len( + actual_logs + ), f"Expected {len(expected_logs)} logs, got {len(actual_logs)}" + + for expected, actual in zip(expected_logs, actual_logs): + assert expected in actual, f"Expected: '{expected}', got: '{actual}'" + + +def test_moving_average_invalid_window_size(spark): + df = spark.createDataFrame( + [ + ("Tag1", "2024-01-02 03:49:45.000", "Good", 1.0), + ("Tag1", "2024-01-02 07:53:11.000", "Good", 2.0), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + with pytest.raises(ValueError, match="window_size must be a positive integer."): + MovingAverage(df, window_size=-2) + + +def test_large_dataset(spark): + base_path = os.path.dirname(__file__) + file_path = os.path.join(base_path, "../../test_data.csv") + df = spark.read.option("header", "true").csv(file_path) + + assert df.count() > 0, "DataFrame was nicht geladen." + + detector = MovingAverage(df, window_size=5) + detector.check() diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/test_data.csv b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/test_data.csv new file mode 100644 index 000000000..71e1e0895 --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/test_data.csv @@ -0,0 +1,1019 @@ +TagName,EventTime,Status,Value +A2PS64V0J.:ZUX09R,2024-01-02 20:03:46.000,Good,0.3400000035762787 +A2PS64V0J.:ZUX09R,2024-01-02 16:00:12.000,Good,0.1500000059604644 +A2PS64V0J.:ZUX09R,2024-01-02 11:56:42.000,Good,0.1299999952316284 +A2PS64V0J.:ZUX09R,2024-01-02 07:53:11.000,Good,0.1199999973177909 +A2PS64V0J.:ZUX09R,2024-01-02 03:49:45.000,Good,0.1299999952316284 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 20:09:58.053,Good,7107.82080078125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 12:27:10.518,Good,19407.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 05:23:10.143,Good,19403.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 01:31:10.086,Good,19399.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 23:41:10.358,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 18:09:10.488,Good,19375.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 16:15:10.492,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 06:51:10.077,Good,19403.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 07:42:24.227,Good,6.55859375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 06:08:23.777,Good,5921.5498046875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 05:14:10.896,Good,5838.216796875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 01:37:10.967,Good,5607.82568359375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 00:26:53.449,Good,5563.7080078125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 01:11:10.361,Good,19396.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 14:01:10.150,Good,19409.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 10:22:10.018,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:58:10.496,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 06:50:10.483,Good,19402.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 07:26:20.495,Good,6.55126953125 +R0:Z24WVP.0S10L,2024-01-02 21:26:00.001,Good,2266.861083984375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 21:16:08.988,Good,7205.85986328125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 14:25:10.252,Good,19410.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 07:18:10.275,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 16:12:10.288,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:04:10.256,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 03:16:10.178,Good,19401.0 +R0:Z24WVP.0S10L,2024-01-02 16:21:00.001,Good,2267.4541015625 +R0:Z24WVP.0S10L,2024-01-02 10:28:01.001,Good,2344.558349609375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 07:23:40.514,Good,6132.33349609375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 04:34:57.886,Good,5818.609375 +1N325T3MTOR-P0L29:9.T0,2024-01-02 19:45:10.416,Good,19371.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 16:35:10.108,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 11:22:10.381,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 01:08:10.214,Good,19396.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 00:57:10.083,Good,19397.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 23:44:10.054,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 21:57:10.201,Good,19377.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 19:38:10.450,Good,19375.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 15:13:10.477,Good,19385.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 09:12:10.466,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 08:22:10.145,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 06:42:10.099,Good,19404.0 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 17:12:09.997,Good,6867.62548828125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 08:54:59.922,Good,6249.98046875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 06:45:10.238,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 08:52:10.381,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 06:37:10.213,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 10:13:10.226,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 07:43:10.096,Good,19404.0 +R0:Z24WVP.0S10L,2024-01-02 21:08:00.001,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 04:44:01.001,Good,2307.78564453125 +R0:Z24WVP.0S10L,2024-01-02 03:38:00.001,Good,2306.006103515625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 05:30:10.341,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 19:06:10.475,Good,19375.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 14:36:10.389,Good,19410.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 20:01:10.231,Good,19374.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 03:20:10.309,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 02:52:10.136,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 00:08:10.000,Good,19395.0 +R0:Z24WVP.0S10L,2024-01-02 22:40:00.001,Good,2300.074951171875 +R0:Z24WVP.0S10L,2024-01-02 10:22:00.001,Good,2346.9306640625 +PM20:PCO4SLU_000R4.3D0_T-23,2024-01-02 23:39:20.058,Good,5.300000190734863 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 14:35:31.661,Good,6514.685546875 +1N325T3MTOR-P0L29:9.T0,2024-01-02 17:34:10.228,Good,19375.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 16:39:10.043,Good,19375.0 +R0:Z24WVP.0S10L,2024-01-02 20:02:00.000,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 01:45:01.001,Good,2304.81982421875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 12:38:10.472,Good,19406.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 07:19:10.316,Good,19403.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 01:28:10.208,Good,19399.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 00:12:10.481,Good,19395.0 +R0:Z24WVP.0S10L,2024-01-02 18:54:00.001,Good,2266.26806640625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 19:48:56.048,Good,7073.50732421875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 23:38:10.214,Good,19377.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 12:06:10.336,Good,19405.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 01:19:10.497,Good,19399.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 23:35:10.480,Good,19378.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 22:44:10.247,Good,19380.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 15:42:10.046,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 00:40:10.497,Good,19397.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 09:47:55.430,Good,6.615234375 +R0:Z24WVP.0S10L,2024-01-02 12:36:00.001,Good,2264.488525390625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 21:41:15.646,Good,7240.17333984375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 19:23:42.152,Good,7034.29150390625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 06:31:30.460,Good,5975.47119140625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 17:48:10.347,Good,19373.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 01:32:10.261,Good,19399.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 21:14:10.435,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 23:30:10.228,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 06:54:10.356,Good,19403.0 +R0:Z24WVP.0S10L,2024-01-02 23:47:00.001,Good,2258.5576171875 +R0:Z24WVP.0S10L,2024-01-02 23:05:00.001,Good,2298.88916015625 +R0:Z24WVP.0S10L,2024-01-02 18:39:00.001,Good,2266.26806640625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 07:03:36.141,Good,6068.6083984375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 03:33:10.113,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 18:40:10.232,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 09:47:10.467,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 05:50:10.087,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 21:59:10.357,Good,19379.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 20:04:10.452,Good,19374.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 15:05:10.307,Good,19394.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 15:03:10.279,Good,19395.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 11:11:10.407,Good,19403.0 +R0:Z24WVP.0S10L,2024-01-02 14:25:00.001,Good,2265.081787109375 +R0:Z24WVP.0S10L,2024-01-02 01:17:00.001,Good,2306.006103515625 +1N325T3MTOR-P0L29:9.T0,2024-01-02 13:23:10.098,Good,19409.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 14:31:10.337,Good,19411.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 00:05:10.479,Good,19396.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 04:22:36.151,Good,6.43603515625 +R0:Z24WVP.0S10L,2024-01-02 19:30:00.014,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 07:22:00.001,Good,2310.158203125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 22:43:28.441,Good,7284.291015625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 17:33:10.245,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 16:24:10.199,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 08:54:10.428,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 07:34:10.156,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 05:13:10.270,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 00:33:10.295,Good,19397.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 18:40:10.232,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 09:39:10.294,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 08:36:10.294,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 07:18:10.275,Good,19404.0 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 16:47:04.123,Good,6848.017578125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 06:05:22.981,Good,5906.84423828125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 04:22:10.076,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 13:34:10.499,Good,19408.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 16:46:10.139,Good,19377.0 +R0:Z24WVP.0S10L,2024-01-02 12:53:00.001,Good,2265.6748046875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 01:25:06.919,Good,5588.2177734375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 20:02:10.354,Good,19373.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 06:28:10.325,Good,19403.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 00:48:10.122,Good,19396.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 11:53:10.049,Good,19405.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 06:34:10.389,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 16:19:10.174,Good,19376.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 04:35:39.227,Good,6.4423828125 +R0:Z24WVP.0S10L,2024-01-02 14:45:00.001,Good,2266.26806640625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 22:42:10.034,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 22:07:10.035,Good,19380.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 21:15:10.449,Good,19379.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 17:48:10.347,Good,19373.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 17:11:10.376,Good,19375.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 13:46:10.091,Good,19409.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 11:55:10.339,Good,19404.0 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 03:38:44.198,Good,5705.8642578125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 22:21:10.452,Good,19379.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 19:20:10.382,Good,19379.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 16:10:10.095,Good,19377.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 14:35:10.297,Good,19410.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 08:42:10.486,Good,19404.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 00:32:10.169,Good,19395.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 08:04:10.068,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 04:32:10.413,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 10:14:10.274,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 01:54:10.132,Good,19399.0 +R0:Z24WVP.0S10L,2024-01-02 20:54:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 02:02:00.001,Good,2304.81982421875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 14:48:34.105,Good,6534.29345703125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 02:57:10.117,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 14:15:10.393,Good,19410.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:35:10.215,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 22:16:10.070,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 19:01:10.497,Good,19375.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 16:38:10.380,Good,19377.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 15:25:10.428,Good,19375.0 +R0:Z24WVP.0S10L,2024-01-02 14:54:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 12:15:00.001,Good,2264.488525390625 +R0:Z24WVP.0S10L,2024-01-02 09:36:00.001,Good,2312.53076171875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 06:24:27.269,Good,5960.765625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 04:30:56.563,Good,5818.609375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 22:17:10.113,Good,19377.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 06:19:10.348,Good,19403.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 05:39:10.120,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 09:35:10.483,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 22:17:10.113,Good,19377.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 21:52:10.264,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 19:58:10.031,Good,19375.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 11:21:10.383,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 10:55:10.264,Good,19403.0 +R0:Z24WVP.0S10L,2024-01-02 19:10:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 10:38:00.001,Good,2347.52392578125 +R0:Z24WVP.0S10L,2024-01-02 01:16:01.001,Good,2305.413330078125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 00:25:10.042,Good,19396.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 22:11:10.233,Good,19379.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:36:10.463,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 20:51:10.216,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 14:25:10.252,Good,19410.0 +R0:Z24WVP.0S10L,2024-01-02 18:04:00.001,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 14:48:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 02:26:01.001,Good,2304.81982421875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 23:45:10.147,Good,19377.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 14:37:10.404,Good,19411.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 20:50:10.027,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 19:08:10.248,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 18:53:10.249,Good,19372.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 12:46:10.520,Good,19409.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 13:57:10.389,Good,19409.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 10:57:10.430,Good,19403.0 +R0:Z24WVP.0S10L,2024-01-02 20:12:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 14:46:01.001,Good,2266.861083984375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 15:46:50.909,Good,6700.95947265625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 14:40:32.055,Good,6519.58740234375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 12:12:52.261,Good,6362.72509765625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 05:04:07.396,Good,5828.4130859375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 12:02:10.417,Good,19405.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 11:48:10.231,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 23:10:10.055,Good,19378.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 21:22:10.379,Good,19377.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 22:05:10.279,Good,19376.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 05:18:49.267,Good,6.4658203125 +R0:Z24WVP.0S10L,2024-01-02 01:43:00.001,Good,2304.81982421875 +R0:Z24WVP.0S10L,2024-01-02 01:03:00.001,Good,2304.81982421875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 21:30:10.122,Good,19380.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 15:16:10.297,Good,19383.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 02:24:10.132,Good,19401.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 00:21:10.191,Good,19396.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 17:00:10.325,Good,19378.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:26:10.116,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 17:16:10.199,Good,19374.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 13:54:10.106,Good,19409.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 19:15:12.284,Good,6.810546875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 23:51:10.379,Good,19377.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 06:41:10.504,Good,19403.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 04:24:10.265,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 16:50:10.432,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 15:33:10.389,Good,19375.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 10:09:00.796,Good,6.625 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 05:15:48.607,Good,6.46435546875 +R0:Z24WVP.0S10L,2024-01-02 21:47:00.001,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 12:44:00.001,Good,2264.488525390625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 08:17:51.642,Good,6205.86279296875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 21:57:10.201,Good,19377.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 18:25:10.157,Good,19376.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 08:39:10.378,Good,19402.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 01:18:10.423,Good,19398.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 22:25:10.262,Good,19380.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 07:22:10.465,Good,19403.0 +R0:Z24WVP.0S10L,2024-01-02 23:00:00.001,Good,2296.5166015625 +R0:Z24WVP.0S10L,2024-01-02 05:50:00.001,Good,2308.378662109375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 04:20:10.029,Good,19404.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 02:56:10.024,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 18:31:10.152,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:13:10.406,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 12:35:10.110,Good,19406.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 03:47:10.341,Good,19403.0 +R0:Z24WVP.0S10L,2024-01-02 10:45:00.001,Good,2263.8955078125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 08:38:10.281,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 14:53:10.052,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 13:10:10.491,Good,19408.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 07:51:10.090,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:05:10.291,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 09:54:10.181,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 03:59:10.079,Good,19402.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 06:19:03.191,Good,6.515625 +R0:Z24WVP.0S10L,2024-01-02 18:52:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 17:57:00.001,Good,2267.4541015625 +R0:Z24WVP.0S10L,2024-01-02 14:43:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 03:31:01.001,Good,2306.006103515625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 21:49:17.685,Good,7249.97705078125 +1N325T3MTOR-P0L29:9.T0,2024-01-02 22:57:10.292,Good,19378.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 15:36:10.106,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 02:38:10.212,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 22:25:10.262,Good,19380.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 19:39:10.032,Good,19373.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 00:50:10.168,Good,19396.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 05:11:47.514,Good,6.46142578125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 15:25:45.091,Good,6656.841796875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 12:40:10.199,Good,19406.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 14:30:10.243,Good,19410.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 13:24:10.225,Good,19408.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 21:45:10.330,Good,19379.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 02:05:10.348,Good,19399.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 06:44:09.960,Good,6.53076171875 +R0:Z24WVP.0S10L,2024-01-02 19:43:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 11:43:01.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 05:52:00.001,Good,2308.378662109375 +R0:Z24WVP.0S10L,2024-01-02 00:53:00.001,Good,2305.413330078125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 21:55:19.247,Good,7254.87939453125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 18:37:10.382,Good,19373.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 13:13:10.228,Good,19410.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 18:56:10.434,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 08:58:10.254,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 20:45:10.464,Good,19376.0 +R0:Z24WVP.0S10L,2024-01-02 13:07:00.001,Good,2264.488525390625 +R0:Z24WVP.0S10L,2024-01-02 12:38:00.001,Good,2265.081787109375 +R0:Z24WVP.0S10L,2024-01-02 10:32:00.001,Good,2346.9306640625 +R0:Z24WVP.0S10L,2024-01-02 07:45:00.001,Good,2310.158203125 +R0:Z24WVP.0S10L,2024-01-02 02:42:00.001,Good,2304.81982421875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 08:38:57.109,Good,6220.56884765625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 18:22:10.184,Good,19374.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 18:08:10.394,Good,19377.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 15:24:10.385,Good,19377.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 10:56:10.343,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 01:21:10.136,Good,19398.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 00:41:43.646,Good,6.39013671875 +R0:Z24WVP.0S10L,2024-01-02 03:55:00.001,Good,2306.006103515625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 09:10:04.230,Good,6245.07861328125 +1N325T3MTOR-P0L29:9.T0,2024-01-02 17:36:10.430,Good,19375.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 17:28:10.059,Good,19375.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 12:21:10.044,Good,19406.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 00:18:10.500,Good,19396.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 09:18:10.258,Good,19403.0 +R0:Z24WVP.0S10L,2024-01-02 08:38:00.002,Good,2311.344482421875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 17:32:14.792,Good,6892.13525390625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 05:30:14.921,Good,5843.119140625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 22:11:10.233,Good,19379.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 22:06:10.388,Good,19378.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 20:10:10.302,Good,19375.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 08:25:10.032,Good,19402.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 04:45:10.419,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 05:17:10.151,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 10:22:10.018,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 05:09:10.247,Good,19403.0 +R0:Z24WVP.0S10L,2024-01-02 23:40:00.001,Good,2301.8544921875 +R0:Z24WVP.0S10L,2024-01-02 13:45:00.001,Good,2265.081787109375 +R0:Z24WVP.0S10L,2024-01-02 07:19:00.001,Good,2310.158203125 +R0:Z24WVP.0S10L,2024-01-02 02:41:00.001,Good,2305.413330078125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 15:29:46.609,Good,6676.44970703125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 04:33:57.828,Good,5823.51123046875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 10:21:10.464,Good,19402.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 09:49:10.165,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 22:04:10.313,Good,19379.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 15:22:10.304,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 14:36:10.389,Good,19410.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 04:05:10.365,Good,19403.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 05:56:57.891,Good,6.5009765625 +R0:Z24WVP.0S10L,2024-01-02 16:49:00.001,Good,2267.4541015625 +R0:Z24WVP.0S10L,2024-01-02 15:38:00.001,Good,2266.861083984375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 21:51:18.376,Good,7245.0751953125 +1N325T3MTOR-P0L29:9.T0,2024-01-02 22:23:10.093,Good,19379.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 14:22:10.398,Good,19410.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 14:05:10.327,Good,19409.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 18:53:10.249,Good,19372.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 12:07:10.458,Good,19406.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 07:35:10.184,Good,19404.0 +R0:Z24WVP.0S10L,2024-01-02 21:43:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 14:42:01.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 00:44:00.001,Good,2304.81982421875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 10:09:19.567,Good,6274.490234375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 03:41:10.441,Good,19404.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 03:37:09.997,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 23:11:10.120,Good,19375.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 02:33:10.374,Good,19402.0 +R0:Z24WVP.0S10L,2024-01-02 23:45:00.001,Good,2275.7578125 +R0:Z24WVP.0S10L,2024-01-02 05:58:00.001,Good,2309.56494140625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 10:37:10.172,Good,19402.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 07:02:10.081,Good,19402.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 07:02:10.081,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 22:42:10.034,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 12:50:10.139,Good,19408.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 11:17:10.123,Good,19403.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 05:55:57.659,Good,6.49951171875 +R0:Z24WVP.0S10L,2024-01-02 23:37:00.001,Good,2300.074951171875 +R0:Z24WVP.0S10L,2024-01-02 02:54:00.001,Good,2306.006103515625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 09:18:05.695,Good,6259.7841796875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 15:21:10.276,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 23:32:10.219,Good,19378.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 17:37:10.431,Good,19375.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 09:41:10.450,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 08:42:10.486,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 02:51:10.029,Good,19403.0 +R0:Z24WVP.0S10L,2024-01-02 18:02:00.001,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 10:17:00.001,Good,2344.558349609375 +R0:Z24WVP.0S10L,2024-01-02 06:03:00.001,Good,2309.56494140625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 08:53:59.739,Good,6245.07861328125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 06:19:26.112,Good,5941.15771484375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 13:27:10.473,Good,19409.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 09:50:10.257,Good,19402.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 09:05:10.021,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 23:37:10.214,Good,19377.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 23:20:10.142,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 14:17:10.062,Good,19410.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 04:11:10.500,Good,19404.0 +R0:Z24WVP.0S10L,2024-01-02 17:24:00.001,Good,2267.4541015625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 06:29:30.287,Good,5970.5693359375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 13:44:10.013,Good,19409.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 09:29:10.029,Good,19401.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 03:08:10.053,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 23:47:10.271,Good,19375.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 14:23:10.068,Good,19411.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 10:45:10.004,Good,19403.0 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 20:26:59.616,Good,7122.52685546875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 10:34:10.422,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 02:20:10.225,Good,19401.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 17:51:10.236,Good,19375.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 07:59:10.286,Good,19403.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 21:33:46.754,Good,6.81005859375 +R0:Z24WVP.0S10L,2024-01-02 18:51:00.001,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 14:06:01.001,Good,2266.26806640625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 04:18:54.746,Good,5794.099609375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 05:25:10.303,Good,19404.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 04:37:10.348,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 09:30:10.125,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:21:10.432,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 02:07:10.491,Good,19400.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 10:52:10.285,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 04:13:10.194,Good,19403.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 09:39:52.993,Good,6.61083984375 +R0:Z24WVP.0S10L,2024-01-02 14:36:00.001,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 10:24:00.001,Good,2342.779052734375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 04:27:56.333,Good,5818.609375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 22:40:10.365,Good,19379.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 06:29:10.405,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 13:54:10.106,Good,19409.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 07:36:10.230,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 21:08:10.070,Good,19374.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 14:46:10.068,Good,19411.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 06:19:10.348,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 00:38:10.325,Good,19396.0 +R0:Z24WVP.0S10L,2024-01-02 13:15:00.001,Good,2264.488525390625 +R0:Z24WVP.0S10L,2024-01-02 09:49:00.001,Good,2345.151611328125 +R0:Z24WVP.0S10L,2024-01-02 06:30:00.001,Good,2308.971923828125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 18:50:34.408,Good,6990.17431640625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 01:34:09.551,Good,5607.82568359375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 19:53:10.261,Good,19373.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 00:41:10.106,Good,19396.0 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 12:17:53.443,Good,6377.43115234375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 18:43:10.331,Good,19375.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 15:42:10.046,Good,19376.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 09:28:10.514,Good,19402.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 02:47:10.305,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 19:24:10.180,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 12:42:10.399,Good,19408.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 11:09:10.224,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 15:30:10.074,Good,19375.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 05:53:10.081,Good,19404.0 +R0:Z24WVP.0S10L,2024-01-02 21:40:01.001,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 02:05:00.001,Good,2304.81982421875 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 23:38:10.214,Good,19377.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 18:46:10.111,Good,19375.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 09:58:10.127,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 07:51:10.090,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 05:28:10.082,Good,19405.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 22:49:04.742,Good,6.79833984375 +R0:Z24WVP.0S10L,2024-01-02 07:17:00.001,Good,2310.158203125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 06:40:31.323,Good,6009.78515625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 04:00:49.864,Good,5759.78564453125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 21:55:10.104,Good,19377.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 19:49:10.315,Good,19375.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 09:07:10.167,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 21:47:10.469,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 02:53:10.240,Good,19402.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 17:55:53.258,Good,6.7783203125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 20:05:57.805,Good,7098.01708984375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 13:47:18.272,Good,6455.8623046875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 04:17:54.710,Good,5808.80517578125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 18:30:10.082,Good,19375.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 18:11:10.145,Good,19375.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 14:10:10.123,Good,19410.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 08:46:10.198,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 02:34:10.017,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 00:44:10.292,Good,19396.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 18:04:55.376,Good,6.78125 +R0:Z24WVP.0S10L,2024-01-02 16:13:00.001,Good,2267.4541015625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 14:18:26.463,Good,6490.17578125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 04:22:55.937,Good,5818.609375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 13:34:10.499,Good,19408.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 23:14:10.381,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 08:07:10.462,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 23:53:10.115,Good,19377.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 20:33:10.229,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 19:13:10.494,Good,19369.0 +R0:Z24WVP.0S10L,2024-01-02 19:03:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 15:36:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 05:36:00.001,Good,2307.78564453125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 01:36:10.902,Good,5602.92333984375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 21:42:10.070,Good,19380.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 04:41:10.146,Good,19407.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 15:29:09.995,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 11:01:10.012,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 08:44:10.161,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 00:43:10.184,Good,19396.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 17:04:10.483,Good,19374.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 06:29:10.405,Good,19404.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 05:13:48.079,Good,6.462890625 +R0:Z24WVP.0S10L,2024-01-02 19:25:01.005,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 13:26:00.001,Good,2264.488525390625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 11:37:43.098,Good,6323.509765625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 04:40:59.941,Good,5823.51123046875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 12:46:10.520,Good,19409.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 04:47:10.175,Good,19404.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 04:15:10.369,Good,19404.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 03:55:10.308,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 22:06:10.388,Good,19378.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 16:06:10.090,Good,19375.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 22:54:10.133,Good,19379.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 07:34:10.156,Good,19403.0 +R0:Z24WVP.0S10L,2024-01-02 08:20:00.001,Good,2310.751220703125 +1N325T3MTOR-P0L29:9.T0,2024-01-02 17:38:10.521,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 11:31:10.184,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 23:33:10.264,Good,19378.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 07:39:23.513,Good,6.55712890625 +R0:Z24WVP.0S10L,2024-01-02 22:10:00.004,Good,2266.26806640625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 23:39:42.178,Good,7338.21240234375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 07:41:10.491,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 09:26:10.264,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 08:20:10.015,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 07:48:10.322,Good,19405.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 04:48:42.149,Good,6.4482421875 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 02:48:10.555,Good,6.40771484375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 20:35:00.708,Good,7147.03662109375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 19:19:41.408,Good,7039.19384765625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 04:04:50.985,Good,5774.49169921875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 23:13:10.358,Good,19376.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 04:10:10.359,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 23:52:10.501,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 14:26:10.299,Good,19411.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 12:18:10.305,Good,19406.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:03:10.098,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 23:55:10.207,Good,19380.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 02:32:10.250,Good,19402.0 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 23:38:41.723,Good,7333.310546875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 17:38:10.521,Good,19376.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 15:30:10.074,Good,19375.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 06:48:10.312,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 15:38:10.185,Good,19374.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 07:33:10.109,Good,19405.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 08:56:41.384,Good,6.59033203125 +R0:Z24WVP.0S10L,2024-01-02 06:20:00.000,Good,2308.971923828125 +R0:Z24WVP.0S10L,2024-01-02 01:26:00.001,Good,2305.413330078125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 05:36:16.042,Good,5857.82470703125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 04:53:03.575,Good,5828.4130859375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 22:20:10.430,Good,19378.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 11:53:10.049,Good,19405.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 10:17:10.042,Good,19403.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 08:15:10.160,Good,19403.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 02:08:10.042,Good,19400.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 22:46:10.373,Good,19379.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 06:57:10.475,Good,19405.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 18:42:10.287,Good,19375.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 10:01:59.016,Good,6.62158203125 +R0:Z24WVP.0S10L,2024-01-02 15:28:00.001,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 15:17:01.001,Good,2266.861083984375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 17:34:10.228,Good,19375.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 15:17:10.443,Good,19382.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 13:58:10.441,Good,19410.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 07:14:10.029,Good,19403.0 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 06:01:21.345,Good,5906.84423828125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 19:54:10.321,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 18:50:10.468,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 21:13:10.367,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 19:14:10.095,Good,19374.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 10:15:10.427,Good,19403.0 +R0:Z24WVP.0S10L,2024-01-02 16:11:00.001,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 10:42:00.001,Good,2317.86865234375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 12:34:57.359,Good,6401.94091796875 +1N325T3MTOR-P0L29:9.T0,2024-01-02 01:17:10.377,Good,19399.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 05:26:10.455,Good,19404.0 +R0:Z24WVP.0S10L,2024-01-02 17:03:00.001,Good,2267.4541015625 +R0:Z24WVP.0S10L,2024-01-02 08:50:00.001,Good,2311.344482421875 +R0:Z24WVP.0S10L,2024-01-02 02:23:00.001,Good,2305.413330078125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 17:48:17.999,Good,6916.64501953125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 16:37:01.662,Good,6818.60595703125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 20:43:10.355,Good,19377.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 18:49:10.361,Good,19373.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 07:50:10.001,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 20:14:10.115,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 12:46:10.520,Good,19409.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 03:44:10.087,Good,19402.0 +R0:Z24WVP.0S10L,2024-01-02 15:37:00.001,Good,2266.861083984375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 15:44:49.831,Good,6700.95947265625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 10:08:10.397,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 09:44:10.249,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 05:06:10.086,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 05:17:10.151,Good,19402.0 +R0:Z24WVP.0S10L,2024-01-02 22:48:00.001,Good,2294.14404296875 +R0:Z24WVP.0S10L,2024-01-02 10:34:01.001,Good,2346.337646484375 +R0:Z24WVP.0S10L,2024-01-02 08:11:00.001,Good,2311.344482421875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 16:57:05.956,Good,6852.919921875 +1N325T3MTOR-P0L29:9.T0,2024-01-02 21:11:10.231,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 11:43:10.450,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 11:43:10.450,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 06:16:10.119,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 02:23:10.486,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 21:19:10.217,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 17:05:10.481,Good,19375.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 15:57:25.699,Good,6.73828125 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 05:59:58.650,Good,6.50244140625 +R0:Z24WVP.0S10L,2024-01-02 00:43:00.001,Good,2305.413330078125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 21:04:05.340,Good,7191.15380859375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 15:43:49.817,Good,6696.0576171875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 05:53:10.081,Good,19404.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 00:12:10.481,Good,19395.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 13:28:10.073,Good,19407.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 11:56:10.430,Good,19405.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 04:04:10.272,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 01:06:10.039,Good,19397.0 +R0:Z24WVP.0S10L,2024-01-02 12:34:00.001,Good,2264.488525390625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 03:28:40.945,Good,5715.66845703125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 14:39:10.051,Good,19411.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 01:45:10.038,Good,19400.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 19:56:10.387,Good,19375.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 19:54:10.321,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 18:15:10.412,Good,19375.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 15:59:10.328,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 10:26:10.316,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 23:12:10.230,Good,19377.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 16:10:10.095,Good,19377.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 15:58:10.047,Good,19377.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 15:20:17.109,Good,6.72802734375 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 09:13:45.909,Good,6.59765625 +R0:Z24WVP.0S10L,2024-01-02 13:29:00.001,Good,2264.488525390625 +R0:Z24WVP.0S10L,2024-01-02 12:27:00.001,Good,2264.488525390625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 17:49:18.753,Good,6921.546875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 11:09:10.224,Good,19403.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 02:01:10.092,Good,19401.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 21:34:10.447,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 08:15:10.160,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 08:00:10.322,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 22:33:10.275,Good,19379.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 05:19:10.287,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 01:39:10.074,Good,19398.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 11:41:24.009,Good,6.6650390625 +R0:Z24WVP.0S10L,2024-01-02 17:56:00.001,Good,2267.4541015625 +R0:Z24WVP.0S10L,2024-01-02 03:56:00.001,Good,2307.1923828125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 04:52:10.012,Good,19405.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 02:11:10.205,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 18:23:10.425,Good,19375.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 12:40:10.199,Good,19406.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:53:10.250,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 05:49:10.055,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 03:39:10.192,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 02:20:10.225,Good,19401.0 +R0:Z24WVP.0S10L,2024-01-02 00:58:00.001,Good,2304.81982421875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 15:33:47.029,Good,6661.74365234375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 10:23:22.774,Good,6284.2939453125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 12:43:10.451,Good,19407.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 10:44:10.395,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 15:38:10.185,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 08:53:10.368,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 07:37:10.314,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 00:51:10.285,Good,19397.0 +R0:Z24WVP.0S10L,2024-01-02 23:49:00.001,Good,2216.44677734375 +R0:Z24WVP.0S10L,2024-01-02 19:29:00.001,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 04:20:00.000,Good,2307.78564453125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 16:25:10.277,Good,19375.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 04:43:10.246,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 11:55:10.339,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 16:16:10.039,Good,19377.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 14:22:10.398,Good,19410.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 14:01:10.150,Good,19409.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 10:45:10.004,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 08:14:10.047,Good,19405.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 20:51:37.349,Good,6.81982421875 +R0:Z24WVP.0S10L,2024-01-02 21:02:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 17:33:00.001,Good,2266.861083984375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 14:51:34.876,Good,6544.09716796875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 06:03:22.788,Good,5911.74609375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 03:02:10.022,Good,19402.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 02:59:10.274,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 23:45:10.147,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 17:49:10.439,Good,19375.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 14:37:10.404,Good,19411.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 06:44:10.164,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 11:37:10.417,Good,19404.0 +R0:Z24WVP.0S10L,2024-01-02 21:01:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 10:09:00.001,Good,2347.52392578125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 23:58:44.589,Good,7348.01611328125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 21:05:10.440,Good,19378.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 20:18:10.365,Good,19378.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 06:50:10.483,Good,19402.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 03:21:10.432,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 20:58:10.014,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 19:36:10.327,Good,19373.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 09:28:10.514,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 03:58:10.496,Good,19403.0 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 19:20:41.454,Good,7044.095703125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 10:00:17.477,Good,6269.58837890625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 16:55:10.391,Good,19378.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 16:24:10.199,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 17:26:10.395,Good,19374.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 22:50:10.417,Good,19379.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 19:23:10.080,Good,19375.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 13:24:10.225,Good,19408.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 09:07:10.167,Good,19403.0 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 23:12:34.643,Good,7313.70263671875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 00:33:10.295,Good,19397.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 20:41:10.113,Good,19378.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 17:33:10.245,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 06:46:10.279,Good,19402.0 +R0:Z24WVP.0S10L,2024-01-02 21:18:00.001,Good,2267.4541015625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 16:08:56.169,Good,6764.6845703125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 20:08:10.139,Good,19377.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 14:31:10.337,Good,19411.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 20:44:10.404,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 16:08:10.286,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 17:30:10.069,Good,19375.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 12:43:10.451,Good,19407.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 11:16:10.111,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 10:08:10.397,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 08:42:10.486,Good,19404.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 03:43:26.560,Good,6.42138671875 +R0:Z24WVP.0S10L,2024-01-02 16:50:00.001,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 05:20:00.001,Good,2307.78564453125 +R0:Z24WVP.0S10L,2024-01-02 01:10:01.001,Good,2304.81982421875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 20:07:57.901,Good,7102.9189453125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 22:28:09.999,Good,19379.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 21:41:10.018,Good,19377.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 09:46:10.423,Good,19403.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 06:55:10.327,Good,19404.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 02:18:10.088,Good,19401.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 09:36:10.495,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:14:10.453,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 23:51:10.379,Good,19377.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 05:35:52.723,Good,6.48291015625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 13:07:08.204,Good,6416.646484375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 23:19:10.031,Good,19381.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 17:02:10.357,Good,19373.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 20:45:10.464,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 10:57:10.430,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 17:18:10.315,Good,19373.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 11:25:10.137,Good,19404.0 +R0:Z24WVP.0S10L,2024-01-02 22:30:00.001,Good,2297.702880859375 +R0:Z24WVP.0S10L,2024-01-02 19:20:01.005,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 14:30:00.001,Good,2265.081787109375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 08:19:52.152,Good,6205.86279296875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 20:05:10.480,Good,19374.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 15:38:10.185,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 09:16:10.203,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 02:16:10.055,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 00:21:10.191,Good,19396.0 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 16:05:55.853,Good,6759.78271484375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 08:44:58.008,Good,6235.2744140625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 23:28:10.451,Good,19378.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 08:32:10.509,Good,19404.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 08:21:10.109,Good,19402.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 03:53:10.250,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 20:03:10.398,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 00:36:10.169,Good,19397.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 00:47:44.880,Good,6.3916015625 +R0:Z24WVP.0S10L,2024-01-02 21:58:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 11:41:00.001,Good,2264.488525390625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 12:06:51.126,Good,6348.01953125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 23:14:10.381,Good,19377.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 14:33:10.350,Good,19411.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 22:54:10.133,Good,19379.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 22:12:10.357,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 14:44:10.477,Good,19411.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 13:12:10.139,Good,19408.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 12:57:10.333,Good,19408.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 06:31:10.082,Good,19404.0 +R0:Z24WVP.0S10L,2024-01-02 15:12:00.001,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 15:09:00.001,Good,2266.26806640625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 11:36:42.912,Good,6318.607421875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 03:21:39.440,Good,5710.7666015625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 17:07:10.148,Good,19377.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 02:36:10.137,Good,19401.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 01:40:10.234,Good,19400.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 17:01:10.327,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 08:56:10.144,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 02:52:10.136,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 19:42:10.236,Good,19375.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 09:54:57.274,Good,6.6181640625 +R0:Z24WVP.0S10L,2024-01-02 19:28:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 14:24:00.001,Good,2265.6748046875 +R0:Z24WVP.0S10L,2024-01-02 12:13:01.001,Good,2264.488525390625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 03:56:48.624,Good,5749.98193359375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 08:52:10.381,Good,19402.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 05:38:10.055,Good,19405.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 04:42:10.228,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 22:49:10.479,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 22:48:10.463,Good,19381.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 12:06:10.336,Good,19405.0 +R0:Z24WVP.0S10L,2024-01-02 17:14:00.010,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 08:22:00.001,Good,2310.751220703125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 18:41:32.154,Good,6975.46826171875 +1N325T3MTOR-P0L29:9.T0,2024-01-02 08:28:10.177,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 23:23:10.401,Good,19379.0 +R0:Z24WVP.0S10L,2024-01-02 08:28:00.001,Good,2310.751220703125 +R0:Z24WVP.0S10L,2024-01-02 06:09:00.001,Good,2309.56494140625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 22:20:23.671,Good,7264.68310546875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 04:57:10.259,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 06:00:10.076,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:43:10.057,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 21:28:10.414,Good,19381.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 14:32:10.320,Good,19410.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 18:16:58.170,Good,6.78564453125 +R0:Z24WVP.0S10L,2024-01-02 19:05:00.002,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 04:30:00.001,Good,2307.1923828125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 11:25:36.517,Good,6313.70556640625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 08:56:00.471,Good,6245.07861328125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 19:42:10.236,Good,19375.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 17:06:10.054,Good,19376.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 01:06:10.039,Good,19397.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:45:10.154,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 20:36:10.390,Good,19378.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 20:18:28.932,Good,6.830078125 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 04:25:37.035,Good,6.43798828125 +R0:Z24WVP.0S10L,2024-01-02 21:44:01.001,Good,2266.861083984375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 21:32:10.391,Good,19378.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 17:36:10.430,Good,19375.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 06:27:10.202,Good,19403.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 00:38:10.325,Good,19396.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 21:53:10.356,Good,19382.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 19:21:10.429,Good,19375.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 02:58:10.178,Good,19403.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 20:49:36.771,Good,6.8212890625 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 16:20:31.311,Good,6.74560546875 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 02:38:10.555,Good,6.40771484375 +R0:Z24WVP.0S10L,2024-01-02 22:38:00.001,Good,2301.8544921875 +R0:Z24WVP.0S10L,2024-01-02 21:00:00.001,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 18:33:00.001,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 08:08:01.001,Good,2311.344482421875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 16:17:57.885,Good,6789.1943359375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 15:58:54.526,Good,6730.37109375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 05:31:15.014,Good,5848.02099609375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 03:03:35.090,Good,5691.15869140625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 02:04:10.335,Good,19400.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 18:47:10.185,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 18:43:10.331,Good,19375.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 05:05:10.007,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 07:42:10.511,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 01:35:10.830,Good,19399.0 +R0:Z24WVP.0S10L,2024-01-02 14:31:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 13:51:00.001,Good,2264.488525390625 +R0:Z24WVP.0S10L,2024-01-02 03:03:00.001,Good,2306.006103515625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 09:40:10.354,Good,19401.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 08:07:10.462,Good,19402.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 04:33:10.486,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 14:12:10.158,Good,19409.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 08:09:10.185,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 21:50:10.172,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 15:28:10.247,Good,19378.0 +R0:Z24WVP.0S10L,2024-01-02 17:31:00.000,Good,2267.4541015625 +R0:Z24WVP.0S10L,2024-01-02 08:29:00.001,Good,2311.344482421875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 21:11:07.186,Good,7200.9580078125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 06:44:10.164,Good,19402.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 00:26:10.188,Good,19396.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 23:34:10.340,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 19:21:10.429,Good,19375.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 05:22:10.029,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 00:46:10.365,Good,19397.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 22:06:10.388,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 21:27:10.117,Good,19379.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 19:57:10.479,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 17:43:10.430,Good,19374.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 14:04:10.204,Good,19409.0 +R0:Z24WVP.0S10L,2024-01-02 22:43:00.001,Good,2299.48193359375 +R0:Z24WVP.0S10L,2024-01-02 21:11:00.001,Good,2266.26806640625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 13:46:17.703,Good,6450.9599609375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 21:28:10.414,Good,19380.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 15:14:10.016,Good,19384.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 13:30:10.275,Good,19409.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 10:59:10.057,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 23:36:10.072,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 22:18:10.195,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 18:12:10.208,Good,19375.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 18:12:10.208,Good,19375.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 11:15:10.041,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 07:06:10.473,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:41:10.441,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 18:17:10.032,Good,19374.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 15:53:10.203,Good,19376.0 +R0:Z24WVP.0S10L,2024-01-02 21:32:00.001,Good,2266.26806640625 +R0:Z24WVP.0S10L,2024-01-02 18:32:00.001,Good,2266.26806640625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 22:22:23.814,Good,7259.78125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 05:49:19.748,Good,5867.62841796875 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 22:54:10.133,Good,19379.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 16:48:10.312,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 09:14:10.055,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 02:57:10.117,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 22:02:10.069,Good,19377.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 12:42:10.399,Good,19408.0 +R0:Z24WVP.0S10L,2024-01-02 20:10:00.001,Good,2266.861083984375 +R0:Z24WVP.0S10L,2024-01-02 14:51:00.001,Good,2266.26806640625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 10:45:28.351,Good,6299.0 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 05:03:06.771,Good,5833.31494140625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 23:08:10.425,Good,19378.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 18:15:10.412,Good,19375.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 16:11:10.167,Good,19376.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 15:49:10.462,Good,19373.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 10:14:10.274,Good,19401.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 17:04:10.483,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 07:35:10.184,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 23:32:10.219,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 20:40:10.207,Good,19375.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 14:06:10.401,Good,19410.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 11:14:17.402,Good,6.65478515625 +R0:Z24WVP.0S10L,2024-01-02 09:37:00.001,Good,2311.937255859375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 23:56:10.312,Good,19376.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 23:47:10.271,Good,19375.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 20:53:10.157,Good,19378.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 22:56:10.250,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 19:19:10.376,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 19:02:10.026,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 06:01:10.129,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:23:10.045,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 00:16:10.242,Good,19394.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 08:39:10.378,Good,19402.0 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 21:14:08.404,Good,7200.9580078125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 08:34:56.025,Good,6220.56884765625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 06:33:30.708,Good,5985.275390625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 23:43:10.001,Good,19377.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 22:44:10.247,Good,19380.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 19:16:10.265,Good,19374.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 18:47:10.185,Good,19374.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 00:40:10.497,Good,19397.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 20:21:10.333,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 23:39:10.184,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 15:21:10.276,Good,19377.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 15:10:10.346,Good,19388.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 14:30:10.243,Good,19410.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 10:24:10.145,Good,19403.0 +R0:Z24WVP.0S10L,2024-01-02 16:22:00.001,Good,2266.861083984375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 22:37:27.473,Good,7274.48681640625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 12:43:00.288,Good,6406.8427734375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 20:42:10.247,Good,19378.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 11:47:10.138,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 21:26:10.073,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 16:00:10.422,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 14:44:10.477,Good,19411.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 05:02:10.169,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 20:35:10.368,Good,19374.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 19:46:10.011,Good,19375.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 19:36:10.327,Good,19373.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 18:33:10.253,Good,19373.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 18:33:10.253,Good,19373.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 17:06:10.054,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 11:00:10.012,Good,19402.0 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 18:41:04.113,Good,6.7958984375 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 10:44:09.973,Good,6.642578125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 04:59:05.845,Good,5828.4130859375 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 23:59:10.033,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 21:58:10.317,Good,19379.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 21:02:10.333,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 03:30:10.500,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 02:44:10.167,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 05:56:10.285,Good,19405.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 02:04:10.335,Good,19400.0 +R0:Z24WVP.0S10L,2024-01-02 12:09:00.001,Good,2264.488525390625 +1N325T3MTOR-P0L29:9.T0,2024-01-02 23:38:10.214,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 07:41:10.491,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 05:45:10.083,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 05:08:10.152,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 01:09:10.227,Good,19397.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 11:35:10.227,Good,19403.0 +R0:Z24WVP.0S10L,2024-01-02 23:43:00.001,Good,2295.330322265625 +R0:Z24WVP.0S10L,2024-01-02 23:35:00.001,Good,2297.702880859375 +R0:Z24WVP.0S10L,2024-01-02 12:12:00.001,Good,2265.081787109375 +R0:Z24WVP.0S10L,2024-01-02 05:33:00.001,Good,2308.378662109375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 15:07:39.464,Good,6583.3125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 15:39:10.372,Good,19375.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 09:23:10.497,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 16:29:10.143,Good,19378.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 05:15:10.019,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 00:43:10.184,Good,19396.0 +R0:Z24WVP.0S10L,2024-01-02 13:14:00.001,Good,2265.081787109375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 06:51:34.506,Good,6044.0986328125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 23:25:10.443,Good,19377.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 21:43:10.195,Good,19378.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 15:12:10.505,Good,19386.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 12:42:10.399,Good,19408.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 03:38:10.129,Good,19403.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 01:52:10.488,Good,19401.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 06:03:10.114,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 05:29:10.186,Good,19403.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 01:25:10.483,Good,19398.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 01:12:10.421,Good,19398.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 11:41:10.301,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 05:57:10.423,Good,19402.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 04:12:10.078,Good,19403.0 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 07:29:41.930,Good,6147.03955078125 +1N325T3MTOR-P0L29:9.T0,2024-01-02 22:12:10.357,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 16:28:10.469,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 10:34:10.422,Good,19403.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 17:01:10.327,Good,19376.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 11:54:10.205,Good,19404.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 08:47:10.339,Good,19404.0 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 16:06:55.961,Good,6754.880859375 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 14:24:27.995,Good,6490.17578125 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 07:33:43.150,Good,6156.84326171875 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 02:47:30.694,Good,5671.55078125 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 22:55:10.198,Good,19378.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 20:30:10.442,Good,19376.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 20:04:10.452,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 15:37:10.090,Good,19376.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 13:12:10.139,Good,19408.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 04:20:10.029,Good,19404.0 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 11:57:48.785,Good,6333.3134765625 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 20:46:10.070,Good,19375.0 +_LT2EPL-9PM0.OROTENV3:,2024-01-02 11:18:10.090,Good,19404.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 18:17:10.032,Good,19374.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 16:38:10.380,Good,19377.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 14:34:10.348,Good,19412.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 00:22:10.264,Good,19395.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 23:14:10.381,Good,19377.0 +TT33-01M9Z2L9:P20.AIRO5N,2024-01-02 21:10:10.203,Good,19376.0 +value_range, 2024-01-02 03:49:45.000, Good, 1 +value_range, 2024-01-02 07:53:11.000, Good, 2 +value_range, 2024-01-02 11:56:42.000, Good, 3 +value_range, 2024-01-02 16:00:12.000, Good, 4 +value_range, 2024-01-02 20:03:46.000, Good, 5 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 17:29:47.361,Good,6.7666015625 +O:05RI0.2T2M6STN6_PP-I165AT,2024-01-02 11:08:16.131,Good,6.6533203125 +R0:Z24WVP.0S10L,2024-01-02 10:54:00.001,Good,2264.488525390625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 11:15:36.517,Good,6313.70556640625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 03:52:47.354,Good,5740.17822265625 +-4O7LSSAM_3EA02:2GT7E02I_R_MP,2024-01-02 01:56:15.905,Good,5627.43310546875 +FLATLINE_TEST,2024-01-02 22:50:10.417,Good,19379.0 +FLATLINE_TEST,2024-01-02 14:57:10.372,Good,0 +FLATLINE_TEST,2024-01-02 02:49:10.408,Good,0 +FLATLINE_TEST,2024-01-02 02:35:10.511,Good,0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 21:51:10.219,Good,19402.0 +1N325T3MTOR-P0L29:9.T0,2024-01-02 17:08:10.242,Good,19402.0 +MISSING_DATA,2024-01-02 00:08:10.000,Good,19379.0 +MISSING_DATA,2024-01-02 00:08:11.000,Good,1 +MISSING_DATA,2024-01-02 00:08:13.000,Good,1 +MISSING_DATA,2024-01-02 00:08:14.000,Good,1 +MISSING_DATA_PATTERN,2024-01-05 00:02:10.000,Good,19379.0 +MISSING_DATA_PATTERN,2024-01-05 00:02:11.000,Good,1 +MISSING_DATA_PATTERN,2024-01-05 00:02:13.000,Good,1 +MISSING_DATA_PATTERN,2024-01-05 00:02:14.000,Good,1 + diff --git a/tests/sdk/python/rtdip_sdk/pipelines/data_quality/test_input_validator.py b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/test_input_validator.py new file mode 100644 index 000000000..69eeba3fa --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/data_quality/test_input_validator.py @@ -0,0 +1,160 @@ +# Copyright 2022 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import pytest + +from pyspark.sql import SparkSession +from pyspark.sql.types import ( + StructType, + StructField, + StringType, + TimestampType, + FloatType, +) + +from src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.missing_value_imputation import ( + MissingValueImputation, +) + + +@pytest.fixture(scope="session") +def spark_session(): + return SparkSession.builder.master("local[2]").appName("test").getOrCreate() + + +def test_input_validator_basic(spark_session: SparkSession): + test_schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", StringType(), True), + StructField("Status", StringType(), True), + StructField("Value", StringType(), True), + ] + ) + + expected_schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + column_expected_schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + StructField("Tolerance", FloatType(), True), + ] + ) + + pyspark_type_schema = { + "TagName": StringType(), + "EventTime": TimestampType(), + "Status": StringType(), + "Value": float, + } + + test_data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-01 03:29:21.000", "Good", "1.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 07:32:55.000", "Good", "2.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 11:36:29.000", "Good", "3.0"), + ] + + dirty_data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-01 03:29:21.000", "Good", "abc"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 07:32:55.000", "Good", "rtdip"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 11:36:29.000", "Good", "def"), + ] + + test_df = spark_session.createDataFrame(test_data, schema=test_schema) + dirty_df = spark_session.createDataFrame(dirty_data, schema=test_schema) + + test_component = MissingValueImputation(spark_session, test_df) + dirty_component = MissingValueImputation(spark_session, dirty_df) + + # Check if the column exists + with pytest.raises(ValueError) as e: + test_component.validate(column_expected_schema) + assert "Column 'Tolerance' is missing in the DataFrame." in str(e.value) + + # Check for pyspark Datatypes + with pytest.raises(TypeError) as e: + test_component.validate(pyspark_type_schema) + assert ( + "Expected and actual types must be instances of pyspark.sql.types.DataType." + in str(e.value) + ) + + # Check for casting failures + with pytest.raises(ValueError) as e: + dirty_component.validate(expected_schema) + assert ( + "Error during casting column 'Value' to FloatType(): Column 'Value' cannot be cast to FloatType()." + in str(e.value) + ) + + # Check for success + assert test_component.validate(expected_schema) == True + assert test_component.df.schema == expected_schema + + +def test_input_validator_with_null_strings(spark_session: SparkSession): + # Schema und Testdaten + test_schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", StringType(), True), + StructField("Status", StringType(), True), + StructField("Value", StringType(), True), + ] + ) + + expected_schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + test_data_with_null_strings = [ + ("A2PS64V0J.:ZUX09R", "2024-01-01 03:29:21.000", "Good", "None"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 07:32:55.000", "Good", "none"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 11:36:29.000", "Good", "Null"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 15:40:00.000", "Good", "null"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 19:50:00.000", "Good", ""), + ] + + test_df = spark_session.createDataFrame( + test_data_with_null_strings, schema=test_schema + ) + + test_component = MissingValueImputation(spark_session, test_df) + + # Validate the DataFrame + assert test_component.validate(expected_schema) == True + processed_df = test_component.df + + # Prüfen, ob alle Werte in "Value" None sind + value_column = processed_df.select("Value").collect() + + for row in value_column: + assert ( + row["Value"] is None + ), f"Value {row['Value']} wurde nicht korrekt zu None konvertiert." diff --git a/tests/sdk/python/rtdip_sdk/pipelines/forecasting/__init__.py b/tests/sdk/python/rtdip_sdk/pipelines/forecasting/__init__.py new file mode 100644 index 000000000..1832b01ae --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/forecasting/__init__.py @@ -0,0 +1,13 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/__init__.py b/tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/__init__.py new file mode 100644 index 000000000..1832b01ae --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/__init__.py @@ -0,0 +1,13 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/test_arima.py b/tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/test_arima.py new file mode 100644 index 000000000..7c6891cc1 --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/test_arima.py @@ -0,0 +1,520 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +import pandas as pd +import pytest +import os + +from pyspark.sql import SparkSession +from pyspark.sql.dataframe import DataFrame +from pyspark.sql.types import ( + StructType, + StructField, + StringType, + TimestampType, + FloatType, +) + +from src.sdk.python.rtdip_sdk._sdk_utils.pandas import ( + _prepare_pandas_to_convert_to_spark, +) +from src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.arima import ( + ArimaPrediction, +) +from src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.auto_arima import ( + ArimaAutoPrediction, +) + +# Testcases to add: + +# = TEST COLUMN NAME FINDER = +# Non-existing columns +# Wrong columns given +# correct columns given + +# = COLUMN-BASED = + +# = SOURCE-BASED = +# Pass additional future data -> should not be discarded + +# = PMD-Arima = +# Column-based +# Source-based + + +@pytest.fixture(scope="session") +def spark_session(): + # Additional config needed since older PySpark <3.5 have troubles converting data with timestamps to pandas Dataframes + return ( + SparkSession.builder.master("local[2]") + .appName("test") + .config("spark.sql.execution.arrow.pyspark.enabled", "true") + .getOrCreate() + ) + + +@pytest.fixture(scope="session") +def historic_data(): + hist_data = [ + ("A2PS64V0J.:ZUX09R", "2024-01-01 03:29:21", "Good", "1.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 07:32:55", "Good", "2.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 11:36:29", "Good", "3.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 15:39:03", "Good", "4.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 19:42:37", "Good", "5.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-01 23:46:10", "Good", "6.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 03:49:45", "Good", "7.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 07:53:11", "Good", "8.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 11:56:42", "Good", "9.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 16:00:12", "Good", "10.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:13:46", "Good", "11.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-03 00:07:20", "Good", "12.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-03 04:10:50", "Good", "13.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-03 08:14:20", "Good", "14.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-03 12:18:02", "Good", "15.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-03 16:21:30", "Good", "16.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-03 20:25:10", "Good", "17.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-04 00:28:44", "Good", "18.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-04 04:32:18", "Good", "19.0"), + ("A2PS64V0J.:ZUX09R", "2024-01-04 08:35:52", "Good", "20.0"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:01:43", "Good", "4686.26"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:02:44", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:03:44", "Good", "4688.019"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:04:44", "Good", "4686.26"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:05:44", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:06:44", "Good", "4694.203"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:07:44", "Good", "4693.92"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:08:44", "Good", "4691.6475"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:09:44", "Good", "4688.722"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:10:44", "Good", "4686.481"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:11:46", "Good", "4686.26"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:12:46", "Good", "4688.637"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:13:46", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:14:46", "Good", "4691.4985"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:15:46", "Good", "4690.817"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:16:47", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:17:47", "Good", "4693.7354"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:18:47", "Good", "4696.372"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:19:48", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:20:48", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:21:48", "Good", "4684.8516"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:22:48", "Good", "4679.2305"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:23:48", "Good", "4675.784"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:24:48", "Good", "4675.998"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:25:50", "Good", "4681.358"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:26:50", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:27:50", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:28:50", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:29:50", "Good", "4691.056"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:30:50", "Good", "4694.813"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:31:51", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:32:52", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:33:52", "Good", "4685.6963"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:34:52", "Good", "4681.356"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:35:52", "Good", "4678.175"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:36:52", "Good", "4676.186"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:37:52", "Good", "4675.423"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:38:52", "Good", "4675.9185"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:39:52", "Good", "4677.707"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:40:52", "Good", "4680.8213"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:41:52", "Good", "4685.295"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:42:52", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:42:54", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:43:52", "Good", "4692.863"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:43:54", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:44:54", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:45:54", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:46:55", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:47:55", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:48:55", "Good", "4689.178"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:49:55", "Good", "4692.111"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:50:55", "Good", "4695.794"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:51:56", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:52:56", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:53:56", "Good", "4687.381"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:54:56", "Good", "4687.1104"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:55:57", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:56:58", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:57:58", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:58:58", "Good", "4693.161"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 00:59:59", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:00:59", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:01:59", "Good", "4688.2207"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:02:59", "Good", "4689.07"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:03:59", "Good", "4692.1904"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:05:01", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:06:01", "Good", "4699.3506"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:07:01", "Good", "4701.433"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:08:01", "Good", "4701.872"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:09:01", "Good", "4700.228"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:10:02", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:11:03", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:12:03", "Good", "4692.6973"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:13:06", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:14:06", "Good", "4695.113"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:15:06", "Good", "4691.5415"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:16:06", "Good", "4689.0054"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:17:07", "Good", "4691.1616"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:18:07", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:19:07", "Good", "4688.7515"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:20:07", "Good", "4686.26"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:21:07", "Good", "4700.966"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:22:07", "Good", "4700.935"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:23:07", "Good", "4687.808"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:24:07", "Good", "4675.1323"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:25:09", "Good", "4676.456"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:26:09", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:27:09", "Good", "4708.868"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:28:09", "Good", "4711.2476"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:29:09", "Good", "4707.2603"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:30:09", "Good", "4700.966"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:31:09", "Good", "4695.7764"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:32:09", "Good", "4692.5146"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:33:09", "Good", "4691.358"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:34:09", "Good", "4692.482"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:35:10", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:36:10", "Good", "4700.966"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:37:10", "Good", "4702.4126"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:38:10", "Good", "4700.763"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:39:10", "Good", "4697.9897"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:40:11", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:41:11", "Good", "4696.747"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:42:11", "Good", "4700.966"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:43:11", "Good", "4705.8677"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:44:11", "Good", "4700.966"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:45:11", "Good", "4695.9624"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:46:11", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:47:11", "Good", "4700.966"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:48:11", "Good", "4702.187"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:49:11", "Good", "4699.401"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:50:11", "Good", "4695.0015"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:51:11", "Good", "4691.3823"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:52:11", "Good", "4690.9385"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:53:13", "Good", "4696.0635"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:54:13", "Good", "4700.966"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:55:13", "Good", "4686.26"), + ("-4O7LSSAM_3EA02:2GT7E02I_R_MP", "2023-12-31 01:56:13", "Good", "4700.966"), + ] + return hist_data + + +@pytest.fixture(scope="session") +def source_based_synthetic_data(): + output_object = {} + + df1 = pd.DataFrame() + df2 = pd.DataFrame() + np.random.seed(0) + + arr_len = 100 + h_a_l = int(arr_len / 2) + df1["Value"] = np.random.rand(arr_len) + np.sin( + np.linspace(0, arr_len / 2, num=arr_len) + ) + df2["Value"] = ( + df1["Value"] * 2 + np.cos(np.linspace(0, arr_len / 2, num=arr_len)) + 5 + ) + df1["index"] = np.asarray( + pd.date_range(start="1/1/2024", end="2/1/2024", periods=arr_len) + ).astype(str) + df2["index"] = np.asarray( + pd.date_range(start="1/1/2024", end="2/1/2024", periods=arr_len) + ).astype(str) + df1["TagName"] = "PrimarySensor" + df2["TagName"] = "SecondarySensor" + df1["Status"] = "Good" + df2["Status"] = "Good" + + output_object["df1"] = df1 + output_object["df2"] = df2 + output_object["arr_len"] = arr_len + output_object["h_a_l"] = h_a_l + output_object["half_df1_full_df2"] = _prepare_pandas_to_convert_to_spark( + pd.concat([df1.head(h_a_l), df2]) + ) + output_object["full_df1_full_df2"] = _prepare_pandas_to_convert_to_spark( + pd.concat([df1, df2]) + ) + output_object["full_df1_half_df2"] = _prepare_pandas_to_convert_to_spark( + pd.concat([df1, df2.head(h_a_l)]) + ) + output_object["half_df1_half_df2"] = _prepare_pandas_to_convert_to_spark( + pd.concat([df1.head(h_a_l), df2.head(h_a_l)]) + ) + return output_object + + +@pytest.fixture(scope="session") +def column_based_synthetic_data(): + output_object = {} + + df1 = pd.DataFrame() + np.random.seed(0) + + arr_len = 100 + h_a_l = int(arr_len / 2) + idx_start = "1/1/2024" + idx_end = "2/1/2024" + + df1["PrimarySensor"] = np.random.rand(arr_len) + np.sin( + np.linspace(0, arr_len / 2, num=arr_len) + ) + df1["SecondarySensor"] = ( + df1["PrimarySensor"] * 2 + np.cos(np.linspace(0, arr_len / 2, num=arr_len)) + 5 + ) + df1["index"] = np.asarray( + pd.date_range(start=idx_start, end=idx_end, periods=arr_len) + ).astype(str) + + output_object["df"] = df1 + output_object["arr_len"] = arr_len + output_object["h_a_l"] = h_a_l + output_object["half_df1_full_df2"] = _prepare_pandas_to_convert_to_spark(df1.copy()) + output_object["half_df1_full_df2"].loc[h_a_l:, "PrimarySensor"] = None + output_object["full_df1_full_df2"] = _prepare_pandas_to_convert_to_spark(df1.copy()) + output_object["full_df1_half_df2"] = _prepare_pandas_to_convert_to_spark(df1.copy()) + output_object["full_df1_half_df2"].loc[h_a_l:, "SecondarySensor"] = None + output_object["half_df1_half_df2"] = _prepare_pandas_to_convert_to_spark( + df1.copy().head(h_a_l) + ) + return output_object + + +def test_nonexistent_column_arima(spark_session: SparkSession): + input_df = spark_session.createDataFrame( + [ + (1.0,), + (2.0,), + ], + ["Value"], + ) + + with pytest.raises(ValueError): + ArimaPrediction(input_df, to_extend_name="NonexistingColumn") + + +def test_invalid_size_arima(spark_session: SparkSession): + input_df = spark_session.createDataFrame( + [ + (1.0,), + (2.0,), + ], + ["Value"], + ) + + with pytest.raises(ValueError): + ArimaPrediction( + input_df, + to_extend_name="Value", + order=(3, 0, 0), + seasonal_order=(3, 0, 0, 62), + number_of_data_points_to_analyze=62, + ) + + +def test_single_column_prediction_arima(spark_session: SparkSession, historic_data): + schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", StringType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + # convert last column to float + for idx, item in enumerate(historic_data): + historic_data[idx] = item[0:3] + (float(item[3]),) + + input_df = spark_session.createDataFrame(historic_data, schema=schema) + + h_a_l = int(input_df.count() / 2) + + arima_comp = ArimaPrediction( + input_df, + value_name="Value", + past_data_style=ArimaPrediction.InputStyle.SOURCE_BASED, + to_extend_name="-4O7LSSAM_3EA02:2GT7E02I_R_MP", + number_of_data_points_to_analyze=input_df.count(), + number_of_data_points_to_predict=h_a_l, + order=(3, 0, 0), + seasonal_order=(3, 0, 0, 62), + timestamp_name="EventTime", + source_name="TagName", + status_name="Status", + ) + forecasted_df = arima_comp.filter_data() + # print(forecasted_df.show(forecasted_df.count(), False)) + + assert isinstance(forecasted_df, DataFrame) + + assert input_df.columns == forecasted_df.columns + assert forecasted_df.count() == (input_df.count() + h_a_l) + + +def test_single_column_prediction_auto_arima( + spark_session: SparkSession, historic_data +): + + schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", StringType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + # convert last column to float + for idx, item in enumerate(historic_data): + historic_data[idx] = item[0:3] + (float(item[3]),) + + input_df = spark_session.createDataFrame(historic_data, schema=schema) + + h_a_l = int(input_df.count() / 2) + + arima_comp = ArimaAutoPrediction( + past_data=input_df, + # past_data_style=ArimaPrediction.InputStyle.SOURCE_BASED, + # value_name="Value", + to_extend_name="-4O7LSSAM_3EA02:2GT7E02I_R_MP", + number_of_data_points_to_analyze=input_df.count(), + number_of_data_points_to_predict=h_a_l, + # timestamp_name="EventTime", + # source_name="TagName", + # status_name="Status", + seasonal=True, + ) + forecasted_df = arima_comp.filter_data() + # print(forecasted_df.show(forecasted_df.count(), False)) + + assert isinstance(forecasted_df, DataFrame) + + assert input_df.columns == forecasted_df.columns + assert forecasted_df.count() == (input_df.count() + h_a_l) + assert arima_comp.value_name == "Value" + assert arima_comp.past_data_style == ArimaPrediction.InputStyle.SOURCE_BASED + assert arima_comp.timestamp_name == "EventTime" + assert arima_comp.source_name == "TagName" + assert arima_comp.status_name == "Status" + + +def test_column_based_prediction_arima( + spark_session: SparkSession, column_based_synthetic_data +): + + schema = StructType( + [ + StructField("PrimarySource", StringType(), True), + StructField("SecondarySource", StringType(), True), + StructField("EventTime", StringType(), True), + ] + ) + + data = column_based_synthetic_data["half_df1_half_df2"] + + input_df = spark_session.createDataFrame(data, schema=schema) + + arima_comp = ArimaAutoPrediction( + past_data=input_df, + to_extend_name="PrimarySource", + number_of_data_points_to_analyze=input_df.count(), + number_of_data_points_to_predict=input_df.count(), + seasonal=True, + ) + forecasted_df = arima_comp.filter_data() + + # forecasted_df.show() + + assert isinstance(forecasted_df, DataFrame) + + assert input_df.columns == forecasted_df.columns + assert forecasted_df.count() == (input_df.count() + input_df.count()) + assert arima_comp.value_name == None + assert arima_comp.past_data_style == ArimaPrediction.InputStyle.COLUMN_BASED + assert arima_comp.timestamp_name == "EventTime" + assert arima_comp.source_name is None + assert arima_comp.status_name is None + + +def test_arima_large_data_set(spark_session: SparkSession): + test_path = os.path.dirname(__file__) + data_path = os.path.join(test_path, "../../data_quality/test_data.csv") + + input_df = spark_session.read.option("header", "true").csv(data_path) + + expected_schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + print((input_df.count(), len(input_df.columns))) + + count_signal = input_df.filter('TagName = "R0:Z24WVP.0S10L"').count() + h_a_l = int(count_signal / 2) + + arima_comp = ArimaAutoPrediction( + input_df, + to_extend_name="R0:Z24WVP.0S10L", + number_of_data_points_to_analyze=count_signal, + number_of_data_points_to_predict=h_a_l, + ) + + result_df = arima_comp.filter_data() + + tolerance = 0.01 + + assert isinstance(result_df, DataFrame) + + assert result_df.count() == pytest.approx((input_df.count() + h_a_l), rel=tolerance) + + +def test_arima_wrong_datatype(spark_session: SparkSession): + + expected_schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] + ) + + test_df = spark_session.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "invalid_data_type", "Good", "1.0"), + ("A2PS64V0J.:ZUX09R", "invalid_data_type", "Good", "2.0"), + ("A2PS64V0J.:ZUX09R", "invalid_data_type", "Good", "3.0"), + ("A2PS64V0J.:ZUX09R", "invalid_data_type", "Good", "4.0"), + ("A2PS64V0J.:ZUX09R", "invalid_data_type", "Good", "5.0"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + count_signal = 5 + h_a_l = int(count_signal / 2) + + with pytest.raises(ValueError) as exc_info: + arima_comp = ArimaAutoPrediction( + test_df, + to_extend_name="A2PS64V0J.:ZUX09R", + number_of_data_points_to_analyze=count_signal, + number_of_data_points_to_predict=h_a_l, + ) + + arima_comp.validate(expected_schema) diff --git a/tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/test_data_binning.py b/tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/test_data_binning.py new file mode 100644 index 000000000..f4f8fafee --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/test_data_binning.py @@ -0,0 +1,71 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import pytest +from pyspark.sql import SparkSession +from pyspark.ml.linalg import Vectors +from src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.data_binning import ( + DataBinning, +) + + +@pytest.fixture(scope="session") +def spark(): + return ( + SparkSession.builder.master("local[*]") + .appName("Linear Regression Unit Test") + .getOrCreate() + ) + + +@pytest.fixture(scope="function") +def sample_data(spark): + data = [ + (Vectors.dense([1.0]),), + (Vectors.dense([1.2]),), + (Vectors.dense([1.5]),), + (Vectors.dense([5.0]),), + (Vectors.dense([5.2]),), + (Vectors.dense([9.8]),), + (Vectors.dense([10.0]),), + (Vectors.dense([10.2]),), + ] + + return spark.createDataFrame(data, ["features"]) + + +def test_data_binning_kmeans(sample_data): + binning = DataBinning(column_name="features", bins=3, output_column_name="bin") + + result_df = binning.train(sample_data).predict(sample_data) + + assert "bin" in result_df.columns + assert result_df.count() == sample_data.count() + + bin_values = result_df.select("bin").distinct().collect() + bin_numbers = [row.bin for row in bin_values] + assert all(0 <= bin_num < 3 for bin_num in bin_numbers) + + for row in result_df.collect(): + if row["features"] in [1.0, 1.2, 1.5]: + assert row["bin"] == 2 + elif row["features"] in [5.0, 5.2]: + assert row["bin"] == 1 + elif row["features"] in [9.8, 10.0, 10.2]: + assert row["bin"] == 0 + + +def test_data_binning_invalid_method(sample_data): + with pytest.raises(Exception) as exc_info: + DataBinning(column_name="features", bins=3, method="invalid_method") + assert "Unknown method" in str(exc_info.value) diff --git a/tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/test_k_nearest_neighbors.py b/tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/test_k_nearest_neighbors.py new file mode 100644 index 000000000..95d91c4bf --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/test_k_nearest_neighbors.py @@ -0,0 +1,300 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import pytest +from pyspark.sql import SparkSession +from pyspark.sql.types import ( + StructType, + StructField, + StringType, + TimestampType, + FloatType, +) +from datetime import datetime +from src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.k_nearest_neighbors import ( + KNearestNeighbors, +) +from pyspark.ml.feature import VectorAssembler, StandardScaler, StringIndexer +from pyspark.sql.functions import col + +# Schema definition (same as template) +SCHEMA = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] +) + + +@pytest.fixture(scope="session") +def spark(): + return ( + SparkSession.builder.master("local[*]").appName("KNN Unit Test").getOrCreate() + ) + + +@pytest.fixture(scope="function") +def sample_data(spark): + # Using similar data structure as template but with more varied values + data = [ + ( + "TAG1", + datetime.strptime("2024-01-02 20:03:46.000", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 0.34, + ), + ( + "TAG1", + datetime.strptime("2024-01-02 20:04:46.000", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 0.35, + ), + ( + "TAG2", + datetime.strptime("2024-01-02 20:05:46.000", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 0.45, + ), + ( + "TAG2", + datetime.strptime("2024-01-02 20:06:46.000", "%Y-%m-%d %H:%M:%S.%f"), + "Bad", + 0.55, + ), + ] + return spark.createDataFrame(data, schema=SCHEMA) + + +@pytest.fixture(scope="function") +def prepared_data(sample_data): + # Convert categorical variables, Index TagName and Status + tag_indexer = StringIndexer(inputCol="TagName", outputCol="TagIndex") + status_indexer = StringIndexer(inputCol="Status", outputCol="StatusIndex") + + df = tag_indexer.fit(sample_data).transform(sample_data) + df = status_indexer.fit(df).transform(df) + + assembler = VectorAssembler( + inputCols=["TagIndex", "StatusIndex", "Value"], outputCol="raw_features" + ) + df = assembler.transform(df) + + scaler = StandardScaler( + inputCol="raw_features", outputCol="features", withStd=True, withMean=True + ) + return scaler.fit(df).transform(df) + + +def test_knn_initialization(prepared_data): + """Test KNN initialization with various parameters""" + # Test valid initialization + knn = KNearestNeighbors( + features_col="features", + label_col="Value", + timestamp_col="EventTime", + k=3, + weighted=True, + distance_metric="combined", + ) + assert knn.k == 3 + assert knn.weighted is True + + # Test invalid distance metric + with pytest.raises(ValueError): + KNearestNeighbors( + features_col="features", + label_col="Value", + distance_metric="invalid_metric", + ) + + # Test missing timestamp column for temporal distance + with pytest.raises(ValueError): + KNearestNeighbors( + features_col="features", + label_col="Value", + # timestamp_col is compulsory for temporal distance + distance_metric="temporal", + ) + + +def test_data_splitting(prepared_data): + """Test the data splitting functionality""" + knn = KNearestNeighbors( + features_col="features", + label_col="Value", + timestamp_col="EventTime", + ) + + train_df, test_df = prepared_data.randomSplit([0.8, 0.2], seed=42) + + assert train_df.count() + test_df.count() == prepared_data.count() + assert train_df.count() > 0 + assert test_df.count() > 0 + + +def test_model_training(prepared_data): + """Test model training functionality""" + knn = KNearestNeighbors( + features_col="features", + label_col="Value", + timestamp_col="EventTime", + ) + + train_df, _ = prepared_data.randomSplit([0.8, 0.2], seed=42) + trained_model = knn.train(train_df) + + assert trained_model is not None + assert trained_model.train_features is not None + assert trained_model.train_labels is not None + + +def test_predictions(prepared_data): + """Test prediction functionality""" + knn = KNearestNeighbors( + features_col="features", + label_col="Value", + timestamp_col="EventTime", + weighted=True, + ) + + train_df, test_df = prepared_data.randomSplit([0.8, 0.2], seed=42) + knn.train(train_df) + predictions = knn.predict(test_df) + + assert "prediction" in predictions.columns + assert predictions.count() > 0 + assert all(pred is not None for pred in predictions.select("prediction").collect()) + + +def test_temporal_distance(prepared_data): + """Test temporal distance calculation""" + knn = KNearestNeighbors( + features_col="features", + label_col="Value", + timestamp_col="EventTime", + distance_metric="temporal", + ) + + train_df, test_df = prepared_data.randomSplit([0.8, 0.2], seed=42) + knn.train(train_df) + predictions = knn.predict(test_df) + + assert predictions.count() > 0 + assert "prediction" in predictions.columns + + +def test_combined_distance(prepared_data): + """Test combined distance calculation""" + knn = KNearestNeighbors( + features_col="features", + label_col="Value", + timestamp_col="EventTime", + distance_metric="combined", + temporal_weight=0.5, + ) + + train_df, test_df = prepared_data.randomSplit([0.8, 0.2], seed=42) + knn.train(train_df) + predictions = knn.predict(test_df) + + assert predictions.count() > 0 + assert "prediction" in predictions.columns + + +def test_invalid_data_handling(spark): + """Test handling of invalid data""" + invalid_data = [ + ("TAG1", "invalid_date", "Good", "invalid_value"), + ("TAG1", "2024-01-02 20:03:46.000", "Good", "NaN"), + ("TAG2", "2024-01-02 20:03:46.000", None, 123.45), + ] + + schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", StringType(), True), + StructField("Status", StringType(), True), + StructField("Value", StringType(), True), + ] + ) + + df = spark.createDataFrame(invalid_data, schema=schema) + + try: + df = df.withColumn("Value", col("Value").cast(FloatType())) + invalid_rows = df.filter(col("Value").isNull()) + valid_rows = df.filter(col("Value").isNotNull()) + + assert invalid_rows.count() > 0 + assert valid_rows.count() > 0 + except Exception as e: + pytest.fail(f"Unexpected error during invalid data handling: {e}") + + +def test_large_dataset(spark): + """Test KNN on a larger dataset""" + base_path = os.path.dirname(__file__) + file_path = os.path.join(base_path, "../../data_quality/test_data.csv") + + try: + df = spark.read.option("header", "true").csv(file_path) + df = df.withColumn("Value", col("Value").cast(FloatType())) + df = df.withColumn("EventTime", col("EventTime").cast(TimestampType())) + + prepared_df = prepare_data_for_knn(df) + + knn = KNearestNeighbors( + features_col="features", + label_col="Value", + timestamp_col="EventTime", + ) + + train_df, test_df = prepared_df.randomSplit([0.8, 0.2], seed=42) + knn.train(train_df) + predictions = knn.predict(test_df) + + assert predictions.count() > 0 + assert "prediction" in predictions.columns + except Exception as e: + pytest.fail(f"Failed to process large dataset: {e}") + + +def prepare_data_for_knn(df): + """Helper function to prepare data for KNN""" + + # Convert categorical variables + indexers = [ + StringIndexer(inputCol=col, outputCol=f"{col}Index") + for col in ["TagName", "Status"] + if col in df.columns + ] + + for indexer in indexers: + df = indexer.fit(df).transform(df) + + # Create feature vector + numeric_cols = [col for col in df.columns if df.schema[col].dataType == FloatType()] + index_cols = [col for col in df.columns if col.endswith("Index")] + feature_cols = numeric_cols + index_cols + + assembler = VectorAssembler(inputCols=feature_cols, outputCol="raw_features") + df = assembler.transform(df) + + # Scale features + scaler = StandardScaler( + inputCol="raw_features", outputCol="features", withStd=True, withMean=True + ) + return scaler.fit(df).transform(df) diff --git a/tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/test_linear_regression.py b/tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/test_linear_regression.py new file mode 100644 index 000000000..aa43830fc --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/test_linear_regression.py @@ -0,0 +1,321 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import pytest +from pyspark.sql import SparkSession +from pyspark.sql import Row +from pyspark.sql.types import ( + StructType, + StructField, + StringType, + TimestampType, + FloatType, +) +from datetime import datetime +from src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.linear_regression import ( + LinearRegression, +) +from src.sdk.python.rtdip_sdk.pipelines.transformers.spark.machine_learning.columns_to_vector import ( + ColumnsToVector, +) +from src.sdk.python.rtdip_sdk.pipelines.transformers.spark.machine_learning.polynomial_features import ( + PolynomialFeatures, +) + +SCHEMA = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", TimestampType(), True), + StructField("Status", StringType(), True), + StructField("Value", FloatType(), True), + ] +) + + +@pytest.fixture(scope="session") +def spark(): + return ( + SparkSession.builder.master("local[*]") + .appName("Linear Regression Unit Test") + .getOrCreate() + ) + + +@pytest.fixture(scope="function") +def sample_data(spark): + data = [ + ( + "A2PS64V0J.:ZUX09R", + datetime.strptime("2024-01-02 20:03:46.000", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 0.3400000035762787, + ), + ( + "A2PS64V0J.:ZUX09R", + datetime.strptime("2024-01-02 16:00:12.000", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 0.15000000596046448, + ), + ( + "A2PS64V0J.:ZUX09R", + datetime.strptime("2024-01-02 11:56:42.000", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 0.12999999523162842, + ), + ( + "A2PS64V0J.:ZUX09R", + datetime.strptime("2024-01-02 07:53:11.000", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 0.11999999731779099, + ), + ( + "A2PS64V0J.:ZUX09R", + datetime.strptime("2024-01-02 03:49:45.000", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 0.12999999523162842, + ), + ( + "-4O7LSSAM_3EA02:2GT7E02I_R_MP", + datetime.strptime("2024-01-02 20:09:58.053", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 7107.82080078125, + ), + ( + "_LT2EPL-9PM0.OROTENV3:", + datetime.strptime("2024-01-02 12:27:10.518", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 19407.0, + ), + ( + "_LT2EPL-9PM0.OROTENV3:", + datetime.strptime("2024-01-02 05:23:10.143", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 19403.0, + ), + ( + "_LT2EPL-9PM0.OROTENV3:", + datetime.strptime("2024-01-02 01:31:10.086", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 19399.0, + ), + ( + "1N325T3MTOR-P0L29:9.T0", + datetime.strptime("2024-01-02 23:41:10.358", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 19376.0, + ), + ( + "TT33-01M9Z2L9:P20.AIRO5N", + datetime.strptime("2024-01-02 18:09:10.488", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 19375.0, + ), + ( + "TT33-01M9Z2L9:P20.AIRO5N", + datetime.strptime("2024-01-02 16:15:10.492", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 19376.0, + ), + ( + "TT33-01M9Z2L9:P20.AIRO5N", + datetime.strptime("2024-01-02 06:51:10.077", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 19403.0, + ), + ( + "O:05RI0.2T2M6STN6_PP-I165AT", + datetime.strptime("2024-01-02 07:42:24.227", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 6.55859375, + ), + ( + "-4O7LSSAM_3EA02:2GT7E02I_R_MP", + datetime.strptime("2024-01-02 06:08:23.777", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 5921.5498046875, + ), + ( + "-4O7LSSAM_3EA02:2GT7E02I_R_MP", + datetime.strptime("2024-01-02 05:14:10.896", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 5838.216796875, + ), + ( + "-4O7LSSAM_3EA02:2GT7E02I_R_MP", + datetime.strptime("2024-01-02 01:37:10.967", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 5607.82568359375, + ), + ( + "-4O7LSSAM_3EA02:2GT7E02I_R_MP", + datetime.strptime("2024-01-02 00:26:53.449", "%Y-%m-%d %H:%M:%S.%f"), + "Good", + 5563.7080078125, + ), + ] + + return spark.createDataFrame(data, schema=SCHEMA) + + +def test_columns_to_vector(sample_data): + df = sample_data + columns_to_vector = ColumnsToVector( + df=df, input_cols=["Value"], output_col="features" + ) + transformed_df = columns_to_vector.transform() + + assert "features" in transformed_df.columns + transformed_df.show() + + +def test_polynomial_features(sample_data): + df = sample_data + # Convert 'Value' to a vector using ColumnsToVector + columns_to_vector = ColumnsToVector( + df=df, input_cols=["Value"], output_col="features" + ) + vectorized_df = columns_to_vector.transform() + + polynomial_features = PolynomialFeatures( + df=vectorized_df, + input_col="features", + output_col="poly_features", + poly_degree=2, + ) + transformed_df = polynomial_features.transform() + assert ( + "poly_features" in transformed_df.columns + ), "Polynomial features column not created" + assert transformed_df.count() > 0, "Transformed DataFrame is empty" + + transformed_df.show() + + +def test_dataframe_validation(sample_data): + df = sample_data + + required_columns = ["TagName", "EventTime", "Status", "Value"] + for column in required_columns: + if column not in df.columns: + raise ValueError(f"Missing required column: {column}") + + try: + df.withColumn("Value", df["Value"].cast(FloatType())) + except Exception as e: + raise ValueError("Column 'Value' could not be converted to FloatType.") from e + + +def test_invalid_data_handling(spark): + + data = [ + ("A2PS64V0J.:ZUX09R", "invalid_date", "Good", "invalid_value"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", "NaN"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", None, 123.45), + ("A2PS64V0J.:ZUX09R", "2024-01-02 20:03:46.000", "Good", 123.45), + ] + + schema = StructType( + [ + StructField("TagName", StringType(), True), + StructField("EventTime", StringType(), True), + StructField("Status", StringType(), True), + StructField("Value", StringType(), True), + ] + ) + + df = spark.createDataFrame(data, schema=schema) + + try: + df = df.withColumn("Value", df["Value"].cast(FloatType())) + except Exception as e: + pytest.fail(f"Unexpected error during casting: {e}") + + invalid_rows = df.filter(df["Value"].isNull()) + valid_rows = df.filter(df["Value"].isNotNull()) + + assert invalid_rows.count() > 0, "No invalid rows detected when expected" + assert valid_rows.count() > 0, "All rows were invalid, which is unexpected" + + if valid_rows.count() > 0: + vectorized_df = ColumnsToVector( + df=valid_rows, input_cols=["Value"], output_col="features" + ).transform() + assert ( + "features" in vectorized_df.columns + ), "Vectorized column 'features' not created" + + +def test_invalid_prediction_without_training(sample_data): + df = sample_data + + vectorized_df = ColumnsToVector( + df=df, input_cols=["Value"], output_col="features" + ).transform() + + linear_regression = LinearRegression( + features_col="features", + label_col="Value", + prediction_col="prediction", + ) + + # Attempt prediction without training + with pytest.raises( + AttributeError, match="'LinearRegression' object has no attribute 'model'" + ): + linear_regression.predict(vectorized_df) + + +def test_prediction_on_large_dataset(spark): + base_path = os.path.dirname(__file__) + file_path = os.path.join(base_path, "../../data_quality/test_data.csv") + df = spark.read.option("header", "true").csv(file_path) + assert df.count() > 0, "Dataframe was not loaded correctly" + + assert df.count() > 0, "Dataframe was not loaded correctly" + assert "EventTime" in df.columns, "Missing 'EventTime' column in dataframe" + assert "Value" in df.columns, "Missing 'Value' column in dataframe" + + df = df.withColumn("Value", df["Value"].cast("float")) + assert ( + df.select("Value").schema[0].dataType == FloatType() + ), "Value column was not cast to FloatType" + + vectorized_df = ColumnsToVector( + df=df, input_cols=["Value"], output_col="features" + ).transform() + + assert ( + "features" in vectorized_df.columns + ), "Vectorized column 'features' not created" + + linear_regression = LinearRegression( + features_col="features", + label_col="Value", + prediction_col="prediction", + ) + + train_df, test_df = linear_regression.split_data(vectorized_df, train_ratio=0.8) + assert train_df.count() > 0, "Training dataset is empty" + assert test_df.count() > 0, "Testing dataset is empty" + + model = linear_regression.train(train_df) + assert model is not None, "Model training failed" + + predictions = model.predict(test_df) + + assert predictions is not None, "Predictions dataframe is empty" + assert predictions.count() > 0, "No predictions were generated" + assert ( + "prediction" in predictions.columns + ), "Missing 'prediction' column in predictions dataframe" diff --git a/tests/sdk/python/rtdip_sdk/pipelines/logging/__init__.py b/tests/sdk/python/rtdip_sdk/pipelines/logging/__init__.py new file mode 100644 index 000000000..1832b01ae --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/logging/__init__.py @@ -0,0 +1,13 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/tests/sdk/python/rtdip_sdk/pipelines/logging/test_log_collection.py b/tests/sdk/python/rtdip_sdk/pipelines/logging/test_log_collection.py new file mode 100644 index 000000000..103f09f01 --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/logging/test_log_collection.py @@ -0,0 +1,149 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os + +import pytest + +from pandas import DataFrame +from pyspark.sql import SparkSession + +from src.sdk.python.rtdip_sdk.pipelines.logging.logger_manager import LoggerManager +from src.sdk.python.rtdip_sdk.pipelines.logging.spark.runtime_log_collector import ( + RuntimeLogCollector, +) +from src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.identify_missing_data_interval import ( + IdentifyMissingDataInterval, +) + +import logging + + +@pytest.fixture(scope="session") +def spark(): + spark = ( + SparkSession.builder.master("local[2]") + .appName("LogCollectionTest") + .getOrCreate() + ) + yield spark + spark.stop() + + +def test_logger_manager_basic_function(spark): + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:00.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:01:25.000", "Good", "0.150000006"), + ( + "A2PS64V0J.:ZUX09R", + "2024-01-02 00:01:41.000", + "Good", + "0.340000004", + ), # Missing interval (25s to 41s) + ], + ["TagName", "EventTime", "Status", "Value"], + ) + monitor = IdentifyMissingDataInterval( + df=df, + interval="10s", + tolerance="500ms", + ) + log_collector = RuntimeLogCollector(spark) + + assert monitor.logger_manager is log_collector.logger_manager + + +def test_df_output(spark, caplog): + log_collector = RuntimeLogCollector(spark) + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:00.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:10.000", "Good", "0.119999997"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + + monitor = IdentifyMissingDataInterval( + df=df, + interval="10s", + tolerance="500ms", + ) + log_handler = log_collector._attach_dataframe_handler_to_logger( + "IdentifyMissingDataInterval" + ) + + with caplog.at_level(logging.INFO, logger="IdentifyMissingDataInterval"): + monitor.check() + + result_df = log_handler.get_logs_as_df() + + assert result_df.count() == 4 + + +def test_unique_dataframes(spark, caplog): + log_collector = RuntimeLogCollector(spark) + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:00.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:10.000", "Good", "0.119999997"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + logger = LoggerManager().create_logger("Test_Logger") + monitor = IdentifyMissingDataInterval( + df=df, + interval="10s", + tolerance="500ms", + ) + log_handler_identify_missing_data_interval = ( + log_collector._attach_dataframe_handler_to_logger("IdentifyMissingDataInterval") + ) + + log_handler_test = log_collector._attach_dataframe_handler_to_logger("Test_Logger") + + with caplog.at_level(logging.INFO, logger="IdentifyMissingDataInterval"): + monitor.check() + + result_df = log_handler_identify_missing_data_interval.get_logs_as_df() + result_df_test = log_handler_test.get_logs_as_df() + + assert result_df.count() != result_df_test.count() + + +def test_file_logging(spark, caplog): + + log_collector = RuntimeLogCollector(spark) + df = spark.createDataFrame( + [ + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:00.000", "Good", "0.129999995"), + ("A2PS64V0J.:ZUX09R", "2024-01-02 00:00:10.000", "Good", "0.119999997"), + ], + ["TagName", "EventTime", "Status", "Value"], + ) + monitor = IdentifyMissingDataInterval( + df=df, + interval="10s", + tolerance="500ms", + ) + log_collector._attach_file_handler_to_loggers("logs.log", ".") + + with caplog.at_level(logging.INFO, logger="IdentifyMissingDataInterval"): + monitor.check() + + with open("./logs.log", "r") as f: + logs = f.readlines() + + assert len(logs) == 4 + if os.path.exists("./logs.log"): + os.remove("./logs.log") diff --git a/tests/sdk/python/rtdip_sdk/pipelines/logging/test_logger_manager.py b/tests/sdk/python/rtdip_sdk/pipelines/logging/test_logger_manager.py new file mode 100644 index 000000000..0b2e4e6cc --- /dev/null +++ b/tests/sdk/python/rtdip_sdk/pipelines/logging/test_logger_manager.py @@ -0,0 +1,31 @@ +# Copyright 2025 RTDIP +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import pytest +from src.sdk.python.rtdip_sdk.pipelines.logging.logger_manager import LoggerManager + + +def test_logger_manager_basic_function(): + logger_manager = LoggerManager() + logger1 = logger_manager.create_logger("logger1") + assert logger1 is logger_manager.get_logger("logger1") + + assert logger_manager.get_logger("logger2") is None + + +def test_singleton_functionality(): + logger_manager = LoggerManager() + logger_manager2 = LoggerManager() + + assert logger_manager is logger_manager2