-
Notifications
You must be signed in to change notification settings - Fork 409
[SPARK] Smart debug facet #3715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fe55fc7 to
46956a4
Compare
integration/spark/shared/src/main/java/io/openlineage/spark/api/DebugConfig.java
Outdated
Show resolved
Hide resolved
6a82d31 to
6896523
Compare
6896523 to
b92fba7
Compare
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #3715 +/- ##
==========================================
+ Coverage 84.79% 86.12% +1.32%
==========================================
Files 35 57 +22
Lines 1888 3661 +1773
==========================================
+ Hits 1601 3153 +1552
- Misses 287 508 +221 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ed6a58b to
82322a6
Compare
82322a6 to
9d3ffa0
Compare
integration/spark/shared/src/main/java/io/openlineage/spark/api/DebugConfig.java
Outdated
Show resolved
Hide resolved
| int payloadSize = | ||
| OpenLineageClientUtils.toJson(new DebugRunFacetWithStandardSerializer(facet)) | ||
| .getBytes(StandardCharsets.UTF_8) | ||
| .length | ||
| / 1024; | ||
|
|
||
| if (payloadSize > facet.getPayloadSizeLimitInKilobytes()) { | ||
| log.warn( | ||
| "DebugRunFacetSerializer skipping serialization of DebugRunFacet due to payload size: {} kilobytes", | ||
| payloadSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you serialize then check, you might already blow up the memory consumption by holding the JSON
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug facet does not contain Spark's LogicalPlan classes and serialises only OpenLineage classes explicitly defined. Logical plan's nodes are contained, but by using:
String.format("%s@%s", node.nodeName(), node.hashCode())
So, this is not aiming to protect against gigabyte size logical plan nodes serialization, but as a limiter to make sure event size does not exceed backend boundaries.
| private final SparkConfigDebugFacet config; | ||
| private final LogicalPlanDebugFacet logicalPlan; | ||
| private final MetricsDebugFacet metrics; | ||
| private final List<String> logs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how you fill this field - should this not be very much dependent of logger config and Spark distribution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not contain slf4j logs. It can just contain some extra information about debug facet like:
- debug facet serialization size exceeds certain size
- no inputs or no output dataset have been detected.
9d3ffa0 to
6354792
Compare
Signed-off-by: Pawel Leszczynski <[email protected]>
6354792 to
13893f4
Compare
Signed-off-by: Pawel Leszczynski <[email protected]> Signed-off-by: marccampa <[email protected]>
* Update consumers.tsx * Add files via upload * Update consumers.tsx * Add files via upload * Update consumers.tsx * Update consumers.tsx * Update consumers.tsx * Update consumers.tsx * Delete Collibra-Logo-RGB.png * Update consumers.tsx * [Flink] Do not hide OpenLineage config parsing errors (#3724) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * Update consumers.tsx Signed-off-by: marccampa <[email protected]> * Add files via upload Signed-off-by: marccampa <[email protected]> * Update consumers.tsx Signed-off-by: marccampa <[email protected]> * Add files via upload Signed-off-by: marccampa <[email protected]> * Update consumers.tsx Signed-off-by: marccampa <[email protected]> * [DBT] Add processing_engine facet (#3725) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * java: prevent original events from being mutated in TransformTransport (#3728) - Add deepCopy utility method to OpenLineageClientUtils for safe object cloning - Modify TransformTransport to create deep copies of events before transformation Signed-off-by: Jakub Dardzinski <[email protected]> Signed-off-by: marccampa <[email protected]> * [Flink] Add processing_engine facet (#3726) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * Add Github stars statistics to Readme (#3730) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * [DBT] Document supported adapters (#3729) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * Prettify Spark JSON event examples (#3740) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * Prettify Flink JSON event examples (#3742) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * Prettify Airflow JSON event examples (#3741) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * Prettify DBT JSON event examples (#3743) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * Restyle cardmedia. (#3733) Signed-off-by: merobi-hub <[email protected]> Signed-off-by: marccampa <[email protected]> * dbt-ol should not error on job complete if there is no start event (#3749) Signed-off-by: Maciej Obuchowski <[email protected]> Signed-off-by: marccampa <[email protected]> * [Flink] Add facet with Flink jobId (#3744) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * Update consumers.tsx Signed-off-by: marccampa <[email protected]> * Update consumers.tsx Signed-off-by: marccampa <[email protected]> * [DBT] Initial support for Clickhouse (#3739) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * [SPEC] Add contentType to documentation facet (#3748) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * [spark] Update Spark 4 dependency to 4.0.0 (remove -preview1 suffix) (#3751) Signed-off-by: Dominik Dębowczyk <[email protected]> Signed-off-by: marccampa <[email protected]> * filter temp inner jobs for bigquery indirect mode (#3722) Signed-off-by: Pawel Leszczynski <[email protected]> Signed-off-by: marccampa <[email protected]> * [Docs] Add documentation for some facets (#3752) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * Tweak the Maven signing config (#3069) This tweak allows Gradle to default on using values set in `~/.gradle/gradle.properties` Signed-off-by: Julien Phalip <[email protected]> Signed-off-by: marccampa <[email protected]> * Run prettier on .json files (#3750) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * remove native proxy (#3680) * remove native proxy Signed-off-by: Maciej Obuchowski <[email protected]> # Conflicts: # proxy/backend/gradle.properties * remove leftover proxy gradle reference Signed-off-by: Kacper Muda <[email protected]> --------- Signed-off-by: Kacper Muda <[email protected]> Co-authored-by: Kacper Muda <[email protected]> Signed-off-by: marccampa <[email protected]> * [DBT] Add DbtRun facet (#3738) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * Remove Airflow < 2.5.0 support (#3669) Signed-off-by: Kacper Muda <[email protected]> Signed-off-by: marccampa <[email protected]> * nit: fix supported airflow versions (#3755) Signed-off-by: Kacper Muda <[email protected]> Signed-off-by: marccampa <[email protected]> * [Java] Speedup generateNewUUID (#3754) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * [DBT] Use adapter rows_affected as outputStatistics (#3731) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * fix variables in docs for setting of root parents in spark config (#3761) Signed-off-by: Humzah Kiani <[email protected]> Signed-off-by: marccampa <[email protected]> * [spark] Add support for Big Query Metastore catalog type (#3760) Signed-off-by: Dominik Dębowczyk <[email protected]> Signed-off-by: marccampa <[email protected]> * Fix visibility of GcpLineageTransportConfig.Mode (#3762) * Register GCP common job facet Signed-off-by: Natalia Gorchakova <[email protected]> * Add ACCEPT_CASE_INSENSITIVE_ENUMS for ObjectMapper to ensure that lower and upper case enum values are accepted for config Signed-off-by: Natalia Gorchakova <[email protected]> * Add ACCEPT_CASE_INSENSITIVE_ENUMS for ObjectMapper to ensure that lower and upper case enum values are accepted for config Signed-off-by: Natalia Gorchakova <[email protected]> * Add ACCEPT_CASE_INSENSITIVE_ENUMS for ObjectMapper to ensure that lower and upper case enum values are accepted for config Signed-off-by: Natalia Gorchakova <[email protected]> * Add ACCEPT_CASE_INSENSITIVE_ENUMS for ObjectMapper to ensure that lower and upper case enum values are accepted for config Signed-off-by: Natalia Gorchakova <[email protected]> --------- Signed-off-by: Natalia Gorchakova <[email protected]> Signed-off-by: marccampa <[email protected]> * Update consumers.tsx Signed-off-by: marccampa <[email protected]> * Delete Collibra-Logo-RGB.png Signed-off-by: marccampa <[email protected]> * update httpConfig Headers and TimeoutInMillis property values (#3767) Signed-off-by: Nidhin Varghese <[email protected]> Signed-off-by: marccampa <[email protected]> * [java] Add log if load from yaml fails (#3766) Signed-off-by: Fiore Mario Vitale <[email protected]> Signed-off-by: marccampa <[email protected]> * smart debug facet (#3715) Signed-off-by: Pawel Leszczynski <[email protected]> Signed-off-by: marccampa <[email protected]> * [Spark] Fix missing table path in InsertIntoHadoopFsRelationCommand (#3773) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * Github: mark Hive PRs with proper label (#3778) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * fix configurable test failin in CI (#3782) Signed-off-by: Pawel Leszczynski <[email protected]> Signed-off-by: marccampa <[email protected]> * Column level lineage for jdbc queries load (#3763) * test column level lineage for jdbc queries load Signed-off-by: Pawel Leszczynski <[email protected]> * refactor jdbc lineage visitor Signed-off-by: Pawel Leszczynski <[email protected]> --------- Signed-off-by: Pawel Leszczynski <[email protected]> Signed-off-by: marccampa <[email protected]> * chore: Use attr.define instead of attr.s (#3776) Signed-off-by: Kacper Muda <[email protected]> Signed-off-by: marccampa <[email protected]> * [Hive] Add job sql facet (#3777) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * build(deps): bump the integration-sql group (#3704) Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) and [pyo3-build-config](https://github.com/pyo3/pyo3) to permit the latest version. Updates `pyo3` to 0.25.0 - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.24.0...v0.25.0) Updates `pyo3-build-config` to 0.25.0 - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.24.0...v0.25.0) --- updated-dependencies: - dependency-name: pyo3 dependency-version: 0.25.0 dependency-type: direct:production dependency-group: integration-sql - dependency-name: pyo3-build-config dependency-version: 0.25.0 dependency-type: direct:production dependency-group: integration-sql ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: marccampa <[email protected]> * [Hive] Add hive_query facet (#3781) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * Website: correct Node version in README (#3783) * Fix node version in website readme. Signed-off-by: merobi-hub <[email protected]> * Misc fixes. Signed-off-by: merobi-hub <[email protected]> --------- Signed-off-by: merobi-hub <[email protected]> Signed-off-by: marccampa <[email protected]> * [spark] Disable module metadata file generation (#3785) Signed-off-by: Dominik Dębowczyk <[email protected]> Signed-off-by: marccampa <[email protected]> * Add Debezium to producers (#3787) Signed-off-by: Fiore Mario Vitale <[email protected]> Signed-off-by: marccampa <[email protected]> * build(deps): bump requests from 2.32.0 to 2.32.4 in /dev (#3759) Bumps [requests](https://github.com/psf/requests) from 2.32.0 to 2.32.4. - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](psf/requests@v2.32.0...v2.32.4) --- updated-dependencies: - dependency-name: requests dependency-version: 2.32.4 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: marccampa <[email protected]> * [Hive] Add hive_session facet (#3786) * [Hive] Add hive_session facet Signed-off-by: Martynov Maxim <[email protected]> * [Hive] Record hive session creation time Signed-off-by: Martynov Maxim <[email protected]> --------- Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * changelog for release 1.34.0 (#3790) Signed-off-by: Maciej Obuchowski <[email protected]> Signed-off-by: marccampa <[email protected]> * Prepare for release 1.34.0 Signed-off-by: Maciej Obuchowski <[email protected]> Signed-off-by: marccampa <[email protected]> * Prepare next development version 1.35.0-SNAPSHOT Signed-off-by: Maciej Obuchowski <[email protected]> Signed-off-by: marccampa <[email protected]> * chore: Fix changelog item authors (#3791) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * [Hive] Add jobType facet (#3789) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * Website: update README (#3801) * Update website readme. Signed-off-by: merobi-hub <[email protected]> * Fix code blocks. Signed-off-by: merobi-hub <[email protected]> * Fix wordiness. Signed-off-by: merobi-hub <[email protected]> * More details in deployment sec. Signed-off-by: merobi-hub <[email protected]> * Continued. Signed-off-by: merobi-hub <[email protected]> * Continued. Signed-off-by: merobi-hub <[email protected]> * Continued. Signed-off-by: merobi-hub <[email protected]> --------- Signed-off-by: merobi-hub <[email protected]> Signed-off-by: marccampa <[email protected]> * fix spotless in hive integration (#3806) Signed-off-by: Maciej Obuchowski <[email protected]> Signed-off-by: marccampa <[email protected]> * run Java SQL tests (#3808) Signed-off-by: Maciej Obuchowski <[email protected]> Signed-off-by: marccampa <[email protected]> * [Hive] Add docker-compose example for local testing (#3800) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * [DBT] Make invocation_id field optional (#3796) Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> * Remove empty Flink page. (#3810) Signed-off-by: Jakub Dardzinski <[email protected]> Signed-off-by: marccampa <[email protected]> * Flink integration: Fixed a bug incorrectly loading configuration in Event Emitter (#3799) * Flink integration: Fixed a bug incorrectly loading configuration in Event Emitter, resulting in "disabled facets" feature not working (and probably others as well). Signed-off-by: Jan Siekierski <[email protected]> * Flink integration: Fixed a bug incorrectly loading configuration in Event Emitter, resulting in "disabled facets" feature not working (and probably others as well). Signed-off-by: Jan Siekierski <[email protected]> --------- Co-authored-by: Jan Siekierski <[email protected]> Signed-off-by: marccampa <[email protected]> * Website: add missing guidance to readme (#3807) * Add missing guidance to readme. Signed-off-by: merobi-hub <[email protected]> * Img file formats. Signed-off-by: merobi-hub <[email protected]> --------- Signed-off-by: merobi-hub <[email protected]> Signed-off-by: marccampa <[email protected]> * build(deps): bump urllib3 from 1.26.19 to 2.5.0 in /dev (#3794) Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.19 to 2.5.0. - [Release notes](https://github.com/urllib3/urllib3/releases) - [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst) - [Commits](urllib3/urllib3@1.26.19...2.5.0) --- updated-dependencies: - dependency-name: urllib3 dependency-version: 2.5.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: marccampa <[email protected]> * dbt: fix log path, more precise file reading (#3793) Signed-off-by: Maciej Obuchowski <[email protected]> Signed-off-by: marccampa <[email protected]> * Spark: fix & upgrade databricks test (#3811) Signed-off-by: Pawel Leszczynski <[email protected]> Signed-off-by: marccampa <[email protected]> * Formalize dataset naming (#3775) * Formalize dataset naming --------- Signed-off-by: Dominik Dębowczyk <[email protected]> Signed-off-by: marccampa <[email protected]> * Update consumers.tsx Signed-off-by: marccampa <[email protected]> * Apply prettier fix. Signed-off-by: merobi-hub <[email protected]> --------- Signed-off-by: Martynov Maxim <[email protected]> Signed-off-by: marccampa <[email protected]> Signed-off-by: Jakub Dardzinski <[email protected]> Signed-off-by: merobi-hub <[email protected]> Signed-off-by: Maciej Obuchowski <[email protected]> Signed-off-by: Dominik Dębowczyk <[email protected]> Signed-off-by: Pawel Leszczynski <[email protected]> Signed-off-by: Julien Phalip <[email protected]> Signed-off-by: Kacper Muda <[email protected]> Signed-off-by: Humzah Kiani <[email protected]> Signed-off-by: Natalia Gorchakova <[email protected]> Signed-off-by: Nidhin Varghese <[email protected]> Signed-off-by: Fiore Mario Vitale <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Maxim Martynov <[email protected]> Co-authored-by: Jakub Dardzinski <[email protected]> Co-authored-by: Michael Robinson <[email protected]> Co-authored-by: Maciej Obuchowski <[email protected]> Co-authored-by: ddebowczyk92 <[email protected]> Co-authored-by: pawel.leszczynski <[email protected]> Co-authored-by: Julien Phalip <[email protected]> Co-authored-by: Kacper Muda <[email protected]> Co-authored-by: Humzah Kiani <[email protected]> Co-authored-by: ngorchakova <[email protected]> Co-authored-by: Nidhin Varghese <[email protected]> Co-authored-by: Fiore Mario Vitale <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Fiore Mario Vitale <[email protected]> Co-authored-by: Maciej Obuchowski <[email protected]> Co-authored-by: Jan Siekierski <[email protected]> Co-authored-by: Jan Siekierski <[email protected]> Co-authored-by: merobi-hub <[email protected]>
[https://github.com//issues/3711] Add the ability to automatically include debug facet, when no inputs and outputs datasets detected. This can be useful to assure lineage completeness. Please read the docs within the PR to get more details on the feature.
Changes implemented:
OpenLineageContextwithOpenLineageRunStatusto store information whether any inputs/outputs have been collected through all the events for the given run.DebugFacetonCOMPLETEevent when the configurable criteria are met: output dataset missing or any of the input/output is missing.logsfield to debug facet determining if inputs or outputs are empty.DebugFacetjson does not exceed 100KB, or other configurable value.spark.openlineage.debugFacetsetting to turn on/off the debug facet.