-
Notifications
You must be signed in to change notification settings - Fork 3
Description
πͺ Motivation
I wrote a pipeline validation tool early on in the development of KDP, with the intention of minimizing user error when creating pipelines. From the outset pipelines were meant to be DAGs, and thus there are some implicit 'rules' that must be followed when creating pipelines:
Pipeline 'rules':
- Only 1 root node (no incoming edges)
- At least 1 leaf node (node with no outgoing edges)
- Directed edges between nodes
- All nodes are reachable from the root
- No cycles (unless specially flagged)
I created the validation tool because it would be too cumbersome for KDP users to check adherence by hand. It is currently a standalone package (& likely in need of updates) but it would make sense to integrate it into the operator to ensure that no invalid pipelines are deployed to the cluster.
Pipelines with obscure bugs such as hard-to-spot cycles could be costly for large processing jobs on expensive cloud resources, so stopping deployment of such pipelines at the operator (i.e. at the pre-deployment phase) is ideal.
π Additional Details
The validator is currently a command-line utility, so its core components will need to be split into a python package that the operator can pull in (probably at build time, like the base node does with the manager). We'd then want to run all the checks before deploying a pipeline.
If the validator encounters errors, we will need a sensible way to report them. Operator logs probably aren't sufficient... we will likely need some sort of annotation on the pipeline resource in k8s so that it is obvious that the pipeline has failed / been rejected for validation issues, and for what specific reason.
Another note - because operator can only see a pipeline after a valid CRD has been pushed to the cluster, we don't need to worry about any of the json/yaml schema validation that is present in the command-line tool. K8s will reject the pipeline definition if it is invalidly formatted or doesn't match the CRD schema.