Add proposal for Shard Autoscaling #5961

ArthurSens · 2023-10-03T16:49:51Z

Description

Add proposal for automated sharding

Type of change

What type of changes does your code introduce to the Prometheus operator? Put an x in the box that apply.

CHANGE (fix or feature that would cause existing functionality to not work as expected)
FEATURE (non-breaking change which adds functionality)
BUGFIX (non-breaking change which fixes an issue)
ENHANCEMENT (non-breaking change which improves existing functionality)
NONE (if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)

Changelog entry

Please put a one-line changelog entry below. This will be copied to the changelog file during the release process.

ArthurSens · 2023-10-03T16:50:59Z

@nicolastakashi your review would also be nice

Documentation/proposals/202310-automated-sharding.md

xiu

Quick pass, I like it!

Documentation/proposals/202310-automated-sharding.md

simonpasquier

🥳 this is a great start!

Documentation/proposals/202310-automated-sharding.md

simonpasquier · 2023-10-05T08:30:23Z

Documentation/proposals/202310-automated-sharding.md

+* Prometheus can be shutdown immediately without data loss.
+
+***Disadvantages:***
+* Loading TSDB Blocks into memory is expensive, requiring Prometheus-Operator to run with big Memory/CPU requests.


Would the TSDB blocks be loaded by the operator or the Prometheus?

I believe the operator since it can easily identify all Prometheus endpoints? (it has the RBAC permissions)

Thinking again about this strategy, this could be done in batches to reduce resource consumption 🤔

Documentation/proposals/202310-automated-sharding.md

simonpasquier · 2023-10-05T08:34:10Z

Documentation/proposals/202310-automated-sharding.md

+***Disadvantages:***
+* Loading TSDB Blocks into memory is expensive, requiring Prometheus-Operator to run with big Memory/CPU requests.
+
+### Snapshot & Upload on shutdown


I think that it might be a viable alternative for folks using Prometheus agent + Thanos as their main metrics storage.

Prometheus Server with Thanos Sidecar as well no?

You mean Thanos sidecar with object upload?

I'm not familiar enough but you might still want to keep the pods around for the retention period in case you have local alerting?

simonpasquier · 2023-10-05T08:36:12Z

Documentation/proposals/202310-automated-sharding.md

+
+When working with CRDs, common HPAs (e.g. Keda) depends on a Status subresource called [Scale](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource). Unfortunately, Prometheus-Operator resources still don't implement this field. There was an attempt to implement this in https://github.com/prometheus-operator/prometheus-operator/pull/4735, but was reverted because the PR was scaling replicas instead of shards. Although scaling replicas can help with High-Availability, it is expensive and hard to manage since it duplicates scrapes while not reducing the load on top of Prometheus because all replicas still scrape the same targets. Sharding serves as a better autoscaling strategy since the load is spread into all instances instead of getting duplicated.
+
+With only this change, Prometheus Agents can already be horizontally scaled without problems, but for Prometheus Servers it gets a little more complicated.


I think that we still need to account for proper shutdown of the agents and ensure that the deletion happens only after the agent has forwarded all metrics to the remote write endpoints. It might be worth having a dedicated section on it. WDYT?

I had the impression that Prometheus Agent would still try to finish the remote-write queue when receiving SIGTERM. I'll need to investigate

Let's start with this assumption. We can adjust later if needed.

I'm not sure that Prometheus (server or agent) has extra synchronization between the scrape manager, rule manager and the remote writer. E.g. it might be possible to ingest samples into the WAL while the remote write queues have already stopped.
But it can be addressed after we have a first implementation. Maybe the scale down should be 2 steps?

update the scrape config so all targets move to the other shards.

tear down the shards in excess.

Documentation/proposals/202310-automated-sharding.md

ArthurSens · 2023-10-05T18:07:57Z

Thanks everyone for all the reviews so far! I've addressed as many comments as I could, ready for another round :)

bwplotka · 2023-10-10T11:27:15Z

Related issue BTW that you will hit: prometheus/prometheus#12941

ArthurSens · 2023-10-31T18:06:14Z

Proposal updated with information about:

Prometheus Agent scale down strategy
How do we scale up Prometheus servers after a scale down

ArthurSens · 2023-12-20T21:47:23Z

Is this good to merge? Any doubts or change requests? :)

Documentation/proposals/202310-automated-sharding.md

simonpasquier · 2023-12-22T13:12:21Z

Documentation/proposals/202310-automated-sharding.md

+* Eventual cardinality explosions have a big blast radius if a single Prometheus instance is responsible for scraping the majority of monitored applications.
+* Recovering from a crash takes several minutes due to the WAL replay.
+
+Meanwhile, another strategy would be to use Horizontal Pod Autoscalers (HPAs) with Prometheus statefulsets by increasing the number of replicas. However, arguably it makes things more complicated:


(suggestion) do we need to keep this part? It should be obvious that increasing the number of replicas (not shards) doesn't solve the scalability issue. I'd rather see a brief explanation about why the actual sharding doesn't work either.

While the Prometheus and PrometheusAgent CRDs implement sharding for scrape configurations, there are still pending issues and missing features that prevent adoption.

Documentation/proposals/202310-automated-sharding.md

simonpasquier · 2023-12-22T13:33:15Z

Documentation/proposals/202310-automated-sharding.md

+
+Prometheus Agents are different than servers since queries are not available in this mode. Their only responsibility is scraping metrics and pushing them via remote-write to a long-term storage backend, making the scale-down experience much easier to handle.
+
+When receiving the SIGTERM signal, the Prometheus Agent will gracefully handle the signal by finishing all remote-write queues before ending the process. [When terminating a Pod, Kubernetes sends the SIGTERM signal and, by default, waits 30 seconds for all containers in the pod to finish their processes](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination). Thirty seconds is usually plenty of time to finish the remote-write queues, but the PrometheusAgent CRD will be extended to allow changing the Pod's `terminationGracePeriodSeconds` field so bigger or unstable instances can be shut down without data loss.


The operator sets a default termination grace period of 600s which should be more than enough. If really needed, users can use strategic merge patch to adjust the value.

ah good reminder! I'll rephrase this part

simonpasquier · 2023-12-22T13:34:56Z

Documentation/proposals/202310-automated-sharding.md

+
+***Disadvantages:***
+* Loading TSDB Blocks into memory is expensive, requiring Prometheus-Operator to run with big Memory/CPU requests.
+* Complex and hard to coordinate workflow. (Require restarts to reload TSDB blocks, hard to identify possible corruptions)


I'm also not sure how the data would be transferred. Even with persistent storage, it would be intricate to attach the existing PVCs to other pods.

simonpasquier · 2023-12-22T13:36:31Z

Documentation/proposals/202310-automated-sharding.md

+
+## Snapshot & Upload on shutdown
+
+During scale-down, Prometheus-Operator could send an HTTP request to [Prometheus' Snapshot endpoint](https://prometheus.io/docs/prometheus/latest/querying/api/#snapshot). Thanos sidecar could be extended to watch the snapshot directory and automatically upload snapshots to Object storage.


that's an interesting approach which could be complementary to the main proposal.

Totally agree, but for the sake of simplicity I'd prefer to leave this out of the main proposal for now

After we have the whole proposal working I'd move forward with this idea

ArthurSens · 2023-12-24T10:28:21Z

Thanks for all the reviews everybody, proposal updated again!

simonpasquier · 2024-01-02T13:43:37Z

Documentation/proposals/202310-automated-sharding.md

+
+# Goals
+
+* Enable scaling of the Prometheus shards up and down via Horizontal Pod Autoscaler objects.


(nit) since we said that we don't prescribe autoscaling implementation.

Suggested change

* Enable scaling of the Prometheus shards up and down via Horizontal Pod Autoscaler objects.

* Enable automatic scaling of the Prometheus shards up and down.

I had the feeling we weren't suggesting which HPA to use, but we're still focused on enabling the HPA use-case 🤔

ah ok, from your comment (#5961 (comment)) I think we had different point of views about what HPA means.

From my point of view, anything that is able to scale pods horizontally, is an HPA. From your comment, I think you distinguish the native K8s HPA resource from other tools that are capable of doing the same thing. Is that correct?

I'm fine with updating the proposal making a more clear distinction between them, just wanted to make sure I understand the problem here :)

You're correct. When I see "HPA", I tend to assume the Kubernetes Horizontal Pod Autoscaling controller.

simonpasquier · 2024-01-02T13:46:44Z

Documentation/proposals/202310-automated-sharding.md

+
+## Scale subresource
+
+When working with any resource, including CRDs, HPAs depends on a subresource called [Scale](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource). This proposal suggests to implement the scale subresource for the Prometheus and PrometheusAgent CRDs. Instead of working on the "replicas" count, it will operate on the "shards" count because the purpose of scaling up (resp. down) is to distribute the same number of targets across more (resp. less) Prometheus instances.


Suggested change

When working with any resource, including CRDs, HPAs depends on a subresource called [Scale](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource). This proposal suggests to implement the scale subresource for the Prometheus and PrometheusAgent CRDs. Instead of working on the "replicas" count, it will operate on the "shards" count because the purpose of scaling up (resp. down) is to distribute the same number of targets across more (resp. less) Prometheus instances.

When working with any resource, including custom resources, the autoscaler depends on a subresource called [Scale](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource). This proposal suggests to implement the scale subresource for the Prometheus and PrometheusAgent CRDs. Instead of working on the "replicas" count, it will operate on the "shards" count because the purpose of scaling up (resp. down) is to distribute the same number of targets across more (resp. less) Prometheus instances.

simonpasquier · 2024-01-02T13:52:10Z

Documentation/proposals/202310-automated-sharding.md

+
+# How
+
+Today, there are a few strategies to measure the load of Prometheus instances.


I might rephrase a bit this part and mention:

the different autoscaling solutions that may be used: the native Kubernetes HorizontalPodAutoscaler (with resource, custom or external metrics) and Keda (https://keda.sh/docs/2.12/scalers/).

the different indicators that may be used: CPU, RAM, rate of samples being ingested, ...

the fact that the operator will be agnostic and should work with all options if it implements the scale subresource in the right way.

simonpasquier · 2024-01-02T14:01:12Z

Documentation/proposals/202310-automated-sharding.md

+
+When working with CRDs, common HPAs (e.g. Keda) depends on a Status subresource called [Scale](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource). Unfortunately, Prometheus-Operator resources still don't implement this field. There was an attempt to implement this in https://github.com/prometheus-operator/prometheus-operator/pull/4735, but was reverted because the PR was scaling replicas instead of shards. Although scaling replicas can help with High-Availability, it is expensive and hard to manage since it duplicates scrapes while not reducing the load on top of Prometheus because all replicas still scrape the same targets. Sharding serves as a better autoscaling strategy since the load is spread into all instances instead of getting duplicated.
+
+With only this change, Prometheus Agents can already be horizontally scaled without problems, but for Prometheus Servers it gets a little more complicated.


I'm not sure that Prometheus (server or agent) has extra synchronization between the scrape manager, rule manager and the remote writer. E.g. it might be possible to ingest samples into the WAL while the remote write queues have already stopped.
But it can be addressed after we have a first implementation. Maybe the scale down should be 2 steps?

update the scrape config so all targets move to the other shards.

tear down the shards in excess.

Documentation/proposals/202310-automated-sharding.md

Signed-off-by: Arthur Silva Sens <[email protected]>

simonpasquier

Just a few nits that I'm going to commit right away.

Documentation/proposals/202310-shard-autoscaling.md

* Add proposal for automated sharding Signed-off-by: Arthur Silva Sens <[email protected]> * Update Documentation/proposals/202310-shard-autoscaling.md * Update Documentation/proposals/202310-shard-autoscaling.md * Update Documentation/proposals/202310-shard-autoscaling.md * Update Documentation/proposals/202310-shard-autoscaling.md * Update Documentation/proposals/202310-shard-autoscaling.md * Update Documentation/proposals/202310-shard-autoscaling.md --------- Signed-off-by: Arthur Silva Sens <[email protected]> Co-authored-by: Simon Pasquier <[email protected]>

* Disable dependabot automation targeting k8s libs (prometheus-operator#6191) Signed-off-by: Arthur Silva Sens <[email protected]> * Add support for enableHttp2 in prometheus remotewrite (prometheus-operator#6192) --------- Co-authored-by: Herve Nicol <[email protected]> * chore: refactor generateScrapeConfig() From a comment while reviewing prometheus-operator#6153. Signed-off-by: Simon Pasquier <[email protected]> * build(deps): bump github.com/prometheus-community/prom-label-proxy Bumps [github.com/prometheus-community/prom-label-proxy](https://github.com/prometheus-community/prom-label-proxy) from 0.7.0 to 0.8.0. - [Release notes](https://github.com/prometheus-community/prom-label-proxy/releases) - [Changelog](https://github.com/prometheus-community/prom-label-proxy/blob/main/CHANGELOG.md) - [Commits](prometheus-community/prom-label-proxy@v0.7.0...v0.8.0) --- updated-dependencies: - dependency-name: github.com/prometheus-community/prom-label-proxy dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * chore: fix pkg/apis/monitoring/go.mod PR prometheus-operator#6001 updated the `go.mod` file under `pkg/apis/monitoring` to depend on `github.com/prometheus-operator/prometheus-operator` which isn't desired: the goal is that external projects can import `github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring` pulling as few dependencies as possible. This commit removes the unneeded dependency by moving the validation function to the `pkg/prometheus` directory. Signed-off-by: Simon Pasquier <[email protected]> * build(deps): bump golang.org/x/sync from 0.5.0 to 0.6.0 Bumps [golang.org/x/sync](https://github.com/golang/sync) from 0.5.0 to 0.6.0. - [Commits](golang/sync@v0.5.0...v0.6.0) --- updated-dependencies: - dependency-name: golang.org/x/sync dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * Add proposal for Shard Autoscaling (prometheus-operator#5961) * Add proposal for automated sharding Signed-off-by: Arthur Silva Sens <[email protected]> * Update Documentation/proposals/202310-shard-autoscaling.md * Update Documentation/proposals/202310-shard-autoscaling.md * Update Documentation/proposals/202310-shard-autoscaling.md * Update Documentation/proposals/202310-shard-autoscaling.md * Update Documentation/proposals/202310-shard-autoscaling.md * Update Documentation/proposals/202310-shard-autoscaling.md --------- Signed-off-by: Arthur Silva Sens <[email protected]> Co-authored-by: Simon Pasquier <[email protected]> * chore: refactor ShardedSecret This commit reorganizes the code makinguse of `SharedSecret` to reduce duplication. Signed-off-by: Simon Pasquier <[email protected]> * Add testing guidelines (prometheus-operator#5903) * Add testing guidelines Signed-off-by: Arthur Silva Sens <[email protected]> * Add separate Makefile targets for e2e-tests Signed-off-by: Arthur Silva Sens <[email protected]> * Apply suggestions from code review Signed-off-by: Arthur Silva Sens <[email protected]> Co-authored-by: Simon Pasquier <[email protected]> --------- Signed-off-by: Arthur Silva Sens <[email protected]> Co-authored-by: Simon Pasquier <[email protected]> * Ensure all comments end with a period (prometheus-operator#6208) * Ensure all comments end with a period. By enabling the godot linter Signed-off-by: Arthur Silva Sens <[email protected]> * Fix godot issues Signed-off-by: Arthur Silva Sens <[email protected]> --------- Signed-off-by: Arthur Silva Sens <[email protected]> * Prevent mistakes with testify lib (prometheus-operator#6211) * Prevent mistakes with testify By enabling testifylint Signed-off-by: Arthur Silva Sens <[email protected]> * Fix testifylint issues Signed-off-by: Arthur Silva Sens <[email protected]> --------- Signed-off-by: Arthur Silva Sens <[email protected]> * Prevent unnecessary type conversions (prometheus-operator#6210) * Prevent unnecessary type conversions By enabling the unconvert linter Signed-off-by: Arthur Silva Sens <[email protected]> * Fix unconvert issues Signed-off-by: Arthur Silva Sens <[email protected]> --------- Signed-off-by: Arthur Silva Sens <[email protected]> * build(deps): bump golang.org/x/net from 0.19.0 to 0.20.0 Bumps [golang.org/x/net](https://github.com/golang/net) from 0.19.0 to 0.20.0. - [Commits](golang/net@v0.19.0...v0.20.0) --- updated-dependencies: - dependency-name: golang.org/x/net dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * chore: create new feature and support issue templates Based on github's new issue template form https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/syntax-for-issue-forms Signed-off-by: Jayapriya Pai <[email protected]> * feat: emit events for invalid configurations (prometheus-operator#6179) * feat: emit events for invalid configurations Emit events when the controller rejects a resource, owing to an invalid configuration. Fixes: prometheus-operator#3611 Signed-off-by: Pranshu Srivastava <[email protected]> * Decouple event recorder from operator metrics Signed-off-by: Arthur Silva Sens <[email protected]> * Only emit events if permissions were given Signed-off-by: Arthur Silva Sens <[email protected]> * Keep operator name consistent across telemetry Signed-off-by: Arthur Silva Sens <[email protected]> * Address comments Signed-off-by: Arthur Silva Sens <[email protected]> --------- Signed-off-by: Pranshu Srivastava <[email protected]> Signed-off-by: Arthur Silva Sens <[email protected]> Co-authored-by: Pranshu Srivastava <[email protected]> * feat: support the operator make use of the config-reloader tls and basic auth with prometheus/alertmanager webConfigFile (prometheus-operator#6194) * support the operator make use of the config-reloader tls and basic authentication --------- Signed-off-by: dongjiang1989 <[email protected]> * Add scale subresource to Prometheus/PrometheusAgent (prometheus-operator#5962) * Add scale subresource to Prometheus/PrometheusAgent Signed-off-by: Arthur Silva Sens <[email protected]> * Add retry mechanism when updating Status subresource The goal is to allow the new Prometheus-Operator version to run even with outdated CRDs. It will try to update the Status subresource and also Scale subresource. If it fails, it will retry but without Scale-related fields. Signed-off-by: Arthur Silva Sens <[email protected]> * Create function to generate selector labels Signed-off-by: Arthur Silva Sens <[email protected]> * Add UpdateScale and GetScale methods (prometheus-operator#6197) Signed-off-by: Arthur Silva Sens <[email protected]> --------- Signed-off-by: Arthur Silva Sens <[email protected]> * chore: fix field name to comply with conventions The Kubernetes API conventions say: > All letters in the acronym should have the same case, using the > appropriate case for the situation. Since no release includes the field yet, it's ok to change the name. Signed-off-by: Simon Pasquier <[email protected]> * feat: add support for remaining fields in Kubernetes SD (prometheus-operator#6178) * feat: add support for remaining fields in Kubernetes SD Fixes prometheus-operator#6087 --------- Signed-off-by: Jayapriya Pai <[email protected]> Co-authored-by: Simon Pasquier <[email protected]> * chore: refactor creation of the TLS assets volume Signed-off-by: Simon Pasquier <[email protected]> * chore: Add ArthurSens as 0.72 shepherd Signed-off-by: Arthur Silva Sens <[email protected]> * chore: cut v0.71.0 (prometheus-operator#6223) * chore: cut v0.71.0 --------- Signed-off-by: Simon Pasquier <[email protected]> Co-authored-by: Jayapriya Pai <[email protected]> * chore: fix golangci-lint `Files Exists` errors (prometheus-operator#6221) * fix golangci-lint files exists errors --------- Signed-off-by: dongjiang1989 <[email protected]> Co-authored-by: Simon Pasquier <[email protected]> * chore: refactor logger and eventrecorder creations (prometheus-operator#6225) Signed-off-by: Simon Pasquier <[email protected]> * setting targeted go version Signed-off-by: dongjiang1989 <[email protected]> * Enable revive (prometheus-operator#6207) * Enable revive linter in test/framework Signed-off-by: Arthur Silva Sens <[email protected]> * Fix revive issues Signed-off-by: Arthur Silva Sens <[email protected]> --------- Signed-off-by: Arthur Silva Sens <[email protected]> * build(deps): bump github.com/evanphx/json-patch/v5 from 5.7.0 to 5.8.0 Bumps [github.com/evanphx/json-patch/v5](https://github.com/evanphx/json-patch) from 5.7.0 to 5.8.0. - [Release notes](https://github.com/evanphx/json-patch/releases) - [Commits](evanphx/json-patch@v5.7.0...v5.8.0) --- updated-dependencies: - dependency-name: github.com/evanphx/json-patch/v5 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * build(deps): bump github.com/prometheus/common from 0.45.0 to 0.46.0 Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.45.0 to 0.46.0. - [Release notes](https://github.com/prometheus/common/releases) - [Commits](prometheus/common@v0.45.0...v0.46.0) --- updated-dependencies: - dependency-name: github.com/prometheus/common dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * feat(scrapeConfigs): Add sharding to scrapeConfigs Signed-off-by: adinhodovic <[email protected]> * chore: remove proxyconfig code duplication Fixes prometheus-operator#6218 Signed-off-by: Jayapriya Pai <[email protected]> * chore: fix makefile targets Signed-off-by: Arthur Silva Sens <[email protected]> * chore: bump to Prometheus v2.49.1 (prometheus-operator#6234) * update prometheus version --------- Signed-off-by: dongjiang1989 <[email protected]> * build(deps): bump sigs.k8s.io/controller-runtime from 0.16.3 to 0.17.0 Bumps [sigs.k8s.io/controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) from 0.16.3 to 0.17.0. - [Release notes](https://github.com/kubernetes-sigs/controller-runtime/releases) - [Changelog](https://github.com/kubernetes-sigs/controller-runtime/blob/main/RELEASE.md) - [Commits](kubernetes-sigs/controller-runtime@v0.16.3...v0.17.0) --- updated-dependencies: - dependency-name: sigs.k8s.io/controller-runtime dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * feat: support `enable_compression` for ScrapeConfig (prometheus-operator#6236) * support enable_compression setting --------- Signed-off-by: dongjiang1989 <[email protected]> * build(deps): bump github.com/evanphx/json-patch/v5 from 5.8.0 to 5.8.1 Bumps [github.com/evanphx/json-patch/v5](https://github.com/evanphx/json-patch) from 5.8.0 to 5.8.1. - [Release notes](https://github.com/evanphx/json-patch/releases) - [Commits](evanphx/json-patch@v5.8.0...v5.8.1) --- updated-dependencies: - dependency-name: github.com/evanphx/json-patch/v5 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * ADOPTERS: add authzed * ruler: add subpath to volumeMounts if specified (prometheus-operator#6243) * ruler: pass spec.volumeMount as-is * feat: add support for DigitalOcean SD (prometheus-operator#6220) Signed-off-by: Jayapriya Pai <[email protected]> * ruler: add subpath to volumeMounts if specified (prometheus-operator#6243) * ruler: pass spec.volumeMount as-is * feat: support scrape_protocols for GlobalConfig and ScrapeConfig (prometheus-operator#6235) * support scrape_protocols for GlobalConfig and ScrapeConfig --------- Signed-off-by: dongjiang1989 <[email protected]> Co-authored-by: Simon Pasquier <[email protected]> * chore: bump k8s dependencies to v1.29.1 Signed-off-by: Simon Pasquier <[email protected]> * feat: Add support for NS records to DNSSDConfig (prometheus-operator#6240) * update dns sd config --------- Signed-off-by: dongjiang1989 <[email protected]> Co-authored-by: Simon Pasquier <[email protected]> * chore: cut v0.71.1 Signed-off-by: Simon Pasquier <[email protected]> * build(deps): bump github.com/google/uuid from 1.5.0 to 1.6.0 Bumps [github.com/google/uuid](https://github.com/google/uuid) from 1.5.0 to 1.6.0. - [Release notes](https://github.com/google/uuid/releases) - [Changelog](https://github.com/google/uuid/blob/master/CHANGELOG.md) - [Commits](google/uuid@v1.5.0...v1.6.0) --- updated-dependencies: - dependency-name: github.com/google/uuid dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * build(deps): bump github.com/thanos-io/thanos Bumps [github.com/thanos-io/thanos](https://github.com/thanos-io/thanos) from 0.32.5-0.20231124114724-023faa2d67a3 to 0.34.0-rc.1. - [Release notes](https://github.com/thanos-io/thanos/releases) - [Changelog](https://github.com/thanos-io/thanos/blob/main/CHANGELOG.md) - [Commits](https://github.com/thanos-io/thanos/commits/v0.34.0-rc.1) --- updated-dependencies: - dependency-name: github.com/thanos-io/thanos dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * build(deps): bump github.com/prometheus/prometheus from 0.48.1 to 0.49.1 Bumps [github.com/prometheus/prometheus](https://github.com/prometheus/prometheus) from 0.48.1 to 0.49.1. - [Release notes](https://github.com/prometheus/prometheus/releases) - [Changelog](https://github.com/prometheus/prometheus/blob/main/CHANGELOG.md) - [Commits](prometheus/prometheus@v0.48.1...v0.49.1) --- updated-dependencies: - dependency-name: github.com/prometheus/prometheus dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * fix: azuresdconfig typo (prometheus-operator#6259) * fix: typo on AuthenticationMethod check * chore: cut v0.71.2 Signed-off-by: Simon Pasquier <[email protected]> * build(deps): bump github.com/thanos-io/thanos from 0.34.0-rc.1 to 0.34.0 Bumps [github.com/thanos-io/thanos](https://github.com/thanos-io/thanos) from 0.34.0-rc.1 to 0.34.0. - [Release notes](https://github.com/thanos-io/thanos/releases) - [Changelog](https://github.com/thanos-io/thanos/blob/main/CHANGELOG.md) - [Commits](thanos-io/thanos@v0.34.0-rc.1...v0.34.0) --- updated-dependencies: - dependency-name: github.com/thanos-io/thanos dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * build(deps): bump github.com/prometheus-community/prom-label-proxy Bumps [github.com/prometheus-community/prom-label-proxy](https://github.com/prometheus-community/prom-label-proxy) from 0.8.0 to 0.8.1. - [Release notes](https://github.com/prometheus-community/prom-label-proxy/releases) - [Changelog](https://github.com/prometheus-community/prom-label-proxy/blob/main/CHANGELOG.md) - [Commits](prometheus-community/prom-label-proxy@v0.8.0...v0.8.1) --- updated-dependencies: - dependency-name: github.com/prometheus-community/prom-label-proxy dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * build(deps): bump github.com/evanphx/json-patch/v5 from 5.8.1 to 5.9.0 Bumps [github.com/evanphx/json-patch/v5](https://github.com/evanphx/json-patch) from 5.8.1 to 5.9.0. - [Release notes](https://github.com/evanphx/json-patch/releases) - [Commits](evanphx/json-patch@v5.8.1...v5.9.0) --- updated-dependencies: - dependency-name: github.com/evanphx/json-patch/v5 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * chore: update remote-write tests Signed-off-by: Simon Pasquier <[email protected]> * chore: create monitors before Prometheus It speeds up the tests since it doesn't have to wait for the updated configuration to be propagated to Prometheus. Signed-off-by: Simon Pasquier <[email protected]> * feat: support added scrape_protocols to Pod/Service monitors (prometheus-operator#6268) * support scrape_protocols to podmonitor/servicemonitor --------- Signed-off-by: dongjiang1989 <[email protected]> * chore: add e2e test detecting the issue Signed-off-by: Simon Pasquier <[email protected]> * fix: convert `continue` field between v1beta1 and v1alpha1 This change converts the `continue` field between v1alpha1 and v1beta1 AlertmanagerConfig versions. When a v1beta1 AlertmanagerConfig object was created with `continue: true`, the `continue` field was always converted to `false` when stored as v1alpha1. Signed-off-by: Simon Pasquier <[email protected]> * add slashpai to maintainers (prometheus-operator#6280) Signed-off-by: Jayapriya Pai <[email protected]> * chore: update Kind version to v0.21.0 This commit also bumps the Kubernetes version to v1.29.1. Signed-off-by: Simon Pasquier <[email protected]> * update go version 1.22 Signed-off-by: dongjiang1989 <[email protected]> * Adds Warpbuild * Bracket change * Update ADOPTERS.md * build(deps): bump golang.org/x/net from 0.20.0 to 0.21.0 Bumps [golang.org/x/net](https://github.com/golang/net) from 0.20.0 to 0.21.0. - [Commits](golang/net@v0.20.0...v0.21.0) --- updated-dependencies: - dependency-name: golang.org/x/net dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * feat: add version check for thanos. keep_firing_for now available (prometheus-operator#6283) * build(deps): bump golangci/golangci-lint-action from 3.7.0 to 3.7.1 Bumps [golangci/golangci-lint-action](https://github.com/golangci/golangci-lint-action) from 3.7.0 to 3.7.1. - [Release notes](https://github.com/golangci/golangci-lint-action/releases) - [Commits](golangci/golangci-lint-action@v3.7.0...v3.7.1) --- updated-dependencies: - dependency-name: golangci/golangci-lint-action dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * build(deps): bump sigs.k8s.io/controller-runtime from 0.17.0 to 0.17.1 Bumps [sigs.k8s.io/controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) from 0.17.0 to 0.17.1. - [Release notes](https://github.com/kubernetes-sigs/controller-runtime/releases) - [Changelog](https://github.com/kubernetes-sigs/controller-runtime/blob/main/RELEASE.md) - [Commits](kubernetes-sigs/controller-runtime@v0.17.0...v0.17.1) --- updated-dependencies: - dependency-name: sigs.k8s.io/controller-runtime dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * fix: remove deprecation on service monitor's targetPort Closes prometheus-operator#6269 Signed-off-by: Simon Pasquier <[email protected]> * build(deps): bump helm/kind-action from 1.8.0 to 1.9.0 Bumps [helm/kind-action](https://github.com/helm/kind-action) from 1.8.0 to 1.9.0. - [Release notes](https://github.com/helm/kind-action/releases) - [Commits](helm/kind-action@v1.8.0...v1.9.0) --- updated-dependencies: - dependency-name: helm/kind-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * build(deps): bump golangci/golangci-lint-action from 3.7.1 to 4.0.0 Bumps [golangci/golangci-lint-action](https://github.com/golangci/golangci-lint-action) from 3.7.1 to 4.0.0. - [Release notes](https://github.com/golangci/golangci-lint-action/releases) - [Commits](golangci/golangci-lint-action@v3.7.1...v4.0.0) --- updated-dependencies: - dependency-name: golangci/golangci-lint-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> * fix: don't fail metadata transform on unknown types (prometheus-operator#6298) * fix: don't fail metadata transform on unknown types This change modifies the `PartialObjectMetadataStrip` function to return the object unmodified if casting to `*v1.PartialObjectMetadata` fails. When the informer processes a deleted object, its type can be `cache.DeletedFinalStateUnknown`. Co-authored-by: Ayoub Mrini <[email protected]> Signed-off-by: Simon Pasquier <[email protected]> * test: add TestPartialObjectMetadataStripOnDeletedFinalStateUnknown Co-authored-by: machine424 <[email protected]> Signed-off-by: Simon Pasquier <[email protected]> Signed-off-by: machine424 <[email protected]> --------- Signed-off-by: Simon Pasquier <[email protected]> Signed-off-by: machine424 <[email protected]> Co-authored-by: Ayoub Mrini <[email protected]> Co-authored-by: machine424 <[email protected]> * build(deps): bump github.com/prometheus/common from 0.46.0 to 0.47.0 Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.46.0 to 0.47.0. - [Release notes](https://github.com/prometheus/common/releases) - [Commits](prometheus/common@v0.46.0...v0.47.0) --- updated-dependencies: - dependency-name: github.com/prometheus/common dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * build(deps): bump sigs.k8s.io/controller-runtime from 0.17.1 to 0.17.2 Bumps [sigs.k8s.io/controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) from 0.17.1 to 0.17.2. - [Release notes](https://github.com/kubernetes-sigs/controller-runtime/releases) - [Changelog](https://github.com/kubernetes-sigs/controller-runtime/blob/main/RELEASE.md) - [Commits](kubernetes-sigs/controller-runtime@v0.17.1...v0.17.2) --- updated-dependencies: - dependency-name: sigs.k8s.io/controller-runtime dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * chore: update Kubernetes to v1.29.2 Signed-off-by: Simon Pasquier <[email protected]> * chore: switch example app image Signed-off-by: Simon Pasquier <[email protected]> * Update go dependencies before release (prometheus-operator#6315) * Update go dependencies before release Signed-off-by: Arthur Silva Sens <[email protected]> * make generate Signed-off-by: Arthur Silva Sens <[email protected]> --------- Signed-off-by: Arthur Silva Sens <[email protected]> * docs: correct slashpai slack id Signed-off-by: Jayapriya Pai <[email protected]> * build(deps): bump github.com/prometheus/prometheus from 0.49.1 to 0.50.0 Bumps [github.com/prometheus/prometheus](https://github.com/prometheus/prometheus) from 0.49.1 to 0.50.0. - [Release notes](https://github.com/prometheus/prometheus/releases) - [Changelog](https://github.com/prometheus/prometheus/blob/main/CHANGELOG.md) - [Commits](prometheus/prometheus@v0.49.1...v0.50.0) --- updated-dependencies: - dependency-name: github.com/prometheus/prometheus dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * Update default Thanos version (prometheus-operator#6317) * Update default Thanos version Signed-off-by: Arthur Silva Sens <[email protected]> * Update unit tests depending on DefaultThanosVersion Signed-off-by: Arthur Silva Sens <[email protected]> --------- Signed-off-by: Arthur Silva Sens <[email protected]> * Update Default Prometheus version Signed-off-by: Arthur Silva Sens <[email protected]> * feat: adding scrape class (prometheus-operator#6199) * feat: adding scrape class Signed-off-by: Nicolas Takashi <[email protected]> * Update pkg/apis/monitoring/v1/prometheus_types.go Co-authored-by: Arthur Silva Sens <[email protected]> * Update pkg/prometheus/promcfg.go Co-authored-by: Arthur Silva Sens <[email protected]> * Update pkg/prometheus/store.go Co-authored-by: Simon Pasquier <[email protected]> * Update pkg/prometheus/resource_selector.go Co-authored-by: Arthur Silva Sens <[email protected]> * Update pkg/prometheus/store.go Co-authored-by: Simon Pasquier <[email protected]> * Update pkg/prometheus/resource_selector.go Co-authored-by: Simon Pasquier <[email protected]> * Update pkg/prometheus/resource_selector.go Co-authored-by: Arthur Silva Sens <[email protected]> * Update pkg/prometheus/promcfg.go Co-authored-by: Simon Pasquier <[email protected]> * Update pkg/prometheus/promcfg.go Co-authored-by: Simon Pasquier <[email protected]> * Update pkg/prometheus/server/operator.go Co-authored-by: Simon Pasquier <[email protected]> * Update pkg/prometheus/promcfg.go Co-authored-by: Simon Pasquier <[email protected]> * Update prometheus_types.go Co-authored-by: Simon Pasquier <[email protected]> --------- Signed-off-by: Nicolas Takashi <[email protected]> Co-authored-by: Arthur Silva Sens <[email protected]> Co-authored-by: Simon Pasquier <[email protected]> * fix: update kubernetes slack link Signed-off-by: Jayapriya Pai <[email protected]> * build(deps): bump github.com/prometheus/prometheus from 0.50.0 to 0.50.1 Bumps [github.com/prometheus/prometheus](https://github.com/prometheus/prometheus) from 0.50.0 to 0.50.1. - [Release notes](https://github.com/prometheus/prometheus/releases) - [Changelog](https://github.com/prometheus/prometheus/blob/main/CHANGELOG.md) - [Commits](prometheus/prometheus@v0.50.0...v0.50.1) --- updated-dependencies: - dependency-name: github.com/prometheus/prometheus dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * chore: move ProxyConfig type to v1 Related-to prometheus-operator#6301 Signed-off-by: Jayapriya Pai <[email protected]> * build(deps): bump github.com/prometheus/alertmanager Bumps [github.com/prometheus/alertmanager](https://github.com/prometheus/alertmanager) from 0.26.0 to 0.27.0. - [Release notes](https://github.com/prometheus/alertmanager/releases) - [Changelog](https://github.com/prometheus/alertmanager/blob/main/CHANGELOG.md) - [Commits](prometheus/alertmanager@v0.26.0...v0.27.0) --- updated-dependencies: - dependency-name: github.com/prometheus/alertmanager dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * update prometheus version Signed-off-by: dongjiang1989 <[email protected]> * build(deps): bump github.com/prometheus/client_golang Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.18.0 to 1.19.0. - [Release notes](https://github.com/prometheus/client_golang/releases) - [Changelog](https://github.com/prometheus/client_golang/blob/v1.19.0/CHANGELOG.md) - [Commits](prometheus/client_golang@v1.18.0...v1.19.0) --- updated-dependencies: - dependency-name: github.com/prometheus/client_golang dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * update alertmanager version Signed-off-by: dongjiang1989 <[email protected]> * [FIX] scrape class regression (prometheus-operator#6345) Signed-off-by: Nicolas Takashi <[email protected]> * build(deps): bump github.com/stretchr/testify from 1.8.4 to 1.9.0 Bumps [github.com/stretchr/testify](https://github.com/stretchr/testify) from 1.8.4 to 1.9.0. - [Release notes](https://github.com/stretchr/testify/releases) - [Commits](stretchr/testify@v1.8.4...v1.9.0) --- updated-dependencies: - dependency-name: github.com/stretchr/testify dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * AlertmanagerConfig CRD: fix MonthRange validation regex * AlertmanagerConfig CRD: improve MonthRange unit tests * docs: correct example of scrapeConfigSelector in scrapeConfig doc In the docs for scrapeConfig, the example of scrapeConfig in prometheus CR was incorrect. In prometheus CR, in scrapeConfigSelector, there should be matchLabels and then the scrapeConfig label. fixes prometheus-operator#6350 Signed-off-by: Dhruv Bindra <[email protected]> * chores: change string type to duration type (prometheus-operator#6337) * change string to duration --------- Signed-off-by: dongjiang1989 <[email protected]> * Prepare 0.72 release (prometheus-operator#6329) Signed-off-by: Arthur Silva Sens <[email protected]> * build(deps): bump golang.org/x/net from 0.21.0 to 0.22.0 Bumps [golang.org/x/net](https://github.com/golang/net) from 0.21.0 to 0.22.0. - [Commits](golang/net@v0.21.0...v0.22.0) --- updated-dependencies: - dependency-name: golang.org/x/net dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * build(deps): bump google.golang.org/protobuf from 1.32.0 to 1.33.0 Bumps google.golang.org/protobuf from 1.32.0 to 1.33.0. --- updated-dependencies: - dependency-name: google.golang.org/protobuf dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * Bump prometheus/common Signed-off-by: Arthur Silva Sens <[email protected]> * feat: support --enable-feature argument in Alertmanager CRD (prometheus-operator#6152) * feat: support alertmanager --enable-feature argument this will expose more Alertmanager configuration parameters to users of the Alertmanager CRD. Signed-off-by: Yonatan Sasson <[email protected]> * feat: support sample_age_limit for QueueConfig (prometheus-operator#6326) * add SampleAgeLimit Signed-off-by: dongjiang1989 <[email protected]> * update by code review Signed-off-by: dongjiang1989 <[email protected]> * change string type to duration Signed-off-by: dongjiang1989 <[email protected]> * update make generate Signed-off-by: dongjiang1989 <[email protected]> * update promcfg_test unittest Signed-off-by: dongjiang1989 <[email protected]> * update some nits by code review Signed-off-by: dongjiang1989 <[email protected]> --------- Signed-off-by: dongjiang1989 <[email protected]> * feat: add bodySizeLimit to service and pod monitors (prometheus-operator#6349) * feat: add EnforcedBodySizeLimit to service and monitor * chore: bump to golangci-lint v1.56.2 (prometheus-operator#6384) * update golangci lint version --------- Signed-off-by: dongjiang1989 <[email protected]> * [fix] test * thanos: add support for web configuration to the ThanosRuler CRD (prometheus-operator#6278) * thanos: add support for web configuration to the ThanosRuler CRD This enable us to set tls for thanos ruler Fixes prometheus-operator#6157 * [CHORE] normalizing tls structs monitor objects Signed-off-by: Nicolas Takashi <[email protected]> * build(deps): bump google.golang.org/protobuf in /scripts Bumps google.golang.org/protobuf from 1.31.0 to 1.33.0. --- updated-dependencies: - dependency-name: google.golang.org/protobuf dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> * build(deps): bump google.golang.org/protobuf in /pkg/client Bumps google.golang.org/protobuf from 1.32.0 to 1.33.0. --- updated-dependencies: - dependency-name: google.golang.org/protobuf dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> * Documentation: adding link for other supported service discoveries prometheus-operator#6382 (prometheus-operator#6391) * build(deps): bump the k8s-libs group with 5 updates Bumps the k8s-libs group with 5 updates: | Package | From | To | | --- | --- | --- | | [k8s.io/api](https://github.com/kubernetes/api) | `0.29.2` | `0.29.3` | | [k8s.io/apiextensions-apiserver](https://github.com/kubernetes/apiextensions-apiserver) | `0.29.2` | `0.29.3` | | [k8s.io/apimachinery](https://github.com/kubernetes/apimachinery) | `0.29.2` | `0.29.3` | | [k8s.io/client-go](https://github.com/kubernetes/client-go) | `0.29.2` | `0.29.3` | | [k8s.io/component-base](https://github.com/kubernetes/component-base) | `0.29.2` | `0.29.3` | Updates `k8s.io/api` from 0.29.2 to 0.29.3 - [Commits](kubernetes/api@v0.29.2...v0.29.3) Updates `k8s.io/apiextensions-apiserver` from 0.29.2 to 0.29.3 - [Release notes](https://github.com/kubernetes/apiextensions-apiserver/releases) - [Commits](kubernetes/apiextensions-apiserver@v0.29.2...v0.29.3) Updates `k8s.io/apimachinery` from 0.29.2 to 0.29.3 - [Commits](kubernetes/apimachinery@v0.29.2...v0.29.3) Updates `k8s.io/client-go` from 0.29.2 to 0.29.3 - [Changelog](https://github.com/kubernetes/client-go/blob/master/CHANGELOG.md) - [Commits](kubernetes/client-go@v0.29.2...v0.29.3) Updates `k8s.io/component-base` from 0.29.2 to 0.29.3 - [Commits](kubernetes/component-base@v0.29.2...v0.29.3) --- updated-dependencies: - dependency-name: k8s.io/api dependency-type: direct:production update-type: version-update:semver-patch dependency-group: k8s-libs - dependency-name: k8s.io/apiextensions-apiserver dependency-type: direct:production update-type: version-update:semver-patch dependency-group: k8s-libs - dependency-name: k8s.io/apimachinery dependency-type: direct:production update-type: version-update:semver-patch dependency-group: k8s-libs - dependency-name: k8s.io/client-go dependency-type: direct:production update-type: version-update:semver-patch dependency-group: k8s-libs - dependency-name: k8s.io/component-base dependency-type: direct:production update-type: version-update:semver-patch dependency-group: k8s-libs ... Signed-off-by: dependabot[bot] <[email protected]> * Regenerate documentation Signed-off-by: Simon Pasquier <[email protected]> * chore: add slashpai as v0.73 release shepherd Signed-off-by: Jayapriya Pai <[email protected]> * [fix] test * Add extra relabelings to scrape classes (prometheus-operator#6379) * [fix] - message * Controller id implementation to avoid errors with multiple operators (prometheus-operator#6319) Signed-off-by: Mario Fernandez <[email protected]> Co-authored-by: Simon Pasquier <[email protected]> * fix: enqueue in updating secret * Add extra context to warnings (prometheus-operator#6410) * Add extra context to warnings * fix spread operator * ScrapeConfig CRD: refactor ProxyConfig struct embedding (prometheus-operator#6401) * ScrapeConfig CRD: refactor ProxyConfig embedding to v1.ProxyConfig instead of *v1.ProxyConfig * Documentation: Remove experimental tag from sharding option in Prometheus CRD (prometheus-operator#6409) This commit changes the docs so that future prometheus operator users know the option is out of experimental. The sharding option is used for many years by multiple contributers. Co-authored-by: Gijs Entius <[email protected]> * Update MAINTAINERS.md (prometheus-operator#6413) Moving from Nicolas from triage to Maintainer * build(deps): bump github.com/prometheus/common from 0.50.0 to 0.51.0 Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.50.0 to 0.51.0. - [Release notes](https://github.com/prometheus/common/releases) - [Commits](prometheus/common@v0.50.0...v0.51.0) --- updated-dependencies: - dependency-name: github.com/prometheus/common dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * build(deps): bump dependabot/fetch-metadata from 1 to 2 (prometheus-operator#6420) Bumps [dependabot/fetch-metadata](https://github.com/dependabot/fetch-metadata) from 1 to 2. - [Release notes](https://github.com/dependabot/fetch-metadata/releases) - [Commits](dependabot/fetch-metadata@v1...v2) --- updated-dependencies: - dependency-name: dependabot/fetch-metadata dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Signed-off-by: deterclosed <[email protected]> chore: remove repetitive words Signed-off-by: deterclosed <[email protected]> * build(deps): bump github.com/prometheus/common from 0.51.0 to 0.51.1 Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.51.0 to 0.51.1. - [Release notes](https://github.com/prometheus/common/releases) - [Commits](prometheus/common@v0.51.0...v0.51.1) --- updated-dependencies: - dependency-name: github.com/prometheus/common dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> * Document PodMonitor, Probe and Thanos sidecar as stable These 3 features have been there for so long that we agreed to remove the experimental warning on them. This commit also makes the wording more consistent for all fields which are still considered experimental (either from the operator standpoint or from Prometheus standpoint). Signed-off-by: Simon Pasquier <[email protected]> * Fix: ScrapeConfigs Selection Issue Across Different Namespaces (prometheus-operator#6390) * Add testing for scrapeconfig and prometheus CR in different namespaces * fix: wrap panic for scheme * build(deps): bump github.com/distribution/reference from 0.5.0 to 0.6.0 Bumps [github.com/distribution/reference](https://github.com/distribution/reference) from 0.5.0 to 0.6.0. - [Release notes](https://github.com/distribution/reference/releases) - [Commits](distribution/reference@v0.5.0...v0.6.0) --- updated-dependencies: - dependency-name: github.com/distribution/reference dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * Docs: Include Local Deployment in CONTRIBUTING.md (prometheus-operator#6388) * Update CONTRIBUTING.md to include local deployment section * Bump prometheus to 0.51.1 Signed-off-by: Arthur Silva Sens <[email protected]> * Kubelet: Add a flag to set node address priority (prometheus-operator#6377) * Add a flag to set node address priority Currently internal node addresses are prioritized over external addresses. This adds a flag to allow users to freely set node address priority (internal/external). This is helpful for use cases where node internal addresses exist but are not routable. Fixes prometheus-operator#3247 * Refactor operators event handler (prometheus-operator#6416) * chore: Refactor controller's event handler to reduce code duplication Signed-off-by: Mohammad Jamshidi <[email protected]> --------- Signed-off-by: Mohammad Jamshidi <[email protected]> * Fixed the link in Prometheus Agent page * feat(xds): Add support eureka service discovery to the ScrapeConfig CRD (prometheus-operator#6408) feat: add eureka sd config Signed-off-by: dongjiang1989 <[email protected]> * Update http_sd description for clarity (prometheus-operator#6454) * chore: bump go dependencies before release Signed-off-by: Jayapriya Pai <[email protected]> * chore: update default prometheus version Signed-off-by: Jayapriya Pai <[email protected]> * Update kakkoyun's affiliation * feat(xds): Add Kuma service discovery to the ScrapeConfig CRD (prometheus-operator#6327) * support kuma xds Signed-off-by: dongjiang1989 <[email protected]> * Add DockerSD support for ScrapeConfig CRD Add DockerSDConfig struct and array of DockerSDConfig to the ScrapeConfig struct Add code block placeholder to process DockerSDConfig Add Code-gen for the updated scrapeconfig with DockerSDConfig Revert "Add Code-gen for the updated scrapeconfig with DockerSDConfig" This reverts commit f7d2ff9. Edit DockerSDConfig struct Add Code-gen for the updated DockerSDConfig Add processing code block for DockerSDConfig in the ScrapeConfig CRD Update processing filters in DockerSDConfig Add tests for DockerSDConfig Add missing host field to DockerSDConfig struct and remove TODOs in promcfg.go Update promcfg.go to append host field to the DockerSDConfiguration Update autogen code and perform formatting fixes Update DockerSD tests to include Host field Add resource_selector validation and tests for Docker SD configs Update tests according to host variable, tests pass Add DockerFilter type for the filters field in DockerSDConfigs Add code-gen for DockerFilter type update Update promcfg test and test data for DockerFilter type Update DockerFilter Format code Update pkg/apis/monitoring/v1alpha1/scrapeconfig_types.go Co-authored-by: Jayapriya Pai <[email protected]> Add validation for host field Add relevant comments and remove unrelated debug code Code-gen and format code Revert "Change git mod file" This reverts commit 232816f. Change from pointer to ProxyConfig to variable reference Generate Code and Format Format code Refactor test cases for Docker SD One test case each for OAuth, BasicAuth and Authorization fields. Also includes other fields like TLSConfig, hostnetworkinghost etc. Format code * feat: added a check to determine if thanos support the '--prometheus.http-client' flag (prometheus-operator#6448) * feat: added a check to determine if thanos support the '--prometheus.http-client' flag * Check if controllers' CRDs are provided and manageable by operator (prometheus-operator#6351) * operator cmd: check if controllers' crds are supplied Only start each controller when its crd is provided, and fail the operator if no controllers start. Fixes prometheus-operator#6140 * Nit * Resolve reviews * chore: Add checks for selectors in KubernetesSDConfig (prometheus-operator#6177) chore: test added rfac: kubernetes sd role chore: cofig.Role to lowercase rfac: unit_test role_consts * fix: add proxyURL validation for smon,pmon and probe If a user specify a non-parsable proxyUrl it was not validated/rejected but will break reloading and restarting of Prometheus due to possible invalid syntax. This commit adds validation and rejects the invalid ones Signed-off-by: Jayapriya Pai <[email protected]> * feat: add support for Hetzner SD in ScrapeConfig CRD (prometheus-operator#6436) * ScrapeConfig CRD: add HetznerSDConfig API definition & include it under ScrapeConfig spec * feat(kuma): Add validation for kuma server (prometheus-operator#6465) Signed-off-by: dongjiang1989 <[email protected]> Co-authored-by: Jayapriya Pai <[email protected]> * relabel config: allow empty separator Allow empty separator in relabel config. This is corresponding to Prometheus' relabel config. Fixes prometheus-operator#5003 * chore: cut v0.73.0 Signed-off-by: Jayapriya Pai <[email protected]> * fix: log deprecated bearer token fields at debug level Signed-off-by: Simon Pasquier <[email protected]> * chore: cut v0.73.1 Signed-off-by: Jayapriya Pai <[email protected]> * fix: register k8s metrics controller-runtime also calls `metrics.Register()` during init and this function can be called only once. To ensure that the k8s client metrics get updated, the global variables need to be set again by the operator. https://github.com/kubernetes-sigs/controller-runtime/blob/67b27f27e514bd9ac4cf9a2d84dec089ece95bf7/pkg/metrics/client_go_adapter.go#L42-L55 https://github.com/kubernetes/client-go/blob/aa7909e7d7c0661792ba21b9e882f3cd6ad0ce53/tools/metrics/metrics.go#L129-L170 Signed-off-by: Simon Pasquier <[email protected]> * fix: ScrapeClass TLSConfig nil pointer exception (prometheus-operator#6507) Signed-off-by: Simon Pasquier <[email protected]> * chore:cut v0.73.2 Signed-off-by: Jayapriya Pai <[email protected]> Co-authored-by: Simon Pasquier <[email protected]> * Fix errors and go versions in build scripts Signed-off-by: Coleen Iona Quadros <[email protected]> * Run make --always-make format generate Signed-off-by: Coleen Iona Quadros <[email protected]> * lint Signed-off-by: Coleen Iona Quadros <[email protected]> * remove duplicate code Signed-off-by: Coleen Iona Quadros <[email protected]> --------- Signed-off-by: Arthur Silva Sens <[email protected]> Signed-off-by: Simon Pasquier <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Jayapriya Pai <[email protected]> Signed-off-by: Pranshu Srivastava <[email protected]> Signed-off-by: dongjiang1989 <[email protected]> Signed-off-by: adinhodovic <[email protected]> Signed-off-by: machine424 <[email protected]> Signed-off-by: Nicolas Takashi <[email protected]> Signed-off-by: Dhruv Bindra <[email protected]> Signed-off-by: Yonatan Sasson <[email protected]> Signed-off-by: Mario Fernandez <[email protected]> Signed-off-by: deterclosed <[email protected]> Signed-off-by: Mohammad Jamshidi <[email protected]> Signed-off-by: Coleen Iona Quadros <[email protected]> Co-authored-by: Simon Pasquier <[email protected]> Co-authored-by: Arthur Silva Sens <[email protected]> Co-authored-by: Hervé Nicol <[email protected]> Co-authored-by: Herve Nicol <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Jayapriya Pai <[email protected]> Co-authored-by: Pranshu Srivastava <[email protected]> Co-authored-by: dongjiang <[email protected]> Co-authored-by: Arthur Silva Sens <[email protected]> Co-authored-by: adinhodovic <[email protected]> Co-authored-by: Jimmy Zelinskie <[email protected]> Co-authored-by: Sam Kirsch <[email protected]> Co-authored-by: Michael Borens <[email protected]> Co-authored-by: Prajjwal <[email protected]> Co-authored-by: DeamonMV <[email protected]> Co-authored-by: Ayoub Mrini <[email protected]> Co-authored-by: machine424 <[email protected]> Co-authored-by: Nicolas Takashi <[email protected]> Co-authored-by: Mouad Elhaouari <[email protected]> Co-authored-by: Dhruv Bindra <[email protected]> Co-authored-by: Yonatan Sasson <[email protected]> Co-authored-by: Mohammad <[email protected]> Co-authored-by: Helia Barroso <[email protected]> Co-authored-by: Seriki Ayodele <[email protected]> Co-authored-by: Quentin Bisson <[email protected]> Co-authored-by: Mario Fernandez Herrero <[email protected]> Co-authored-by: Mouad Elhaouari <[email protected]> Co-authored-by: Gijs Entius <[email protected]> Co-authored-by: Gijs Entius <[email protected]> Co-authored-by: deterclosed <[email protected]> Co-authored-by: M Viswanath Sai <[email protected]> Co-authored-by: googs1025 <[email protected]> Co-authored-by: Ha Anh Vu <[email protected]> Co-authored-by: Ashwin <[email protected]> Co-authored-by: Pavan Gudiwada <[email protected]> Co-authored-by: Kemal Akkoyun <[email protected]> Co-authored-by: mviswanathsai <[email protected]> Co-authored-by: Matheus Sousa <[email protected]> Co-authored-by: yash <[email protected]> Co-authored-by: haanhvu <[email protected]>

ArthurSens requested a review from a team as a code owner October 3, 2023 16:49

pull-request-size bot added the size/L label Oct 3, 2023

ArthurSens force-pushed the sharding-proposal branch 2 times, most recently from 7c4c748 to 01d1c3f Compare October 3, 2023 17:29

ArthurSens mentioned this pull request Oct 3, 2023

Add scale subresource to Prometheus/PrometheusAgent #5962

Merged

5 tasks

nicolastakashi reviewed Oct 3, 2023

View reviewed changes

Documentation/proposals/202310-automated-sharding.md Outdated Show resolved Hide resolved

nicolastakashi reviewed Oct 3, 2023

View reviewed changes

Documentation/proposals/202310-automated-sharding.md Outdated Show resolved Hide resolved

nicolastakashi reviewed Oct 3, 2023

View reviewed changes

Documentation/proposals/202310-automated-sharding.md Outdated Show resolved Hide resolved

nicolastakashi reviewed Oct 4, 2023

View reviewed changes

Documentation/proposals/202310-automated-sharding.md Outdated Show resolved Hide resolved

xiu reviewed Oct 4, 2023

View reviewed changes

Documentation/proposals/202310-automated-sharding.md Outdated Show resolved Hide resolved

Documentation/proposals/202310-automated-sharding.md Outdated Show resolved Hide resolved

simonpasquier reviewed Oct 5, 2023

View reviewed changes

nicolastakashi reviewed Oct 5, 2023

View reviewed changes

Documentation/proposals/202310-automated-sharding.md Outdated Show resolved Hide resolved

ArthurSens changed the title ~~Add proposal for automated sharding~~ Add proposal for Shard Autoscaling Oct 5, 2023

ArthurSens force-pushed the sharding-proposal branch from 01d1c3f to f8c10c6 Compare October 5, 2023 18:06

ArthurSens force-pushed the sharding-proposal branch from f8c10c6 to cc5c9a3 Compare October 31, 2023 18:04

ArthurSens force-pushed the sharding-proposal branch 2 times, most recently from 3853fa1 to 8810509 Compare October 31, 2023 18:21

nicolastakashi approved these changes Dec 21, 2023

View reviewed changes

simonpasquier reviewed Dec 22, 2023

View reviewed changes

ArthurSens force-pushed the sharding-proposal branch from 8810509 to 2d95791 Compare December 24, 2023 10:28

ArthurSens force-pushed the sharding-proposal branch from 2d95791 to 3940680 Compare December 24, 2023 10:33

simonpasquier reviewed Jan 3, 2024

View reviewed changes

Add proposal for automated sharding

dd6afa5

Signed-off-by: Arthur Silva Sens <[email protected]>

ArthurSens force-pushed the sharding-proposal branch from 3940680 to dd6afa5 Compare January 3, 2024 20:07

simonpasquier approved these changes Jan 5, 2024

View reviewed changes

simonpasquier added 6 commits January 5, 2024 14:46

Update Documentation/proposals/202310-shard-autoscaling.md

6ac5e3b

Update Documentation/proposals/202310-shard-autoscaling.md

005af70

Update Documentation/proposals/202310-shard-autoscaling.md

337c1ad

Update Documentation/proposals/202310-shard-autoscaling.md

0a83c9d

Update Documentation/proposals/202310-shard-autoscaling.md

0e148d7

Update Documentation/proposals/202310-shard-autoscaling.md

73f5e05

simonpasquier enabled auto-merge (squash) January 5, 2024 13:47

ArthurSens disabled auto-merge January 6, 2024 19:40

ArthurSens merged commit 48d3604 into main Jan 6, 2024

ArthurSens deleted the sharding-proposal branch January 6, 2024 19:41

simonpasquier mentioned this pull request Apr 9, 2024

[meta] Enable zone-aware sharding for Prometheus/PrometheusAgent #6437

Open


		When working with CRDs, common HPAs (e.g. Keda) depends on a Status subresource called [Scale](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource). Unfortunately, Prometheus-Operator resources still don't implement this field. There was an attempt to implement this in https://github.com/prometheus-operator/prometheus-operator/pull/4735, but was reverted because the PR was scaling replicas instead of shards. Although scaling replicas can help with High-Availability, it is expensive and hard to manage since it duplicates scrapes while not reducing the load on top of Prometheus because all replicas still scrape the same targets. Sharding serves as a better autoscaling strategy since the load is spread into all instances instead of getting duplicated.

		With only this change, Prometheus Agents can already be horizontally scaled without problems, but for Prometheus Servers it gets a little more complicated.


		Prometheus Agents are different than servers since queries are not available in this mode. Their only responsibility is scraping metrics and pushing them via remote-write to a long-term storage backend, making the scale-down experience much easier to handle.

		When receiving the SIGTERM signal, the Prometheus Agent will gracefully handle the signal by finishing all remote-write queues before ending the process. [When terminating a Pod, Kubernetes sends the SIGTERM signal and, by default, waits 30 seconds for all containers in the pod to finish their processes](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination). Thirty seconds is usually plenty of time to finish the remote-write queues, but the PrometheusAgent CRD will be extended to allow changing the Pod's `terminationGracePeriodSeconds` field so bigger or unstable instances can be shut down without data loss.


		## Snapshot & Upload on shutdown

		During scale-down, Prometheus-Operator could send an HTTP request to [Prometheus' Snapshot endpoint](https://prometheus.io/docs/prometheus/latest/querying/api/#snapshot). Thanos sidecar could be extended to watch the snapshot directory and automatically upload snapshots to Object storage.


		# Goals

		* Enable scaling of the Prometheus shards up and down via Horizontal Pod Autoscaler objects.

	* Enable scaling of the Prometheus shards up and down via Horizontal Pod Autoscaler objects.
	* Enable automatic scaling of the Prometheus shards up and down.


		## Scale subresource

		When working with any resource, including CRDs, HPAs depends on a subresource called [Scale](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource). This proposal suggests to implement the scale subresource for the Prometheus and PrometheusAgent CRDs. Instead of working on the "replicas" count, it will operate on the "shards" count because the purpose of scaling up (resp. down) is to distribute the same number of targets across more (resp. less) Prometheus instances.


		# How

		Today, there are a few strategies to measure the load of Prometheus instances.

Add proposal for Shard Autoscaling #5961

Add proposal for Shard Autoscaling #5961

Uh oh!

Conversation

ArthurSens commented Oct 3, 2023

Description

Type of change

Changelog entry

Uh oh!

ArthurSens commented Oct 3, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xiu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

simonpasquier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurSens commented Oct 5, 2023

Uh oh!

bwplotka commented Oct 10, 2023

Uh oh!

ArthurSens commented Oct 31, 2023

Uh oh!

ArthurSens commented Dec 20, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurSens commented Dec 24, 2023