CNF-10142: Enable NROP metrics to be to scraped securely by Prometheus #1007

rbaturov · 2024-09-12T10:47:03Z

This PR encompasses all the required changes to reintegrate NROP metrics with Prometheus. It introduces a kube-rbac-proxy sidecar to establish a secure communication channel. The majority of the changes in this PR follow the guidelines outlined in the following guide:
https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/

The only difference between this implementation and the guide's is that we use a bearerTokenFile for Prometheus authentication instead of tls.crt and tls.key. This approach uses TLS but does not implement mTLS.

A follow-up PR will be issued to ensure we implement this for RTE metrics as well.
Moreover, will issue a PR adding an e2e test to the CI, for this functionality.

To validate that this PR is functioning correctly, please follow these steps:

build image of the operator (make docker-build docker-push)
run: make deploy
Attach to one of the prometheus pods oc exec -it prometheus-k8s-0 -n openshift-monitoring /bin/bash
run:

curl -v \
--cacert /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt \
-H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
https://numaresources-controller-manager-metrics-service.numaresources.svc:8443/metrics

ffromani · 2024-09-12T11:07:47Z

main.go

+	defaultWebhookPort    = 9443
+	defaultMetricsAddr    = ":8080"
 	defaultMetricsEnabled = true
-	defaultProbeAddr   = ":8081"
-	defaultNamespace   = "numaresources-operator"
+	defaultProbeAddr      = ":8081"
+	defaultNamespace      = "numaresources-operator"


where do we use these?

I only added defaultMetricsEnabled which is used here

ffromani

at glance LGTM, but I'll have a proper review later on

ffromani · 2024-09-12T11:41:43Z

let's merge #1008 before

ffromani · 2024-09-12T14:01:54Z

/retest

ffromani · 2024-09-12T14:18:07Z

please rebase on top of current main branch

rbaturov · 2024-09-16T05:35:03Z

/retest

ffromani

/hold
/lgtm

need to be tested d/s before merge. Looks good, but we need the due diligence and I don't have enough bandwidth atm.

openshift-ci-robot · 2024-09-23T06:57:46Z

@rbaturov: This pull request references CNF-10142 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

This PR encompasses all the required changes to reintegrate NROP metrics with Prometheus. It introduces a kube-rbac-proxy sidecar to establish a secure communication channel. The majority of the changes in this PR follow the guidelines outlined in the following guide:
https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/

The only difference between this implementation and the guide's is that we use a bearerTokenFile for Prometheus authentication instead of tls.crt and tls.key. This approach uses TLS but does not implement mTLS.

A follow-up PR will be issued to ensure we implement this for RTE metrics as well.
Moreover, will issue a PR adding an e2e test to the CI, for this functionality.

To validate that this PR is functioning correctly, please follow these steps:

build image of the operator (make docker-build docker-push)

run: make deploy

Attach to one of the prometheus pods oc exec -it prometheus-k8s-0 -n openshift-monitoring /bin/bash

run:
curl -v \
--cacert /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt \
-H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
https://numaresources-controller-manager-metrics-service.numaresources.svc:8443/metrics

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

ffromani · 2024-10-10T13:13:55Z

main.go

@@ -73,6 +73,7 @@ const (
 const (
 	defaultWebhookPort = 9443
 	defaultMetricsAddr = ":8080"
+	defaultMetricsEnabled = true


let's rename 'defaultMetricsSupport'

Updated code to consume this change

ffromani · 2024-10-10T13:15:15Z

config/prometheus/monitor.yaml

-  labels:
-    control-plane: controller-manager
-  name: controller-manager-metrics-monitor
+  name: controller-manager


this is a pretty generic name for the system namespace
we can getaway with controller-manager generic name ONLY if we are in a numaresources namespace

But the system namespace will always be replaced with numaresources.
Do you wish to return to controller-manager-metrics-monitor?

ok, if this is gonna be sitting in the numaresources namespace, it's good

ffromani · 2024-10-10T13:16:36Z

config/prometheus/rbac.yaml

@@ -0,0 +1,31 @@
+# creates Role and RoleBinding for prometheus-k8s service account to access our namespace


do we need this only in CI or in production in general?

For CI, which don't have prometheus installed we don't need this.
However, for production (OCP) if we opt to allow prometheus to scrape merics by default, we should apply these RBAC's.

ffromani · 2024-10-10T13:17:34Z

config/rbac/kustomization.yaml

+- auth_proxy_service.yaml
+- auth_proxy_role.yaml
+- auth_proxy_role_binding.yaml
+- auth_proxy_client_clusterrole.yaml


same question, are those for CI or for production?

auth_proxy_role.yaml, auth_proxy_service.yamlandauth_proxy_role_binding.yaml` are mandatory for the sidecar operation. Meaning these three needed for CI (for curl tests) but also for production.
without them sidecar won't be ready.
The service deployment is mandatory as its annotation responsible for creating a secret consumed as a volume by the sidecar.

However, the auth_proxy_client_clusterrole.yaml is not being used at all

ok, so let's not add auth_proxy_client_clusterrole.yaml or unnecessary stuff in general

auth_proxy_role.yaml, auth_proxy_service.yamlandauth_proxy_role_binding.yaml` are mandatory for the sidecar operation. Meaning these three needed for CI (for curl tests) but also for production. without them sidecar won't be ready. The service deployment is mandatory as its annotation responsible for creating a secret consumed as a volume by the sidecar.

Could you elaborate about "not be ready"? I'd expect the sidecars to be up and running, but not be accessible by anyone in the cluster without these RBAC rules.

If the service for example won't be created, the secret that we mount as a volume won't be created. Therefore, this would result with an error and the sidecar state won't be ready.

ok, thanks. This is a possible problem for the backports, because makes them more invasive than expected.

Signed-off-by: Ronny Baturov <[email protected]>

This commit consist of the following changes: * Reenabled kube-rbac-proxy sidecar container to securely expose the /metrics endpoint for Prometheus scraping. * Added a secret to enforce HTTPS-only access to the /metrics endpoint, restricted to the Prometheus service account. * modified ServiceMonitor resource to enable Prometheus pods to scrape metrics. * Added an annotation to the deployment Service, which is monitored by the Service CA operator. This operator will generate the tls.key and tls.crt files inside the secret-kube-rbac-proxy-tls secret, which is used by the kube-rbac-proxy container. * Added Role and RoleBinding resources to grant the necessary permissions to the Prometheus service account. Most of this configuration was based on this guide: https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/ Signed-off-by: Ronny Baturov <[email protected]>

Signed-off-by: Ronny Baturov <[email protected]>

openshift-ci · 2024-10-11T07:42:59Z

New changes are detected. LGTM label has been removed.

openshift-ci · 2024-10-11T07:43:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rbaturov
Once this PR has been reviewed and has the lgtm label, please ask for approval from ffromani. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-10-11T09:53:37Z

@rbaturov: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/ci-install-e2e	`3450ec5`	link	true	`/test ci-install-e2e`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from shajmakh and swatisehgal September 12, 2024 10:47

ffromani reviewed Sep 12, 2024

View reviewed changes

rbaturov force-pushed the enable-prometheus branch 3 times, most recently from c26ea7d to 997fcbf Compare September 15, 2024 11:06

rbaturov force-pushed the enable-prometheus branch 5 times, most recently from cdda53a to 8abebb6 Compare September 17, 2024 12:27

ffromani reviewed Sep 18, 2024

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 18, 2024

openshift-ci bot assigned ffromani Sep 18, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 18, 2024

rbaturov changed the title ~~Enable NROP metrics to be to scraped securely by Prometheus~~ CNF-10142: Enable NROP metrics to be to scraped securely by Prometheus Sep 23, 2024

This was referenced Oct 6, 2024

CNF-11234: Enable RTE metrics to be scraped securely by Prometheus #1035

Open

CNF-14833: E2E: metrics exposed securely #1037

Draft

ffromani reviewed Oct 10, 2024

View reviewed changes

rbaturov added 3 commits October 11, 2024 10:42

Enable metrics by default

3e92990

Signed-off-by: Ronny Baturov <[email protected]>

make generate bundle manifests

3450ec5

Signed-off-by: Ronny Baturov <[email protected]>

rbaturov force-pushed the enable-prometheus branch from 8abebb6 to 3450ec5 Compare October 11, 2024 07:42

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNF-10142: Enable NROP metrics to be to scraped securely by Prometheus #1007

CNF-10142: Enable NROP metrics to be to scraped securely by Prometheus #1007

rbaturov commented Sep 12, 2024 •

edited

Loading

ffromani Sep 12, 2024

rbaturov Sep 15, 2024

ffromani left a comment

ffromani commented Sep 12, 2024

ffromani commented Sep 12, 2024

ffromani commented Sep 12, 2024

rbaturov commented Sep 16, 2024

ffromani left a comment

openshift-ci-robot commented Sep 23, 2024

ffromani Oct 10, 2024

rbaturov Oct 11, 2024

ffromani Oct 10, 2024

rbaturov Oct 10, 2024

ffromani Oct 10, 2024

ffromani Oct 10, 2024

rbaturov Oct 10, 2024

ffromani Oct 10, 2024

rbaturov Oct 10, 2024 •

edited

Loading

rbaturov Oct 10, 2024

ffromani Oct 10, 2024

ffromani Oct 10, 2024

rbaturov Oct 11, 2024

ffromani Oct 11, 2024

openshift-ci bot commented Oct 11, 2024

openshift-ci bot commented Oct 11, 2024

openshift-ci bot commented Oct 11, 2024

		@@ -0,0 +1,31 @@
		# creates Role and RoleBinding for prometheus-k8s service account to access our namespace

CNF-10142: Enable NROP metrics to be to scraped securely by Prometheus #1007

Are you sure you want to change the base?

CNF-10142: Enable NROP metrics to be to scraped securely by Prometheus #1007

Conversation

rbaturov commented Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani left a comment

Choose a reason for hiding this comment

ffromani commented Sep 12, 2024

ffromani commented Sep 12, 2024

ffromani commented Sep 12, 2024

rbaturov commented Sep 16, 2024

ffromani left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Sep 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rbaturov Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Oct 11, 2024

openshift-ci bot commented Oct 11, 2024

openshift-ci bot commented Oct 11, 2024

rbaturov commented Sep 12, 2024 •

edited

Loading

rbaturov Oct 10, 2024 •

edited

Loading