Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNF-10142: Enable NROP metrics to be to scraped securely by Prometheus #1007

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rbaturov
Copy link
Contributor

@rbaturov rbaturov commented Sep 12, 2024

This PR encompasses all the required changes to reintegrate NROP metrics with Prometheus. It introduces a kube-rbac-proxy sidecar to establish a secure communication channel. The majority of the changes in this PR follow the guidelines outlined in the following guide:
https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/

The only difference between this implementation and the guide's is that we use a bearerTokenFile for Prometheus authentication instead of tls.crt and tls.key. This approach uses TLS but does not implement mTLS.

A follow-up PR will be issued to ensure we implement this for RTE metrics as well.
Moreover, will issue a PR adding an e2e test to the CI, for this functionality.

To validate that this PR is functioning correctly, please follow these steps:

  1. build image of the operator (make docker-build docker-push)
  2. run: make deploy
  3. Attach to one of the prometheus pods oc exec -it prometheus-k8s-0 -n openshift-monitoring /bin/bash
  4. run:
curl -v \
--cacert /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt \
-H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
https://numaresources-controller-manager-metrics-service.numaresources.svc:8443/metrics

main.go Outdated
Comment on lines 74 to 78
defaultWebhookPort = 9443
defaultMetricsAddr = ":8080"
defaultMetricsEnabled = true
defaultProbeAddr = ":8081"
defaultNamespace = "numaresources-operator"
defaultProbeAddr = ":8081"
defaultNamespace = "numaresources-operator"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we use these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added defaultMetricsEnabled which is used here

Copy link
Member

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at glance LGTM, but I'll have a proper review later on

@ffromani
Copy link
Member

let's merge #1008 before

@ffromani
Copy link
Member

/retest

@ffromani
Copy link
Member

please rebase on top of current main branch

@rbaturov rbaturov force-pushed the enable-prometheus branch 3 times, most recently from c26ea7d to 997fcbf Compare September 15, 2024 11:06
@rbaturov
Copy link
Contributor Author

/retest

@rbaturov rbaturov force-pushed the enable-prometheus branch 5 times, most recently from cdda53a to 8abebb6 Compare September 17, 2024 12:27
Copy link
Member

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold
/lgtm

need to be tested d/s before merge. Looks good, but we need the due diligence and I don't have enough bandwidth atm.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 18, 2024
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 18, 2024
@rbaturov rbaturov changed the title Enable NROP metrics to be to scraped securely by Prometheus CNF-10142: Enable NROP metrics to be to scraped securely by Prometheus Sep 23, 2024
@openshift-ci-robot
Copy link

@rbaturov: This pull request references CNF-10142 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

This PR encompasses all the required changes to reintegrate NROP metrics with Prometheus. It introduces a kube-rbac-proxy sidecar to establish a secure communication channel. The majority of the changes in this PR follow the guidelines outlined in the following guide:
https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/

The only difference between this implementation and the guide's is that we use a bearerTokenFile for Prometheus authentication instead of tls.crt and tls.key. This approach uses TLS but does not implement mTLS.

A follow-up PR will be issued to ensure we implement this for RTE metrics as well.
Moreover, will issue a PR adding an e2e test to the CI, for this functionality.

To validate that this PR is functioning correctly, please follow these steps:

  1. build image of the operator (make docker-build docker-push)
  2. run: make deploy
  3. Attach to one of the prometheus pods oc exec -it prometheus-k8s-0 -n openshift-monitoring /bin/bash
  4. run:
curl -v \
--cacert /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt \
-H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
https://numaresources-controller-manager-metrics-service.numaresources.svc:8443/metrics

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

main.go Outdated
@@ -73,6 +73,7 @@ const (
const (
defaultWebhookPort = 9443
defaultMetricsAddr = ":8080"
defaultMetricsEnabled = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's rename 'defaultMetricsSupport'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated code to consume this change

labels:
control-plane: controller-manager
name: controller-manager-metrics-monitor
name: controller-manager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pretty generic name for the system namespace
we can getaway with controller-manager generic name ONLY if we are in a numaresources namespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the system namespace will always be replaced with numaresources.
Do you wish to return to controller-manager-metrics-monitor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, if this is gonna be sitting in the numaresources namespace, it's good

@@ -0,0 +1,31 @@
# creates Role and RoleBinding for prometheus-k8s service account to access our namespace
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this only in CI or in production in general?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CI, which don't have prometheus installed we don't need this.
However, for production (OCP) if we opt to allow prometheus to scrape merics by default, we should apply these RBAC's.

Comment on lines +15 to +18
- auth_proxy_service.yaml
- auth_proxy_role.yaml
- auth_proxy_role_binding.yaml
- auth_proxy_client_clusterrole.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question, are those for CI or for production?

Copy link
Contributor Author

@rbaturov rbaturov Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auth_proxy_role.yaml, auth_proxy_service.yamlandauth_proxy_role_binding.yaml` are mandatory for the sidecar operation. Meaning these three needed for CI (for curl tests) but also for production.
without them sidecar won't be ready.
The service deployment is mandatory as its annotation responsible for creating a secret consumed as a volume by the sidecar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the auth_proxy_client_clusterrole.yaml is not being used at all

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so let's not add auth_proxy_client_clusterrole.yaml or unnecessary stuff in general

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auth_proxy_role.yaml, auth_proxy_service.yamlandauth_proxy_role_binding.yaml` are mandatory for the sidecar operation. Meaning these three needed for CI (for curl tests) but also for production. without them sidecar won't be ready. The service deployment is mandatory as its annotation responsible for creating a secret consumed as a volume by the sidecar.

Could you elaborate about "not be ready"? I'd expect the sidecars to be up and running, but not be accessible by anyone in the cluster without these RBAC rules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the service for example won't be created, the secret that we mount as a volume won't be created. Therefore, this would result with an error and the sidecar state won't be ready.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks. This is a possible problem for the backports, because makes them more invasive than expected.

Signed-off-by: Ronny Baturov <[email protected]>
This commit consist of the following changes:

* Reenabled kube-rbac-proxy sidecar container to securely expose the /metrics endpoint for Prometheus scraping.
* Added a secret to enforce HTTPS-only access to the /metrics endpoint, restricted to the Prometheus service account.
* modified ServiceMonitor resource to enable Prometheus pods to scrape metrics.
* Added an annotation to the deployment Service, which is monitored by the Service CA operator. This operator will generate the tls.key and tls.crt files inside the secret-kube-rbac-proxy-tls secret, which is used by the kube-rbac-proxy container.
* Added Role and RoleBinding resources to grant the necessary permissions to the Prometheus service account.

Most of this configuration was based on this guide:
https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/

Signed-off-by: Ronny Baturov <[email protected]>
Signed-off-by: Ronny Baturov <[email protected]>
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 11, 2024
Copy link
Contributor

openshift-ci bot commented Oct 11, 2024

New changes are detected. LGTM label has been removed.

Copy link
Contributor

openshift-ci bot commented Oct 11, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rbaturov
Once this PR has been reviewed and has the lgtm label, please ask for approval from ffromani. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Oct 11, 2024

@rbaturov: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ci-install-e2e 3450ec5 link true /test ci-install-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants