-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNF-10142: Enable NROP metrics to be to scraped securely by Prometheus #1007
base: main
Are you sure you want to change the base?
Conversation
main.go
Outdated
defaultWebhookPort = 9443 | ||
defaultMetricsAddr = ":8080" | ||
defaultMetricsEnabled = true | ||
defaultProbeAddr = ":8081" | ||
defaultNamespace = "numaresources-operator" | ||
defaultProbeAddr = ":8081" | ||
defaultNamespace = "numaresources-operator" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we use these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only added defaultMetricsEnabled
which is used here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at glance LGTM, but I'll have a proper review later on
let's merge #1008 before |
/retest |
please rebase on top of current main branch |
c26ea7d
to
997fcbf
Compare
/retest |
cdda53a
to
8abebb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
/lgtm
need to be tested d/s before merge. Looks good, but we need the due diligence and I don't have enough bandwidth atm.
@rbaturov: This pull request references CNF-10142 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
main.go
Outdated
@@ -73,6 +73,7 @@ const ( | |||
const ( | |||
defaultWebhookPort = 9443 | |||
defaultMetricsAddr = ":8080" | |||
defaultMetricsEnabled = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's rename 'defaultMetricsSupport'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated code to consume this change
labels: | ||
control-plane: controller-manager | ||
name: controller-manager-metrics-monitor | ||
name: controller-manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a pretty generic name for the system
namespace
we can getaway with controller-manager
generic name ONLY if we are in a numaresources namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the system
namespace will always be replaced with numaresources
.
Do you wish to return to controller-manager-metrics-monitor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, if this is gonna be sitting in the numaresources
namespace, it's good
@@ -0,0 +1,31 @@ | |||
# creates Role and RoleBinding for prometheus-k8s service account to access our namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this only in CI or in production in general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For CI, which don't have prometheus installed we don't need this.
However, for production (OCP) if we opt to allow prometheus to scrape merics by default, we should apply these RBAC's.
- auth_proxy_service.yaml | ||
- auth_proxy_role.yaml | ||
- auth_proxy_role_binding.yaml | ||
- auth_proxy_client_clusterrole.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question, are those for CI or for production?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auth_proxy_role.yaml,
auth_proxy_service.yamland
auth_proxy_role_binding.yaml` are mandatory for the sidecar operation. Meaning these three needed for CI (for curl tests) but also for production.
without them sidecar won't be ready.
The service deployment is mandatory as its annotation responsible for creating a secret consumed as a volume by the sidecar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the auth_proxy_client_clusterrole.yaml
is not being used at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so let's not add auth_proxy_client_clusterrole.yaml
or unnecessary stuff in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auth_proxy_role.yaml,
auth_proxy_service.yamland
auth_proxy_role_binding.yaml` are mandatory for the sidecar operation. Meaning these three needed for CI (for curl tests) but also for production. without them sidecar won't be ready. The service deployment is mandatory as its annotation responsible for creating a secret consumed as a volume by the sidecar.
Could you elaborate about "not be ready"? I'd expect the sidecars to be up and running, but not be accessible by anyone in the cluster without these RBAC rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the service for example won't be created, the secret that we mount as a volume won't be created. Therefore, this would result with an error and the sidecar state won't be ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks. This is a possible problem for the backports, because makes them more invasive than expected.
Signed-off-by: Ronny Baturov <[email protected]>
This commit consist of the following changes: * Reenabled kube-rbac-proxy sidecar container to securely expose the /metrics endpoint for Prometheus scraping. * Added a secret to enforce HTTPS-only access to the /metrics endpoint, restricted to the Prometheus service account. * modified ServiceMonitor resource to enable Prometheus pods to scrape metrics. * Added an annotation to the deployment Service, which is monitored by the Service CA operator. This operator will generate the tls.key and tls.crt files inside the secret-kube-rbac-proxy-tls secret, which is used by the kube-rbac-proxy container. * Added Role and RoleBinding resources to grant the necessary permissions to the Prometheus service account. Most of this configuration was based on this guide: https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/ Signed-off-by: Ronny Baturov <[email protected]>
Signed-off-by: Ronny Baturov <[email protected]>
8abebb6
to
3450ec5
Compare
New changes are detected. LGTM label has been removed. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rbaturov The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@rbaturov: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR encompasses all the required changes to reintegrate NROP metrics with Prometheus. It introduces a kube-rbac-proxy sidecar to establish a secure communication channel. The majority of the changes in this PR follow the guidelines outlined in the following guide:
https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/
The only difference between this implementation and the guide's is that we use a
bearerTokenFile
for Prometheus authentication instead of tls.crt and tls.key. This approach uses TLS but does not implement mTLS.A follow-up PR will be issued to ensure we implement this for RTE metrics as well.
Moreover, will issue a PR adding an e2e test to the CI, for this functionality.
To validate that this PR is functioning correctly, please follow these steps:
make docker-build docker-push
)make deploy
oc exec -it prometheus-k8s-0 -n openshift-monitoring /bin/bash