-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod disruption schedule #1719
Comments
This issue is currently awaiting triage. If Karpenter contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Is this for your job/task related pods? Would it be sufficient for you if the |
This would be for always-running job/task workers or singleton services. |
Would you be able to share your "own controller" code ? |
@jukie what i'm wondering is if this is a function of the lifetime of the pod in any way? What sort of workloads only want to be disrupted at a certain time as opposed to some other signal in the cluster (like other pods going away). I'm not sure I like encoding this sort of API surface onto the pod itself. It's very loosely defined, easier to run into validation issues, and doesn't promote elasticity. On top of this, Karpenter has to reason about when it's fine to enqueue a disruption vs when it's fine to actually drain the pod. Let's say I had this schedule + duration, do I want to nominate the node for consolidation if it's not in it's disruptable period? Or do i wait for it to be in a disruptable period before I nominate it? If so, if it goes out of being able to disrupt, then i'm now left with a pod I can't evict until my TGP, which could be overall higher cost. |
@njtran I'll try to expand a bit on the long running task executor example - these don't execute as
For your other questions:
My PR (#1720) uses the existing logic that do-not-disrupt uses by updating podutil.IsEvictable() and podutil.IsDisruptable() to consider this new annotation. If the window is inactive it would lead to the same
Wouldn't the higher cost scenario already be the default? Adding the ability to consider a disruptable window would lower overall cost by being able to disrupt nodes before TGP. It'd probably be a good idea to set a minimum window duration to avoid the scenario you describe though. |
I'd like to add support to this issue. My use case: We have services with varying tolerance for disruption. We'd like to allow services to express their own requirements: "I can tolerate X restarts every Y days and (optionally), as long as it's within Z timeframe" We'd prefer to limit the number of different nodepools we manage. We're currently running 0.36 and are having to implement something similar to https://github.com/jukie/karpenter-deprovision-controller (where we strategically remove do-not-disrupt annotations). Our understanding is that in 1.0, we can take advantage of expiration + terminationGracePeriod to enforce a maximum node age, regardless of the do-not-disrupt annotation. But that still would make it so that any service who uses the do-not-disrupt annotation is subject to the same frequency of disruption. I think the proposal described in this issue would give us what we want. Any thoughts or recommendations? |
@njtran any more thoughts on this one? |
Description
What problem are you trying to solve?
I have some workloads that are sensitive to interruptions at certain points of the day and thus are using the
karpenter.sh/do-not-disrupt
annotation. I'd like the ability to allow disruptions to these pods at specific points via cron format schedule.How important is this feature to you?
In order to allow reclaiming nodes for expiration or underutilization I'm currently running my own controller that watches DisruptionBlocked events and then removes the do-not-disrupt annotation if the pods are marked with another one indicating the schedule for when disruptions are allowed. I'd like something similar to be added upstream and get rid of my own controller.
0 14 * * 6
)3h
)The text was updated successfully, but these errors were encountered: