You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
When using external compactions, there is a possibility that a compaction job could contain an rfile that exceeds the resources assigned to a compactor.
If that compactor dies with an OOM issue, there exists no mechanism to indicate that that specific rfile should be moved to a different compaction queue that uses larger compactors.
Describe the solution you'd like
When the compaction-coordinator detected a failed compaction job, it should add an errored rfile entry with a count.
At a specific failure threshold, these files should be returned as a "large" compaction criteria.
Describe alternatives you've considered
The compaction-coordinator could have knowledge of which queue has "larger" compactor resources and auto submit these compactions.
Additional context
conditional mutations are probably needed for this to work
The text was updated successfully, but these errors were encountered:
When a compactor fails we could add a fraction like 1/(total failed compactors) to the per file failure count. This would be a simple way to detect events related to a single tablet vs systemic issues unrelated to a single tablet. For example when the dead compaction detector runs if it detects 200 dead compactors it could add 1/200 to each file that was involved. If the dead compactor runs and detects 2 failed compactors it could add 1/2 to each file the compactor was processing. This would make the count increase faster when a few tablets are repeatedly killing compactors.
This seems similar to the cascading tablet server death problem, where a tserver fails due to data in a scan, the scan moves to another tserver and subsequently kills that one, repeat. I don't know that we have code that handles this, I think the user has to look at logs/metrics to realize their server count is going down.
Is your feature request related to a problem? Please describe.
When using external compactions, there is a possibility that a compaction job could contain an rfile that exceeds the resources assigned to a compactor.
If that compactor dies with an OOM issue, there exists no mechanism to indicate that that specific rfile should be moved to a different compaction queue that uses larger compactors.
Describe the solution you'd like
When the compaction-coordinator detected a failed compaction job, it should add an errored rfile entry with a count.
At a specific failure threshold, these files should be returned as a "large" compaction criteria.
Describe alternatives you've considered
The compaction-coordinator could have knowledge of which queue has "larger" compactor resources and auto submit these compactions.
Additional context
conditional mutations are probably needed for this to work
The text was updated successfully, but these errors were encountered: