Add failed compaction counts for specific rfiles. #5024

ddanielr · 2024-10-30T15:46:05Z

Is your feature request related to a problem? Please describe.
When using external compactions, there is a possibility that a compaction job could contain an rfile that exceeds the resources assigned to a compactor.
If that compactor dies with an OOM issue, there exists no mechanism to indicate that that specific rfile should be moved to a different compaction queue that uses larger compactors.

Describe the solution you'd like
When the compaction-coordinator detected a failed compaction job, it should add an errored rfile entry with a count.
At a specific failure threshold, these files should be returned as a "large" compaction criteria.

Describe alternatives you've considered
The compaction-coordinator could have knowledge of which queue has "larger" compactor resources and auto submit these compactions.

Additional context
conditional mutations are probably needed for this to work

keith-turner · 2024-10-30T18:20:17Z

When a compactor fails we could add a fraction like 1/(total failed compactors) to the per file failure count. This would be a simple way to detect events related to a single tablet vs systemic issues unrelated to a single tablet. For example when the dead compaction detector runs if it detects 200 dead compactors it could add 1/200 to each file that was involved. If the dead compactor runs and detects 2 failed compactors it could add 1/2 to each file the compactor was processing. This would make the count increase faster when a few tablets are repeatedly killing compactors.

dlmarion · 2024-10-31T15:03:47Z

This seems similar to the cascading tablet server death problem, where a tserver fails due to data in a scan, the scan moves to another tserver and subsequently kills that one, repeat. I don't know that we have code that handles this, I think the user has to look at logs/metrics to realize their server count is going down.

ddanielr added the enhancement This issue describes a new feature, improvement, or optimization. label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add failed compaction counts for specific rfiles. #5024

Add failed compaction counts for specific rfiles. #5024

ddanielr commented Oct 30, 2024

keith-turner commented Oct 30, 2024 •

edited

Loading

dlmarion commented Oct 31, 2024

Add failed compaction counts for specific rfiles. #5024

Add failed compaction counts for specific rfiles. #5024

Comments

ddanielr commented Oct 30, 2024

keith-turner commented Oct 30, 2024 • edited Loading

dlmarion commented Oct 31, 2024

keith-turner commented Oct 30, 2024 •

edited

Loading