Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add failed compaction counts for specific rfiles. #5024

Open
ddanielr opened this issue Oct 30, 2024 · 2 comments
Open

Add failed compaction counts for specific rfiles. #5024

ddanielr opened this issue Oct 30, 2024 · 2 comments
Labels
enhancement This issue describes a new feature, improvement, or optimization.

Comments

@ddanielr
Copy link
Contributor

Is your feature request related to a problem? Please describe.
When using external compactions, there is a possibility that a compaction job could contain an rfile that exceeds the resources assigned to a compactor.
If that compactor dies with an OOM issue, there exists no mechanism to indicate that that specific rfile should be moved to a different compaction queue that uses larger compactors.

Describe the solution you'd like
When the compaction-coordinator detected a failed compaction job, it should add an errored rfile entry with a count.
At a specific failure threshold, these files should be returned as a "large" compaction criteria.

Describe alternatives you've considered
The compaction-coordinator could have knowledge of which queue has "larger" compactor resources and auto submit these compactions.

Additional context
conditional mutations are probably needed for this to work

@ddanielr ddanielr added the enhancement This issue describes a new feature, improvement, or optimization. label Oct 30, 2024
@keith-turner
Copy link
Contributor

keith-turner commented Oct 30, 2024

When a compactor fails we could add a fraction like 1/(total failed compactors) to the per file failure count. This would be a simple way to detect events related to a single tablet vs systemic issues unrelated to a single tablet. For example when the dead compaction detector runs if it detects 200 dead compactors it could add 1/200 to each file that was involved. If the dead compactor runs and detects 2 failed compactors it could add 1/2 to each file the compactor was processing. This would make the count increase faster when a few tablets are repeatedly killing compactors.

@dlmarion
Copy link
Contributor

This seems similar to the cascading tablet server death problem, where a tserver fails due to data in a scan, the scan moves to another tserver and subsequently kills that one, repeat. I don't know that we have code that handles this, I think the user has to look at logs/metrics to realize their server count is going down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This issue describes a new feature, improvement, or optimization.
Projects
None yet
Development

No branches or pull requests

3 participants