Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce automated test crashes #2968

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

edwardneal
Copy link
Contributor

@edwardneal edwardneal commented Nov 3, 2024

This is an attempt to stabilise the automated tests slightly. They sometimes crash and need to be restarted; I've recently been able to reproduce the same behaviour in my local testing.

Several tests instantiate a number of threads, and these threads don't handle exceptions (such as failed test assertions.) If these assertions fail, the testhost exits.

I've converted these threads to tasks, specifying TaskCreationOptions.LongRunning. This will force the task scheduler to give the task its own thread, and it'll mean that the exceptions are propagated back to the caller. The test will thus fail, but the testhost will stay running.

This should hopefully mean that ManualTests consistently takes about 15-20 minutes to run, rather than occasionally taking 30-40 minutes.

The key advantage is that exceptions propagate properly.
If a thread throws an exception (as a result of a failed test assertion, or otherwise) then the test host crashes and must be restarted.
…tion behaviour testing.

This should also allow the test to run on both netcore and netfx.
@edwardneal
Copy link
Contributor Author

edwardneal commented Nov 3, 2024

While not strictly related to this, two comments on the ApiShould.TestSqlCommandCancel test: this is always going to be a somewhat flaky test.

The flaky part of the test methodology is as follows:

  • Create a SqlCommand
  • Start Thread A, which calls SqlCommand.ExecuteNonQuery/.ExecuteReader on it. Expect it to return an exception indicating that it's been cancelled
  • Start Thread B, which calls SqlCommand.Cancel

Besides using threads directly, the current code enables Thread A to run immediately, while Thread B sleeps for 0.5s. This introduces a race condition where sometimes the command could have finished executing before the cancellation runs. In such a case, the original Assert.Throws call would fail (because ExecuteNonQuery/ExecuteReader completes without throwing an InvalidOperationException.) There was also an edge case where ExecuteNonQuery/ExecuteReader might throw a SqlException rather than an InvalidOperationException. In both of these cases, the exceptions went unhandled, the thread crashed and the testhost restarted.

I've tried to rework this coordination slightly with an interlock and a tweak to the approach:

  1. Thread A now waits for a ManualResetEvent controlled by Thread B. Thread B will only set this MRE directly before calling SqlCommand.Cancel
  2. Thread B now hammers calls to SqlCommand.Cancel until it receives a signal on an MRE controlled by Thread A. Thread A sets this signal when the ExecuteNonQuery/ExecuteReader call has finished

This reduces the race condition, but doesn't eliminate it: calling ExecuteNonQuery and ExecuteReader implicitly resets the SqlCommand's cancellation status, so it's possible that thread B might call Cancel, then thread A might reset this. We can never guarantee that a single Cancel call (or X sequential Cancel calls) will be enough to actually cancel the command; this is why Thread B now calls Cancel in a loop.

A second comment on the same TestSqlCommandCancel test: following my changes, there's another couple of errors which look worrying:

Managed SNI, on Linux
  [xUnit.net 00:01:29.38]     Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.4;Database=NORTHWIND;UID=sa"···, value: "ExecuteNonQuery") [FAIL]
  [xUnit.net 00:01:29.39]       System.AggregateException : One or more errors occurred. (Assert.Equal() Failure: Strings differ
  [xUnit.net 00:01:29.39]                  ↓ (pos 0)
  [xUnit.net 00:01:29.39]       Expected: "Operation cancelled by user."
  [xUnit.net 00:01:29.39]       Actual:   "The request failed to run because the bat"···
  [xUnit.net 00:01:29.39]                  ↑ (pos 0))
  [xUnit.net 00:01:29.39]       ---- Assert.Equal() Failure: Strings differ
  [xUnit.net 00:01:29.39]                  ↓ (pos 0)
  [xUnit.net 00:01:29.39]       Expected: "Operation cancelled by user."
  [xUnit.net 00:01:29.39]       Actual:   "The request failed to run because the bat"···
  [xUnit.net 00:01:29.39]                  ↑ (pos 0)
  [xUnit.net 00:01:29.39]       Stack Trace:
  [xUnit.net 00:01:29.39]            at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
  [xUnit.net 00:01:29.39]            at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
  [xUnit.net 00:01:29.39]            at System.Threading.Tasks.Task.Wait()
  [xUnit.net 00:01:29.39]         /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs(2012,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value)
  [xUnit.net 00:01:29.39]            at InvokeStub_ApiShould.TestSqlCommandCancel(Object, Span`1)
  [xUnit.net 00:01:29.39]            at System.Reflection.MethodBaseInvoker.InvokeWithFewArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
  [xUnit.net 00:01:29.39]         ----- Inner Stack Trace -----
  [xUnit.net 00:01:29.39]         /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs(3160,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.Thread_ExecuteNonQuery(Object state)
  [xUnit.net 00:01:29.39]            at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
  [xUnit.net 00:01:29.39]         --- End of stack trace from previous location ---
  [xUnit.net 00:01:29.39]            at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
  [xUnit.net 00:01:29.39]            at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
Native SNI, on Windows
  [xUnit.net 00:01:29.12]     Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.4;Database=NORTHWIND;UID=sa"..., value: "ExecuteReader") [FAIL]
  [xUnit.net 00:01:29.12]       System.AggregateException : One or more errors occurred.
  [xUnit.net 00:01:29.12]       ---- Assert.Equal() Failure
  [xUnit.net 00:01:29.12]                 Γåô (pos 0)
  [xUnit.net 00:01:29.12]       Expected: Operation cancelled by user.
  [xUnit.net 00:01:29.12]       Actual:   A severe error occurred on the current co┬╖┬╖┬╖
  [xUnit.net 00:01:29.12]                 Γåæ (pos 0)
  [xUnit.net 00:01:29.13]       Stack Trace:
  [xUnit.net 00:01:29.13]            at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
  [xUnit.net 00:01:29.13]            at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
  [xUnit.net 00:01:29.13]            at System.Threading.Tasks.Task.Wait()
  [xUnit.net 00:01:29.13]         /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs(2013,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value)
  [xUnit.net 00:01:29.13]         ----- Inner Stack Trace -----
  [xUnit.net 00:01:29.13]         /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs(3138,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.Thread_ExecuteReader(Object state)
  [xUnit.net 00:01:29.13]            at System.Threading.Tasks.Task.Execute()

I think the managed SNI exception message is probably "The request failed to run because the batch is aborted, this can be caused by abort signal sent from client, or another request is running in the same session, which makes the session busy." If so, it's another instance of #26. The native SNI exception message looks like a bug though.


While this test can fail in two ways, the failure now shows up as a test failure rather than a crash & restart of testhost. I can revert my third commit (the changes to TestSqlCommandCancel) if it's better discussed in an issue.

@benrr101 benrr101 added the Area\Tests Issues that are targeted to tests or test projects label Nov 5, 2024
Copy link
Contributor

@benrr101 benrr101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edwardneal thanks so much for helping to make our automated testing more robust!! This is one of my biggest complaints in my day to day work, so anything that reduces the number of times I have to restart a build b/c of flaky tests is super appreciated.

I do have some concerns regarding the interlocking change, and a few other comments I'd like to see addressed before checking it in. But, I'll approve it asap once they are addressed

* Removed two unnecessary iterations from DatabaseHelper.
* Added explanatory comments to ApiShould.
* Switched to using Task.WaitAll rather than waiting for each Task in sequence.
@benrr101
Copy link
Contributor

@edwardneal Thanks again for working on this. I really really appreciate any help with improving our build process.

Once we get another review, I'll be happy to merge this 👍🚀

@benrr101 benrr101 added this to the 6.0-preview3 milestone Nov 14, 2024
cheenamalhotra
cheenamalhotra previously approved these changes Nov 15, 2024
@cheenamalhotra
Copy link
Member

It seems there are test failures:

Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.5;Database=NORTHWIND;UID=sa"···, value: "ExecuteNonQuery") [1 s]
EXEC : error Message:  [/mnt/vss/_work/1/s/build.proj]
     System.AggregateException : One or more errors occurred. (Assert.Equal() Failure: Strings differ
             ↓ (pos 0)
  Expected: "Operation cancelled by user."
  Actual:   "The request failed to run because the bat"···
             ↑ (pos 0))
  ---- Assert.Equal() Failure: Strings differ
             ↓ (pos 0)
  Expected: "Operation cancelled by user."
  Actual:   "The request failed to run because the bat"···
             ↑ (pos 0)
    Stack Trace:
       at System.Threading.Tasks.Task.WaitAllCore(Task[] tasks, Int32 millisecondsTimeout, CancellationToken cancellationToken)
     at System.Threading.Tasks.Task.WaitAll(Task[] tasks)
     at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value) in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs:line 2134
     at InvokeStub_ApiShould.TestSqlCommandCancel(Object, Span`1)
     at System.Reflection.MethodBaseInvoker.InvokeWithFewArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
...

Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.4;Database=NORTHWIND;UID=sa"···, value: "ExecuteReader") [1 s]
EXEC : error Message:  [/mnt/vss/_work/1/s/build.proj]
     System.AggregateException : One or more errors occurred. (Assert.Equal() Failure: Strings differ
             ↓ (pos 0)
  Expected: "Operation cancelled by user."
  Actual:   "The request failed to run because the bat"···
             ↑ (pos 0))
  ---- Assert.Equal() Failure: Strings differ
             ↓ (pos 0)
  Expected: "Operation cancelled by user."
  Actual:   "The request failed to run because the bat"···
             ↑ (pos 0)
    Stack Trace:
       at System.Threading.Tasks.Task.WaitAllCore(Task[] tasks, Int32 millisecondsTimeout, CancellationToken cancellationToken)
     at System.Threading.Tasks.Task.WaitAll(Task[] tasks)
     at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value) in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs:line 2080
     at InvokeStub_ApiShould.TestSqlCommandCancel(Object, Span`1)
     at System.Reflection.MethodBaseInvoker.InvokeWithFewArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
...

@cheenamalhotra
Copy link
Member

It also appears this test failed on Windows too in first round and passed on rerunning, so I'm a little skeptic of this change, though they look promising..

Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.4;Database=NORTHWIND;UID=sa"..., value: "ExecuteReader") [976 ms]
##[error]EXEC(0,0): Error Message: 
EXEC : error Message:  [D:\a\_work\1\s\build.proj]
     System.AggregateException : One or more errors occurred.
  ---- Assert.Equal() Failure
            Γåô (pos 0)
  Expected: Operation cancelled by user.
  Actual:   A severe error occurred on the current co┬╖┬╖┬╖
            Γåæ (pos 0)
    Stack Trace:
       at System.Threading.Tasks.Task.WaitAll(Task[] tasks, Int32 millisecondsTimeout, CancellationToken cancellationToken)
     at System.Threading.Tasks.Task.WaitAll(Task[] tasks, Int32 millisecondsTimeout)
     at System.Threading.Tasks.Task.WaitAll(Task[] tasks)
     at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value) in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs:line 2027

@edwardneal
Copy link
Contributor Author

edwardneal commented Nov 15, 2024

Thanks @cheenamalhotra, that's fair. The main objective of this PR is to prevent the exceptions thrown by the secondary thread from killing the testhost process. At the moment, a flaky test kills the process and CI restarts it; this adds 10-15 minutes to the build time. With this change, the flaky test fails, the process returns a non-zero exit code and CI restarts it.

I hadn't spotted the second restart criteria earlier, so this PR doesn't reduce the build time as I first thought. It does make some previously-failing tests visible though, and hopefully some future work can start moving progressively more tests into a test run which doesn't have that restart criteria, allowing us to shift the remaining ones towards restarting the individual test rather than the entire testhost.

The exception message which starts with "The request failed to run because the bat" is a simple enough fix, I'm pretty sure it's just another instance of efcore#29861 and will change the test to account for that. I'm more concerned about the exception with the corrupted message string though. That looks a little like the native SNI's SNIGetLastError method is overwriting or freeing the memory behind the SNI_Error.errorMessage member which it returns to TdsParser.

Edit: looks like there are a few additional possible exception messages which can report cancellation. That's a little frustrating - I'd hoped running it until it failed in VS would catch them all. I'll keep looking...

Cancellation can trigger one of several different errors, resulting in a flakier test.
Also ensure that the query always takes more than 150ms, ensuring that a quick query execution doesn't cause the test to fail.
Finally, make sure that we try to read everything from the SqlDataReader.
@cheenamalhotra cheenamalhotra removed this from the 6.0-preview3 milestone Nov 15, 2024
@cheenamalhotra cheenamalhotra dismissed their stale review November 15, 2024 23:15

Will review again new changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area\Tests Issues that are targeted to tests or test projects
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants