Reduce automated test crashes #2968

edwardneal · 2024-11-03T09:56:41Z

This is an attempt to stabilise the automated tests slightly. They sometimes crash and need to be restarted; I've recently been able to reproduce the same behaviour in my local testing.

Several tests instantiate a number of threads, and these threads don't handle exceptions (such as failed test assertions.) If these assertions fail, the testhost exits.

I've converted these threads to tasks, specifying TaskCreationOptions.LongRunning. This will force the task scheduler to give the task its own thread, and it'll mean that the exceptions are propagated back to the caller. The test will thus fail, but the testhost will stay running.

This should hopefully mean that ManualTests consistently takes about 15-20 minutes to run, rather than occasionally taking 30-40 minutes.

The key advantage is that exceptions propagate properly. If a thread throws an exception (as a result of a failed test assertion, or otherwise) then the test host crashes and must be restarted.

…parameter.

…tion behaviour testing. This should also allow the test to run on both netcore and netfx.

edwardneal · 2024-11-03T18:29:36Z

While not strictly related to this, two comments on the ApiShould.TestSqlCommandCancel test: this is always going to be a somewhat flaky test.

The flaky part of the test methodology is as follows:

Create a SqlCommand
Start Thread A, which calls SqlCommand.ExecuteNonQuery/.ExecuteReader on it. Expect it to return an exception indicating that it's been cancelled
Start Thread B, which calls SqlCommand.Cancel

Besides using threads directly, the current code enables Thread A to run immediately, while Thread B sleeps for 0.5s. This introduces a race condition where sometimes the command could have finished executing before the cancellation runs. In such a case, the original Assert.Throws call would fail (because ExecuteNonQuery/ExecuteReader completes without throwing an InvalidOperationException.) There was also an edge case where ExecuteNonQuery/ExecuteReader might throw a SqlException rather than an InvalidOperationException. In both of these cases, the exceptions went unhandled, the thread crashed and the testhost restarted.

I've tried to rework this coordination slightly with an interlock and a tweak to the approach:

Thread A now waits for a ManualResetEvent controlled by Thread B. Thread B will only set this MRE directly before calling SqlCommand.Cancel
Thread B now hammers calls to SqlCommand.Cancel until it receives a signal on an MRE controlled by Thread A. Thread A sets this signal when the ExecuteNonQuery/ExecuteReader call has finished

This reduces the race condition, but doesn't eliminate it: calling ExecuteNonQuery and ExecuteReader implicitly resets the SqlCommand's cancellation status, so it's possible that thread B might call Cancel, then thread A might reset this. We can never guarantee that a single Cancel call (or X sequential Cancel calls) will be enough to actually cancel the command; this is why Thread B now calls Cancel in a loop.

A second comment on the same TestSqlCommandCancel test: following my changes, there's another couple of errors which look worrying:

Managed SNI, on Linux

  [xUnit.net 00:01:29.38]     Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.4;Database=NORTHWIND;UID=sa"···, value: "ExecuteNonQuery") [FAIL]
  [xUnit.net 00:01:29.39]       System.AggregateException : One or more errors occurred. (Assert.Equal() Failure: Strings differ
  [xUnit.net 00:01:29.39]                  ↓ (pos 0)
  [xUnit.net 00:01:29.39]       Expected: "Operation cancelled by user."
  [xUnit.net 00:01:29.39]       Actual:   "The request failed to run because the bat"···
  [xUnit.net 00:01:29.39]                  ↑ (pos 0))
  [xUnit.net 00:01:29.39]       ---- Assert.Equal() Failure: Strings differ
  [xUnit.net 00:01:29.39]                  ↓ (pos 0)
  [xUnit.net 00:01:29.39]       Expected: "Operation cancelled by user."
  [xUnit.net 00:01:29.39]       Actual:   "The request failed to run because the bat"···
  [xUnit.net 00:01:29.39]                  ↑ (pos 0)
  [xUnit.net 00:01:29.39]       Stack Trace:
  [xUnit.net 00:01:29.39]            at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
  [xUnit.net 00:01:29.39]            at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
  [xUnit.net 00:01:29.39]            at System.Threading.Tasks.Task.Wait()
  [xUnit.net 00:01:29.39]         /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs(2012,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value)
  [xUnit.net 00:01:29.39]            at InvokeStub_ApiShould.TestSqlCommandCancel(Object, Span`1)
  [xUnit.net 00:01:29.39]            at System.Reflection.MethodBaseInvoker.InvokeWithFewArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
  [xUnit.net 00:01:29.39]         ----- Inner Stack Trace -----
  [xUnit.net 00:01:29.39]         /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs(3160,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.Thread_ExecuteNonQuery(Object state)
  [xUnit.net 00:01:29.39]            at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
  [xUnit.net 00:01:29.39]         --- End of stack trace from previous location ---
  [xUnit.net 00:01:29.39]            at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
  [xUnit.net 00:01:29.39]            at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)

Native SNI, on Windows

  [xUnit.net 00:01:29.12]     Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.4;Database=NORTHWIND;UID=sa"..., value: "ExecuteReader") [FAIL]
  [xUnit.net 00:01:29.12]       System.AggregateException : One or more errors occurred.
  [xUnit.net 00:01:29.12]       ---- Assert.Equal() Failure
  [xUnit.net 00:01:29.12]                 Γåô (pos 0)
  [xUnit.net 00:01:29.12]       Expected: Operation cancelled by user.
  [xUnit.net 00:01:29.12]       Actual:   A severe error occurred on the current co┬╖┬╖┬╖
  [xUnit.net 00:01:29.12]                 Γåæ (pos 0)
  [xUnit.net 00:01:29.13]       Stack Trace:
  [xUnit.net 00:01:29.13]            at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
  [xUnit.net 00:01:29.13]            at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
  [xUnit.net 00:01:29.13]            at System.Threading.Tasks.Task.Wait()
  [xUnit.net 00:01:29.13]         /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs(2013,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value)
  [xUnit.net 00:01:29.13]         ----- Inner Stack Trace -----
  [xUnit.net 00:01:29.13]         /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs(3138,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.Thread_ExecuteReader(Object state)
  [xUnit.net 00:01:29.13]            at System.Threading.Tasks.Task.Execute()

I think the managed SNI exception message is probably "The request failed to run because the batch is aborted, this can be caused by abort signal sent from client, or another request is running in the same session, which makes the session busy." If so, it's another instance of #26. The native SNI exception message looks like a bug though.

While this test can fail in two ways, the failure now shows up as a test failure rather than a crash & restart of testhost. I can revert my third commit (the changes to TestSqlCommandCancel) if it's better discussed in an issue.

benrr101

@edwardneal thanks so much for helping to make our automated testing more robust!! This is one of my biggest complaints in my day to day work, so anything that reduces the number of times I have to restart a build b/c of flaky tests is super appreciated.

I do have some concerns regarding the interlocking change, and a few other comments I'd like to see addressed before checking it in. But, I'll approve it asap once they are addressed

src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/TestFixtures/DatabaseHelper.cs

src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/MirroringTest/ConnectionOnMirroringTest.cs

src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/ParameterTest/ParametersTest.cs

src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs

* Removed two unnecessary iterations from DatabaseHelper. * Added explanatory comments to ApiShould. * Switched to using Task.WaitAll rather than waiting for each Task in sequence.

benrr101 · 2024-11-14T22:16:29Z

@edwardneal Thanks again for working on this. I really really appreciate any help with improving our build process.

Once we get another review, I'll be happy to merge this 👍🚀

cheenamalhotra · 2024-11-15T01:35:46Z

It seems there are test failures:

Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.5;Database=NORTHWIND;UID=sa"···, value: "ExecuteNonQuery") [1 s]
EXEC : error Message:  [/mnt/vss/_work/1/s/build.proj]
     System.AggregateException : One or more errors occurred. (Assert.Equal() Failure: Strings differ
             ↓ (pos 0)
  Expected: "Operation cancelled by user."
  Actual:   "The request failed to run because the bat"···
             ↑ (pos 0))
  ---- Assert.Equal() Failure: Strings differ
             ↓ (pos 0)
  Expected: "Operation cancelled by user."
  Actual:   "The request failed to run because the bat"···
             ↑ (pos 0)
    Stack Trace:
       at System.Threading.Tasks.Task.WaitAllCore(Task[] tasks, Int32 millisecondsTimeout, CancellationToken cancellationToken)
     at System.Threading.Tasks.Task.WaitAll(Task[] tasks)
     at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value) in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs:line 2134
     at InvokeStub_ApiShould.TestSqlCommandCancel(Object, Span`1)
     at System.Reflection.MethodBaseInvoker.InvokeWithFewArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
...

Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.4;Database=NORTHWIND;UID=sa"···, value: "ExecuteReader") [1 s]
EXEC : error Message:  [/mnt/vss/_work/1/s/build.proj]
     System.AggregateException : One or more errors occurred. (Assert.Equal() Failure: Strings differ
             ↓ (pos 0)
  Expected: "Operation cancelled by user."
  Actual:   "The request failed to run because the bat"···
             ↑ (pos 0))
  ---- Assert.Equal() Failure: Strings differ
             ↓ (pos 0)
  Expected: "Operation cancelled by user."
  Actual:   "The request failed to run because the bat"···
             ↑ (pos 0)
    Stack Trace:
       at System.Threading.Tasks.Task.WaitAllCore(Task[] tasks, Int32 millisecondsTimeout, CancellationToken cancellationToken)
     at System.Threading.Tasks.Task.WaitAll(Task[] tasks)
     at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value) in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs:line 2080
     at InvokeStub_ApiShould.TestSqlCommandCancel(Object, Span`1)
     at System.Reflection.MethodBaseInvoker.InvokeWithFewArgs(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
...

cheenamalhotra · 2024-11-15T01:44:06Z

It also appears this test failed on Windows too in first round and passed on rerunning, so I'm a little skeptic of this change, though they look promising..

Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(connection: "Data Source=tcp:10.0.0.4;Database=NORTHWIND;UID=sa"..., value: "ExecuteReader") [976 ms]
##[error]EXEC(0,0): Error Message: 
EXEC : error Message:  [D:\a\_work\1\s\build.proj]
     System.AggregateException : One or more errors occurred.
  ---- Assert.Equal() Failure
            Γåô (pos 0)
  Expected: Operation cancelled by user.
  Actual:   A severe error occurred on the current co┬╖┬╖┬╖
            Γåæ (pos 0)
    Stack Trace:
       at System.Threading.Tasks.Task.WaitAll(Task[] tasks, Int32 millisecondsTimeout, CancellationToken cancellationToken)
     at System.Threading.Tasks.Task.WaitAll(Task[] tasks, Int32 millisecondsTimeout)
     at System.Threading.Tasks.Task.WaitAll(Task[] tasks)
     at Microsoft.Data.SqlClient.ManualTesting.Tests.AlwaysEncrypted.ApiShould.TestSqlCommandCancel(String connection, String value) in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs:line 2027

edwardneal · 2024-11-15T14:31:03Z

Thanks @cheenamalhotra, that's fair. The main objective of this PR is to prevent the exceptions thrown by the secondary thread from killing the testhost process. At the moment, a flaky test kills the process and CI restarts it; this adds 10-15 minutes to the build time. With this change, the flaky test fails, the process returns a non-zero exit code and CI restarts it.

I hadn't spotted the second restart criteria earlier, so this PR doesn't reduce the build time as I first thought. It does make some previously-failing tests visible though, and hopefully some future work can start moving progressively more tests into a test run which doesn't have that restart criteria, allowing us to shift the remaining ones towards restarting the individual test rather than the entire testhost.

The exception message which starts with "The request failed to run because the bat" is a simple enough fix, I'm pretty sure it's just another instance of efcore#29861 and will change the test to account for that. I'm more concerned about the exception with the corrupted message string though. That looks a little like the native SNI's SNIGetLastError method is overwriting or freeing the memory behind the SNI_Error.errorMessage member which it returns to TdsParser.

Edit: looks like there are a few additional possible exception messages which can report cancellation. That's a little frustrating - I'd hoped running it until it failed in VS would catch them all. I'll keep looking...

Cancellation can trigger one of several different errors, resulting in a flakier test. Also ensure that the query always takes more than 150ms, ensuring that a quick query execution doesn't cause the test to fail. Finally, make sure that we try to read everything from the SqlDataReader.

Will review again new changes

edwardneal added 3 commits November 3, 2024 09:44

Converted Threads to long-running Tasks

848241d

The key advantage is that exceptions propagate properly. If a thread throws an exception (as a result of a failed test assertion, or otherwise) then the test host crashes and must be restarted.

Corrected the instantiation of the cancellation task - missing state …

4bd9b8a

…parameter.

Changes to TestSqlCommandCancel, eliminating timing-specific cancella…

7c766b4

…tion behaviour testing. This should also allow the test to run on both netcore and netfx.

benrr101 added the Area\Tests Issues that are targeted to tests or test projects label Nov 5, 2024

benrr101 requested changes Nov 5, 2024

View reviewed changes

Responding to code review.

ad1a572

* Removed two unnecessary iterations from DatabaseHelper. * Added explanatory comments to ApiShould. * Switched to using Task.WaitAll rather than waiting for each Task in sequence.

benrr101 approved these changes Nov 14, 2024

View reviewed changes

benrr101 added this to the 6.0-preview3 milestone Nov 14, 2024

cheenamalhotra previously approved these changes Nov 15, 2024

View reviewed changes

edwardneal added 3 commits November 15, 2024 20:30

Merge upstream

35d4607

Correcting previous merge

821ea46

cheenamalhotra self-requested a review November 15, 2024 22:11

cheenamalhotra removed this from the 6.0-preview3 milestone Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce automated test crashes #2968

Reduce automated test crashes #2968

edwardneal commented Nov 3, 2024 •

edited

Loading

edwardneal commented Nov 3, 2024 •

edited

Loading

benrr101 left a comment

benrr101 commented Nov 14, 2024

cheenamalhotra commented Nov 15, 2024

cheenamalhotra commented Nov 15, 2024

edwardneal commented Nov 15, 2024 •

edited

Loading

Reduce automated test crashes #2968

Are you sure you want to change the base?

Reduce automated test crashes #2968

Conversation

edwardneal commented Nov 3, 2024 • edited Loading

edwardneal commented Nov 3, 2024 • edited Loading

benrr101 left a comment

Choose a reason for hiding this comment

benrr101 commented Nov 14, 2024

cheenamalhotra commented Nov 15, 2024

cheenamalhotra commented Nov 15, 2024

edwardneal commented Nov 15, 2024 • edited Loading

edwardneal commented Nov 3, 2024 •

edited

Loading

edwardneal commented Nov 3, 2024 •

edited

Loading

edwardneal commented Nov 15, 2024 •

edited

Loading