-
-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aligned allocation for faster SIMD #1446
Comments
Current CPUs aren't too fussed about unaligned loads; but they are more fussy about aligned stores; i.e. you should prefer aligned stores over aligned loads if you need to make a choice (this even applies to non-SIMD, as you'll loose the performance benefit of store forwarding if you then read back an unaligned store) You could do a best endevor alignment ; e.g. over request the size from the ArrayPool by 16/32 bytes, get the pointer of the
|
That's the exact approach I use in MagicScaler. I make an effort to grab a SIMD-aligned section of a rented array here: https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Core/PixelBuffer.cs#L179-L190 If the GC moves the array, which becomes unlikely once the pooled arrays get moved into gen2, the worst case is a single operation runs on unaligned memory. You'll also find my 'guardrails' feature in there, in which I mark the areas outside my aligned-offset buffer segment with a byte pattern and then check that those bytes weren't touched when the buffer is returned. It's a super quick way to check for overruns when you're going unsafe. |
Thanks @benaadams @saucecontrol for the suggestion! I also considered this idea, but discarded it, since we can not use explicit aligned Load/Store intrinsics without guaranteed alignment. I guess it's no longer important on today's CPU's? What about non-temporal stores which also need a guarantee on alignment? |
Are they useful here? They update main memory but don't update cache; so would be good for data that becomes idle after store, however I assume it would be often read back to flow into next stage so it would be good to have the data in cache? |
Our internal primitives usually process data in batches, for example:
Such an batch is typically called for a pixel row (which is a lot of data), then the buffer in From this I assumed they might be useful for us, but I'm not sure, most of this stuff is new for me. |
L2 Cache is 256KB+ per core and L3 is 9MB+; so would be a lot of data for a row? |
NT stores are one of those things where you really need to be sure you're smarter than the processor before you do it. Sort of the same as prefetch these days... the processor will do better than you at managing the cache 99% of the time. You're looking at latencies in the neighborhood of 400 cycles for an NT store, which only works out if the processor is filling the cache with something else as you're storing straight to memory. |
Ah right! I completely forgot for a moment that L2 & L3 is also a thing 😅 |
@saucecontrol btw |
I'm sure you would have heard about it by now if ImageSharp had any overruns 😆 It's more a time saver because it gives you instant feedback when you screw up and gives you a really solid idea where. With the array pool handing out bigger buffers than you ask for, those bugs can hide for a while. It's saved me a ton of debug time since I implemented it. |
Problem
Intel's optimization manual places strong emphasis on the importance of alignment of memory addresses SIMD registers are stored/loaded from. Significant performance penalties occur when loads/stores are crossing (64 byte) cache line boundaries.
Managed (array pool) arrays give us no guarantees on the alignment of the
&arrray[0]
base address, which means that all ImageSharp SIMD code suffers from these penalties globally.Idea
dotnet/runtime#27146 initially proposed a mechanism to control the alignment of objects on the pinned heap, but that feature did not get into the final implementation.
Considering that the earliest release where runtime support might be available is .NET 6.0, and it won't be available in erlier releases anyways, I'm proposing a library-specific solution:
AllocationOptions.AlignedPinned
The question I can't figure out
I want to make sure there are no unwanted side-effects for the GC. What buffers should we use with this trick, considering that the (mid to large) buffers we want to align will be always temporary? Note that max lifetime of these buffers is the run of an image codec or a processing operation, pinned allocations will be typically bound to these kind of big, expensive operations and won't happen within individual primitive steps.
Marshal.Alloc
?@tannergooding @saucecontrol any thoughts/concerns?
Trying to also summon @benaadams if it's not too impolite.
The text was updated successfully, but these errors were encountered: