Releases: ROCm/Tensile
Releases · ROCm/Tensile
Tensile 4.41.0 for ROCm 6.2.4
Tensile code for ROCm 6.2.4 did not change. The library was rebuilt for the updated ROCm 6.2.4 stack.
Tensile 4.40.0 for ROCm 6.1.2
Tensile code for ROCm 6.1.2 did not change. The library was rebuilt for the updated ROCm 6.1.2 stack.
Tensile 4.40.0 for ROCm 6.1.1
Tensile code for ROCm 6.1.1 did not change. The library was rebuilt for the updated ROCm 6.1.1 stack.
Tensile 4.41.0 for ROCm 6.2.2
Tensile code for ROCm 6.2.2 did not change. The library was rebuilt for the updated ROCm 6.2.2 stack.
Tensile 4.41.0 for ROCm 6.2.1
Tensile code for ROCm 6.2.1 did not change. The library was rebuilt for the updated ROCm 6.2.1 stack.
Tensile 4.41.0 for ROCm 6.2.0
Additions
- new tuning script to summarize rocBLAS log file
- new environment variable to test fixed grid size with Stream-K kernels
- new Stream-K dynamic mode to run large problems at slightly reduced CU count if it improves work division and power
- add reject conditions for SourceKernel + PrefetchGlobalRead/LoopDoWhile
- add reject condition for PreloadKernelArguments (disable PreloadKernelArguments if not supported (instead of rejecting kernel generation))
- support NT flag for global load and store for gfx94x
- new Kernarg preloading feature (DelayRemainingArgument: initiate the load of the remaining (non-preloaded) arguments, updated AsmCaps, AsmRegisterPool to track registers for arguments and preload)
- add option for rotating buffers timing with cache eviction
- add predicate for arithmetic intensity
- add DirectToVgpr + packing for f8/f16 + TLU cases
- enable negative values for ExtraLatencyForLR to reduce interval of local read and wait for DTV
- add test cases for DirectToVgpr + packing
- add batch support for Stream-K kernels and new test cases
- new tuning scripts to analyze rocblas-bench results and remove tuned sizes from liblogic
- enable VgprForLocalReadPacking + PrefetchLocalRead=1 (removed the reject condition for VFLRP + PLR=1, added test cases for VFLRP + PLR=1)
- support VectorWidthB (new parameter VectorWidthB)
- support VectorWidth + non SourceSwap
- add test cases for VectorWidthB, VectorWidth + non SourceSwap
- add code owners file
- new environment variables to dynamically adjust number of CUs used in Stream-K
- add new parameters to specify global load width for A and B separately (GlobalLoadVectorWidthA, B (effective with GlobalReadVectorWidth=-1))
- add xf32 option to rocblas-bench input creator
Optimizations
- initialization optimizations (reordered init code for PreloadKernelArguments opt, used s_mov_b64 for 64 bit address copy, used v_mov_b64/ds_read_b64 for C register initialization, added undefine AddressC/D with PreloadKernelArguments, optimized waitcnt for prefetch global read with DirectToVgpr, refactored waitcnt code for DTV and moved all asm related code to KernelWriterAssembly.py)
- optimize temp vgpr allocation for ClusterLocalRead (added if condition to allocate temp vgpr only for 8bit datatype)
- reverse MFMA order in inner loop for odd outer iteration
- optimize waitcnt lgkmcnt for 1LDSBuffer + PGR>1 (removed redundant waitcnt lgkmcnt after 1LDSBuffer sync)
- enhance maximum value of DepthU to 1024 (used globalParameters MaxDepthU to define maximum value of DepthU)
Changes
- update rocBLAS-bench-input-create script (added number of iteration based on performance, rotating buffer flag)
- limit build threads based on CPUs/RAM available on system (for tests)
- update required workspace size for Stream-K, skip kernel initialization when possible
- use fallback libraries for archs without optimized logic
- use hipMemcpyAsync for validation (replace hipMemcpy with hipMemcpyAsync + hipStreamSynchronize in ReferenceValidator)
- remove OCL tests
- disable HostLibraryTests
- reduce extended test time by removing extra parameters in the test config files
- disable InitAccVgprOpt for Stream-K
- skip sgemm 64bit offset tests for gfx94x
- skip DTV, DTL, LSU+MFMA tests for gfx908
- increase extended test timeout to 720 min
- update xfail test (1sum tests only failing on gfx90a)
- update lib logic convertor script
- test limiting CI threads for only gfx11
- WGM related kernargs are removed if they are not needed (WGM=-1,0,1)
- cleanup on unused old code, mostly related to old client
- change GSUA to SingleBuffer if GlobalSplitU=1 + MultipleBuffer, instead of rejecting it
- update efficiency script for new architecture and xf32 datatype
- re-enable negative values for WorkGroupMapping (asm kernel only)
- disable HW monitor for aquvavanjaram941
- pre-apply offsets for strided batch kernels
- update tensile build with 16 threads
Fixes
- fix WorkspaceCheck implementation when used in rocBLAS
- ignore asm cap check for kernel arg preload for rocm6.0 and older
- fix Stream-K partials cache behavior
- fix MasterSolutionLibrary indexing for multiple architecture build
- fix memory allocation fail with FlushMemorySize + StridedBatched/Batched cases (multiply batch count size when calculating array size)
- fix BufferLoad=False with Stream-K
- fix mismatch issue with GlobalReadCoalesceGroup
- fix rocblas build fail on gfx11 (used state["ISA"] for reject conditions instead of globalParameters["CurrentISA"])
- fix for LdsPad auto (fixed incorrect value assignment for autoAdjusted, set LdsBlockSizePerPadA or B = 0 if stride is not power of 2)
- fix inacurate vgpr allocation for ClusterLocalRead
- fix mismatch issue with LdsBlockSizePerPad + MT1(or 0) not power of 2
- fix mismatch issue with InitAccOpt + InnerUnroll (use const 0 for src1 of MFMA only if index of innerUnrll (iui) is 0)
- fix HostLibraryTests on gfx942 and gfx941
- fix LLVM crash issue
- fix for newer windows vcpkg msgpack and vcpkg version package name
- fix an error with DisableKernelPieces + 32bit ShadowLimit
Tensile 4.40.0 for ROCm 6.1.0
Additions
- new DisableKernelPieces values to invalidate local read, local write, and global read
- stream-K kernel generation, including two-tile stream-k algorithm by setting StreamK=3
- feature to allow testing stream-k grid multipliers
- debug output to check occupancy for Stream-K
- reject condition for FractionalLoad + DepthU!=power of 2
- new TENSILE_DB debugging value to dump the common kernel parameters
- predicate for APU libs
- new parameter (ClusterLocalRead) to turn on/off wider local read opt for TileMajorLDS
- new parameter (ExtraLatencyForLR) to add extra interval between local read and wait
- new logic to check LDS size with auto LdsPad(=1) and change LdsPad to 0 if LDS overflows
- initialization type and general batched options to the rocblas-bench input creator script
Optimizations
- enabled MFMA + LocalSplitU=4 for MT16x16
- enabled (DirectToVgpr + MI4x4) and supported skinny MacroTile
- optimized postGSU kernel: separate postGSU kernels for different GSU values, loop unroll for GSU loop, wider global load depending on array size, and parallel reduction depending on array size
- auto LdsPad calculation for TileMajorLds + MI16x16
- auto LdsPad calculation for UnrollMajorLds + MI16x16 + VectorWidth
Changes
- cleared hipErrorNotFound error since it is an expected part of the search
- modified hipcc search path for Linux
- changed PCI ID from 32bit to 64bit for ROCm SMI HW monitor
- changed LdsBlockSizePerPad to LdsBlockSizePerPadA, B to specify LBSPP separately
- changed the default value of LdsPadA, B, LdsBlockSizePerPadA, B from 0 to -1
- updated test cases according to parameter changes for LdsPad, LBSPP and ClusterLocalRead
- Replaced std::regex with fnmatch()/PathMatchSpec as a workaround to std::regex stack overflow known bug
Fixes
- hipcc compile append flag parallel-jobs=4
- race condition in Stream-K that appeared with large grids and small sizes
- mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and TailLoop
- mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and SplitLds
- incorrect reject condition check for DirectToLds + LdsBlockSizePerPad=-1 case
- small fix for LdsPad optimization (LdsElement calculation)
Tensile 4.39.0 for ROCm 6.0.2
Tensile code for ROCm 6.0.2 did not change. The library was rebuilt for the updated ROCm 6.0.2 stack.
Tensile 4.39.0 for ROCm 6.0.0
Added
- Added aquavanjaram support: gfx940/gfx941/gfx942, fp8/bf8 datatype, xf32 datatype, and stochastic rounding for various datatypes
- Added/updated tuning scripts
- Added DirectToLds support for larger data types with 32bit global load (old parameter DirectToLds is replaced with DirectToLdsA and DirectToLdsB), and the corresponding test cases
- Added the average of frequency, power consumption, and temperature information for the winner kernels to the CSV file
- Added asmcap check for MFMA + const src
- Added support for wider local read + pack with v_perm (with VgprForLocalReadPacking=True)
- Added a new parameter to increase miLatencyLeft
Optimizations
- Enabled InitAccVgprOpt for MatrixInstruction cases
- Implemented local read related parameter calculations with DirectToVgpr
- Adjusted miIssueLatency for gfx940
- Enabled dedicated vgpr allocation for local read + pack
- Optimized code initialization
- Optimized sgpr allocation
- Supported DGEMM TLUB + RLVW=2 for odd N (edge shift change)
- Enabled miLatency optimization for (gfx940/gfx941 + MFMA) for specific data types, and fixed instruction scheduling
Changed
- Removed old code for DTL + (bpe * GlobalReadVectorWidth > 4)
- Changed/updated failed CI tests for gfx11xx, InitAccVgprOpt, and DTLds
- Removed unused CustomKernels and ReplacementKernels
- Added a reject condition for DTVB + TransposeLDS=False (not supported so far)
- Removed unused code for DirectToLds
- Updated test cases for DTV + TransposeLDS=False
- Moved parameter MinKForGSU from globalparameter to BenchmarkCommonParameter to support smaller K
- Changed how to calculate latencyForLR for miLatency
- Set minimum value of latencyForLRCount for 1LDSBuffer to avoid getting rejected by overflowedResources=5 (related to miLatency)
- Refactored allowLRVWBforTLUandMI and renamed it as VectorWidthB
- Supported multi-gpu for different architectures in lazy library loading
- Enabled dtree library for batch > 1
- Added problem scale feature for dtree selection
- Enabled ROCm SMI for gfx940/941.
- Modified non-lazy load build to skip experimental logic
Fixed
- Fixed predicate ordering for fp16alt impl round near zero mode to unbreak distance modes
- Fixed boundary check for mirror dims and re-enable disabled mirror dims test cases
- Fixed merge error affecting i8 with wmma
- Fixed mismatch issue with DTLds + TSGR + TailLoop
- Fixed a bug with InitAccVgprOpt + GSU>1 and a mismatch issue with PGR=0
- Fixed override for unloaded solutions when lazy loading
- Fixed build some errors (adding missing headers)
- Fixed boost link for a clean build on ubuntu22
- Fixed bug in forcestoresc1 arch selection
- Fixed compiler directive for gfx941 and gfx942
- Fixed formatting for DecisionTree_test.cpp
Tensile 4.38.0 for ROCm 5.7.1
Tensile code for ROCm 5.7.1 did not change. The library was rebuilt for the updated ROCm 5.7.1 stack.