Fun with half-precision floats, loops, vectorization-SIMD and shortcomings

Hi all,

I am testing the power of the Ada type system, new CPU functions and the compiler’s ability to optimize code. I have a AMD Zen4 CPU and it supports AVX512 and BFloat16 instructions. I wanted to see if GCC could generate automatic data structures and optimizations using those newer features.

I have been using Compiler Explorer and GCC 15 to run these tests. You can find the code and setup here. The code is

pragma Source_File_Name (Square, Body_File_Name => "example.adb");
pragma Ada_2022;

function Square(num : Integer) return Float is
    type My_Smol_Float is digits 6;  -- CHANGE ME!
    type Index is range 1..300;      -- CHANGE ME!

    Floaty1 : constant array (Index) of My_Smol_Float := (others => My_Smol_Float(num));
    Floaty2 : constant array (Index) of My_Smol_Float := (others => My_Smol_Float(num**2));
    Floaty_All :       array (Index) of My_Smol_Float;
    Accumul1, Accumul2, Accumul3 :      My_Smol_Float := 0.0;
begin
    for N of Floaty1 loop
        Accumul1 := @ + N;
    end loop;

    for I in Floaty1'Range loop
        Accumul2 := @ + Floaty1(I) + Floaty2(I);
    end loop;

    for I in Floaty1'Range loop
        Floaty_All(I) := Floaty1(I) * Floaty2(I);
    end loop;
    
    Accumul3 := Floaty_All'Reduce("*", 0.0);

    return Float(Accumul1 + Accumul2 + Accumul3);
end Square;

I was testing the difference between the different optimization levels and options.
I was mainly looking at:

  • -O2 -gnatp
  • -O3 -gnatp
  • -O2 -march=znver4 -mtune=znver4 -gnatp
  • -O3 -march=znver4 -mtune=znver4 -gnatp -fopt-info-vec-all

The compilers output of the last option setup is the one I care the most. Here is what I found.

  1. GCC 15 fails at optimizing the 'Reduce operation. The -fopt-info-vec-all does provide some hints, but I am not able to use them in any meaningful way as I do not understand these things all that well.
  2. GCC is able to generate half floats for low precision Floats. Though I think it is not able to generate 8-bit or 4-bit floats. It is neither able to generate BFloat16.
  3. GCC can do half float streaming and easily optimizes and vectorizes loops for SIMD without a problem (even with -O2).
  4. GCC unrolls the loops into SIMD operations if the amount of data the array is not too much, nice.
  5. GCC, while it is able to recognise that 6 digits of precision fit a half float, it does not generate any SIMD arithmetic instructions specialized for it. It generates vaddss which is for single precision floats, instead of using the more advance AVX2-FMA or AVX512FP16 arithmetic instructions (wiki info)…
    1. This is strange as it clearly understand that they are half precision and that the arithmetic operations that I am doing are available in my hardware… Is it one of these cases where Ada’s lack of undefined behaviour prevents some operations from being generated? I know that GNAT will not generate some instructions if the hardware implementation has some UB that is incompatible with Ada’s guarantees.

Summary:

  • Point 5 is the one that I was hoping would be automatically optimized. This would create a huge performance gain automagically.
  • It is nice to see that GCC can automatically create half-floats from Ada’s code.

Next episode: try the GNAT-LLVM compiler :smiley:

Cheers,
Fer

EDIT: added link to some basic docs for AVX512 and deleted unneeded optimization flag.

2 Likes

Well… I just compiled GNAT-LLVM (based on LLVM 19) and it does not generate by far as optimized code as GCC does… The half float is just treated as a normal float and it throws away quite a bit of the optimizations. It also generates addss instructions instead of the vaddss and LLVM only uses %xmmX while GCC also happily uses %ymmX and %zmmX registers… Oh well, it seems more compiler optimizations are needed in both sides.

If you look at my maths lib, nothing special and unfinished, but the matrix functions are unrolled and the compiler vectorises it nicely.