Fun with half-precision floats, loops, vectorization-SIMD and shortcomings

Irvise · September 17, 2025, 1:14pm

Hi all,

I am testing the power of the Ada type system, new CPU functions and the compiler’s ability to optimize code. I have a AMD Zen4 CPU and it supports AVX512 and BFloat16 instructions. I wanted to see if GCC could generate automatic data structures and optimizations using those newer features.

I have been using Compiler Explorer and GCC 15 to run these tests. You can find the code and setup here. The code is

pragma Source_File_Name (Square, Body_File_Name => "example.adb");
pragma Ada_2022;

function Square(num : Integer) return Float is
    type My_Smol_Float is digits 6;  -- CHANGE ME!
    type Index is range 1..300;      -- CHANGE ME!

    Floaty1 : constant array (Index) of My_Smol_Float := (others => My_Smol_Float(num));
    Floaty2 : constant array (Index) of My_Smol_Float := (others => My_Smol_Float(num**2));
    Floaty_All :       array (Index) of My_Smol_Float;
    Accumul1, Accumul2, Accumul3 :      My_Smol_Float := 0.0;
begin
    for N of Floaty1 loop
        Accumul1 := @ + N;
    end loop;

    for I in Floaty1'Range loop
        Accumul2 := @ + Floaty1(I) + Floaty2(I);
    end loop;

    for I in Floaty1'Range loop
        Floaty_All(I) := Floaty1(I) * Floaty2(I);
    end loop;
    
    Accumul3 := Floaty_All'Reduce("*", 0.0);

    return Float(Accumul1 + Accumul2 + Accumul3);
end Square;

I was testing the difference between the different optimization levels and options.
I was mainly looking at:

-O2 -gnatp
-O3 -gnatp
-O2 -march=znver4 -mtune=znver4 -gnatp
-O3 -march=znver4 -mtune=znver4 -gnatp -fopt-info-vec-all

The compilers output of the last option setup is the one I care the most. Here is what I found.

GCC 15 fails at optimizing the 'Reduce operation. The -fopt-info-vec-all does provide some hints, but I am not able to use them in any meaningful way as I do not understand these things all that well.
GCC is able to generate half floats for low precision Floats. Though I think it is not able to generate 8-bit or 4-bit floats. It is neither able to generate BFloat16.
GCC can do half float streaming and easily optimizes and vectorizes loops for SIMD without a problem (even with -O2).
GCC unrolls the loops into SIMD operations if the amount of data the array is not too much, nice.
GCC, while it is able to recognise that 6 digits of precision fit a half float, it does not generate any SIMD arithmetic instructions specialized for it. It generates vaddss which is for single precision floats, instead of using the more advance AVX2-FMA or AVX512FP16 arithmetic instructions (wiki info)…
1. This is strange as it clearly understand that they are half precision and that the arithmetic operations that I am doing are available in my hardware… Is it one of these cases where Ada’s lack of undefined behaviour prevents some operations from being generated? I know that GNAT will not generate some instructions if the hardware implementation has some UB that is incompatible with Ada’s guarantees.

Summary:

Point 5 is the one that I was hoping would be automatically optimized. This would create a huge performance gain automagically.
It is nice to see that GCC can automatically create half-floats from Ada’s code.

Next episode: try the GNAT-LLVM compiler

Cheers,
Fer

EDIT: added link to some basic docs for AVX512 and deleted unneeded optimization flag.

Irvise · September 17, 2025, 5:31pm

Well… I just compiled GNAT-LLVM (based on LLVM 19) and it does not generate by far as optimized code as GCC does… The half float is just treated as a normal float and it throws away quite a bit of the optimizations. It also generates addss instructions instead of the vaddss and LLVM only uses %xmmX while GCC also happily uses %ymmX and %zmmX registers… Oh well, it seems more compiler optimizations are needed in both sides.

Lucretia · September 17, 2025, 9:44pm

If you look at my maths lib, nothing special and unfinished, but the matrix functions are unrolled and the compiler vectorises it nicely.