LLM generated code and utf-8 attacks

kevlar700 · November 17, 2025, 1:19am

Considering how easily LLM server systems can be manipulated via I am told very easy sandbox escaping. And the concern over utf-8 compiler attacks that can embed hidden code exploits. Then even AI clients like copilot thar only suggest code without local binary execution may be attack vectors targetting particular companies or individuals.

Is a good mitigation to use -gnatWb to reject any non ascii code?

OneWingedShark · November 17, 2025, 2:59am

Meh, Byron already addressed that: all processing is done with UTF-32 (Wide_Wide_Character):

Converting all input to UTF-32, here;
Using the definition from the LRM to ensure only valid identifiers, here;
Ensuring only “emittable” items are put into the token-stream here.

TL;DR — Correctness mitigates errors, simplicity (e.g. converting everything to Wide_Wide_Character) reduces the “surface area” which would be required (i.e. multiple processors).

dmitry-kazakov · November 17, 2025, 8:17am

It is too early for Fool’s Day jokes.

kevlar700 · November 17, 2025, 8:38am

Are Bidi control characters the only attack vector and fixed by default in GCC/Gnat?

“There are a variety of ways to exploit the adversarial encoding of source code. The underlying principle is the same in each:
use Bidi control characters to create a syntactically valid reordering of source code characters in the target language.
In the following section, we propose three general types of exploits that work across multiple languages. We do not claim that this list is exhaustive.”

dmitry-kazakov · November 17, 2025, 8:56am

See ARM 2.3. Bidirectional control characters have Unicode characterization cf. As such they cannot be a part of an Ada identifier.

And of course this has nothing to do with the encoding at all be it UTF-8, UTF-16 or UCS-4 (Wide_Wide.Character).

Of course one can argue that theoretically Wide_Wide_Character could give some more opportunities for manipulation because it is larger that the set of valid code points and lacks checks UTF encodings have. But that is rubbish too.

The actual reason why only ASCII-7 should be used is in the homographic glyphs. They remain so even after normalization. E.g. a (U+0061) and а (U+0430).

kevlar700 · November 17, 2025, 12:55pm

Can you limit all code to ASCII only?

kevlar700 · November 17, 2025, 1:14pm

I guess the fact that the pie symbol is part of the Ada standard now may be a problem for e.g. No_Wide_Characters and -gnaticn though I guess it could possibly just be avoided (assuming it’s just a name of a constant).

dmitry-kazakov · November 17, 2025, 4:44pm

Yes, I can in Ada and I do.

kevlar700 · November 17, 2025, 4:56pm

I guess I’ll look into simple components then. Would you mind telling me how you do so? Thank you.

dmitry-kazakov · November 17, 2025, 5:21pm

Identifiers and comments are in English.

String literals can be entered as:

Strings_Edit.UTF8.Image (16#E4#) – A umlaut

or

Character’Val (16#C4#) & Character’Val (16#A4#)

Needless to say, I use no Wide_Wide_Character ever.

Wide_Character is used exclusively for Windows bindings as a UTF-16 unit.

If you think you needed code points for parsing, matching patterns etc. You are wrong. Here is full parser of Ada expression:

No Wide_Wide_Characters.

kevlar700 · November 20, 2025, 11:06am

I found I already had the following on most of my code bases, though for embedded reasons.

pragma Restrictions (No_WIde_Characters);

I added -gnatin to my Alire configuration but sometimes the program compiles and sometimes it picks up illegal characters perhaps depending on whether it compiles in the constant as sometimes using it elsewhere in the code causes it to be picked up but I don’t seem to be able to reliably trigger it.

I seem to be able to compile at times with this which is something I would like to avoid

Shadows : constant Character := Ada.Characters.Latin_9.Pound_Sign;
Shаdows : constant Character := Ada.Characters.Latin_9.Ampersand;

I also found this

dmitry-kazakov · November 20, 2025, 11:48am

Do not do that. The proper way is to consider Character an octet and String UTF-8 encoded:

with Strings_Edit.UTF8;
...
Pound     : constant String := Image (16#A3#);
Ampersand : constant String := Image (16#26#); -- Length 1

You never need to convert anything in UTF-8.

kevlar700 · November 20, 2025, 12:07pm

Right, always use Latin-1 for UTF-8 compatibility and only for ASCII-7 . What I am actually trying to accomplish (not that it’s really a big deal as it would likely get noticed) is that I don’t want any potential trickery like hiding away a Boolean. Even if I don’t use AI or always code by hand from it’s suggestions, others will and it seems nefarious actors can literally control what it provides without any detection by the user and without much difficulty. I want code review to be as straight forward as possible without any shadowing tricks etc.. I don’t want two symbols that look the same in code to be possibly different etc..

Shadows :
Shаdows :

One with a Cyrillic a as you mentioned

kevlar700 · December 3, 2025, 3:07pm

So with the following in Alire.toml then either -gnatin works reliably or the shadow constants are banned outright.

[build-switches]
"*".Source_Encoding = "Compiler_Default"