UTF-8, Identifiers and dealing with Strings in Ada

ThyMYthOS · December 21, 2025, 11:22am

Please give a little bit more context how this is related, as the whole topic is about not dealing with [Wide_]Wide_Strings, but instead using compact UTF-8 encoded strings.

dmitry-kazakov · December 21, 2025, 12:05pm

The -gnatWe switch with e=8 uses UTF-8 encoding. For example you can write this:

  ["E4"] : Integer := 1;
begin
  ä := 2;

ThyMYthOS · December 21, 2025, 12:43pm

Right, and I hope you agree that using [“E4”] as identifier is not particularly user friendly. Using it as a portable way to define e.g. π seems reasonable, though.

Talking about Unicode identifiers, you already mentioned the issue that visually identical characters are actually separate symbols for the compiler. This is only an issue when used in an intentionally malicious way (e.g. by LLMs). But Ada also has the inverse “problem” that visually different characters are treated as identical due to case invariance. This always reminds me of the IMHO bad design choice of DOS/Windows to treat file names case insensitive.

Personally I would like to use greek characters for identifiers in code, as it avoids ugly names like “Omega_Hat”, but Ada makes this particularly difficult as in most formulas Ω and ω are certainly not the same.

dmitry-kazakov · December 21, 2025, 2:12pm

In most formulae division is represented by a horizontal line while the operands are determined by the line length!

OneWingedShark · December 21, 2025, 2:14pm

Allow me to disagree.
Case insensitivity is a good thing, because while “STOP NOW” and “stop now” and “Stop now” are all written differently, they are the same thing. While this makes certain name-schemes impossible (eg int f(Object object)) and forces a bit more thought into naming, this extra thought is often an opportunity to make your code more readable/maintainable.

You’re going about it backwards.
Put the formula in the comment, then use sensible names; if things get cluttered, use a renames.

-- …Ω = …
Resistance := …

IOW, don’t write Funky_Symbol, write Meaningful_Name and let the comments show the correspondence.

As “bad-design” you can make subprograms that are ambiguous w/o named parameters, thus forcing the use of named-parameter calls to use:

Package Example is
  Type Target_Heat(<>) is private;
  Function Make( Fahrenheit : Integer ) return Target_Heat;
  Function Make( Celcius    : Integer ) return Target_Heat;

This might be appropriate in order to force some care in how someone is handling the proffered type, illustrated with Target_Heat here.

Kohlrak · December 21, 2025, 2:56pm

Not so much that as I believe i should be able to store utf8 at all from within a source file. I want it to read everything between the “” as bytes, not interpret and check the values for a constraint error, which is what it’s doing. If Ada had a utf8 type that allowed me to do that, i would be fine doing that, but it does not.

ThyMYthOS:

@Kohlrak no zeros are inserted (why would they?):

pragma Wide_Character_Encoding ( UTF8 );
with Ada.Wide_Text_IO; use Ada.Wide_Text_IO;
procedure Main is
  S : Wide_String := "こ|ん|に|ち|は|☃|";
begin
  Put_Line(S);
end Main;

Results in:

./main|hexdump -C
00000000  e3 81 93 7c e3 82 93 7c  e3 81 ab 7c e3 81 a1 7c  |...|...|...|...||
00000010  e3 81 af 7c e2 98 83 7c  0a                       |...|...|.|

The problem with your test is you’re testing output ,not the internal storage. Use the -S flag to see the assembly source:

        .file   "main.adb"
        .text
        .section        .rodata
        .align 8
.LC0:
        .long   1
        .long   12
        .text
        .align 2
        .globl  _ada_main
        .type   _ada_main, @function
_ada_main:
.LFB1:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        leaq    s.0(%rip), %rax
        leaq    .LC0(%rip), %rdx
        movq    %rax, %rcx
        movq    %rdx, %rax
        movq    %rcx, %rdi
        movq    %rax, %rsi
        call    ada__wide_text_io__put_line__2@PLT
        nop
        popq    %rbp
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc
.LFE1:
        .size   _ada_main, .-_ada_main
        .section        .rodata
        .align 16
        .type   s.0, @object
        .size   s.0, 24
s.0:
        .value  12371
        .value  124
        .value  12435
        .value  124
        .value  12395
        .value  124
        .value  12385
        .value  124
        .value  12399
        .value  124
        .value  9731
        .value  124
        .ident  "GCC: (GNU) 15.2.0"
        .section        .note.GNU-stack,"",@progbits

Under s.0, you’re seeing that s.0.size is 24, and that all data is algined to 16 bits (that is to say, where you see 124, it’s actually writing 7C 00). What’s happening is if you don’t specify -gnatW8 on your computer (i tried not specifying it, but something here is turning the switch on), it isn’t testing the value of each character in the string. When you enable -gnatW8, it’s testing the codepoints to ensure they’re not above Latin-1 instead of being agnostic of what is between the “”. This is where the challenge begins, for modern code that’s strictly ada, we need a way to write something like a string that isn’t constraint tested. Ada is determined to store the UTF8 characters by their decoded value instead of seeing the individual character as being made up of multiple bytes. Now i understand that you shouldn’d be able to store it in a single character, but the fact of the matter is we’re dealing with strings, so there should be a way to do this, but there doesn’t seem to be.

Right. That’s the error. The problem is there’s no normal way to actually do this outside of finding a way to make it not decode (any guesses?) or breaking them down into individual bytes which is unreadable in the source. I get that String probably shouldn’t be my answer, but wide_string and wide_wide_string also definitely aren’t it. It seems strange that if I were to have a single target-language for the application that I write that I would be unable to do it if that target language required characters outside of Latin-1. Stranger yet, if i wanted to use “string files” or something like that, I would have to load them at run time rather than baking it into the code, unless i used C strings. I mean, in theory, i could try hacking away at the -S output of package to create some macros for AS, but that’s not clean, either. The modern way of doing things actually requires C functions (gettext), which then begs to ask why you’re using Ada if you have to import C functions for Hello World in your native language (my native tongue is english, but that’s irrelevant to the topic from which this topic spawned of why Ada isn’t getting enough attention).

The plot twist of all of this is that if you were to ask what features of Ada got prioritized over storing utf8 strings, you’d have to respond with using utf8 strings as identifiers in code. So i can use UTF8 in my code, so long as it’s not actually going into the executable.

pmnw · December 21, 2025, 2:57pm

Case insensitivity incentivizes thoughtful name choices and it is simply comfortable to never have to worry about the case.
That few other languages choose to do this is a travesty!

dmitry-kazakov · December 21, 2025, 3:38pm

You are thoroughly confused. No language can that. You need some escape sequences or byte counts as in the FORTRAN Hollerith literals. In Ada " is escaped by doubling it. C++ has a whole lotta of escape sequences.

Differently to many languages Ada does not specify the encoding of the source. It can work with a limited character set. Any assumption that the source is UTF-8, UTF-16, KOI8 is wrong.

As for Wide characters you have another problem on top: endiannes. Wide [wide] blobs are non-portable.

I have difficulty to understand what are you talking about. You confuse literals, string, wide string, I/O. Can you simply tell what you want?

Raw I/O is done in Ada by streams. They operate on the basis of a Stream_Element which is roughly an octet. Text IO is formatted I/O which suggests certain processing. But again, it seems that you do not understand what you want actually write and where. Sorry.

Kohlrak · December 21, 2025, 4:11pm

I mean, yeah, since " is a character that would need to be escaped, at least some form of escaping would be necessary at least for that particular character. This is not debated, and you’re splitting hairs.

I want being able to type hello world, using utf8 storage in any given language in a single source file in Ada to not be a pipe dream. Saying that it’s an encoding issue or that it works if i load it in a file is beating around the bush. It is a normal thing in a modern programming language in a modern environment to be able to expect utf8 representation in source to also be storable in utf8 in the resultant program, given that it’s the modern standard. I understand that using “the old tools” of the language might break compatibility so we may need new ones (a new type or something), but trying waltz around the fact it isn’t possible is just avoiding the problem and pretending it isn’t one by claiming a normal thing isn’t normal. I could understand if we were talking about SJIS or something, but there’s no reasonable way that I can find to use the standard encoding. Which, i get: Ada isn’t supposed to specify the encoding. Thing is, actually is doing that, because it’s putting constraints in based on switches, and in my case automatically.

We agree, here, but guess what Ada is currently demanding.

Oh, i know exactly what I want. I want something that is both portable and readable, which UTF8 is outside of microcontrollers. Your solution is, i assume, reading from an external file. Which, curiously, is going to use what format? If one wanted to stick to strictly Ada, I’d have to load things like a response to “file not found” in at run time… I guess that wouldn’t be very safe of Ada. So, compile time, right? Oh, now i’m surely using Assembly macros (might actually be portable) or C (and converting). Think this might be a bit much? Oh oh, i could manually convert to 1-byte sequences, but that’s not very readable, either.

dmitry-kazakov · December 21, 2025, 4:23pm

This is what happens without -gnatW8. When you declare:

S : Wide_String := "こんにちは☃";

then in the UTF-8 source こんにちは☃ is UTF-8 encoded. As a literal the compiler takes it literally from " to ". Therefore S’Length = 18. Each octet of UTF-8 representation produces one Wide_Character in the range 0..255. Logically,

When you specify -gnatW8. The literal is recoded into UCS-2. Naturally, it must be a legal UTF-8 sequence representable as UCS-2. In that case S’Length = 6.

Use

   for I in S'Range loop
      Put_Line (Integer'Image (Wide_Character'Pos (S (I))));
   end loop;

to see what is going on.

Kohlrak · December 21, 2025, 4:37pm

You can also use -S to see what’s going on, too, and i’m seeing that. The problem is, -gnatW8 counter-intuitively pushes these checks. The bigger problem is if i manually invoke gnat, it’ll pull -gnatW8 from some mysterious place. Worst yet, when i was at work, it was auto-applying -gnatW8, and now that I’m home it’s not. So somehow the terminal i use sets some environment variable that gnat is checking (and it isn’t $LANG). Whatever it is, it shouldn’t be gnat specific, as i’m switching terminal emulators.

But as a side note, it is worth asking why the -gnatW8 option which makes Ada aware of utf8, to the degree of allowing it for identifiers, enables a check that also forces conversion to UTF16 (UCS-2) or UTF32 (UCS-4). Unless i’m missing some datatype or method.

dmitry-kazakov · December 21, 2025, 4:42pm

It is not portable because Windows is UTF-16. I said that if you want to write a good Ada program then stick to ASCII-7, do not use Wide[Wide] stuff.

If the question is specifically, can I have a UTF-8 encoded Ada source?" The answer is yes you can:

with Ada.Text_IO; use Ada.Text_IO;
procedure Main is
begin
   Put_Line ("Привет!");
end Main;

Do not use -gnatW8 because it would attempt to interpret UTF-8 as Latin-1 and recode it to UTF-8 which is naturally impossible. I posted links to AdaCore documentation on the subject.

dmitry-kazakov · December 21, 2025, 4:58pm

Read the documentation! gnatW8 is not what you think!

Do not use Alire. Install the native toolchain that does mangle the compiler settings.

I doubt it.

-gnatW8 is an AdaCore’s hack. It does not make Ada “aware” of UTF-8. It recodes literals. When you have a wide literal in UTF-8 that is garbage. You must recode it to UCS-2 to make it work. This what -gnatW8 does.

AdaCore has many extensions and alternations. Some of them are good, some are not so.

Kohlrak · December 21, 2025, 5:21pm

Incorrect: windows switched from UTF-16 to UTF-8 years ago. (Later than we like, but it happened.)

This is much clearer. Not that i have intent to code in japanese, then i assume that -gnatiw is to enable the identifiers?

And to break the ice, frankly, i can’t tell if you’re trying to be hostile or not. Some of your messages come off absolutely so, others completely change tone. And at some points it seems like we have a communication barrier. The hostile points have made me prone to skimming over your posts and links in favor of not-hostile posts.

dmitry-kazakov · December 21, 2025, 8:14pm

No, it is -gnatif (see below). For Unicode you need -gnatW8.

When the source is UTF-8 encoded there is no characters other than Latin-1 in there. If you try to compile:

П : Integer := 1;

You will get the illegal character error because the code will contain 16#D0# 16#9F#. You can still compile it using the -gnatif switch that would allow 8-bit characters in the identifiers. In that case the variable name will become Ð + something. Which is not what you expect.

For GNAT to have full Unicode identifiers you need a Unicode source. UTF-8 is a Latin-1 source for GNAT. You must recode it to UCS-2 in order to have П as a single character in there. This is how the compiler works. I do it differently in my Ada parser. I consider the source UTF-8 encoded. AFAIK other people do it by converting everything to Wide_Wide_String first, with the same effect. But GNAT does it this, may I say, strange way. Now, to work around the corners there is the -gnatW8 switch that recodes UTF-8 to UCS on the fly in the identifiers and literals and allows Unicode letters in the identifiers. The problem with this approach is that UTF-8 string literals are mutually incompatible with Wide literals and identifiers. --gnatW8 kills the first. Without it you are restrained to ASCII-7 identifiers and no Wide literals. Which causes a total confusion among the users.

You cannot have both

   S1 : String := "ü";  -- No -gnatW8
   S2 : Wide_String := "ü"; -- With -gnatW8

P.S. I am not at AdaCore. Maybe they have something else in the sleeve.

P.P.S. I understand that an advice to read documentation is the utter form of hostility in these days. I promise to restrain myself in the future!

Kohlrak · December 22, 2025, 2:59am

Technically still UTF-16 under the hood, microsoft makes their intent clear here: Use UTF-8 code pages in Windows apps - Windows apps | Microsoft Learn

You’re supposed to be using UTF-8 for compatibility. Also, win32 API is legacy. Mark this day on your calendar, I actually have a compliment for Microsoft: the one and only thing they do right is maintaining legacy support.

dmitry-kazakov:

No, it is -gnatif (see below). For Unicode you need -gnatW8.

When the source is UTF-8 encoded there is no characters other than Latin-1 in there. If you try to compile:

П : Integer := 1;

You will get the illegal character error because the code will contain 16#D0# 16#9F#. You can still compile it using the -gnatif switch that would allow 8-bit characters in the identifiers. In that case the variable name will become Ð + something. Which is not what you expect.

For GNAT to have full Unicode identifiers you need a Unicode source. UTF-8 is a Latin-1 source for GNAT. You must recode it to UCS-2 in order to have П as a single character in there. This is how the compiler works. I do it differently in my Ada parser. I consider the source UTF-8 encoded. AFAIK other people do it by converting everything to Wide_Wide_String first, with the same effect. But GNAT does it this, may I say, strange way. Now, to work around the corners there is the -gnatW8 switch that recodes UTF-8 to UCS on the fly in the identifiers and literals and allows Unicode letters in the identifiers. The problem with this approach is that UTF-8 string literals are mutually incompatible with Wide literals and identifiers. --gnatW8 kills the first. Without it you are restrained to ASCII-7 identifiers and no Wide literals. Which causes a total confusion among the users.

You cannot have both
   S1 : String := "ü";  -- No -gnatW8
   S2 : Wide_String := "ü"; -- With -gnatW8
P.S. I am not at AdaCore. Maybe

They really should fix this, but thank you for the suggestion. I’ll be trying it to see how it works in the event I actually have to deal with Unicode code. I got a PM with another interesting suggestion I plan on trying out as well that might be more to your liking and cleaner, but I haveto see whether or not there is a runtime cost.

Hardly. To the contrary, pointing to the document and where in the document something is covered is quite polite. Your name to me looks eastern-european or even north-east-asian. But I have learned over time not to treat people based solely on that. I do know, however, if it is the case, those languages tend to handle tone very differently.

JC001 · December 22, 2025, 10:00am

What you are trying to do is unreasonable. UTF-8 is a sequence of bytes, not a string. If you want to define a sequence of bytes, you can easily do that, but not as a string literal. If you want a string literal, then you have to use a string type with components of an appropriate character type.

The ARM defines Ada source as Unicode. So something like

Ada.TEXT_IO.put_line("こんにちは");

should always be invalid and rejected by all compilers, regardless of what compiler options you use. That GNAT has options that allow this to compile is technically a compiler error (though I don’t know how you could have a compiler that optionally accepts Latin-1 source that could detect it).

If you want to persist in writing invalid Ada and relying on a compiler error to compile it, then you’ll have to become very familiar with the compiler’s options.

dmitry-kazakov · December 22, 2025, 10:41am

Why?

Yes, but that does not tell anything about encoding that source.

If the source is UTF-8 then this applies:

4.2 (10/5)
“The evaluation of a string_literal that is a primary and has an expected type that is a string type, yields an array value containing the value of each character of the sequence of characters of the string_literal, as defined in [2.6]”

therefore

S : String := "ü";

should produce a Latin-1 encoded string of length 1 containing ü. Again, if the source were UTF-8.

But GNAT considers the source Latin-1 and thus each octet of UTF-8 encoded ü becomes an independent “graphic character” and the length of S is 2.

Actually you need a switch to make this illegal! Without that you can have any 8-bit Character in string, The interpretation of Character depends on the environment. And environment is UTF-8 unless you configure it otherwise.