UTF-8, Identifiers and dealing with Strings in Ada

Kohlrak · December 22, 2025, 2:50pm

Well from the reading it’s clear to me that while the intent of String was not to hold utf8, i’m also reading that utf8 support is also intended. FWIW, UTF-8 is ultimately the standard everyone goes by and it is so for a reason. It’s certainly not ideal or perfect, but it is ultimately what the world settled on. And it’s clear to me that ARG intends to ultimately support what is being said to be “unreasonable.” The question is “how?” But keep in mind, i came to that conclusion without reading your link. That said, it’s clear from the link that my conclusion was correct.

dmitry-kazakov · December 22, 2025, 4:58pm

You are reading it wrong. String can hold anything. The issue is with string literals. Make this mental picture. Literal “Anisöl” is same for all string types. Just like 123 is same for all integer types. Ada does not have C-escue mess when you must specify the type of the literal.

Now the literal must be converted to the target type. In our example it is string. Now the encoding of the source starts to be relevant. Is the literal a sequence of 6 Unicode characters? Is the literal a sequence of 7 Latin-1 characters? The text editor renders them same. Ada standard requires that each “character” of the literal must be converted to one “character” of the target type. This requirement precludes UTF-8 encoding of literals.

It is not string it is literals.

And there is no problem to use it in Ada. It is just so that Ada follows Unicode while other languages do not. In Ada a Unicode character is a character. In other languages it is not. A language purist could say, OK, do not use String, use Stream_Element_Array. This is exactly what UTF-8 encoded string is. You can have it all but no literals. Or you can say OK I go full Unicode and use Wide_Wide_String, then you will have literals but no UTF-8.

The choice is yours. I do not care about literals in Elder Futhark, I stay ASCII-7.

krischik · December 22, 2025, 5:29pm

I disagree with that. If you need random access and you target a desktop computer where memory consumption is not an issue then converting to UTF-8 to Wide_Wide_String make string handling a lot easier. You can then use the slices you mentioned earlier.

Example:

declare:
   Test_1 : constant Wide_Wide_String :=  "😂ABCäöü Hallo ÄÖÜß😛";
   Hallo : Wide_Wide_String renames Test (9 .. 13);
begin
   null;
end;

Try that with Ada.Strings.UTF_Encoding.UTF_8_String and all the elegance of Ada slices are gone. As well as the renames.

dmitry-kazakov · December 22, 2025, 5:51pm

Ada slices work perfectly well with UTF-8 encoded string. Surely magic numbers is poor programming. What does 9 and 13 exactly mean?

The point is and always was, no random access to code points is needed in a sane program. Strings need not to be “handled.” If you feel that you need it, then reconsider the algorithm. All strings must be processed piece by piece consequently.

krischik · December 22, 2025, 6:18pm

This is sample code. They mean nothing. You are deflecting to superfluous details.

Maybe not needed but it’s easier. I’m not developing for academic rigour. I just want to get the job done. For me Ada is only a hobby.

My program, my rules. You are not to decide what “must” be done in my programs.

dmitry-kazakov · December 22, 2025, 7:37pm

Sure, but then your argument is not “it is easier” with the assumed quantifier for all, but “it worked for me.” Fair enough.

Kohlrak · December 23, 2025, 5:32am

Right, sort of. See, it’s reading the literal and checking it after conversion at the character level. Being aware of utf8, it sees the individual characters are are larger than 1 byte, and rejects them. If it doesn’t convert (which removes the intent for checking the values before placing them in string) it “just works.”

I think the thing you may or may not be missing is that in practical terms, it’s impossible to check the values coming in any other way. The fact you can read utf8 in from a file or other source post-compilation is actually an error, because they’re not checking the values (but literals give them the opportunity to do that at least there). They simultaneously intend to support UTF-8 and other encodings where you might have character that takes up space larger than the storage, but they insist on constraint checking hence the conversation here:

github.com/Ada-Rapporteur-Group/User-Community-Input

Support for a universal string type, unicode, grapheme clusters, and all that

opened 10:23PM - 29 Oct 25 UTC

sttaft

Feature Request

This issue is a continuation of ARG GitHub issue [#40](https://github.com/Ada-Ra…pporteur-Group/User-Community-Input/issues/40) on universal strings, focusing now on how we might define various types and packages to facilitate the manipulation of strings using the builtin string types, along with UTF-8, UTF-16, and perhaps grapheme clusters. This issue will build on ARG GitHub issue [#148](https://github.com/Ada-Rapporteur-Group/User-Community-Input/issues/148) which proposed to make unconstrained array subtypes (like String), or more generally indefinite subtypes, more flexible. The first comment includes a specific proposal, but it is just there to prompt further comments and brainstorming.

The solution for every other language has been “we don’t check like that, so we don’t care.” Similarly, i think the solution for Ada could be “unchecked_string” (similar to unchecked_conversion) and call it a day like every other language, but they insist on actually interpreting the characters and checking their validity and size. The fact that it only affects literals is incidental to the fact that it’s impossible to check anything else. Heaven forbid they discover that some people out there might be trying to use utf16. Right now, that discussion seems to be leaning towards making a “utf8 string” or redefining String to be as permissive as we’re already using it so people can use things like the -gnatW8, since we can already tell that short of some non-existent case of “malicious unicode” it has not demonstrated to be a problem aside from when Ada tries to intervene. Frankly, it should be obvious at this point that Ada cannot protect us from malicious unicode when it can only check literals, and they’re shooting themselves in the foot in the attempt of taking the foot-shooting-gun from the coders.

Correct, but that is just ignoring the fact that literals are used for a reason. Sometimes you have only one target language (i’d argue this is most often the case with smaller scale programs designed for a specific work environment). The other problem is “how do you load in strings externally?” if you’re refusing to use literals. Breaking the strings down to individual character values is a violation of a core idea of Ada: readability. Ultimately, we have to store the strings in something that Ada isn’t checking to bypass Ada’s internal checks or use one of the ridiculous wide-types. Right now, everything (including the unreadable method byte arrays) is about sidestepping the value checks.

This is all fine until this internal handling starts using up too much space (unlikely outside of embedded) or it somehow gets send externally. His point about wide he’s not conveying clearly, but he does have a huge point: you cannot expect UTF-16LE and UTF16-BE to behave with one another, same with UTF32-LE and UTF-32BE which is your suggestion here. IIRC, the problem you’re going to encounter is with Macs, because historically they have switched between LE and BE processors. ARM and x86 are LE so situations where this is a problem is rare, but that’s not to say it doesn’t exist or won’t exist. This is a ticking time bomb, because no one is going to run a conversion on strings to network-byte-order for every file or network operation.

Something important for Ada is seeing down this road and preventing it. By using wide_string or wide_wide_string you’re deceptively stating that you consent to this risk.

JC001 · December 23, 2025, 11:35am

None of this is true. Ada does not prevent allowing users to use emoji or using the same amount of memory to store the same information as other languages.

You seem to be confusing how you store and output data in an Ada program with how the output device presents those data.

Ada is and always has been a general-purpose language.

Kohlrak · December 23, 2025, 11:45am

No, it absolutely does. Ada intentionally restricts how information is coded into it. Giving you access to adding more of these restrictions is precisely how Ada works. Without the -gnatW8 switch, the compiler allows it because it’s set by default to honor legacy and to assume every byte represents one and only one character. When you break it’s illusion by enabling character interpretation via -gnatWx, it begins to do what it was designed to do: check your input. That’s why this discussion is here and all the talk about UTF-8 string support.

dmitry-kazakov · December 23, 2025, 4:50pm

Individual characters have no size. You confuse value and representation of. Compare, 123 has no size. Depending on representation it can be any number of bytes. Similarly, UTF-8 is a method of representation of “characters” = Unicode code point values.

There is no run-time checks. The Unicode character ü is a syntactically illegal in String literal.

No, it would crash the program obviously. The proposal seems to make string types descendant of a common ancestor. I didn’t read it, because it Ada type system cannot handle both encoding view and “character” view anyway.

String is permissive as hell. It is all syntax issue.

By reading them from a file?

The problem is that -gnatW8 corrupts Ada I/O subsystem as well as stream attributes by replacing them with encoding/decoding for wide and wide-wide characters. Wide characters were nonsense from the very beginning. There is no such thing. There is no wide sources. I/O should have write wide character as 16-bit words. Nobody neededs that ever! So AdaCore invented a hack switch that recodes UCS into UTF-8. As I said many times before. Do not use this mess! Read strings as other languages do.

Read what I wrote before. There is no any checks. The issue is that the definition of string as an array of “characters” is incompatible with the very notion of variable length encoding.

In other languages strings are arrays of bytes. “char” in C++ is as much character as “class” is class…

In a properly designed language a string should be an array of representation units with an array interface of characters array on top. No type system I know is capable of that.

ThyMYthOS · December 23, 2025, 7:11pm

Using the feature “User-Defined Literals” from Ada 2022 and Ada.Iterator_Interfaces it should be possible to create a type (possibly tagged) that could combine both: internal storage as UTF-8 and iteration by code points. One minor issue with internal UTF-8 representation is of course that you cannot easily substitute code points, so search and replace of code points needs to create a copy or change the size of the internal storage. But again that’s obvious for UTF-8.

dmitry-kazakov · December 23, 2025, 8:21pm

The representation of the construct will not be UTF-8.

Kohlrak · December 24, 2025, 5:32am

while technically correct, the practical side is most characters are only representable by values that cannot fit into a certain size. The value implies a minimum size required to store that value.

Right. Key point here is that Ada does restrict what values can go into each variable type (an implicit and necessary constraint being overflow prevention). So when it actually is aware of utf8, it sees characters by their value and consequently sees what size of space they won’t fit in. However, this cannot be done run time.

Technically not. As you pointed out, once UTF8 is discovered, gnat checks the values and sees that the characters with a value above 16#FF# cannot fit in a single byte. This is a fundamental aspect of Ada that when it can, it checks what’s going into each variable.The problem with the whole thing is that a character value (regardless of encoding, because at some point these values will exceed 16#FF# regardless of whether orr not the sign bit is reserved for code pages) will exceed the single character storage. One of the hard stipulations is that string is an array of characters whose values are 0-255. UTF-8 was designed to fit in systems that don’t check this, because characters are made of multiple bytes that to an ignorant compiler look like several characters (for everyone 1 utf8 charcter).

If i were to take a 4 byte integer and try to shove it into a short array or an individual 1-byte integer array, we’d have the same problem. I can make it work by breaking it up, but it’s a singular int being unchecked_converted (manually by changing syntax) into the array (and probably in the wrong endian order). That’s effectively what we’re doing to make utf8 work. There’s no way to tell Ada “yo, we know what we’re doing actually, and we have a special format for this value that allows it to properly fit into an array of a smaller type without it being a problem.”

It’s actually failing a check: it’s seeing the value is out of range, which is precisely what it’s actually reporting. But this is splitting hairs because the range is 0-255, the maximum possibility for 1 byte. We lack the facilities to say we have multi-byte encoding to fit something into consecutive values of smaller potential values by treating it as an array.

Honestly, what we need is a “trust me, bro” facility, but applying to this, which Ada refuses to do. What’s really being highlighted here is that the expectation to protect the coder from themselves creates an inherent conflict when standards require it. These types of standards are created sparingly (so in this case, such a keyword would only really necessarily be useful for utf8). The ARG issue conversation highlights the fact that the reason this doesn’t already exist in Ada is largely because they want the compiler (GNAT inevitably) to check the values are valid UTF8 values (instead of requesting impossible code pages or leaving dangling code pages without entries) and they know that’s a lot of work. Instead of just accepting that maybe that’s not their job, since it would prevent any expansion in the future (which we all like to say wouldn’t happen, but supposedly we wouldn’t need more than 1MB of RAM either) and sine it’s really not their job.

dmitry-kazakov · December 24, 2025, 12:27pm

No. Unchecked conversion takes the representation as-is.

Converting a Unicode wide literal to UTF-8 encoded String is a proper conversion. You can have that (with -gnatW8):

  with Strings_Edit.UTF8.Handling;  use Strings_Edit.UTF8.Handling;

  S : String := To_UTF8 (Wide_String'("Привет"));

or shorter than in most languages:

  function "+" (S : Wide_String) return String renames To_UTF8;
  S : String := +"Привет";

krischik · January 3, 2026, 7:29pm

We already have a UTF-8 string: Ada.Strings.UTF_Encoding.UTF_8_String. I use that regularly — as I use the string type most suitable to the problem at hand. There is also an Ada.Strings.UTF_Encoding.UTF_String and a Ada.Strings.UTF_Encoding. UTF_16_Wide_String.

Before sending or saving them I convert the back to back to UTF-8. And at that point you can add a BOM (also part of Ada.Strings.UTF_Encoding) - a byte order mark so encoding is clear. But the GNAT’s implementation of Ada.Text_IO does most of that anyway.

We are probably fine with 32 bit Wide_Wide_Characters until the extraterrestrials arrive and we have to incorporate all the languages for the local group.

That’s for compilation only. When you program is running you can load any text you like and then use Ada.Strings.UTF_Encoding.Conversions to convert it into any format that us convenient to you.

dmitry-kazakov · January 3, 2026, 8:13pm

Not quite. UTF_8_String is nothing but “renaming” of a String. See ARM A.4.11 6/3

subtype UTF_8_String is String;

Unfortunately the standard library lacks the vital operations like scanning a UTF-8 string forward and backward, the only operations one actually needs when dealing with UTF-8.