UTF-8, Identifiers and dealing with Strings in Ada

I also had a lot of problems with string handling in Ada during AoC. I guess I am just not used to “the Ada way” of parsing and filtering.

I can only agree what you wrote about embedded! Going over the SVD generated files it becomes clear how Adas design was made to be used for embedded.

There is no “Ada way of parsing.” There are unexplainable ways people tend to do parsing in the most artificial and complicated ways. It must absolutely include multiple stages of meaningless actions, producing intermediate results of no use and of course never done in a direct imperative way: get a thing, advance, repeat. No, it must be some intricate combination of functions, scanners, filters, tokenizers with parameters tuned in some magical way to produce results. When Ada offers a straightforward elementary approach people simply fail to grasp the idea. That must be “the Ada way!” Though the same approach worked perfectly in C before Ada. In C that does not have strings! Parsing without handling strings…

It is not Ada way, it is just a sane way.

Dealing with strings in Ada is simple as long as one remembers that strings are arrays and that Ada supports array slicing.

6 Likes

Well, not quite, The fact that Ada offers so many kinds of Strings can be confusing. Usually, you only need Fixed Strings and Unbounded Strings, but having to convert between these two fairly often is something one has to get used to.

The point about slicing was important. That is one of the reasons why unbounded strings are not needed. You never ever copy anything when parsing. You can always pass a source slice to a subprogram. Ada slicing keeps index invariant, which additionally simplifies parsing because you do not need to shift indices back and forth you can pass in and out them together with the slices as is.

Speaking of strings, I’m told Ada can’t handle utf8 without first holding it as wide_wide_string and converting to one of the other string types. I imagine that would’ve been simple, and i’m surprised that since i could use something like 猫 as an identifier in the language itself that string literals aren’t so simple. If i have a program with alot of instructions for the user that can add up over time.

You were lied. On the contrary, you should never use Wide_Wide_String forget it exist. Historically Ada predates Unicode. When Unicode came a quick and dirty decision was made and Wide_String (UCS-2) was introduced. BTW, Microsoft did the same error in Windows. Then there was no way back and so Wide_Wide_String (UCS-4) came. Microsoft cared less and instead of keeping it compatible simply declared, oops, it is UTF-16 not USC-2. Why they simply did not go all way back to UTF-8 is beyond me. But we can! Now simply ignore all that mess and consider Ada String UTF-8.

All text processing algorithms work in UTF-8 without changes. For conversions to legacy code pages, code points, sets, maps, normalization see: String processing for Ada .

Keep your program ASCII-7. UTF-8 literals is no problem with above, however it is not a good idea if you mean localization.

1 Like

Hmmm, there is at least one (somewhat recent, a year ago?) solution on Rosetta Code which seemingly works great if the assumption is that all content is 7-bit ASCII. The output of this program looks like some non-Earth-like language when it hits anything related to “Erdős” or anything like it. So, was this author just missing a gnat flag for utf8? Or didn’t document the need for it? The author of the Rosetta Code task in question did not use your Simple Components.

RBE

I never understood the reasons the GNAT for UTF-8 was introduced and especially the inconsistent way it was implemented (at the linking phase).

As for I/O. Surely your program must deal with it. This has nothing to do with the program source. Normally Text_IO happily ignores encoding unless you mess with the flags. Then you can always use stream I/O that passes text as-is.

Under Windows you need set the console to the UTF-8 page if you want to read UTF-8. Under Linux it works out of the box.

Note that Wide_Wide_String I/O will work nowhere: console, utilities, DBs, GUI, at all.

If i put UTF8 in string literals it complains about incompatible characters.

[kohlrak@pizzabox meow]$ alr build
ⓘ Building meow=0.1.0-dev/meow.gpr...
Compile
   [Ada]          meow.adb
meow.adb:3:27: error: literal out of range of type Standard.Character

   compilation of meow.adb failed

gprbuild: *** compilation phase failed
error: Command ["gprbuild", "-s", "-j0", "-p", "-P", "/home/kohlrak/projects/ADA/meow/meow.gpr"] exited with code 4
error: Compilation failed.
[kohlrak@pizzabox meow]$ cat src/meow.adb 
procedure Meow is
        木 : String := "Tree";
        Tree : String := "木";
begin
        null;
end Meow;

Note the line it errors on is 3, which is the string that contains 木, which is utf8 text, but the gnatw8 flag is enabled since it is fine with line 2 where the utf8 is used in the identifier.

The solution thus far given to me was to use an encoding function, but supposedly the call cannot be “optimized away” and supposedly it’ll inevitably, at runtime, even with -Os, take a wide_wide_string and THEN store it as utf8. Part of me thinks i’m being lied to, but experience tells me that since it’d be an external function from the Ada lib (thus linked code) that it’ll probably actually do exactly that.

procedure Meow is
  Tree_English : constant String := "Tree";
  Tree_Dunno   : constant String := Character'Val (16#E6#) & Character'Val (16#9C#) & Character'Val (16#A8#);
begin
   null;
end Meow;
1 Like

I didn’t write the Rosetta Code task I mentioned. When I run a similar Rosetta Code task written in Lua and in Crystal (using the exact same terminal, same settings, same OS), the output comes out fine for all characters in the I/O, not just the 7-bit ASCII. So it is something with the compiler or with the code or with both that fails the mission.

RBE

UTF-8 is an encoding of Unicode code points. The general rule for encodings is that you decode them on input and encode on output. So you should decode UTF-8 input into code points. You can choose how to represent code points; for example, as a modular type (excellent for Caesar ciphers). If you want to represent them with a character type, then the language provides Wide_Wide_Character for that.

(Note that UTF-8 uses 4 bytes to encode some code points. All code points can be represented in 21 bits, and it’s possible to create a variable-length encoding that never uses more than 3 bytes to encode a code point; see Universal Text File for an example. I’ve never understood this choice for UTF-8.)

Did I say you should output ASCII-7? You should use only ASCII-7 in the program source. See the difference?

As for I/O, you do not need to decode or encode anything. Ignore wrong advices. Read UTF-8 as is. Process it as String. Write the result out. Since you stay UTF-8 it will be UTF-8. End of story.

I mean, sure, that works, but that’s absolutely not readable at that point.

UTF-8 has a major advantage over wide-wide and wide: if you mix text or the code points don’t require as much space, you have smaller space. Another major advantage is the same thing, but when the program is language agnostic (meaning it’s translated and can display either in chinese, japanese, or english).

Sorry, but to me 木 is absolutely unreadable. Is it a navigation sign?

One point of having ASCII-7 is not to let obscure character sets in the source. As for literals, you do not need literals in the source. Localization is done in a different way.

I believe that the author used ONLY 7-bit ASCII in the source code of the Rosetta Code task.

RBE

I agree to an extent but I think the type system of Ada makes it less likely to be a target. Largely because Ada emphasizes type or as on Arduino you’re probably not going to be doing much with type and if you do you’re going to be trying to avoid type checking because you’re going to be trying to implicitly extend things.

I have to disagree. Actually using the ADA type system simplifies HW access a lot compared to C. You’re right that for low power embedded systems it makes sense to get rid of unnecessary type checks, but this is exactly what Ada can help with if you use proper restricted types. Then even the compiler knows where checks are required and where it is safe by design.

One you need to make it really easy to get into which means you’re going to want to get a very simple library setup pretty quickly. And you’re going to want to make it very configurable for special new hardware and shields as well as probably design something so it can be easily ported to STM 32 and stuff like that because a lot of people who do are do you know will often upgrade to an STM 32 or another microcontroller.

This is exactly what Alire and e.g. the HAL crate provide. Compared with Arduino and Platform.io Ada was much simpler to setup for a new target. And especially for STM32 there is already a lot of support available.

What is missing right now (IMHO) is a coding guideline for simple embedded designs to follow. E.g. commitment for the style of error handling (exceptions vs return codes) and similar things.

This might not be suitable for high integrity production systems where you would not like to rely on a myriad of community crates, but it helps to get developers onboarded and getting familiar with the language.

Not everyone codes exclusively in english. And sure, localization can be done with external tools and string tables, but you also have people who code for only one language, and that might not be english. You don’t find it entirely counter-intuitive that identifiers can be like this but not actual strings? So someone who is coding for a CJK audience is permitted to code in their native language so long as they don’t document in that language?

Perhaps, but ada seems to emphasize using types to solve the problems. It’s not the intuitive choice that jumps at you.

That’s fair. I haven’t looked through what crates are available, but i don’t know about it and that’s key. Adacore has a lengthy book on the topic and in the intro, at least, it doesn’t mention this crate. If we’re talking about marketing and reach of Ada, you have to understand it’s more than just what’s available. Most people are going to look at what is presented to them first on any given topic., not spending hours finding out if it’s there. They want to know it’s worth spending that time before actually spending that time. And let’s face it, what people see first is going to be Adacore’s stuff. The plus side is, they have every intention of being an advertiser of Ada, so this is semi-easy to address.

You can use Unicode literals in strings, but not with the String type, see:

@dmitry-kazakov I get your point about simply using String for I/O, so we don’t need to encode/decode, but how do you prevent splitting a code point after the first byte with String?