UTF-8, Identifiers and dealing with Strings in Ada

Because it is highly non-portable.

No idea what do you mean. But search is not needed:

   R : Integer := 0;
   J : Integer := I;

   while J <= S'Last loop
      case S (J) is
         when '0'..'9' =>
            R := R * 10 + (Character'Pos (S (J)) - Character'Pos ('0'));
            J := J + 1;
         when others =>
            exit;
   end loop;
   if I = J then
      raise End_Error with "no number";
   end if;
   I := J;

Exactly, you can do it in a right and a wrong way. What puzzles me is that most people consistently do it wrong way. It seems that idea of imperative programming became so incomprehensible.

separate (Parsers.Generic_Ada_Parser) 
   procedure Get_Identifier
             (  Code     : in out Lexers.Lexer_Source_Type;
                Line     : String;
                Pointer  : Integer;
                Argument : out Tokens.Argument_Token
             )  is
   Index     : Integer := Pointer;
   Start     : Integer;
   Malformed : Boolean := False;
   Underline : Boolean := False;
   Symbol    : UTF8_Code_Point;
begin
   while Index <= Line'Last loop
      Start := Index;
      Get (Line, Index, Symbol);
      case Category (Symbol) is
         when Mn | Mc | Nd | Cf | Letter | Nl =>
            Underline := False;
         when Pc =>
            Malformed := Malformed or Underline;
            Underline := True;
         when others =>
            Index := Start;
            exit;
      end case;
   end loop;
   Malformed := Malformed or Underline;
   Set_Pointer (Code, Index);
   Argument.Location := Link (Code);
   Argument.Value := new Identifier (Index - Pointer);
   declare
      This : Identifier renames Identifier (Argument.Value.all);
   begin
      This.Malformed := Malformed;
      This.Value     := Line (Pointer..Index - 1);
   end;
exception
   when Data_Error =>
      Set_Pointer (Code, Index);
      Set_Pointer (Code, Index);
      Raise_Exception
      (  Parsers.Syntax_Error'Identity,
         Encoding_Error & Image (Link (Code))
      );
end Get_Identifier;

It works with all sources, Text_IO, String, Stream etc. Dynamic allocation is done in an arena pool. No heap is involded. It is called from the parser after recognizing Ada identifier start defined in the UTF8.Cathegorization package.

You do not need it in UTF-8 either. Not to mention that in UTF-8 you don’t encode/decode anything. When did you see a RADIX-50 source last time?

This is a fascinating discussion. Unfortunately, I still don’t know how to modify an existing Rosetta Code task written by someone else in Ada which fails to properly deal with input data which is not 7-bit ASCII. Or the mechanism which is involved to properly compile it. When this Ada program encounters input that is not 7-bit ASCII, it outputs ugly stuff. It is not my terminal or my environment that is the problem as the task solutions written in Crystal, Arturo, Raku, etc) work just fine. So it is either a coding fault or an incorrect compiler/linker setting or both. I deliberately did not specify which task this is nor who the author is in order not to embarrass; I am impressed by the author’s work. However, it does not properly handle characters outside of 7-bit ASCII. Now I notice that the author of that unspecified task has joined in the discussion.

RBE

2 Likes

And which is that task, so we can analyze it?

Yours! Rosetta Code/Find unimplemented tasks - Rosetta Code

Well, I thought it was yours…I’m confused now.

RBE

Finally figured it out

Ok, so it must be something on my end, but i don’t know what. I’m on EndeavorOS (arch based).

vim defaults to utf8 on utf8 system.

00000000: 7072 6167 6d61 2057 6964 655f 4368 6172  pragma Wide_Char
00000010: 6163 7465 725f 456e 636f 6469 6e67 2028  acter_Encoding (
00000020: 2055 5446 3820 293b 0a0a 7769 7468 2041   UTF8 );..with A
00000030: 6461 2e54 4558 545f 494f 3b0a 0a70 726f  da.TEXT_IO;..pro
00000040: 6365 6475 7265 204d 656f 7720 6973 0a62  cedure Meow is.b
00000050: 6567 696e 0a09 4164 612e 5445 5854 5f49  egin..Ada.TEXT_I
00000060: 4f2e 7075 745f 6c69 6e65 2822 e381 93e3  O.put_line("....
00000070: 8293 e381 abe3 81a1 e381 af22 293b 0a65  ...........");.e
00000080: 6e64 204d 656f 773b 0a                   nd Meow;.

No bom that i can see. I tried gcc-ada as well. I’m starting to wonder if it’s because my system is utf8. So the question becomes “why this behavior if it detects utf8?” This is definitely starting to look like an issue on my end, but no idea where to look.

By this logic no program should be allowed to use English either, right? Let’s be absolutely clear here.

To clarify: I can see the high probability of malicious intent if one were to use a character that looks identical to a 7-bit ASCII character (but is not) when it is part of a predominately 7-bit ASCII string when it comes to a URL. This is spoofing/phishing designed to trap unsuspecting people to go to a fake web site that looks authentic. So I understand (some of) the risks involved when using predominately 7-bit ASCII in an English setting even if it is not a web site. So, not to confuse myself, I’ll refrain from using identifiers in Ada (other languages as well) that are not 100% 7-bit ASCII. However, the author of the Rosetta Code task (and mself) have not used characters outside of 7-bit ASCII in the source code, even in strings or comments.

RBE

Honestly at this point I think it’s worth mentioning that this is enough of a problem in every language that there should be a tool that identifies when a particular character is used in one file and another is used in another file or when they’re both used in the same file. This would be useful even outside of security contacts for languages like Japanese where it’s not unusual for half with full width and regular Latin one to end up all mixed together by switching IMEs. This problem extends well beyond Ada so I don’t think this should be considered an ada problem for Ada to solve.

I don’t know how accurate chat GPT is on this or if it’s a hallucinating but according to it The Ada reference me and you’ll supposedly allows certain special extensions on platforms that strongly demand it and apparently one such example is Mac OS requires UTF 8, but since Windows and Linux do not even if they frequently default to them, Mac OS is the only one to use this exception. Now if this is the case and chat GPT is right this problem’s a whole lot bigger than what I started with. This would imply that code written on a Mac and compiles on a Mac is not guaranteed to work on other platforms Where one would expect it

I’m not sure what you mean by “take” here, but if you mean “read”, I’m not aware of any way to read UTF-8 as anything other than UTF-8. Converting it to Unicode code points is another step.

Interesting. If a I add pragma Wide_Character_Encoding ( UTF8 );, then I get the same error. Note that in my example there was no pragma present. Also if -gnatW8 is present or when I add a BOM, then I get the error message. There is also -gnatiw, but I does not seem to make a difference.

See also Character Set Control - GNAT User's Guide and UTF-8 encoding in GNAT | ada-lang.io, an Ada community site and the recent discussion about Unicode strings

So if gnat is not aware that the input is UTF-8 and therefore also not that there is something like code points, it will treat UTF-8 strings “literally”, so using as many bytes as the source file has for the string. When you tell gnat to “detect” UTF-8 encoded code points, it will reject to store UTF-8 code points into String. Instead if you use:

pragma Wide_Character_Encoding ( UTF8 );
with Ada.Wide_Text_IO; use Ada.Wide_Text_IO;
procedure Main is
begin
  Put_Line("こんにちは");
end Main;

It compiles and outputs UTF-8 encoded strings (not UTF-16, as you might expect)!

./main | hexdump
0000000 81e3 e393 9382 81e3 e3ab a181 81e3 0aaf
``

It is about ASCII-7 sources not about any particular natural language.

I meant placing directly into the source as opposed to reading from file or stream.

My phone won’t let me directly highlight all your text to hit quote so i have to post separately. I’m still at work so I’ll have to check when i get home how it works on my system. I, too, expect that to lead to 2 byte characters. If you mix in latin characters does it add nulls to align?

That has nothing to do with Ada. It is AWS. I do not use AWS and so I do not know how it handles HTML/XML content. But I am sure it can Unicode.

You probably use wrong calls.

P.S. When nothing else helps, read the documentation!

Character and string literals in Ada are typed, and the compiler determines the type from context. Ada.Text_IO.Put_Line is defined as

procedure Put_Line (Item : in  String);

(ARM A.10.1), so the compiler expects a literal of type String. Type String is defined as

type String is array (Positive range <>) of Character;

(ARM 3.6.3), and type Character is Latin-1. So the compiler is expecting a value of type String, which this string literal is clearly not.

Note the different uses of “string”:

  • string: a general concept
  • string type: a language concept
  • type String: a predefined string type

These are similar to the uses of “integer”, “integer type”, and “type Integer”.

BUT …

If your editor saves your code as UTF-8, then the sequence of bytes in your file will be 16#22# (‘“‘), followed by the bytes of the UTF-8 encoding of your string, followed by 16#22# (‘“‘). Your compiler, if it’s not decoding the UTF-8, might interpret that as a string literal containing the Latin-1 characters corresponding to those bytes, and accept the string literal. Your program would then output those Latin-1 characters, but of course what is actually output is the representation of those characters, which is a sequence of bytes. If your output device expects UTF-8, it will decode those bytes and show the corresponding code points.

How you get a compiler to treat the UTF-8 source code as Latin-1 is compiler dependent. Your compiler appears to be decoding the UTF-8, resulting in an invalid literal for type String; the compiler used by ThyMYthOS appears not to. You will either need to tell your compiler to interpret the source as Latin-1, convert your string literal so that UTF-8 decoding leaves it as Latin-1, or call a subprogram that uses a different string type.

(Kazakov’s approach of restricting source code to ASCII results in it being interpreted the same regardless of whether or not your compiler decodes UTF-8. This enhances portability, but makes the code more difficult to understand.)

From a purely ARM point of view, which says nothing about how code is stored, the error is using a non-Latin-1 character in a string literal that will be interpreted as being of type String. Even if your compiler will accept this, it is still an error.

As I pointed out in this post, you are successfully putting UTF-8 in your source and your compiler is decoding it to Unicode. Your error is expecting this to be a valid literal of type String.

@Kohlrak 's initial claim was that the compiler should indeed allow storing UTF-8 encoded strings into type String, which it does when you don’t tell the compiler that it’s UTF-8. Now your point is that String does not allow storing Unicode code points (which is obvious, as Characters in String only have 8-bit), but also the type UTF_8_String does not allow assignment of string literals with UTF-8 multi-byte sequences, when -gnatW8 is used. So the effect of telling the compiler that we’re working with UTF-8 encoded files actually has the adverse effect of making compact UTF-8 string literals in source code kind of harder to work with.

@Kohlrak no zeros are inserted (why would they?):

pragma Wide_Character_Encoding ( UTF8 );
with Ada.Wide_Text_IO; use Ada.Wide_Text_IO;
procedure Main is
  S : Wide_String := "こ|ん|に|ち|は|☃|";
begin
  Put_Line(S);
end Main;

Results in:

./main|hexdump -C
00000000  e3 81 93 7c e3 82 93 7c  e3 81 ab 7c e3 81 a1 7c  |...|...|...|...||
00000010  e3 81 af 7c e2 98 83 7c  0a                       |...|...|.|
1 Like

See

about wide characters encoding.

Read this first. It would help better understanding what to expect when dealing with Wide_Wide_Strings and GNAT hackery around it.