UTF-8, Identifiers and dealing with Strings in Ada

What about this:

    реа : constant := 1;
    реa : constant := 2;
    рeа : constant := 3;

This is a legal Ada (> 95) code. Can you guess what is wrong/right here? :grinning:

If you are storing a message to output in encoded form, then it makes sense to store it in encoded form. The decode-on-input/encode-on-output rule is for encoded data that is input and processed to produce encoded output.

I like the example

Αnd : Boolean;
Аnd : Boolean;
...
if Αnd And Аnd then

This is clearly poor language design.

2 Likes

You never get this. UTF-8 was designed in a way that even search for a delimiter character can never bring you inside a code point. E.g. Ada.Strings.Fixed.Index works correctly. You compare/skip/move UTF-8 substrings the string index always stay at a code point start.

That’s opposite of the real problem though. You can pass -gnatW8 and get that result but heaven forbid you write the content of a sting that way.

I agree, but ada doesn’t have a way to take utf8 in as utf8. It takes it in as wide_wide_string, from what I can tell.

All Ada I/O packages take UTF-8. Ada.Text_IO, Ada.Streams.Stream_IO, Ada.Text_IO.Text_Streams. More precisely they simply do not interpret the content. Line formatting sequences are exactly same in ASCII and UTF-8. So, everything works.

You can try this:

with Text_IO; use Text_IO;
procedure Main is
begin
   Put (">");
   Put_Line (Get_Line);
end Main;

to see that all Unicode input gets smoothly in and out.

Under Windows console you need to switch to UTF-8 first

chcp 65001

On the contrary Wide_Wide I/O is basically non-existent. Unless you write a UCS-4 file yourself there is no other sources of it. Ada Wide_Wide packages exist strictly for for completeness. They have no use.

See that’s the way I would expect but you have a small error: if I put UTF-8 text in the string part where one would actually expect to be able to use UTF-8 it says it’s out of range. But where one would not normally expect me to be able to use UTF-8 that is to say naming a variable or the program itself I can actually use UTF8. That is to say I can use Japanese in my code, as long as the resultant program doesn’t actually use Japanese as output unless I import the string as a c string or manually drop the hex decimal as a character array or something to that extent. I can literally use こんにちは as the name of the entry point procedure, but i can’t directly print “こんにちは” as a hello world program. I’d have to do something obnoxiously complex if I were a Japanese person trying to learn this language. Whereas other languages I would have to use English for any identifiers, but I could totally put the UTF-8 text as a string literal to any output function.

Well, reading UTF-8 into a String retains the fact that it’s UTF-8 (the data in memory is still UTF-8 encoded). But your program has to be aware of that as soon as you start processing the data. So if you slice the String (@dmitry-kazakov that’s what I was talking about earlier), you could end up splitting a code point:

UTF_8_Text : String := Get_Line;
Process_First_Two_Chars (UTF_8_Text (UTF_8_Text'First .. UTF_8_Text'First + 2));

So String is not very efficient in terms of CPU cycles if you want to process the text by iterating code points (e.g. getting number of code points in text), but sure it’s memory efficient. @Kohlrak but in general you cannot have both: CPU efficiency and memory efficiency.

That is correct, however what i would like to be able to do is the following:

with Ada.TEXT_IO;
procedure Main is 
begin
  Ada.TEXT_IO.put_line("こんにちは");
end Main;

Like, he totally has a point that if i was doing string processing and i was trying to go character by character. However if i’m either not processing strings at all (simply output) or if i’m willing to search for strings (so if i was doing find and replace) it’s not more CPU hungry.

And indeed, memory efficiency and CPU efficiency used to be different, but one of the issues today is cache. There’s such a waste of cache these days that they end up becoming the same thing. (However, it would take a lot of text to make an actual difference speed wise.)

But even at the surface level, efficiency aside, compare that program above i wrote to the following and ask yourself which of the two should logically work first in any given language.

with Ada.TEXT_IO;
procedure Main is
こんにちは : String := "Hello";
begin
  Ada.TEXT_IO.put_line(こんにちは);
end Main;

I think the obvious answer is the first one, not this one.

I find this a bit perplexing, because I was recently processing a text file where the only way I managed to read a file w/UTF-8 data, process it, and output it correctly was to use Ada.Wide_Wide_Text_IO.

This was a month or two ago, maybe more. I couldn’t tell you why nothing else worked; I just remember it wasn’t working with Text_IO, so I began reading the RM and first tried Wide_Text_IO, then, when that didn’t work and I saw it mentioned in a StackOverflow discussion, Wide_Wide_Text_IO, at which point things worked just fine.

The point is that meaningful algorithms require no modifications. Meaningless ones like “taking two characters” do not count. The example you gave would not work in Wide_Wide_String either. The counter example is:d◌̣◌̇d = 2 characters, 4 code points.

Poor Unicode design.
Poor editor design.

I really wish that Unicode had adopted a more structured format, perhaps something like:

Type Language is ( De, En, Fr, … );
Type Text( Lang : Language; Length : Natural ) is record
   Data : String(1..Length) := (NUL_Equivelant);
end record;

Then you could define things by-language, having a clear separation.
(It would probably make it easier on implementers, too, as you could go about implementing/supporting Unicode by-language.)

The big difficulty when dealing with Unicode is that things are much more complex. Just look at how things change in this context asking “what is the length of the string?” — at least three answers: there’s the number of bytes, there’s the number of glyphs, there’s the number of Unicode code-points.

pragma Wide_Character_Encoding ( UTF8 );

@dmitry-kazakov is wrong in his assessment; Wide_Wide_Character’s use is to normalize (and simplify) Unicode processing: each Uniconde code-point cooresponds to a single element of the string. — I use Wide_Wide_String in Byron both to reduce complexity of the incoming program-text (everything is converted to Wide_Wide_String, operated on from there), and ensure correctness (I take the definitions in the LRM pretty directly: see here).

1 Like

and you should not.

If you replace that with a concrete task: I want to extract an integer number from the string at the index I and then advance I to the first string index following the number. You will discover that it works perfectly in UTF-8 without any modification.

No you don’t. It is a meaningless task. It is funny you talk about efficiency and using memory and CPU consuming things you would not even need in the production code. But you can do that in UTF-8 without any modifications! UTF-8 was designed this way.

I wrote several compilers for domain-specific languages, Ada parsers, pattern matching, text-based protocols and never used Wide_Wide_String. You just do not need it.

Why do you claim that it does not work?

Using utf-8-test - Ada - OneCompiler and also:

cat > main.adb << EOF
with Ada.TEXT_IO;
procedure Main is 
begin
  Ada.TEXT_IO.put_line("こんにちは");
end Main;
EOF
gnatmake main.adb
./main

will output

gcc -c main.adb
gnatbind -x main.ali
gnatlink main.ali
こんにちは

(at least on my Mac).

That happens because Alire activates -gnatW8, but when using plain vanilla GNAT/gprbuild, you get what you expect.

Why Alire uses that switch? See Feedback on a possible Alire Unicode policy · alire-project/alire · Discussion #1334 · GitHub

Why GNAT works like it does regarding input/output when this switch `-gnatW8` is active, I don’t get it. There should be a switch to only change the interpretation of source code as UTF-8, and allow UTF-8 identifiers, but keeping the non-interpretation of the String when doing I/O (including storing a UTF-8 literal into a String).

Incorrect. Try it.

Why?

That is true, but that’s now a search by string in effect: reason being you’re disregarding that which you’re not looking for rather than by character index. And this is by design, too: utf8 is more or less a drop-in-and-replace method to basically get code that wasn’t wide to work with wide.

Depends on what you’re doing and how you’re doing it. Take something like PHP for example: you’d want to be looking for “<?php” for example. I find it strange you would suggest this doesn’t happen in production code.

[kohlrak@kohlrak-gaming src]$ gnatmake meow.adb 
gcc -c meow.adb
meow.adb:7:31: error: literal out of range of type Standard.Character
gnatmake: "meow.adb" compilation error

It would seem if it’s compiling for you there’s a way to make it work that eludes me. It would seem our build environments are somehow different. I can confirm it’s NOT -gnatW8, because i’ve been passing that.

There’s something there, but it’s not this: i’ve even manually invoked gnat compile:

[kohlrak@kohlrak-gaming src]$ gnat -v compile meow.adb 
/home/kohlrak/.local/share/alire/toolchains/gnat_native_15.2.1_4640d4b3/bin/gnatmake -f -u -c meow.adb
gcc -c meow.adb
meow.adb:7:31: error: literal out of range of type Standard.Character
gnatmake: "meow.adb" compilation error

But as seen from ThyMYthOS, there is an option that allows it to compile.

I’m not saying you can’t do it in other methods. I’m saying that Wide_Wide_Character/Wide_Wide_String make things simpler — See how everything just flows for defining an identifier:

Pragma Ada_2012;
Pragma Assertion_Policy( Check );

With
Ada.Wide_Wide_Characters.Unicode;

-- The Pure_Types package both defines pure-types, such as identifier-strings,
-- which can be proven in SPARK.
Package Byron.Internals.SPARK.Pure_Types
with Pure, Elaborate_Body, SPARK_Mode => On is

--------------------------------------------------------------------------------
-- PACKAGE/SUBPROGRAM RENAMES                                                 --
--------------------------------------------------------------------------------

    Package WCU renames Ada.Wide_Wide_Characters.Unicode;
    Use Type WCU.Category;
    Function Get_Category (Input : Wide_Wide_Character) return WCU.Category
      renames WCU.Get_Category;

--------------------------------------------------------------------------------
-- INTERNAL STRINGS                                                           --
--------------------------------------------------------------------------------

    -- The String-type used within the internals of the compiler.
    Subtype Internal_String is Wide_Wide_String;

--------------------------------------------------------------------------------
-- IDENTIFIERS                                                                --
--------------------------------------------------------------------------------

    -- LRM 2.3 (3/2)
    --	identifier_start ::=	letter_uppercase
    --			|	letter_lowercase
    --			|	letter_titlecase
    --			|	letter_modifier
    --			|	letter_other
    --			|	number_letter
    Subtype Identifier_Start is WCU.Category
      with Static_Predicate => Identifier_Start in
        WCU.Lu | WCU.Ll | WCU.Lt | WCU.Lm | WCU.Lo | WCU.Nl;

    -- LRM 2.3 (3.1/3)
    --	identifier_extend ::=	mark_non_spacing
    --			|	mark_spacing_combining
    --			|	number_decimal
    --			|	punctuation_connector
    Subtype Identifier_Extend is WCU.Category
      with Static_Predicate => Identifier_Extend in
	WCU.Mn | WCU.Mc | WCU.Nd | WCU.PC;

    -- Intermediate subtype asserting the contents of the string are members of
    -- either Identifier_Start OR Identifier_Extend.
    Type Valid_ID_Chars is new Internal_String
      with Dynamic_Predicate =>
		(For All C of Valid_ID_Chars =>
		    Get_Category(C) in Identifier_Start or
		    Get_Category(C) in Identifier_Extend),
           Predicate_Failure => Raise Constraint_Error with
			"Invalid character in Identifier.";

    -- LRM 2.3 (2/2)
    --	identifier ::=	identifier_start {identifier_start | identifier_extend}
    Type Identifier is new Valid_ID_Chars
      with Dynamic_Predicate =>
	(Identifier'Length in Positive)
      and then (Identifier'First < Positive'Last)
      and then (Get_Category(Identifier(Identifier'First)) in Identifier_Start)
      and then
    -- LRM 2.3 (4/3)
    --	An identifier shall not contain two consecutive characters in category
    --	punctuation_connector, or end with a character in that category.
	((for all Index in Identifier'First..Identifier'Last-1 =>
	   (if Get_Category(Identifier(Index)) in WCU.Pc
               then Get_Category(Identifier(1+Index))   not in WCU.Pc))
          and Get_Category(Identifier(Identifier'Last)) not in WCU.Pc
         ),
      Predicate_Failure => Raise Constraint_Error with "Invalid Identifier.";

Private


End Byron.Internals.SPARK.Pure_Types;

Everything is defined according to the LRM, and SPARK can prove it (granted, I had to make an intermediate type for the prover to be able to prove it). Also, when starting the processing, since Wide_Wide_Character/Wide_Wide_String are [capible of being] supersets of all the ASCII/Unicode inputs, normalizing (on input) to Wide_Wide_String vastly simplifies the downstream processing: I neither have to keep some state, nor require specialized processing for handling multiunit codepoints because all the complexity is frontloaded into discovering the type of text the input-file is and converting that to Wide_Wide_String:

Pragma Ada_2012;
Pragma Assertion_Policy( Check );

with
System,
Interfaces,
Unchecked_Conversion,
Ada.Strings.Fixed,
Ada.Strings.Wide_Wide_Fixed,
Ada.Strings.Unbounded,
Ada.Strings.UTF_Encoding.Conversions,
Ada.Strings.UTF_Encoding.Wide_Wide_Strings,
Ada.Characters.Conversions,
ADA.IO_EXCEPTIONS;

Function Readington (File : Byron.Internals.Types.Stream_Class) return Wide_Wide_String is

   -- Returns the contents of the stream.
   Function Contents return String is
      use Ada.Strings.Unbounded;
      Temp   : Character := Character'First;
      Result : Unbounded_String;
   Begin
      READ_DATA:
      Begin
         loop
            Character'Read(File, Temp);
            Append(New_Item => Temp, Source => Result);
         end loop;
      exception
         When ADA.IO_EXCEPTIONS.END_ERROR =>
            return To_String( Result );
      End READ_DATA;
   End Contents;

   -- Converts BYTE to CHARACTER.
   Function Convert is new Unchecked_Conversion(
      Source => Interfaces.Unsigned_8,
      Target => Character
     );

   function Decode(Item         : Ada.Strings.UTF_Encoding.UTF_String;
                   Input_Scheme : Ada.Strings.UTF_Encoding.Encoding_Scheme
                  ) return Wide_Wide_String
     renames Ada.Strings.UTF_Encoding.Wide_Wide_Strings.Decode;

   -- Quick rename for conversion.
   Function "+" (Right : Interfaces.Unsigned_8) return Character renames Convert;

   Procedure Endian_Conversion( Item : in out Wide_Wide_String ) is
      Type Bytes is record
         HH, HL, LH, LL : Interfaces.Unsigned_8;
      end record
      with Pack, Size => 32, Object_Size => 32;

      Pragma Assert( Bytes'Object_Size = Wide_Wide_String'Component_Size );

      Type String is Array(Item'Range) of Bytes;

      Data : String with Import, Address => Item'Address;
   Begin
      For E : Bytes of Data loop
         E:= ( E.LL,E.LH, E.HL,E.HH );
      end loop;
   End Endian_Conversion;

   -- Type for forcing the alignment.
   Type Aligned_String is New String
   with Alignment => 4;

   -- Data is the file's contents; we convert/assign to the Aligned_String type
   -- to ensure the proper alignment, then we overlay with the normal string
   -- which has an alignment of 1 (meaning any byte), and so any address for an
   -- Aligned_String is compatible with the address for a normal string.
   Aligned_Data : constant Aligned_String  := Aligned_String( Contents );
   Data         : String(Aligned_Data'Range)
     with Import, Address => Aligned_Data'Address;

   -- The literal BOM string that Notepad++ saved.
   BOM_1    : constant String  := (+239,+187,+191);

   -- The BOM string that UTF_Encoding uses.
   BOM_2    : String  renames Ada.Strings.UTF_Encoding.BOM_8;

   -- This BOM indicates the 16-bit big-endian format.
   BOM_3    : String  renames Ada.Strings.UTF_Encoding.BOM_16BE;

   -- This BOM indicates the 16-bit little-endian format.
   BOM_4    : String  renames Ada.Strings.UTF_Encoding.BOM_16LE;

   -- This BOM indicates the 32-bit big-endian format.
   BOM_5    : constant String := (+0, +0, +16#FE#, +16#FF#);

   -- This BOM indicates the 32-bit little-endian format.
   BOM_6    : constant String := (+16#FF#, +16#FE#, +0, +0);

   Check_String : String(1..4)
     with Import, Address => Data'Address;

   Function Wide_Wide_Data(Convert_Endian : Boolean) return Wide_Wide_String is
      Pragma Assert( Data'Length mod 4 = 0,
                     "Data length not evenly divisable by 4." );
      Result_Data : Wide_Wide_String(1..Data'Length/4)
        with Import, Address => Aligned_Data'Address;
   begin
      Return Result : Wide_Wide_String:= Result_Data(2..Result_Data'Last) do
         if Convert_Endian then
            Endian_Conversion(Result);
         end if;
      end return;
   end Wide_Wide_Data;


   Use
     System,
     Ada.Strings.Fixed,
     Ada.Strings.Wide_Wide_Fixed,
     Ada.Strings.UTF_Encoding.Conversions,
     Ada.Strings.UTF_Encoding;
Begin
   -- Return the data, less any extant leading BOM.
   -- This must be done in decending BOM sizesm otherwise the proper size will
   -- not be properly detected.
   if Index(Check_String, BOM_6) = 1 then
      Return Wide_Wide_Data(System.Default_Bit_Order /= Low_Order_First);
   elsif Index(Check_String, BOM_5) = 1 then
      Return Wide_Wide_Data(System.Default_Bit_Order = Low_Order_First);
   elsif Index(Check_String, BOM_4) = 1 then
      Return Decode(Data, UTF_16LE);
   elsif Index(Check_String, BOM_3) = 1 then
      Return Decode(Data, UTF_16BE);
   elsif Index(Check_String, BOM_2) = 1 then
      return Decode(Data(BOM_2'Length+1..Data'Length), UTF_8);
   elsif Index(Check_String, BOM_1) = 1 then
      return Decode(Data(BOM_1'Length+1..Data'Length), UTF_8);
   else
      return Ada.Characters.Conversions.To_Wide_Wide_String( Data );
   end if;
End Readington;

As you can see.

?
I have. It works perfectly fine.
(If you’re on windows, you have to set the codepage of the commandline if you’re using “the terminal”. I believe someone posted it upthread.)

…did you save the text as UTF8? Your editor might be using something else.
(If it’s GPS, Right-Click the text → Properties; ensure Unicode UTF-8 is selected.)

1 Like

I can reproduce this if the file has a BOM. In that case, the `-gnatW8` mode is automatically enabled. If that is your case, remove the BOM to get it working. I’ve used Emacs and `M-x set-buffer-file-coding-system RET utf-8-with-signature` to insert the BOM and `M-x set-buffer-file-coding-system RET utf-8` to remove it.

You’ve probably used an editor that automatically inserts a BOM.

1 Like