[AdaCL] GetOpt updated to wide string support

krischik · December 10, 2024, 1:27pm

With AdaCL the AdaCL.Command_Line.GetOpt package was updated to use Wide_Wide_String instead of Strings.

This makes it easier to parse command lines with unicode characters both in the file name as well as well as the command line options.

In addition AdaCL.Command_Line.GetOpt is object orientated and reentrant so two task can parse the command line concurrently. For example AdaCL.Trace makes use of this feature to parse the command line independently from them main program.

For more details see: AdaCL.Command_Line.GetOpt and download with Alire.

dmitry-kazakov · December 10, 2024, 4:08pm

Parsing is easier and more efficient in UTF-8. Wide_Wide_String is wasting memory and performance. Switch back to String encoded as UTF-8.

krischik · December 10, 2024, 6:30pm

The previous version of AdaCL.Command_Line.GetOpt was strictly Latin_1 encoding. There is no UTF-8 to go back to.

You might be thinking of AdaCL.Command_Line.Orto — The command line parser from Björn Persson. However, the version I have is still Ada 95, depends on Charles and I have not got around changing that:

…AdaCL/adacl_eastrings/src/adacl-command_line-orto-signed_integer_parameters.ads:22:32: error: file "charles.ads" not found

It’s on my todo list but since it’s not my own code so it’s a mayor endeavour for me. Orto will of course continue to use AdaCL.EAstrings.

However, easier? Sure about that?

Test_1 : String := "ÄÜÖ";           -- What is Test_1'Lenght?
Test_2 : Character := String (2);   -- What is now in Test_2
Test_3 : Character := 'Ä';          -- Doesn't even compile

Especially Test_3 is a mayor pain as you don’t have a character type any more. Even if you go for UTF-8 Strings you still need Wide_Wide_Character to handle all possible characters.

Needless to say that the Wide_Wide alternative works flawlessly.

Test_4 : Wide_Wide_String    := "ÄÖÜ";          -- Test_4'Length is 3
Test_5 : Wide_Wide_Character := Test_4 (2);     -- Test_5 contains an Ö
Test_6 : Wide_Wide_Character := 'Ä';            -- Compiles flawlessly.

I also don’t think that AdaCL.EAstrings are in any way faster then Wide_Wide_Strings.

There has been quite the discussion on the Telegram channel about this and we are kind of warming up the Wide_Wide_Character over there because of the ease of use.

dmitry-kazakov · December 10, 2024, 7:07pm

krischik:

However, easier? Sure about that?
Test_1 : String := "ÄÜÖ";           -- What is Test_1'Lenght?
Test_2 : Character := String (2);   -- What is now in Test_2
Test_3 : Character := 'Ä';          -- Doesn't even compile
Especially Test_3 is a mayor pain as you don’t have a character type any more.

Never ever use non-ASCII-7 characters in the source, It is highly non-portable.
Never use arbitrary string indices. Indices must be obtained through operations, like skipping over compared token.
Unicode character is not a thing. There is Unicode code point, but you need not them most of the time.
.

   Test_1 : String := Character'Val (16#0C3#) & Character'Val (16#84#) &
                      Character'Val (16#0C3#) & Character'Val (16#9C#) &
                      Character'Val (16#0C3#) & Character'Val (16#96#);
   Test_2 : Character := String (2); -- Never use arbitrary indices or octets
   Test_3 : String := Character'Val (16#0C3#) & Character'Val (16#84#);

Remember that Character in UTF-8 encoding is an octet, a representation item = do not care.

Working on the level of code points is wasting resources. You just do not need that for parsing. And parsing is far faster in UTF-8 because you need to process much less data. All necessary operations necessary for parsing work in terms of UTF-8 encoded String:

Comparing
Binary search against a table of tokens
Search a hash table
Scanning the string forward and backward up to a delimiter
UTF-8 is native encoding for Linux and most tools. No conversion required.

You can find an implementation of parsers in Simple Components: Simple components for Ada

krischik · December 12, 2024, 12:08pm

I know, at least in theory, how to handle UTF-8. However there is one aspect you are not taking into account: Programmer efficiency.

Out of the box Ada only has a very few packages with UTF_8 support:

Ada.Strings.UTF_Encoding
Ada.Strings.UTF_Encoding.Conversions
Ada.Strings.UTF_Encoding.Strings
Ada.Strings.UTF_Encoding.Wide_Strings
Ada.Strings.UTF_Encoding.Wide_Wide_Strings

Which support only Encode and Decode functionality. There is none of the following:

Ada.Float_UTF_Text_IO
Ada.Integer_UTF_Text_IO
Ada.Strings.UTF_Bounded
Ada.Strings.UTF_Bounded.UTF_Equal_Case_Insensitive
Ada.Strings.UTF_Bounded.UTF_Hash
Ada.Strings.UTF_Bounded.UTF_Hash_Case_Insensitive
Ada.Strings.UTF_Equal_Case_Insensitive
Ada.Strings.UTF_Fixed
Ada.Strings.UTF_Fixed.UTF_Equal_Case_Insensitive
Ada.Strings.UTF_Fixed.UTF_Hash
Ada.Strings.UTF_Fixed.UTF_Hash_Case_Insensitive
Ada.Strings.UTF_Hash
Ada.Strings.UTF_Hash_Case_Insensitive
Ada.Strings.UTF_Maps
Ada.Strings.UTF_Maps.UTF_Constants
Ada.Strings.UTF_Unbounded
Ada.Strings.UTF_Unbounded.UTF_Equal_Case_Insensitive
Ada.Strings.UTF_Unbounded.UTF_Hash
Ada.Strings.UTF_Unbounded.UTF_Hash_Case_Insensitive
Ada.UTF_Characters
Ada.UTF_Characters.Handling
Ada.UTF_Command_Line
Ada.UTF_Directories
Ada.UTF_Directories.Hierarchical_File_Names
Ada.UTF_Directories.Information
Ada.UTF_Environment_Variables
Ada.UTF_Text_IO
Ada.UTF_Text_IO.Complex_IO
Ada.UTF_Text_IO.Editing
Ada.UTF_Text_IO.Text_Streams
Ada.UTF_Text_IO.UTF_Bounded_IO
Ada.UTF_Text_IO.UTF_Unbounded_IO

As such Ada, out of the box and with reasonable effort, supports only the following development paradigm for UTF-8:

function Do_Something (Value : String) return String is
   use Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
   use Ada.Strings.UTF_Encoding;

   Temp: Wide_Wide_String := Decode (Value, UTF_8) & ' ';
begin
   --  Do something with Temp using the many Wide_Wide_String
   --  packages. Example:

   Temp := @ (@'First .. @'First / 2) & 'Ä' & @ (@'First / 2 + 1 .. @'Last);

   return Encode (Temp, UTF_8);
end Do_Something;

And I see nothing wrong with it. I don’t even think it is significantly slower then working directly on an UTF-8 String. It is, however, significantly easier to implement.

Even that one example line would need two loops to implement. One to find out how many characters are in the string and one to find the middle character and only then you can make the insertion. If there was any more to to it would become a mayor endeavour.

Do note that AdaCL is an object orientated library meant for desktop computer (macOS, Windows, Linux). It makes liberal use of tagged types, heap memory using smart pointer, unbounded strings, collections in all of its components.

AdaCL is not suitable or meant for embedded or real time programming where 10kb of temporary memory or 1000 CPU cycles would make a difference.

However, I will try to upgrade Björn Persson’s EAStrings, which properly support both UTF-8 and UTF-16, to Ada 2022. There seem to be a need for it.

dmitry-kazakov · December 12, 2024, 12:56pm

Bounded and Unbounded strings (never use them in parsing!) work perfectly well in UTF-8. The safest way doing I/O is to read/write stream, which guarantees no recoding ever happen, which is the goal. If, for some obscure reason the data are not in UTF-8 but in, say, Windows-1256, recode as you read, manually.

Maps and sets are unrelated to encoding and defined in code points. The standard library provides them, AFAIK.

Character handling, categorization (case, space, letter etc) you can find here
String processing for Ada. It is not specific to encoding. Regardless UCS-4 (Wide_Wide_String) or UTF-8 if you want, say, Is_Alphanumeric or To_Lowercase you need Unicode classification as defined in UnicodeData.txt.

As for you code sample, it indicates what is wrong with the approach. You should simply never need to insert/remove Unicode glyphs into/from a string. Length as number of glyphs is a quite obscure thing in Unicode (Unicode composites is a hell). UCS-4 does not give “length” nor allows you insert/remove, so there is no advantage over UTF-8 here. Just do not do that.

BTW, USC-4 has another huge problem: endianness! UTF-8 is free of that.

The good news is that no sane parsing algorithm needs any of that. All text processing can be performed by advancing the string index. The index always points at the beginning of a glyph (and a token). The representation of glyphs is irrelevant so long they start at the beginning of an octet. RADIX-50 is a counterexample, but we do not use it anymore…

Regarding embedded, yes, using 4 octets where only one or two are needed is certainly a bad idea. Especially considering running Linux, which is all UTF-8. You would have to recode everything that goes in or out!