Case conversions

You might have expected GNAT’sAda.Wide_Wide_Strings.Maps.Constants.Lower_Case_Map to support a wide (!) range of characters; however, it only supports a much reduced subset - I think it’s Latin1.

Might be explained by there being (at 2025-02-21) 1858 uppercase and 2258 lowercase Unicode characters, according to Wikipedia (Google’s AI denies that the numbers differ if you ask it why); extending to anything even approaching the full range of possibilities would, of course, be a mammoth & ongoing task.

Or is there some requirement/documentation that I’ve missed?

And does Ada.Wide_Wide_Characters.Handling.To_Lower cover all the cases?

It says in the RM that it does what Wide_Characters.Handling.To_Lower does except is uses wide wide characters.

The Wide_Character version says:

Hopefully that helps?

In UnicodeData.txt the corresponding lines are:

1858;MONGOLIAN LETTER TODO GAA;Lo;0;L;;;;;N;;;;;

2258;CORRESPONDS TO;Sm;0;ON;;;;;N;;;;;

1858 has no lower case counterpart. The categorization field is Lo = other letter. Compare it with

054A;ARMENIAN CAPITAL LETTER PEH;Lu;0;L;;;;;N;;;;057A;

Here the categorization is Lu = uppercase letter and the next to last field 057A gives the lower case letter.

I tested 16#054A# Wide/Wide_Wide To_Lower and UTF-8 (Simple Components) all three work fine in Debian (GCC 12).

VSS supports simple and full default Unicode case conversion.

Sorry, those numbers (2258/1858) were the count of characters with lower/upper case declared.

For Ḇ, ḇ we have

1E06;LATIN CAPITAL LETTER B WITH LINE BELOW;Lu;0;L;0042 0331;;;;N;;;;1E07;
1E07;LATIN SMALL LETTER B WITH LINE BELOW;Ll;0;L;0062 0331;;;;N;;;1E06;;1E06

which seem to me to have the same kind of mapping as B, b aside from the character decomposition mapping:

0042;LATIN CAPITAL LETTER B;Lu;0;L;;;;;N;;;;0062;
0062;LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042

and I’d say they are simple (single-character) case mappings as defined here.

As @jere noted above, the ARM refers to “the Simple Lowercase Mapping as defined by documents referenced in Clause 2 of ISO/IEC 10646:2020” which I have to say isn’t a lot of help. Certainly Google doesn’t tell us which documents.

Thanks very much: VSS solved the problem for me.

One thing: what’s the difference between Lowercase and Simple_Lowercase?

Another thing (sorry): is this going to be expensive? it would get called for every identifier in a source text.

   function To_Lowercase (S : Wide_Wide_String) return Wide_Wide_String is
      use VSS.Strings;
      use VSS.Strings.Conversions;
      use VSS.Transformers.Casing;
      VS : constant Virtual_String := To_Virtual_String (S);
      LVS : constant Virtual_String :=
        Transform (Self => VSS.Transformers.Casing.To_Simple_Lowercase,
                   Item => VS);
   begin
      return Result : Wide_Wide_String (S'Range) do
         Set_Wide_Wide_String (LVS, Into => Result);
      end return;
   end To_Lowercase;

Yes, simple mapping is what sane people would understand under case conversion. :grinning:

Full mapping covers crazy stuff like German ß ↔ SS, which BTW is now officially wrong as the German orthography was changed to avoid the mess and ambiguities like in Schusssicherung.

I am puzzled a bit what did you expect. Ada.Wide_Wide_Characters.Handling.To_Lower looks perfectly OK to me. Full mapping is fool’s errand, IMO. :grinning:

Lower case U?

1 Like

Yes, it is expensive, it do two encoding conversion and case conversion of variable length encoded text; however, I recommend to investigate performance in context of your particular application. There might be a ways to improve it in your code or in VSS.

Also, for identifiers in general case identifier caseless folding should be used instead on just case conversion. It is even more expensive, but covers many programming languages and representations of source code.

PS. From my experience on ALS and related tools, only few percents of time are spend inside VSS. It might be same or not for your application.

Both To_Simple_Lowercase and To_Lowercase work for me with text containing ẞ (LATIN CAPITAL LETTER SHARP S).

However, when I tried To_Identifier_Caseless I got a Program_Error in VSS.Implementation.UTF8_Normalization.Unchecked_Replace - is this because that hasn’t been updated in v25.0.0?

The reform happened around 2017. In UnicodeData.txt small ß has no capital form while the capital ß has small form. :grinning:

Can you create reproducer?

I could do, but the issue in v25.0.0/Unchecked_Replace arises because it hadn’t been implemented yet, and its body was commented out aside from a raise Program_Error.

I tried with VSS commit 2811167, and it turns out that the Identifier_Caseless conversion expands "ÀḆĈḒËẞ" to something with 7 characters, and "ẞ" to 2 characters ("ss").

I’ll post an issue if you like? but I suspect this is deliberate.

Thanks, issue is not necessary if it works in development version of VSS. I remember issue like this, and wanted to check that it was fixed and not some new issue.