Case conversions

simonjwright · February 21, 2025, 11:07pm

You might have expected GNAT’sAda.Wide_Wide_Strings.Maps.Constants.Lower_Case_Map to support a wide (!) range of characters; however, it only supports a much reduced subset - I think it’s Latin1.

Might be explained by there being (at 2025-02-21) 1858 uppercase and 2258 lowercase Unicode characters, according to Wikipedia (Google’s AI denies that the numbers differ if you ask it why); extending to anything even approaching the full range of possibilities would, of course, be a mammoth & ongoing task.

Or is there some requirement/documentation that I’ve missed?

mgrojo · February 21, 2025, 11:43pm

And does Ada.Wide_Wide_Characters.Handling.To_Lower cover all the cases?

jere · February 22, 2025, 3:17am

It says in the RM that it does what Wide_Characters.Handling.To_Lower does except is uses wide wide characters.

The Wide_Character version says:

Hopefully that helps?

dmitry-kazakov · February 22, 2025, 10:29am

In UnicodeData.txt the corresponding lines are:

1858;MONGOLIAN LETTER TODO GAA;Lo;0;L;;;;;N;;;;;

2258;CORRESPONDS TO;Sm;0;ON;;;;;N;;;;;

1858 has no lower case counterpart. The categorization field is Lo = other letter. Compare it with

054A;ARMENIAN CAPITAL LETTER PEH;Lu;0;L;;;;;N;;;;057A;

Here the categorization is Lu = uppercase letter and the next to last field 057A gives the lower case letter.

I tested 16#054A# Wide/Wide_Wide To_Lower and UTF-8 (Simple Components) all three work fine in Debian (GCC 12).

godunko · February 22, 2025, 12:46pm

VSS supports simple and full default Unicode case conversion.

simonjwright · February 22, 2025, 8:28pm

Sorry, those numbers (2258/1858) were the count of characters with lower/upper case declared.

For Ḇ, ḇ we have

1E06;LATIN CAPITAL LETTER B WITH LINE BELOW;Lu;0;L;0042 0331;;;;N;;;;1E07;
1E07;LATIN SMALL LETTER B WITH LINE BELOW;Ll;0;L;0062 0331;;;;N;;;1E06;;1E06

which seem to me to have the same kind of mapping as B, b aside from the character decomposition mapping:

0042;LATIN CAPITAL LETTER B;Lu;0;L;;;;;N;;;;0062;
0062;LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042

and I’d say they are simple (single-character) case mappings as defined here.

As @jere noted above, the ARM refers to “the Simple Lowercase Mapping as defined by documents referenced in Clause 2 of ISO/IEC 10646:2020” which I have to say isn’t a lot of help. Certainly Google doesn’t tell us which documents.

simonjwright · February 22, 2025, 8:41pm

Thanks very much: VSS solved the problem for me.

One thing: what’s the difference between Lowercase and Simple_Lowercase?

Another thing (sorry): is this going to be expensive? it would get called for every identifier in a source text.

   function To_Lowercase (S : Wide_Wide_String) return Wide_Wide_String is
      use VSS.Strings;
      use VSS.Strings.Conversions;
      use VSS.Transformers.Casing;
      VS : constant Virtual_String := To_Virtual_String (S);
      LVS : constant Virtual_String :=
        Transform (Self => VSS.Transformers.Casing.To_Simple_Lowercase,
                   Item => VS);
   begin
      return Result : Wide_Wide_String (S'Range) do
         Set_Wide_Wide_String (LVS, Into => Result);
      end return;
   end To_Lowercase;

dmitry-kazakov · February 22, 2025, 8:58pm

Yes, simple mapping is what sane people would understand under case conversion.

Full mapping covers crazy stuff like German ß ↔ SS, which BTW is now officially wrong as the German orthography was changed to avoid the mess and ambiguities like in Schusssicherung.

I am puzzled a bit what did you expect. Ada.Wide_Wide_Characters.Handling.To_Lower looks perfectly OK to me. Full mapping is fool’s errand, IMO.

Lower case U?

godunko · February 23, 2025, 8:00am

Yes, it is expensive, it do two encoding conversion and case conversion of variable length encoded text; however, I recommend to investigate performance in context of your particular application. There might be a ways to improve it in your code or in VSS.

Also, for identifiers in general case identifier caseless folding should be used instead on just case conversion. It is even more expensive, but covers many programming languages and representations of source code.

github.com/AdaCore/VSS

source/text/vss-transformers-caseless.ads

2811167bc


      
          --
          --   To_Canonical_Caseless     : constant Abstract_Transformer'Class;
          --   --  Convert text to uppercase using default full case conversion.
          --
          --   To_Compatibility_Caseless : constant Abstract_Transformer'Class;
          --   --  Convert text to lowercase using default simple case conversion.
          --
          --   To_Identifier_Caseless    : constant Abstract_Transformer'Class;
          --   --  Convert text to uppercase using default simple case conversion.
          
          type Identifier_Caseless_Transformer is
            limited new Abstract_Transformer with null record;
          --  @private
          
          overriding function Transform
            (Self : Identifier_Caseless_Transformer;
             Item : VSS.Strings.Virtual_String'Class)
             return VSS.Strings.Virtual_String;
          --  @private
          
          overriding procedure Transform

PS. From my experience on ALS and related tools, only few percents of time are spend inside VSS. It might be same or not for your application.

simonjwright · February 23, 2025, 11:42am

Both To_Simple_Lowercase and To_Lowercase work for me with text containing ẞ (LATIN CAPITAL LETTER SHARP S).

However, when I tried To_Identifier_Caseless I got a Program_Error in VSS.Implementation.UTF8_Normalization.Unchecked_Replace - is this because that hasn’t been updated in v25.0.0?

dmitry-kazakov · February 23, 2025, 12:02pm

The reform happened around 2017. In UnicodeData.txt small ß has no capital form while the capital ß has small form.

godunko · February 23, 2025, 12:23pm

Can you create reproducer?

simonjwright · February 23, 2025, 5:01pm

I could do, but the issue in v25.0.0/Unchecked_Replace arises because it hadn’t been implemented yet, and its body was commented out aside from a raise Program_Error.

I tried with VSS commit 2811167, and it turns out that the Identifier_Caseless conversion expands "ÀḆĈḒËẞ" to something with 7 characters, and "ẞ" to 2 characters ("ss").

I’ll post an issue if you like? but I suspect this is deliberate.

godunko · February 23, 2025, 5:16pm

Thanks, issue is not necessary if it works in development version of VSS. I remember issue like this, and wanted to check that it was fixed and not some new issue.