You might have expected GNAT’sAda.Wide_Wide_Strings.Maps.Constants.Lower_Case_Map to support a wide (!) range of characters; however, it only supports a much reduced subset - I think it’s Latin1.
Might be explained by there being (at 2025-02-21) 1858 uppercase and 2258 lowercase Unicode characters, according to Wikipedia (Google’s AI denies that the numbers differ if you ask it why); extending to anything even approaching the full range of possibilities would, of course, be a mammoth & ongoing task.
Or is there some requirement/documentation that I’ve missed?
Sorry, those numbers (2258/1858) were the count of characters with lower/upper case declared.
For Ḇ, ḇ we have
1E06;LATIN CAPITAL LETTER B WITH LINE BELOW;Lu;0;L;0042 0331;;;;N;;;;1E07;
1E07;LATIN SMALL LETTER B WITH LINE BELOW;Ll;0;L;0062 0331;;;;N;;;1E06;;1E06
which seem to me to have the same kind of mapping as B, b aside from the character decomposition mapping:
0042;LATIN CAPITAL LETTER B;Lu;0;L;;;;;N;;;;0062;
0062;LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042
and I’d say they are simple (single-character) case mappings as defined here.
As @jere noted above, the ARM refers to “the Simple Lowercase Mapping as defined by documents referenced in Clause 2 of ISO/IEC 10646:2020” which I have to say isn’t a lot of help. Certainly Google doesn’t tell us which documents.
One thing: what’s the difference between Lowercase and Simple_Lowercase?
Another thing (sorry): is this going to be expensive? it would get called for every identifier in a source text.
function To_Lowercase (S : Wide_Wide_String) return Wide_Wide_String is
use VSS.Strings;
use VSS.Strings.Conversions;
use VSS.Transformers.Casing;
VS : constant Virtual_String := To_Virtual_String (S);
LVS : constant Virtual_String :=
Transform (Self => VSS.Transformers.Casing.To_Simple_Lowercase,
Item => VS);
begin
return Result : Wide_Wide_String (S'Range) do
Set_Wide_Wide_String (LVS, Into => Result);
end return;
end To_Lowercase;
Yes, simple mapping is what sane people would understand under case conversion.
Full mapping covers crazy stuff like German ß ↔ SS, which BTW is now officially wrong as the German orthography was changed to avoid the mess and ambiguities like in Schusssicherung.
I am puzzled a bit what did you expect. Ada.Wide_Wide_Characters.Handling.To_Lower looks perfectly OK to me. Full mapping is fool’s errand, IMO.
Lower case U?
Yes, it is expensive, it do two encoding conversion and case conversion of variable length encoded text; however, I recommend to investigate performance in context of your particular application. There might be a ways to improve it in your code or in VSS.
Also, for identifiers in general case identifier caseless folding should be used instead on just case conversion. It is even more expensive, but covers many programming languages and representations of source code.
PS. From my experience on ALS and related tools, only few percents of time are spend inside VSS. It might be same or not for your application.
Both To_Simple_Lowercase and To_Lowercase work for me with text containing ẞ (LATIN CAPITAL LETTER SHARP S).
However, when I tried To_Identifier_Caseless I got a Program_Error in VSS.Implementation.UTF8_Normalization.Unchecked_Replace - is this because that hasn’t been updated in v25.0.0?
I could do, but the issue in v25.0.0/Unchecked_Replace arises because it hadn’t been implemented yet, and its body was commented out aside from a raise Program_Error.
I tried with VSS commit 2811167, and it turns out that the Identifier_Caseless conversion expands "ÀḆĈḒËẞ" to something with 7 characters, and "ẞ" to 2 characters ("ss").
I’ll post an issue if you like? but I suspect this is deliberate.
Thanks, issue is not necessary if it works in development version of VSS. I remember issue like this, and wanted to check that it was fixed and not some new issue.