@dmitry-kazakov While this may be true on a theoretical level, that’s not how anything in the standard library works. I wish you would stop misleading newcomers this way. Yes, Wide_Wide_String
uses four bytes per character rather than the more compact UTF-8 encoding. The year is 2025. My phone has 8GB of RAM. You won’t run out of memory doing simple text processing. It’s fine.
@JC001 is correct. You can cast String
into Ada.Strings.UTF_Encoding.UTF_8_String
, then use the Encode
and Decode
functions in Ada.Strings.UTF_Encoding.Wide_Wide_Strings to convert to/from Wide_Wide_String
. The package Ada.Strings.Wide_Wide_Fixed
contains Wide_Wide_String
variants of the same functions as Ada.Strings.Fixed.
Wide_Wide_String
is an array of Wide_Wide_Character
, which is always represented in memory as a 32-bit Unicode code point. The Unicode standard only assigns a number to each character or modifier, it does not guarantee anything about the magnitude of that number. While code points greater than 2**32
are possible, it is unlikely that code points larger than that will be standardized in the near future.
I find it tedious to type Wide_Wide_
repeatedly and hunt down the functions I need in the various Ada.Strings.*
packages. I recommend writing a small package to wrap the operations you need. This way, you don’t have to type so much, can use
the package where you need it, and can experiment with other string representations in the future if desired. I did that over here.
1 Like
If working with libadalang you have to cope with Wide_Wide_*. I didn’t find it that bad; in ada_caser
, out of 253 lines in ada_caser-processing.adb
, the string Wide_Wide
occurred on 10.
There is no standard library functions for parsing CSV files.
Wasting processor time and memory is up to you. But there is literally no Wide_Wide_Character encoded sources. Whatsoever. Ada’s Wide and Wide_Wide I/O are artefacts of Ada standard library. Reading an UTF-8 file and recoding it into UCS-4 (Wide_Wide) is stupid because a properly written text processing program remains exactly same in both UTF-8 and UCS-4.
For an example of how CSV parsing is done, see Strings_Edit.UTF8.Categorization_Generator here which parses unicode.txt file. The file is in CSV format.
A general parsing can be found simple components. It abstracts the source and allows parsing strings, files, streams using the same code.
Though Wide_String Text_IO files are supported among other types of sources, the parser never uses Wide_Character because the implementation of the source recodes text to UTF-8 as it is read.
Your code would not change if libadalang used UTF-8 for identifiers.
BTW, since you are using Ada.Wide_Wide_Text_IO for diagnostics (line 59), I wonder how it would work with non-ASCII-7 characters.
I think it would not without manipulating the console. E.g. this one:
with Ada.Wide_Wide_Text_IO;
procedure Test is
begin
Ada.Wide_Wide_Text_IO.Put_Line ("" & Wide_Wide_Character'Val (16#00E4#)); -- A-umlaut
end Test;
does not work under either Debian or Windows.
On macOS it works just fine (so long as you compile with -gnatW8
).
Very interesting. A compiler flag (-gnatWe, I never used it before) changes the run-time behaviour!
I experimented a bit with it:
with Ada.Wide_Wide_Text_IO;
procedure F is
begin
Ada.Wide_Wide_Text_IO.Put_Line ("" & Wide_Wide_Character'Val (16#00E4#));
end F;
and
with Ada.Wide_Wide_Text_IO;
with F;
procedure Test is
begin
Ada.Wide_Wide_Text_IO.Put_Line ("" & Wide_Wide_Character'Val (16#00E4#));
F;
end Test;
Now, compiling f.adb and test.adb like this:
gcc -c f.adb -gnatW8
gcc -c test.adb
gnatmake test.adb
gives Latin-1 encoding,
gcc -c f.adb
gcc -c test.adb -gnatW8
gnatmake test.adb
results in UTF-8 encoding. The documentation lets suggest that -gnatW8 specified for the main program wins the battle. But I am not sure.
In any case using Wide_Wide_* I/O outside the main program, e.g. in a library, has totally unpredictable behaviour. One more reason to ditch Wide_Wide stuff.