How to read and manipulate table data

JeremyGrosser · March 16, 2025, 4:34pm

@dmitry-kazakov While this may be true on a theoretical level, that’s not how anything in the standard library works. I wish you would stop misleading newcomers this way. Yes, Wide_Wide_String uses four bytes per character rather than the more compact UTF-8 encoding. The year is 2025. My phone has 8GB of RAM. You won’t run out of memory doing simple text processing. It’s fine.

@JC001 is correct. You can cast String into Ada.Strings.UTF_Encoding.UTF_8_String, then use the Encode and Decode functions in Ada.Strings.UTF_Encoding.Wide_Wide_Strings to convert to/from Wide_Wide_String. The package Ada.Strings.Wide_Wide_Fixed contains Wide_Wide_String variants of the same functions as Ada.Strings.Fixed.

Wide_Wide_String is an array of Wide_Wide_Character, which is always represented in memory as a 32-bit Unicode code point. The Unicode standard only assigns a number to each character or modifier, it does not guarantee anything about the magnitude of that number. While code points greater than 2**32 are possible, it is unlikely that code points larger than that will be standardized in the near future.

I find it tedious to type Wide_Wide_ repeatedly and hunt down the functions I need in the various Ada.Strings.* packages. I recommend writing a small package to wrap the operations you need. This way, you don’t have to type so much, can use the package where you need it, and can experiment with other string representations in the future if desired. I did that over here.

simonjwright · March 16, 2025, 4:51pm

If working with libadalang you have to cope with Wide_Wide_*. I didn’t find it that bad; in ada_caser, out of 253 lines in ada_caser-processing.adb, the string Wide_Wide occurred on 10.

dmitry-kazakov · March 16, 2025, 5:46pm

There is no standard library functions for parsing CSV files.

Wasting processor time and memory is up to you. But there is literally no Wide_Wide_Character encoded sources. Whatsoever. Ada’s Wide and Wide_Wide I/O are artefacts of Ada standard library. Reading an UTF-8 file and recoding it into UCS-4 (Wide_Wide) is stupid because a properly written text processing program remains exactly same in both UTF-8 and UCS-4.

For an example of how CSV parsing is done, see Strings_Edit.UTF8.Categorization_Generator here which parses unicode.txt file. The file is in CSV format.

A general parsing can be found simple components. It abstracts the source and allows parsing strings, files, streams using the same code.

Though Wide_String Text_IO files are supported among other types of sources, the parser never uses Wide_Character because the implementation of the source recodes text to UTF-8 as it is read.

dmitry-kazakov · March 16, 2025, 6:10pm

Your code would not change if libadalang used UTF-8 for identifiers.

BTW, since you are using Ada.Wide_Wide_Text_IO for diagnostics (line 59), I wonder how it would work with non-ASCII-7 characters.

I think it would not without manipulating the console. E.g. this one:

with Ada.Wide_Wide_Text_IO;
procedure Test is
begin
   Ada.Wide_Wide_Text_IO.Put_Line ("" & Wide_Wide_Character'Val (16#00E4#)); -- A-umlaut
end Test;

does not work under either Debian or Windows.

simonjwright · March 16, 2025, 9:56pm

On macOS it works just fine (so long as you compile with -gnatW8).

dmitry-kazakov · March 17, 2025, 8:55am

Very interesting. A compiler flag (-gnatWe, I never used it before) changes the run-time behaviour!

I experimented a bit with it:

with Ada.Wide_Wide_Text_IO;

procedure F is
begin
   Ada.Wide_Wide_Text_IO.Put_Line ("" & Wide_Wide_Character'Val (16#00E4#));
end F;

and

with Ada.Wide_Wide_Text_IO;
with F;
procedure Test is
begin
   Ada.Wide_Wide_Text_IO.Put_Line ("" & Wide_Wide_Character'Val (16#00E4#));
   F;
end Test;

Now, compiling f.adb and test.adb like this:

gcc -c f.adb -gnatW8
gcc -c test.adb
gnatmake test.adb

gives Latin-1 encoding,

gcc -c f.adb
gcc -c test.adb  -gnatW8
gnatmake test.adb

results in UTF-8 encoding. The documentation lets suggest that -gnatW8 specified for the main program wins the battle. But I am not sure.

In any case using Wide_Wide_* I/O outside the main program, e.g. in a library, has totally unpredictable behaviour. One more reason to ditch Wide_Wide stuff.