You are just messing parts of utf8 encoded characters with your comma. ‘à’ is c3a0 and comma is 2c. I guess you are outputing c32ca02c or something like that.
Correct.
To complete the picture. If you use -gnatW8 then the compiler will consider the source as UTF-8 encoded. Since there are only Latin-1 characters in the literal the code will compile. The code point of a Latin-1 character fits into a single 8-bit Character. If you tried Greek characters instead it would fail to compile.
So with -gnatW8 Put will convert each Character to UTF-8 and the output will be as expected:
abcdefghijklmnopqrstuvwxyzàâæçèéêëîïôùûüÿ
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,à,â,æ,ç,è,é,ê,ë,î,ï,ô,ù,û,ü,ÿ,
I converted the file to iso-8859-1, recompiled without any flags and got:
abcdefghijklmnopqrstuvwxyz
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,,,,,
Is it misguided to complain that this is not OK?
Not without -gnatW8:
with Ada.Text_IO;
procedure Elliniki is
Gre : constant String := ("Ελληνική");
begin
for C of Gre loop
Ada.Text_IO.Put (C);
end loop;
Ada.Text_IO.New_Line;
for C of Gre loop
Ada.Text_IO.Put (C & ',');
end loop;
end Elliniki;
Result:
$ gnatmake -gnatwae elliniki.adb
gcc -c -gnatwae elliniki.adb
gnatbind -x elliniki.ali
gnatlink elliniki.ali
$ ./elliniki
Ελληνική
,,,,,,,,
Right, without -gnatW8, the program compiles fine. with -gnatW8 you would get compile error.
But remember, without -gnatW8, you should treat String as raw byte sequence (Without the compiler knowing what’s the actual encoding of the literal, compiler just treats them as regular Latin-1, as utf-8 byte sequence are valid Latin-1 sequence). When output your string, you are responsible to break at the utf-8 sequence boundary, and not relying on the compiler doing it for you.
I prefer Dmitry’s method. A good utf8 handling library (treat String as raw bytes sequences and could do proper unicode operations on the byte sequences) would be better approach than messing around with -gnatW8 and Wide_String/Wide_Wide_String.
For people really want unicode identifies, there is a work around, use -gnatcf, it may not be exactly the same as -gnatW8, your editor would display them as proper utf-8 string, and compiler just treat them as latin-1 byte sequences, and there is no upper/lower case conversion for the upper half of Latin-1 code (as a side effect, we got some kind of case sensitive identifier for non ASCII names if my understanding is correct). But from viewer’s point, the identifier is displayed properly in your editor.
Name : constant String := "Cześć";
I copied this in GPS, when trying to store, GPS complains:
This buffer contains UTF-8 characters which could not be translated to ISO-8859-1.
So I copied and stored this in the standard Win11 text editor, opened the file with GPS as Latin 1. GPS displays:
Name : constant String := "CzeÅÄ";
Compile and run with GPS, output in GPS:
Cześć
Run this in PowerShell, output:
Cze┼ø─ç
Very strange!
This is a correct rendering of UTF-8 encoded “Cześć” as a Latin-1 (ISO-8859-1). GPS shows you the “truth” (octets), but we want to see a “lie” (code points).
The GPS output tab is a Gtk widget that renders UTF-8. So the output is correct (a lie from the octet point of view).
Because PowerShell does not use UTF-8 by default. To set it to UTF-8 do this first:
[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
and you will see Cześć again.
If you want to develop applications - use VSS. If you want to train yourself with censured - continue to use predefined string types and packages. It is easy.
PS. And VSS will force to use -gnatW8 if you want to have non-ASCII characters in string literals.
Funny how VSS uses Wide_Wide_String as a “literal transport” under the hood ![]()
VSS lib has been split. Use now vss_text instead.
Language use Wide_Wide_String type for an argument of user defined function to convert string literal into an object of Virtual_String.
You can use UTF_8_IO to simplify these steps.
You are confusing terms. Both UTF-8 and UCS-4/UTF-32 are encodings. Wide_Wide_String is a UTF-32 encoded sequence of code points. It is not same as:
type Code_Point is range 0..16#10FFFF#;
type Code_Point_Array is array (Positive range <>) of Code_Point;
Wide string was introduced when the matter was not yet settled. It was an error. Microsoft did the same error and realised later that there is no way it can work. So they switched to UTF-16. But Ada kept on! So it added Wide_Wide_String on top. Although not very wholeheartedly. Unbounded_String, Ada.Directories were left behind. Nobody sane would suggest to pull this unfunny wide-wide joke though all language. Well, UCS-4 and UTF-32 are same … so far. But who knows, may be some of us will live to see Wide_Wide_Wide_String? ![]()
Everyone does UTF-8 these days, so most languages consider strings or whatever they have for UTF-8 encoded. Newbies get confused looking at the mess of Ada string types, because nobody told them that Ada’s String is exactly the same.
My impression is that there are few people who aren’t confused by this.
Ada’s “string stack” is a perfect combination of complexity, legacy and system-depedent oddities to gatekeep even the most eager beginner like me.
Compared to other languages, the Ada’s standard library isn’t too high-level to begin with but the string-related packages are pure machinery.
It doesn’t help that there is barely any agreement on how to deal with this mess.
Personally I have been following the advice by @JC001 and sticking to ASCII (not even Latin_1).
It just works.
Yes, because Latin-1 is affected by this wide-madness.
Furthermore, there are two distinct cases:
- Unicode identifiers
- Unicode literals
GNAT carefully entangles them so that it is basically impossible to figure out how to have both while keeping literals “normal.” Normal here means like all world does, which translated into Ada terms reads:
Latin-1 String must be considered UTF-8 encoded.
Nobody, nowhere, even after half a crate of beer, expects C char to hold a Cyrillic letter. Yet Ada people hold it for a good idea, yes, sure, replace Character with Wide_Wide_Character, encode, recode, decode, transcode and in the end after adding some cryptic switches it may work or maybe not.
In short. Right, keep all ASCII-7 and you will have no problem with Ada strings ever.
This is because GNAT Studio (GPS) has (its own, not related to project file, compiler switch) setting for editor’s encoding. By default it is ISO-8859-1 (not UTF-8!). I’ve written about it in the article. Did you read it?
You may consider using UXStrings lib and associated Text_IO.
You’ll be able to specify the input and output encodings, even line endings, depending on your terminal settings:
procedure Create
(File : in out File_Type; Mode : in File_Mode := Out_File; Name : in UXString := Null_UXString;
Scheme : in Encoding_Scheme := Latin_1; Ending : Line_Ending := CRLF_Ending);
procedure Open
(File : in out File_Type; Mode : in File_Mode; Name : in UXString; Scheme : in Encoding_Scheme := Latin_1;
Ending : Line_Ending := CRLF_Ending);