Question about Wide_Wide_String and Unicode code points

jere · January 29, 2024, 8:15pm

I am relatively new to other character encodings besides ASCII (I primarily work in areas where ASCII is sufficient), so I thought I would mess around with the various UTF formats and Unicode and see what I could understand. I am also trying to tie my understanding to Ada types in the standard library, so here goes:

I (think) I understand that if I have a set of Unicode code points (16#000000# to 16#10FFFF#), then that kinda translates directly into a set of Wide_Wide_Characters (and thus a Wide_Wide_String)? So if essentially code point U+1000 is equivalent to Wide_Wide_String’Pos(16#1000#)?

If so, I noticed the Ada standard says Wide_Wide_Character can represent a much larger set of code points than Unicode specifies:

Ada22 RM package Standard:

-- The declaration of type Wide_Wide_Character is based on the full
-- ISO/IEC 10646:2020 character set. The first 65536 positions have the
-- same contents as type Wide_Character. See 3.5.2.

type Wide_Wide_Character is (nul, soh ... Hex_7FFFFFFE, Hex_7FFFFFFF);

vs 16#10FFFF# in unicode

Meaning Wide_Wide_Character covers a much larger range than 10464:2020 specifies and is not limited to that range only, despite the reference to 10464?

Sorry, trying to wrap my head around this a bit.

godunko · January 30, 2024, 9:08am

Wide_Wide_Character can represent “integers code” in range 0 … 7FFF_FFFF. It is representation in memory.

Unicode assigns semantic to “integer codes” in range 0 … 10_FFFF. Each code represents… single piece of information. For ASCII range and English there is direct relationship between human recognized character and single code point, thus single instance of the Wide_Wide_Character in the Wide_Wide_String. It is not for almost all other languages - single human recognized character can be a sequence of code points, thus slice of the Wide_Wide_String.

So, I can suggest to use VSS instead of Wide_Wide_*. VSS hides all complexity and provide API to manipulate by units of text necessary for application.

mosteo · January 30, 2024, 11:20am

I see it the same way. They’re not the same but in practice they are (if your only concern is Unicode). So UTF_32 encoding is just the plain sequence of code points, that is stored as-is in a Wide_Wide_Character string. And Unicode code points match the Wide_Wide_Character pos.

But, for example, UTF_16 is no longer the same as a Wide_String the moment that Unicode code points that require more than two bytes appear. And obviously the same happens for String And UTF_8.

There was some RFC with a review from a Unicode guy (Leroy Robin?) of Ada types but I fail to locate it.

DrPi · January 30, 2024, 3:59pm

I suggest you read a very good blog post about how Unicode works.

To manage Unicode strings in Ada, there is also the UXStrings library.

Lucretia · January 30, 2024, 5:22pm

Ada’s Unicode support is really half arsed and should be replaced.

jere · January 31, 2024, 11:51pm

Thanks everyone for the responses!

Ok thanks! That reassures me. I appreciate the response!

Yep yep, I am definitely keeping that in mind. That’s why I was specifically relating a code point vs WWC. I’m not at the point where I am worrying about human recognizable characters yet though. Still just learning the basics of encoding, but ty for the info

I’m currently just working through making my own in an attempt to learn it better, but I’ll keep VSS in mind.

Thanks for the links! I’ll spend some time on that first one for sure. I am actually very aware of UxStrings (I use Gnoga a lot and it uses UxStrings). I was just trying to learn more and I generally learn better when I make things from scratch.

Well hopefully someday in the future they will improve it!

mosteo · February 1, 2024, 9:38am

If interested, keep an eye on Some kind of Univ_String to provide better approach to support full Unicode? · Issue #40 · Ada-Rapporteur-Group/User-Community-Input · GitHub

Edit to add: this definitely looks like a situation in which a new standard to fix them all makes you end with standards := @ + 1;