Unicode strings

With Ada 2022’s support for user defined type literals, Unicode strings should be easier to work with. I worry that everyone is going to implement their own slightly different string type and we’ll end up with a bunch of incompatible libraries in Alire.

It looks like AdaCore is building VSS (derived from Matreshka) while also maintaining XML/Ada, which has it’s own Unicode package and types.

As a community I think we should define a common interface, or risk spending the rest of our lives converting string representations.

2 Likes

The Ada Rapporteur Group, which is responsible for the evolution of the Ada standard, has discussed some notion of a “Universal String” type, though there is no specific proposal yet (though AdaCore’s VSS is a clear potential model).

While on the topic of Ada evolution, if you want to suggest enhancements to the Ada standard, please visit https://arg.adaic.org, and go to the Community Input page there.

5 Likes

I agree with Jeremy. We should have a standard way for managing unicode strings.

1 Like

“Unicode” is too general of a term. In particular, I want a standard and easy way of dealing with UTF-8 strings.

1 Like

Unicode is the character set, UTF-8 is just one possible encoding of that set. These are related, but separate things.

Wide_Wide_String already represents an array of Wide_Wide_Characters, which map to the UCS-32 character set. Currently, all Unicode characters can be represented in UCS-32, but that may not be the case in the future.

It’s possible that adding the new literal aspects to Wide_Wide_String would be good enough, but I think a Vector of Wide_Wide_Character is a more useful representation for mutable strings.

“Unicode string” doesn’t tell me anything useful when reading a program.

Usually what happens is the system puts the encoding as part of the type system (U8String, U16String, etc.), so you can draw a line in the sand with the type system to ensure strings have had their encoding checked and then subsystems work on usually one form.

I very much disagree with using USC-32 for manipulation since it’s very inefficient and would often require conversions to and from, since many systems use UTF-8 encoding.

I agree that each encoding should be a separate type, but they should be converted to an abstract Unicode type for manipulation.

Let’s say you want to split a string on whitespace. There are at least 25 different whitespace characters in Unicode, encoded as 1 … 3 bytes in UTF-8. If the string is stored as UTF-8, then every character needs to be decoded on every call to Split. If the string is stored as UTF-32, then Split simply needs to scan an array. It’s a time/memory tradeoff.

I imagine a spec that defines functions like

function To_Unicode (X : UTF_8_String) return Unicode;

Would have two implementations: One that optimizes for memory size by storing the UTF-8 representation internally and another that converts to UTF-32 for more efficient scanning and manipulation. I think both approaches are valid for different applications, but they should implement the same spec.

A type like UTF_8_String plus Dynamic_Predicate (maybe called Valid_UTF_8_String) would be a good addition, alongside an array of subprograms that operate on individual Unicode characters, however the complexity of implementation and likelihood of mistakes ending up in the language specification’s description of Unicode end up concerning me a bit. A related issue is that it’s very hard to parse the necessary metadata from Unicode’s official metadata releases (at least as far as I can tell - a number of C projects attempted to do the work themselves and decided to go for things like pango to reduce the burden).

Edit: To what degree should Ada provide and store information about Unicode characters, for example? I think it is clear that it wouldn’t really be practical to store an exhaustive list of characters, since they are updated once annually (or thereabouts), however would it be reasonable to contain information like relativel glyph width or category? Such functions might end up being better served by existing libraries than the Standard Library, but I’m sure there are compelling reasons to include it in the stdlib as well.

The UTF_8_String type is already defined in Ada.Strings.UTF_Encoding, but that package only supports converting Wide_Wide_String to/from the various encodings.

For metadata and string processing, I agree that this shouldn’t be in the standard library, but handled by a library available from Alire. I’ve looked into the UCS dataset a bit and to my eyes it looks like something that belongs in a database like SQLite, but that may be too large of a dependency for many applications.

I think an easy first step would be to come up with an interface for Unicode string processing, then implement it as a binding around libunistring. Once we’ve got a reasonable API ironed out, we can start reimplementing those functions in Ada to remove the dependency on the C library.

The best way for community is to migrate to use VSS. Right now it provides many features of Unicode, as well as isolate application from any particular external encoding. Its API is designed to provide reasonable performance, but implementation allows to store data in any encoding, thus minimize overload on external-internal representation conversion when it is important.

Reinventing of new library on top of String type will just add new set of places for human mistakes when processing text information. :frowning: Which will introduce hard to detect security vulnerabilities…

1 Like

You can find interface for Unicode string processing in vss-strings.ads and other files in text/ directory. Currently, Append, Insert, Delete, Replace, Split, Starts_With, To_Upper, iteration by character, grapheme cluster, lines, etc are implemented. There is no direct character indexation, but cursors are used instead. Cursor knows character index, UTF-8, UTF-16 offset. While the cursor is of a limited type, you can take a marker from it (of unlimited type) to store it in a component or container.

Also, it is very important to have good IO support. String processing is almost useless without input-output.

I started messing with a utf-8 encoded unicode libGitHub - Lucretia/uca: Unicode Components for Ada a while back, the idea was for it fo be scalable, which if done by a compiler could be handled with pragma restrictions. When I mean scalable, I mean only dragging in what you need, so you could have unicode strings in embedded, in games you can bring in the character db (or parts of it) to enable locatlisation, etc.

I never likes matrshka’s implementation as it’s really complicated and uses about 3 layers, when you just don’t need that.

Static strings → dynamic strings using the static strings
iterators for code points (forward is done) → grapheme clusters (essentially what are now characters in unicode, i.e. collection of code points to define 1 char) → words → bidi → etc.
IO
Images (now doable in Ada 2022)

1 Like

I did build a UTF-8 stream decoder as part of a general Unicode library (in-progress), it is written in SPARK and verified. Pretty useful for I/O. It’s an AURA (alt package manager) package.

1 Like

I was glad to find this thread! Does anyone know how VSS compares to UXStrings? If I plan on writing Gnoga applications, is that the route I should be going?

I found some discussion on it about 2 years ago[1], though it still looks actively maintained as per a recent release[2].

UXStrings looks like it’s Ada 2012 and VSS I believe takes advantage of Ada 2022, so there’s that.

[1] https://groups.google.com/g/comp.lang.ada/c/_LzfSuDXWgQ/m/nglZyrdXAQAJ
[2] [ANN] Release of UXStrings 5.0

Ada already has support for Unicode via Wide_Wide_[Character | String] and Ada.Strings.Wide_Wide_Unbounded, although they use 32 bits/character, and Unicode only needs 21. These are good enough for general use. Applications with specific requirements might need to define their own Unicode_Character and Unicode_String types.

Much of the discussion seems to be about encodings of Unicode, such as UTF-8. But a general S/W-engineering principle is that encodings are external, and one only uses decoded data internally, converting on input and output. UTF-8 is a sequence of bytes, which is then interpreted into a [sequence of] Wide_Wide_Character. All that seems to be needed is a UTF_8_IO pkg taking and returning Wide_Wide-[Character | String].

UxStrings uses Ada 2022. It’s required for the string literal aspect I believe.

I don’t know how they compare. I’ve only used UxStrings so far but not extensively.

1 Like