Safety of returning an array with Address elsewhere

jcmoyer · December 11, 2024, 10:02pm

I was surprised to find out that GNAT allows you to return an array whose storage is elsewhere, and it seems to work as you’d expect:

function Read_All_Text (Filename : String) return String is
   --  Assume the following line reads the file into a new'd
   --  Stream_Element_Array and we never free it.
   Buffer : Stream_Element_Array_Ptr :=
     Read_All_Bytes (Filename);

   Result : String (1 .. Buffer'Length)
     with Address => Buffer.all'Address;
begin
   return Result;
end Read_All_Text;

Curious what you all think about this piece of code. To me it feels very icky because arrays don’t normally have reference semantics. Is it even well-defined? If so, can it blow up in my face in any way?

For context I’m dealing with files that are tens to hundreds of megabytes. Copying the buffer is undesirable and returning it as a String normally (without the Address clause) overflows the stack. Specifying the address in this way conveniently sidesteps both issues…

I know about memory mapped IO and it may be better to use here, but I’m curious if there are any other alternatives. One other thing I tried was an unchecked conversion from access Stream_Element_Array to access String but the bounds don’t convert correctly.

dmitry-kazakov · December 11, 2024, 10:31pm

Your code copies Result. Note also that the correct use is

   for Result'Address use Buffer.all'Address;
   pragma Import (Ada, Result);

Import Ada is necessary to prevent Result being initialized.

As for the way of making I/O, the canonical method is:

procedure Read_All_Text
          (  File   : File_Type;
             Buffer : in out Stream_Element_Array;
             Last   : out Stream_Element_Offset
          );

Last points to the last array element written. When nothing is read Last = Buffer’First - 1. When all buffer is overwritten Last = Buffer’Last and you probably need to read another chuck later.

Do not convert access types. Either convert arrays or, if you are sure, map them:

subtype Actual_String is String (1..Buffer'Length); -- Flat array
String_View : Actual_String;
for String_View 'Address use Buffer.all'Address;
pragma Import (Ada, String_View);

Array address is the address of its first element.

jcmoyer · December 11, 2024, 11:32pm

Can you explain where it’s being copied? If it were being copied on the stack back to the caller I would expect it to still overflow the stack, but 100mb+ files that were overflowing the stack before no longer overflow it.

Compiling with -fstack-usage (snippet) reports for a 100mb file:

main.adb:11:4:read_all_bytes	160	static
main.adb:7:1:example	544	static
main.adb:24:4:read_all_text	160	static
<built-in>:example	64	static

Thank you for all the advice.

jere · December 12, 2024, 12:09am

When you return an object, it is generally copied so the return Result; line generates a copy to pass back to the outside world.

Consider if you try something similar with a limited type which is forbidden to be copied:

    type LString is limited record
        Value : String(1..11);
    end record;
    
    function Read_All_Text (Filename : String) return LString is
       --  Assume the following line reads the file into a new'd
       --  Stream_Element_Array and we never free it.
       Buffer : LString := (Value => "Hello World");
       Result : LString
         with Address => Buffer'Address;
    begin
       return Result; -- Line 16
    end Read_All_Text;

You get the following error:

jdoodle.adb:16:15: error: (Ada 2005) cannot copy object of a limited type (RM-2005 6.5(5.5/2))
jdoodle.adb:16:15: error: return by reference not permitted in Ada 2005
jdoodle.adb:16:15: error: consider switching to return of access type

When you return a result like that, the compiler interprets it as a copy. There are some rules for build in place, but those involve making the variable outside your current stack (either on the caller’s stack or the secondary stack if using GNAT or the heap if using JanusAda, etc.) It wouldn’t allow you to do a build inplace using a variable in the current call stack, so it would force a copy instead of build in place.

stevelitt · December 12, 2024, 12:22am

In what cases would you want to use limited? When would it be an advantage, and what would the advantage be?

jere · December 12, 2024, 12:27am

I was more showing the limited type so the user could see the compiler is trying to copy at that line.

Some (but not all) situations where you might consider limited types as an option. They are useful when working with:

Tasks and syncronized objects
Composite types with access fields that you don’t want copied at all
Wrapper types for hardware devices (That can’t be copied in a practical way, like a serial port).
When wanting to have safer self referencial types using the Rosen Technique

Additionally limited types have to be built in place, so they can sometimes lead to faster / more efficient code. That said limited types can be really tough to work with. There’s no real language supplied container support for them and since they cannot be copied, they require more fancy programming to work with in large quantities.

jcmoyer · December 12, 2024, 1:39am

This is a good example, thanks.

I did some more experiments and observed in a debugger a malloc for 100mb, followed by an fread for 100mb, followed by a memcpy for 100mb. So you guys are definitely right about there being a copy.

Then I stumbled across System.Secondary_Stack.SS_Info, and if I use it to print the secondary stack size it grows by 100mb after reading the file. Mystery solved: the original snippet allocates 100mb on the heap and 100mb on the secondary stack. Under very specific circumstances, it uses the primary stack instead and overflows it (I could only reproduce it with -O0 and no Address clause though). I will have to rethink this API.

jere · December 12, 2024, 1:58am

Usually the heap can handle things that size. You might consider either an Unbounded_String or an Indefinite_Holder using String as the element type and return that type. Your other option is to see if you can adjust the allowed stack size to be much larger (hit or miss if that is practically possible).

Indefintie_Holders: The Generic Package Containers.Indefinite_Holders
Unbounded_Strings: Unbounded-Length String Handling

If interested, this has some info on the primary and secondary stacks: 3. The Primary and Secondary Stacks — GNAT User's Guide Supplement for Cross Platforms 26.0w documentation

OneWingedShark · December 12, 2024, 2:30am

jcmoyer:

function Read_All_Text (Filename : String) return String is
   --  Assume the following line reads the file into a new'd
   --  Stream_Element_Array and we never free it.
   Buffer : Stream_Element_Array_Ptr :=
     Read_All_Bytes (Filename);

   Result : String (1 .. Buffer'Length)
     with Address => Buffer.all'Address;
begin
   return Result;
end Read_All_Text;
Curious what you all think about this piece of code. To me it feels very icky because arrays don’t normally have reference semantics. Is it even well-defined? If so, can it blow up in my face in any way?

As stated, you are actually copying Result… but, if you want to do something similar-in-intent, Ada has a construct called “extended return” which allows you to set up a result, then perform operations:

-- We are passing in a string, which we assume is one word and all lowercase.
Function To_Title_Case(Input : String) return String is
Begin
  -- First, we copy the input...
  Return Result : String := Input do
     if Result'Length not in Positive then
       return; -- Do nothing on the empty-string.
     else
       declare
          -- We can use RENAMES to identify a single element.
          Initial : Character renames Result( Result'First );
          -- And we can limit scope-visibility to the inner block.
          Use Ada.Characters.Handling;
       begin
          -- Operating on the renamed object, operates on the object.
          Initial:= To_Upper( Initial );
       end;
     end if;
  End return;
End To_Title_Case;

A little bit “overengineered”, but I thought that you should know about things like renames and using use on inner blocks.

OneWingedShark · December 12, 2024, 2:55am

Limited is VERY useful for things like hardware-interfacing: you can’t make a new actual hardware-clock, so copying on the hardware-clock interfacing-object doesn’t make sense.

Limited is good for control, as well: by forbidding assignment, you can force obtaining of an object by making its public-view unconstrained:

Package Example is
  Type Storage_Access_ID(<>) is limited private;
  Function Get_ID return Storage_Access_ID;
  No_IDs_Left : Exception;
Private
   Type Index is range 1..16;
   Keyset : Array(Index) of Boolean:= (Others => True);
   Type Storage_Access_ID is limited record
      Key : Index;
   end record;
End Example;

Package Example is
  Function Get_ID return Storage_Access_ID is
  Begin
    For Value in Keyset'Range loop
      if Keyset(Value) then
        Return Result : Storage_Access_ID( Key => Value ) do
          Keyset(Value):= False;
        End return;
      end if;
    End loop;

    Raise No_IDs_Left;
  End Get_ID;
End Example;

Here we have a model of a simple access control system: the only way to obtain a key is to call Get_ID, which removes the first available key and flags it as taken, throwing an exception if there is none to give.

jcmoyer · December 12, 2024, 3:21am

Ideally I would like to avoid copying the original byte buffer and return a view of it as an array of characters. I tried Indefinite_Holders but it allocated even more (2*100M from the secondary stack), and Unbounded_Strings would also need a copy. Is there any way to do this? The only way I can think of is to do the String overlay in the caller of Read_All_Text, but that has poor ergonomics.

jere · December 12, 2024, 3:37am

If you can hold the original buffer as an aliased object, you can return an anonymous access to the buffer, but you are limited to where that anonymous access can be stored (it wants to protect against dangling).

Alternatively, you can use a reference counted access type. Those don’t make copies of the data, they just make copies of the reference and pass ownership around as needed, using counts on the references to know when to deallocate the data. The GNATCOLL library has a ref counted access type: 18. Refcount: Reference counting — GNATColl 22.0w documentation

Side question, when you were using the indefinite holder, how where you trying to reference the data in it? It you try to pass out the holder directly it’ll make a copy, but instead you can use the Reference function to get a reference to the string.

jcmoyer · December 12, 2024, 4:01am

What is the incantation to convert type Stream_Element_Array_Ptr is access all Stream_Element_Array; to access String? I tried unchecked conversion but it doesn’t preserve the bounds.

My bad, I used .Element. I couldn’t figure out how to get the length of the string without it, because Holder.Reference'Length is ambiguous.

jere · December 12, 2024, 4:23am

I’m not honestly sure. It’s really tricky to do correctly. Hopefully some other folks have some better input on this.

I feel like instead of creating the buffer of Stream_Element_Array, When you read the file you read it directly into an aliased String or a heap allocated string using Stream operations, then you don’t have to convert. You can then pass around a not null access constant String and not null access String to get views (you’ll need to use .all to deref the access type

You can qualify it to remove the ambiguity:
String'(Holder.Reference)'Length

I generally make a wrapper function for things like this:

function Length(Holder : Holders.Holder) return Natural 
   is (String'(Holder.Reference)'Length);

Then just
Length(Holder) gets the length

You can also locally use renames:
View : String renames Holder.Reference;, then you can do View'Length

Just wanted to touch on this. Whenever you see secondary stack usage, that’s usually from functions that return unconstrained types like String (which is a copy operation). Indefinite holders create the string on the heap, so any secondary stack usage comes from calls like Element (which you mentioned above). The holder itself doesn’t directly leverage the secondary stack. Hopefully that make sense. I’m not the best explainer.

jere · December 12, 2024, 5:02am

I wanted to come back to this. When you read the file in, you can read it directly into heap allocated string memory using Streams. See the following example:

with Ada.Text_IO;
with Ada.Streams.Stream_IO;
with Ada.Containers.Indefinite_Holders;

procedure Main is

   type String_Ptr is access String;

   function Read_File(Filename : String) return String_Ptr is
      use Ada.Streams.Stream_IO;
      File : File_Type;
   begin

      Open(File, In_File, Filename);

      -- allocate heap space based on file size
      return Result : String_Ptr := new String(1..Natural(Size(File))) do 

         -- Read file into heap string
         String'Read(Stream(File), Result.all);
         Close(File);

      end return;

   end Read_File;

   File_As_String_Ptr : String_Ptr := Read_File("src/main.adb");

begin
   Ada.Text_IO.Put(File_As_String_Ptr.all);
end Main;

Then you don’t need to convert. I don’t have anything bigger than 104k that I can easily test, but this should avoid a lot of the secondary stack and stack in general. But you do need to keep track of that string pointer and deallocate it when you are finished with it.

ebriot · December 12, 2024, 7:37am

The natural approach I would use here to avoid copying is either a generic, or passing a callback as a parameter. I’ll show an example of the latter:

  procedure Read_File (Filename : String; Process : not null access procedure (Content : String) is
    Buffer : Stream_Element_Array_Ptr := Read_All_Bytes (Filename);
    Result : String (1 .. Buffer'Length)
        with Import, Address => Buffer.all'Address;
begin
    Process (Result);
    Free (Buffer);
end Read_All_Text;

dmitry-kazakov · December 12, 2024, 8:34am

Simple. The size of Integer and Stream_Element_Offset are different on the target, e.g. 32 and 64 bit respectively. When access type is converted Strings bounds are interpreted wrong.

jcmoyer · December 12, 2024, 4:57pm

Aha, I didn’t realize you could read directly into the String. I skimmed over Stream operations when learning but I’ll take a look now. That seems like the ideal solution here.

I’ve run into this ambiguity problem a lot, so this syntax is very helpful. Thanks.

Yeah I figured. I will try to be more careful about this in the future.

Also a good approach I hadn’t considered, thanks.

I’m shocked it’s really that easy. Indeed, the conversion works if I define a Character array type with 64-bit bounds. Thanks.

OneWingedShark · December 12, 2024, 5:57pm

The other method to deal with this are declare+renames:

Procedure Something( Input : Some_Data_Holder ) is
Begin
  declare
    Item : Some_Data renames Element( Input );
  begin
    -- processing.
  end;
End Something;

jere · December 12, 2024, 9:37pm

OneWingedShark:

The other method to deal with this are declare+renames:

Procedure Something( Input : Some_Data_Holder ) is
Begin
  declare
    Item : Some_Data renames Element( Input );
  begin
    -- processing.
  end;
End Something;

Though since the OP is trying to avoid the copies, they should rename the Reference operation instead of Element. I like this method as well!