I was surprised to find out that GNAT allows you to return an array whose storage is elsewhere, and it seems to work as you’d expect:
function Read_All_Text (Filename : String) return String is
-- Assume the following line reads the file into a new'd
-- Stream_Element_Array and we never free it.
Buffer : Stream_Element_Array_Ptr :=
Read_All_Bytes (Filename);
Result : String (1 .. Buffer'Length)
with Address => Buffer.all'Address;
begin
return Result;
end Read_All_Text;
Curious what you all think about this piece of code. To me it feels very icky because arrays don’t normally have reference semantics. Is it even well-defined? If so, can it blow up in my face in any way?
For context I’m dealing with files that are tens to hundreds of megabytes. Copying the buffer is undesirable and returning it as a String normally (without the Address clause) overflows the stack. Specifying the address in this way conveniently sidesteps both issues…
I know about memory mapped IO and it may be better to use here, but I’m curious if there are any other alternatives. One other thing I tried was an unchecked conversion from access Stream_Element_Array to access String but the bounds don’t convert correctly.
Your code copies Result. Note also that the correct use is
for Result'Address use Buffer.all'Address;
pragma Import (Ada, Result);
Import Ada is necessary to prevent Result being initialized.
As for the way of making I/O, the canonical method is:
procedure Read_All_Text
( File : File_Type;
Buffer : in out Stream_Element_Array;
Last : out Stream_Element_Offset
);
Last points to the last array element written. When nothing is read Last = Buffer’First - 1. When all buffer is overwritten Last = Buffer’Last and you probably need to read another chuck later.
Do not convert access types. Either convert arrays or, if you are sure, map them:
subtype Actual_String is String (1..Buffer'Length); -- Flat array
String_View : Actual_String;
for String_View 'Address use Buffer.all'Address;
pragma Import (Ada, String_View);
Array address is the address of its first element.
Can you explain where it’s being copied? If it were being copied on the stack back to the caller I would expect it to still overflow the stack, but 100mb+ files that were overflowing the stack before no longer overflow it.
Compiling with -fstack-usage (snippet) reports for a 100mb file:
When you return an object, it is generally copied so the return Result; line generates a copy to pass back to the outside world.
Consider if you try something similar with a limited type which is forbidden to be copied:
type LString is limited record
Value : String(1..11);
end record;
function Read_All_Text (Filename : String) return LString is
-- Assume the following line reads the file into a new'd
-- Stream_Element_Array and we never free it.
Buffer : LString := (Value => "Hello World");
Result : LString
with Address => Buffer'Address;
begin
return Result; -- Line 16
end Read_All_Text;
You get the following error:
jdoodle.adb:16:15: error: (Ada 2005) cannot copy object of a limited type (RM-2005 6.5(5.5/2))
jdoodle.adb:16:15: error: return by reference not permitted in Ada 2005
jdoodle.adb:16:15: error: consider switching to return of access type
When you return a result like that, the compiler interprets it as a copy. There are some rules for build in place, but those involve making the variable outside your current stack (either on the caller’s stack or the secondary stack if using GNAT or the heap if using JanusAda, etc.) It wouldn’t allow you to do a build inplace using a variable in the current call stack, so it would force a copy instead of build in place.
I was more showing the limited type so the user could see the compiler is trying to copy at that line.
Some (but not all) situations where you might consider limited types as an option. They are useful when working with:
Tasks and syncronized objects
Composite types with access fields that you don’t want copied at all
Wrapper types for hardware devices (That can’t be copied in a practical way, like a serial port).
When wanting to have safer self referencial types using the Rosen Technique
Additionally limited types have to be built in place, so they can sometimes lead to faster / more efficient code. That said limited types can be really tough to work with. There’s no real language supplied container support for them and since they cannot be copied, they require more fancy programming to work with in large quantities.
I did some more experiments and observed in a debugger a malloc for 100mb, followed by an fread for 100mb, followed by a memcpy for 100mb. So you guys are definitely right about there being a copy.
Then I stumbled across System.Secondary_Stack.SS_Info, and if I use it to print the secondary stack size it grows by 100mb after reading the file. Mystery solved: the original snippet allocates 100mb on the heap and 100mb on the secondary stack. Under very specific circumstances, it uses the primary stack instead and overflows it (I could only reproduce it with -O0 and no Address clause though). I will have to rethink this API.
Usually the heap can handle things that size. You might consider either an Unbounded_String or an Indefinite_Holder using String as the element type and return that type. Your other option is to see if you can adjust the allowed stack size to be much larger (hit or miss if that is practically possible).
As stated, you are actually copying Result… but, if you want to do something similar-in-intent, Ada has a construct called “extended return” which allows you to set up a result, then perform operations:
-- We are passing in a string, which we assume is one word and all lowercase.
Function To_Title_Case(Input : String) return String is
Begin
-- First, we copy the input...
Return Result : String := Input do
if Result'Length not in Positive then
return; -- Do nothing on the empty-string.
else
declare
-- We can use RENAMES to identify a single element.
Initial : Character renames Result( Result'First );
-- And we can limit scope-visibility to the inner block.
Use Ada.Characters.Handling;
begin
-- Operating on the renamed object, operates on the object.
Initial:= To_Upper( Initial );
end;
end if;
End return;
End To_Title_Case;
A little bit “overengineered”, but I thought that you should know about things like renames and using use on inner blocks.
Limited is VERY useful for things like hardware-interfacing: you can’t make a new actual hardware-clock, so copying on the hardware-clock interfacing-object doesn’t make sense.
Limited is good for control, as well: by forbidding assignment, you can force obtaining of an object by making its public-view unconstrained:
Package Example is
Type Storage_Access_ID(<>) is limited private;
Function Get_ID return Storage_Access_ID;
No_IDs_Left : Exception;
Private
Type Index is range 1..16;
Keyset : Array(Index) of Boolean:= (Others => True);
Type Storage_Access_ID is limited record
Key : Index;
end record;
End Example;
Package Example is
Function Get_ID return Storage_Access_ID is
Begin
For Value in Keyset'Range loop
if Keyset(Value) then
Return Result : Storage_Access_ID( Key => Value ) do
Keyset(Value):= False;
End return;
end if;
End loop;
Raise No_IDs_Left;
End Get_ID;
End Example;
Here we have a model of a simple access control system: the only way to obtain a key is to call Get_ID, which removes the first available key and flags it as taken, throwing an exception if there is none to give.
Ideally I would like to avoid copying the original byte buffer and return a view of it as an array of characters. I tried Indefinite_Holders but it allocated even more (2*100M from the secondary stack), and Unbounded_Strings would also need a copy. Is there any way to do this? The only way I can think of is to do the String overlay in the caller of Read_All_Text, but that has poor ergonomics.
If you can hold the original buffer as an aliased object, you can return an anonymous access to the buffer, but you are limited to where that anonymous access can be stored (it wants to protect against dangling).
Alternatively, you can use a reference counted access type. Those don’t make copies of the data, they just make copies of the reference and pass ownership around as needed, using counts on the references to know when to deallocate the data. The GNATCOLL library has a ref counted access type: 18. Refcount: Reference counting — GNATColl 22.0w documentation
Side question, when you were using the indefinite holder, how where you trying to reference the data in it? It you try to pass out the holder directly it’ll make a copy, but instead you can use the Reference function to get a reference to the string.
What is the incantation to convert type Stream_Element_Array_Ptr is access all Stream_Element_Array; to access String? I tried unchecked conversion but it doesn’t preserve the bounds.
My bad, I used .Element. I couldn’t figure out how to get the length of the string without it, because Holder.Reference'Length is ambiguous.
I’m not honestly sure. It’s really tricky to do correctly. Hopefully some other folks have some better input on this.
I feel like instead of creating the buffer of Stream_Element_Array, When you read the file you read it directly into an aliased String or a heap allocated string using Stream operations, then you don’t have to convert. You can then pass around a not null access constant String and not null access String to get views (you’ll need to use .all to deref the access type
You can qualify it to remove the ambiguity: String'(Holder.Reference)'Length
I generally make a wrapper function for things like this:
function Length(Holder : Holders.Holder) return Natural
is (String'(Holder.Reference)'Length);
Then just Length(Holder) gets the length
You can also locally use renames: View : String renames Holder.Reference;, then you can do View'Length
Just wanted to touch on this. Whenever you see secondary stack usage, that’s usually from functions that return unconstrained types like String (which is a copy operation). Indefinite holders create the string on the heap, so any secondary stack usage comes from calls like Element (which you mentioned above). The holder itself doesn’t directly leverage the secondary stack. Hopefully that make sense. I’m not the best explainer.
I wanted to come back to this. When you read the file in, you can read it directly into heap allocated string memory using Streams. See the following example:
with Ada.Text_IO;
with Ada.Streams.Stream_IO;
with Ada.Containers.Indefinite_Holders;
procedure Main is
type String_Ptr is access String;
function Read_File(Filename : String) return String_Ptr is
use Ada.Streams.Stream_IO;
File : File_Type;
begin
Open(File, In_File, Filename);
-- allocate heap space based on file size
return Result : String_Ptr := new String(1..Natural(Size(File))) do
-- Read file into heap string
String'Read(Stream(File), Result.all);
Close(File);
end return;
end Read_File;
File_As_String_Ptr : String_Ptr := Read_File("src/main.adb");
begin
Ada.Text_IO.Put(File_As_String_Ptr.all);
end Main;
Then you don’t need to convert. I don’t have anything bigger than 104k that I can easily test, but this should avoid a lot of the secondary stack and stack in general. But you do need to keep track of that string pointer and deallocate it when you are finished with it.
The natural approach I would use here to avoid copying is either a generic, or passing a callback as a parameter. I’ll show an example of the latter:
procedure Read_File (Filename : String; Process : not null access procedure (Content : String) is
Buffer : Stream_Element_Array_Ptr := Read_All_Bytes (Filename);
Result : String (1 .. Buffer'Length)
with Import, Address => Buffer.all'Address;
begin
Process (Result);
Free (Buffer);
end Read_All_Text;
Simple. The size of Integer and Stream_Element_Offset are different on the target, e.g. 32 and 64 bit respectively. When access type is converted Strings bounds are interpreted wrong.
Aha, I didn’t realize you could read directly into the String. I skimmed over Stream operations when learning but I’ll take a look now. That seems like the ideal solution here.
I’ve run into this ambiguity problem a lot, so this syntax is very helpful. Thanks.
Yeah I figured. I will try to be more careful about this in the future.
Also a good approach I hadn’t considered, thanks.
I’m shocked it’s really that easy. Indeed, the conversion works if I define a Character array type with 64-bit bounds. Thanks.