How to read and manipulate table data

codiac · March 1, 2025, 5:11am

I’ve written a lot of code in Python to read and manipulate DataFrames as part of the pandas library, but I’m not sure what the best way would be to work with CSV or table data in Ada? I’ve started to create an array of structures, where each entry in the array contains the table header information (e.g. name, age, height, account number). To clarify, this table header information I’m referring to is also known as a row of values that labels the columns of a CSV file.

jere · March 1, 2025, 6:08am

It’ll depend on exactly what you want to do with it, but in the past, I have used the Index function in Ada.Strings.Fixed to parse out a line looking for commas and splitting out the parts in between. I start by declaring a vector to hold the split parts:

    package Vectors is new Ada.Containers.Indefinite_Vectors
        (Index_Type   => Positive, 
         Element_Type => String);
    subtype String_List is Vectors.Vector;

Then I iterate through the line read from the file and split the string based on commas:

    function Split(Line : String) return String_List is
		Result : String_List;
		First  : Positive := Line'First; -- Set to the start of the string
		Last   : Natural;
	begin
		loop
            -- Look for the next comma in the string.
            -- If a comma is found, store the index in Last.
            -- If no comma is found, Last has a value of 0.
            -- The value First is updated for each iteration
			Last := Ada.Strings.Fixed.Index(Line(First..Line'Last), ",");
			exit when Last = 0; -- Leave if no comma is found

			Result.Append(Line(First .. Last - 1));  -- Append the item
			First := Last + 1; -- set to index after comma for the next search
		end loop;
		Result.Append(Line(First .. Line'Last));  -- Append the last item 
		return Result;
	end Split;

After that you have a list of all the cell strings from the line. You can start converting them. Some strings may be empty, so you’ll want to check their length against 0 to see if there was something between the commas. You can iterate through the string list or index it like an array:

    Items : constant String_List := Split(Get_Line);
begin
    -- Iterating through them
	for Item of Items loop
		Put_Line(Item);
	end loop;

    -- Or indexing directly 
    Put_Line("2nd item is " & Items(2));

Full example (online compiler): PltCRI - Online Ada Compiler & Debugging Tool - Ideone.com

zertovitch · March 1, 2025, 7:53am

You can

create an enumerated type with the topics of the header row: type Topic is (name, age, ...)
parse the header line for mapping table columns to topics (mapping : array (Topic) of Positive)
if you need it, store the values in an array of type Row_Type is array (Topic) of Real
if you need it, store the entire table in a Vector of Row_Type.

This way is practical if you need only a part of the columns and the column order changes sometimes.

I use a CSV package (copy here) that allows for reading the items in a random order.

JC001 · March 1, 2025, 10:43am

I typically begin processing CSV files by reading them line by line and parsing the lines with PragmARC.Line_Fields. This also works for other common separators, such as semicolons and spaces.

dmitry-kazakov · March 1, 2025, 10:53am

You do not need any data structures to work with CSV. That the advantage of the format.
Strings editing library (a part of Simple Components) was designed for exactly that. You just read a line and then get columns one by one using the appropriate target data type advancing the line index, skipping blanks and separators.
As an example you can take a look how Unicodedata.txt is dealt with (it is CSV with semicolon as the separator). The file is read in Strings_Edit.UTF8.Categorization_Generator to generate a Unicode categorization map,

dmitry-kazakov · March 1, 2025, 10:53am

Yes, that is the way.

OneWingedShark · March 1, 2025, 5:29pm

This is incorrect.
It will not work on values like "Smith, John".
In order to properly render a CSV file you need to parse it.

You have to be careful; this doesn’t work with text-fields like:

Oh say can you see,
  by the dawn's early light,

dmitry-kazakov · March 1, 2025, 8:36pm

Not really. See.

You are right that Ada.String,Fixed.Index should never ever be used being in fact a tokenizer = no, no.

Why? Space is a separator, comma is a blank.

The algorithm is:

loop
   read line
   for I in 1..N loop
      skip blanks
      get field I
      skip blanks
      if I /= 1 and I /= N then
        get separator
     end if;
   end loop;
   skip blanks
   check line end
end;

jere · March 2, 2025, 12:04am

Nah that’s pretty easy to work around. If you are parsing known data you just merge the strings ("Smith and John" with a comma between for example). If you want a more general solution you do a second pass merging all cells between the one starting with an unescaped quote and the one ending with an unescaped quote. I’ve never had any trouble with it

dmitry-kazakov · March 2, 2025, 10:39am

This is awful and a good example why not to split into fields.

    "Smith\,\ John"
    1,5, 3,1415   -- means 1.5 and 3.1415 in Europe

You should get a quoted string just like you would do a number. Syntax diagrams Wirth used in the Pascal User Manual is the way to describe how to parse such things, all in a single pass, no tracebacks.

JC001 · March 2, 2025, 11:57am

OneWingedShark:

It will not work on values like "Smith, John".

JC001:

I typically begin processing CSV files by reading them line by line and parsing the lines with PragmARC.Line_Fields. This also works for other common separators, such as semicolons and spaces.

You have to be careful; this doesn’t work with text-fields like:
Oh say can you see,
  by the dawn's early light,

PragmARC.Line_Fields handles quoted fields properly.

For your second example, I presume you are referring to fields with embedded line terminators. This is true, but since in 50 years I have never encountered such fields, I don’t consider it a problem.

OneWingedShark · March 2, 2025, 5:11pm

You are correct about embedded line-terminators.
I have encountered them in CSV, and even generated them. The first programming project I was put on after graduating & getting a job was a program that processed medical/insurance records… using PHP.

This particular problem came up when I had to implement a file import/export function for CSV; I’d used the internal parse function but it wouldn’t work on the production machine. I went through everything I could think of, and nothing. So I wrote up a CSV parser, tested/debugged it, and deployed it to take the place of that PHP-function. (Turns out the version of PHP on the other machine was different, and the parse-CSV function was added between those minor versions.)

codiac · March 15, 2025, 11:58pm

PragmARC.Line_Fields doesn’t seem to handle UTF-8 encoding. When I tried to change the arguments from “String” to “Wide_String” I got errors.

codiac · March 15, 2025, 11:59pm

Will Ada.Strings.Fixed work with wide strings? I need to parse strings containing UTF-8 characters.

jere · March 16, 2025, 1:20am

For Wide_Strings you want to use the Wide Version: Ada.Strings.Wide_Fixed I don’t know much about how it interfaces with UTF8 though. I assume it doesn’t out of the box.

codiac · March 16, 2025, 3:07am

I figured it out. I can share my solution if you think that would be helpful. It was kind of a pain, but I learned a lot about Ada in the process.

codiac · March 16, 2025, 3:08am

I figured this out. Is there a way I can push my corrections for PragmARC.Line_Fields to github?

jere · March 16, 2025, 4:58am

You can try raising a PR at Jeffrey’s github: GitHub - jrcarter/PragmARC: The PragmAda Reusable Components

JC001 · March 16, 2025, 9:14am

If you’re trying to parse UTF-8 encoded data, then you’re on your own. Encoded data should be decoded before processing. Wide_String is not a good choice for representing UTF-8 encoded data.

Modifying Line_Fields to work with (unencoded) [Wide_]Wide_String should be trivial. It could even be generic.

dmitry-kazakov · March 16, 2025, 12:18pm

UTF-8 was designed specially with the goal that all parsing algorithms remain same in UTF-8 encoded format.

One can remove Wide_String and Wide_Wide_String from the language and notice no difference.