How to read and manipulate table data

I’ve written a lot of code in Python to read and manipulate DataFrames as part of the pandas library, but I’m not sure what the best way would be to work with CSV or table data in Ada? I’ve started to create an array of structures, where each entry in the array contains the table header information (e.g. name, age, height, account number). To clarify, this table header information I’m referring to is also known as a row of values that labels the columns of a CSV file.

1 Like

It’ll depend on exactly what you want to do with it, but in the past, I have used the Index function in Ada.Strings.Fixed to parse out a line looking for commas and splitting out the parts in between. I start by declaring a vector to hold the split parts:

    package Vectors is new Ada.Containers.Indefinite_Vectors
        (Index_Type   => Positive, 
         Element_Type => String);
    subtype String_List is Vectors.Vector;

Then I iterate through the line read from the file and split the string based on commas:

    function Split(Line : String) return String_List is
		Result : String_List;
		First  : Positive := Line'First; -- Set to the start of the string
		Last   : Natural;
	begin
		loop
            -- Look for the next comma in the string.
            -- If a comma is found, store the index in Last.
            -- If no comma is found, Last has a value of 0.
            -- The value First is updated for each iteration
			Last := Ada.Strings.Fixed.Index(Line(First..Line'Last), ",");
			exit when Last = 0; -- Leave if no comma is found

			Result.Append(Line(First .. Last - 1));  -- Append the item
			First := Last + 1; -- set to index after comma for the next search
		end loop;
		Result.Append(Line(First .. Line'Last));  -- Append the last item 
		return Result;
	end Split;

After that you have a list of all the cell strings from the line. You can start converting them. Some strings may be empty, so you’ll want to check their length against 0 to see if there was something between the commas. You can iterate through the string list or index it like an array:

    Items : constant String_List := Split(Get_Line);
begin
    -- Iterating through them
	for Item of Items loop
		Put_Line(Item);
	end loop;

    -- Or indexing directly 
    Put_Line("2nd item is " & Items(2));

Full example (online compiler): PltCRI - Online Ada Compiler & Debugging Tool - Ideone.com

2 Likes

You can

  • create an enumerated type with the topics of the header row: type Topic is (name, age, ...)
  • parse the header line for mapping table columns to topics (mapping : array (Topic) of Positive)
  • if you need it, store the values in an array of type Row_Type is array (Topic) of Real
  • if you need it, store the entire table in a Vector of Row_Type.

This way is practical if you need only a part of the columns and the column order changes sometimes.

I use a CSV package (copy here) that allows for reading the items in a random order.

1 Like

I typically begin processing CSV files by reading them line by line and parsing the lines with PragmARC.Line_Fields. This also works for other common separators, such as semicolons and spaces.

1 Like

You do not need any data structures to work with CSV. That the advantage of the format.
Strings editing library (a part of Simple Components) was designed for exactly that. You just read a line and then get columns one by one using the appropriate target data type advancing the line index, skipping blanks and separators.
As an example you can take a look how Unicodedata.txt is dealt with (it is CSV with semicolon as the separator). The file is read in Strings_Edit.UTF8.Categorization_Generator to generate a Unicode categorization map,

Yes, that is the way.

This is incorrect.
It will not work on values like "Smith, John".
In order to properly render a CSV file you need to parse it.

You have to be careful; this doesn’t work with text-fields like:

Oh say can you see,
  by the dawn's early light,

Not really. See.

You are right that Ada.String,Fixed.Index should never ever be used being in fact a tokenizer = no, no.

Why? Space is a separator, comma is a blank.

The algorithm is:

loop
   read line
   for I in 1..N loop
      skip blanks
      get field I
      skip blanks
      if I /= 1 and I /= N then
        get separator
     end if;
   end loop;
   skip blanks
   check line end
end;

Nah that’s pretty easy to work around. If you are parsing known data you just merge the strings ("Smith and John" with a comma between for example). If you want a more general solution you do a second pass merging all cells between the one starting with an unescaped quote and the one ending with an unescaped quote. I’ve never had any trouble with it

This is awful and a good example why not to split into fields.

    "Smith\,\ John"
    1,5, 3,1415   -- means 1.5 and 3.1415 in Europe

You should get a quoted string just like you would do a number. Syntax diagrams Wirth used in the Pascal User Manual is the way to describe how to parse such things, all in a single pass, no tracebacks.

PragmARC.Line_Fields handles quoted fields properly.

For your second example, I presume you are referring to fields with embedded line terminators. This is true, but since in 50 years I have never encountered such fields, I don’t consider it a problem.

You are correct about embedded line-terminators.
I have encountered them in CSV, and even generated them. The first programming project I was put on after graduating & getting a job was a program that processed medical/insurance records… using PHP.

This particular problem came up when I had to implement a file import/export function for CSV; I’d used the internal parse function but it wouldn’t work on the production machine. I went through everything I could think of, and nothing. So I wrote up a CSV parser, tested/debugged it, and deployed it to take the place of that PHP-function. (Turns out the version of PHP on the other machine was different, and the parse-CSV function was added between those minor versions.)

1 Like

PragmARC.Line_Fields doesn’t seem to handle UTF-8 encoding. When I tried to change the arguments from “String” to “Wide_String” I got errors.

Will Ada.Strings.Fixed work with wide strings? I need to parse strings containing UTF-8 characters.

For Wide_Strings you want to use the Wide Version: Ada.Strings.Wide_Fixed I don’t know much about how it interfaces with UTF8 though. I assume it doesn’t out of the box.

1 Like

I figured it out. I can share my solution if you think that would be helpful. It was kind of a pain, but I learned a lot about Ada in the process.

I figured this out. Is there a way I can push my corrections for PragmARC.Line_Fields to github?

You can try raising a PR at Jeffrey’s github: GitHub - jrcarter/PragmARC: The PragmAda Reusable Components

If you’re trying to parse UTF-8 encoded data, then you’re on your own. Encoded data should be decoded before processing. Wide_String is not a good choice for representing UTF-8 encoded data.

Modifying Line_Fields to work with (unencoded) [Wide_]Wide_String should be trivial. It could even be generic.

1 Like

UTF-8 was designed specially with the goal that all parsing algorithms remain same in UTF-8 encoded format.

One can remove Wide_String and Wide_Wide_String from the language and notice no difference.