MySQL data extraction

Hello!

I would like to export data from a measurement system’s MySQL database to files.
Each measurement has a serial number and the software performing the measurement saves the data of each measurement in a new table named the serial number.
During the measurement, the signal of a vibration sensor is digitized and saved in the table as a blob, and other parameters of the measurement are also saved in JSON format.
The file structure would look like this: serialnumber_stepnumber.raw and serialnumber_stepnumber.json for each row of the table.

I decided to solve this task in Ada because it seems like a good practice project! :slight_smile:
I develop and use the program on Ubuntu 24.04, but later it should also run on Windows.

The DDL looks like this:

CREATE TABLE 1201000001 (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
audio longblob,
distance blob,
step_no int(10) unsigned DEFAULT NULL,
serial_no varchar(20) CHARACTER SET latin1 DEFAULT NULL,
date_time datetime DEFAULT NULL,
PRIMARY KEY (id)
) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=latin2 COLLATE=latin2_general_ci ROW_FORMAT=DYNAMIC;

Description of each field:

  • id: primary key, it does not carry any additional information
  • audio: stores the measurement results of the sensors, practically a unique format audio file stored with zlib compression, accordingly it has the zlib header 0x78 0x9C
  • distance: contains the bytes of a 32-bit big-endian floating-point number, stores the sensor distance
  • step_no: a measurement consists of several steps/cycles, this field stores the ID of the current step/cycle
  • serial_no: the serial number of the measurement, the same as the table name
  • date_time: timestamp of the measurement

Since the size of the ‘audio’ field can be up to several hundred MBytes, some kind of stream-based solution should be used during zlib unpacking.

What library do you recommend for MySQL connection? As I looked around, GNATColl does not have MySQL support. Ada-ado is very nice, but I think it’s overkill for this project. Simple Components has ODBC binding, but I’m a little averse to ODBC :slight_smile:
In practice, what would you use for such a typical ETL project?

Thanks in advance for your help!

1 Like

ODBC is OK, but storing measurement data into a relational database is not. Depending on the posprocessing you better use a tailored custom persistence solution. E.g. our datalogger stored up to a few gigabytes of 10kHz channels we needed to show as graphs and search for certain conditions. That is not for puny relational database.

1 Like

Yes, you’re right!

However, this is an inherited legacy system and we want to extract several years of data from it. It is not a critical system and the task is not urgent. That’s why I took it on for learning purposes.

2 Likes

I downloaded and studied the current package of Simple Components, but using ODBC is beyond my skills.
Where can I find a simple example that connects to a database and runs a simple Query?
That would be enough for a start, I could start from there.

Also, one more question: because of streamed processing (the blob field in database is zlib data), I started studying the zlib-ada library (c binding) and Ada.Streams.Stream_Element_Array. Can the Query of the Simple Components ODBC library return a blob value in a way that can be streamed?

I know these are very amateur questions, but I have to start somehow.

ODBC is just like any other RDB API, e.g. MySQL.

  1. Create a connection to the DB Simple components for Ada
  2. Create a command/statement Simple components for Ada.
  3. Prepare
  4. Bind parameters
  5. Execute (many times if prepared)
  6. Fetch results.

The binding are thick and take care about freeing resources and closing handles.

Here is an example from the documentation that does exactly this, assuming that a connection is established:

declare
   Command : aliased ODBC_Command (Connection'Access);
   Stream  : String_Stream (100);
begin
   Prepare
   (  Command,
      String'("SELECT * FROM test_table WHERE x1=10")
   );
   Execute (Command);
   loop
      Fetch (Command);
      Get_Data (Command, Stream, 2, Never);   -- Write data into the stream
      Put_Line ("Column 2 = " & Get (Value)); -- Get stream contents
      Rewind (Stream); -- Erase data written into the stream
   end loop;
exception
   when End_Error => -- No more rows
      null;
end;

I forgot to mention:

Thanks for the info!
@zertovitch One question: does Zip-Ada support the Zlib format? Looking at the sources, it is clear that it knows the DEFLATE algorithm (LZ77 + Huffman coding), but can zip-ada unpack a data stream that only has a Zlib header (rather than complete Zip structure)?
I ask because in the documentation and public packages I only see Zip/PKZip format stream/file processing everywhere.
Since it is a native Ada implementation, it would be more likeable, but the question is how much “hacking” does it have to be and is it worth it?

I did not use zip programmatically, so I cannot tell. Contact the library maintainer Gautier de Montmollin. Announces of the library releases are posted here, this is how I know it exists: Zip-Ada version 60.

Ping @zertovitch for the previous question

1 Like

Zip-Ada doesn’t support zlib, but this one does (also in pure Ada): Alire - Z_compression

This will be a very long response. I apologize in advance. I added a rudimentary Ada binding to libmysqlclient.so to the Linux Simple I/O Library a while ago for my own purposes. I make no claim for completeness, but it is enough for me to push sensor data to a MySQL database and perform basic queries. My binding only handles string data items.

After reading your post, I decided to see what would happen with blob data items. First I created some blobs and stuffed them into a MySQL database table:

# Create some blobs, and *make them readable by the MySQL server process*:

dd if=/dev/random of=blob1.bin bs=1M count=1
dd if=/dev/random of=blob2.bin bs=1M count=1
dd if=/dev/random of=blob3.bin bs=1M count=1

md5sum -b blob1.bin >blob1.md5
md5sum -b blob2.bin >blob2.md5
md5sum -b blob3.bin >blob3.md5

chmod 444 blob*.bin
mv blob*.bin /tmp
sudo chown mysql:mysql /tmp/blob*

ls -l /tmp/blob*
-r--r--r-- 1 mysql mysql 1048576 Jan 15 13:44 /tmp/blob1.bin
-r--r--r-- 1 mysql mysql 1048576 Jan 15 13:44 /tmp/blob2.bin
-r--r--r-- 1 mysql mysql 1048576 Jan 15 13:44 /tmp/blob3.bin

# Create a simple table of blobs, as *database root user*:

mysql -p -u root
MariaDB [(none)]> use test1
MariaDB [test1]> create table Blobs(ID int, data longblob not null, primary key(ID));
Query OK, 0 rows affected (0.013 sec)

MariaDB [test1]> show tables ;
+-----------------+
| Tables_in_test1 |
+-----------------+
| Blobs           |
| LoRaMessages    |
| Temperature     |
+-----------------+
3 rows in set (0.001 sec)

MariaDB [test1]> show fields from Blobs ;
+-------+----------+------+-----+---------+-------+
| Field | Type     | Null | Key | Default | Extra |
+-------+----------+------+-----+---------+-------+
| ID    | int(11)  | NO   | PRI | NULL    |       |
| data  | longblob | NO   |     | NULL    |       |
+-------+----------+------+-----+---------+-------+
2 rows in set (0.001 sec)

MariaDB [test1]> insert into Blobs values(1, load_file('/tmp/blob1.bin'));
Query OK, 1 row affected (0.086 sec)

MariaDB [test1]> insert into Blobs values(2, load_file('/tmp/blob2.bin'));
Query OK, 1 row affected (0.017 sec)

MariaDB [test1]> insert into Blobs values(3, load_file('/tmp/blob3.bin'));
Query OK, 1 row affected (0.018 sec)

MariaDB [test1]> select ID from Blobs ;
+----+
| ID |
+----+
|  1 |
|  2 |
|  3 |
+----+
3 rows in set (0.001 sec)

MariaDB [test1]> 
Bye

# Verify that a normal user can read the Blobs table:

mysql -p
MariaDB [(none)]> use test1
MariaDB [test1]> select ID from Blobs ;
+----+
| ID |
+----+
|  1 |
|  2 |
|  3 |
+----+
3 rows in set (0.001 sec)

MariaDB [test1]> quit
Bye

Then I wrote the following Ada program to reconstruct a blob file from the database. It lacks error checking and isn’t particularly efficient:

WITH Ada.Command_Line;
WITH Ada.Sequential_IO;
WITH Ada.Strings.Fixed;

WITH MySQL.libmysqlclient;

PROCEDURE test_mysql_fetch_blob IS

  FUNCTION Trim
   (Source : IN String;
    Side   : IN Ada.Strings.Trim_End :=
      Ada.Strings.Both) RETURN String RENAMES Ada.Strings.Fixed.Trim;

  TYPE Byte IS MOD 256;

  -- Convert a hex string to a byte

  FUNCTION ToByte(s : string) RETURN Byte IS

  BEGIN
    RETURN Byte'Value("16#" & s & "#");
  END ToByte;

  PACKAGE ByteIO IS NEW Ada.Sequential_IO(Byte);

  server : MySQL.libmysqlclient.Server;
  blobID : Natural;
  dstfile : ByteIO.File_Type;

BEGIN
  -- Connect to the database server

  server := MySQL.libmysqlclient.Create
   (Ada.Command_Line.Argument(1),
    Ada.Command_Line.Argument(2),
    Ada.Command_Line.Argument(3),
    Ada.Command_line.Argument(4));

  blobID := Natural'Value(Ada.Command_line.Argument(5));

  -- Issue query

  server.Dispatch("select hex(data) from Blobs where ID=" & Trim(blobID'Image));

  -- Process results

  server.FetchResults;
  server.FetchRow;

  ByteIO.Create(dstfile, name => "blob" & Trim(blobID'Image) & ".bin");

  DECLARE
    blobhex  : String := server.FetchColumn(1);
    blobsize : Natural := blobhex'Length/2;
  BEGIN
    FOR i IN 1 .. blobsize LOOP
      ByteIO.Write(dstfile, ToByte(blobhex((2*i-1) .. 2*i)));
    END LOOP;
  END;

  ByteIO.Close(dstfile);
 
  server.Disconnect;
END test_mysql_fetch_blob;

Finally, I ran the program to reconstruct the three blob files:

# Reconstruct blob files from database:

./test_mysql_fetch_blob localhost pmunts redacted test1 1
./test_mysql_fetch_blob localhost pmunts redacted test1 2
./test_mysql_fetch_blob localhost pmunts redacted test1 3

ls -l *.bin
-rw-r--r-- 1 pmunts pmunts 1048576 Jan 15 15:17 blob1.bin
-rw-r--r-- 1 pmunts pmunts 1048576 Jan 15 15:17 blob2.bin
-rw-r--r-- 1 pmunts pmunts 1048576 Jan 15 15:17 blob3.bin

md5sum -c *.md5
blob1.bin: OK
blob2.bin: OK
blob3.bin: OK

So, my rudimentary client binding can retrieve blobs from a database simply by using the MySQL hex function to fetch a blob as a (very long) hex string. This scheme should work for any MySQL client binding.

Converting a blob to an in-memory binary object and dealing with compression are left as an exercise for the reader.

Addendum #1: I repeated my tests with 500 MB blobs. They failed until I added

[mysqld]
max_allowed_packet=1G

to /etc/mysql/conf.d/mysql.cnf on my Debian 13 (Trixie) Linux database server. This is a brute force solution that works because my server has 16 GB of ram. I’m actually a little surprised that Ada handles 1 GB strings.

Addendum #2: Next I tried running test_mysql_fetch_blob with 500 MB blobs on a Raspberry Pi 5 with 16 GB RAM running 64-bit Raspberry Pi OS and connected to the same gigabit LAN as the MySQL server:

time ./test_mysql_fetch_blob shiloh.munts.net pmunts redacted test1 1 

real	0m57.958s
user	0m37.778s
sys	0m1.352s

md5sum -c blob1.md5test_mysql_fetch_blob
blob1.bin: OK

Finally, I tried running test_mysql_fetch_blob on a Rasberry Pi 5 with 4 GB RAM running MuntsOS Embedded Linux and connected to the MySQL server via WiFi. It fails with an out of memory kernel fault. Trying same with 8 GB RAM succeeded:

tarsus# time ./test_mysql_fetch_blob shiloh.munts.net pmunts redacted test1 1
real	4m 10.68s
user	0m 52.95s
sys	0m 3.56s
tarsus# md5sum -c blob1.md5
blob1.bin: OK

Test runs on the 4 GB Raspberry Pi 5 lasted 1 to 3 minutes, suggesting that 4 GB RAM is almost enough.

1 Like

@dmitry-kazakov @zertovitch @Heziode Thanks for the answer and the ideas, I’m working on the solution and I think I’ll definitely have more questions!

@pmunts Thank you for the thorough description! I’m especially glad you brought MuntsOS Embedded Linux to my attention! I also develop for RPi (C/C++), although I haven’t tried Ada on arm64 yet.

I’m still very new to Ada programming. I’ve tried it several times because I really like its “philosophy” (much more so than Rust) but there was no real project worth developing. So daily tasks slowly pushed it into the background. Although I have to say that it’s very difficult to switch from C/C++ to the Ada approach. At least it’s difficult for me… :slight_smile:

I do not understand this. If you look at ODBC or other DB API in C, they are very low level. You need a lot of manual things done. For example, you must explicitly allocate and free any resources you use. It gets quite tricky with cursors etc. In Ada it is exactly same calls as in C just half so much. You could translate DB communication C code into Ada in five minutes, provided you understand the C code. There is no difference in Ada approach as it is exactly same: prepare, bind, execute, fetch, repeat.

I am puzzled.

1 Like

Z_Compression only does compression. The OP seems to be looking for decompression.

Alas. Looks that zlib is the only choice. Here is the bindings to: ZLib for Ada thick binding. download | SourceForge.net It looks quite nicely done. It does not have streams outright, but it would be easy to add, there is an example that could be modified for it.

Sorry, I was a bit ambiguous: I didn’t mean the quoted text about the ODBC binding! I was writing in general about the huge difference between Ada’s high-level strongly typed and type-oriented approach and C’s “portable assembly”.

Z_Compression was derived from the Deflate algorithm in Zip-Ada. There is also an implementation of decompression (Inflate) in Zip-Ada which could be extracted as well.

That would be very nice. I can think of many projects where an Ada compression library could be used. Under Windows there is an option to compress/decompress files on the filesystem level. I tested it for log file chunks. It is awfully slow and the compression rate is bad. In the end we used an external library. An Ada solution with a stream interface would be a dream.

1 Like

At work we use Zip-Ada for zipping log files.

The process writes to its log, and when it reaches a certain lines in the file it renames the log file as nam2+timestamp.log, then it spawns an ada process that gets the name as input which then zips it via Zip-Ada. Could be done as a task of course.

The process opens a new log and continues writing to it.

Yes. We chunk log when it grows over a certain size or logging interval extends some duration.