Generic Image Decoder (GID) version 13

GID can be found via the following links:

New in version 13:

  • Quality of progressive JPEG and overall performance of JPEG decoding have been improved.

  • There are also two new tools shipped with GID (and of course using it):

    • comp_img: an image comparison tool (result is from 0: identical, to 1: black/white)

    • benchmark: a performance test between GID and ImageMagick, an open-source library.

Enjoy!

5 Likes

In case you’re wondering what the results might look like…

A little spoiler/teaser (nota bene, statistically unrepresentative):

GIF GIF89a,  Subformat 7
    Images in this category  . . . . . . . . . . . . . . . . . : 1
    Average color difference score . . . . . . . . . . . . . . : 0.00000E+00
    Average duration GID [1] . . . . . . . . . . . . . . . . . : 0.683026139
    Average duration ImageMagick (external call) . . . . . . . : 2.124468994
    Average duration ImageMagick (as if internally called) [2] : 2.097257994
    [1]  vs. [2]: GID is 3.07054E+00 times faster
4 Likes

I read the blog report, and while it’s well-done and the argument about generics is compelling, do you know if there’s another C/C++ decoder you could compare to? Something that isn’t part of a larger project?

Also, and I think I know the answer to this, but just to be sure: what about the resulting size of the program? Won’t the generics make it much larger? does that matter?

It depends on the format.
For instance, many open-source libraries use libjpeg and libpng for JPEG and PNG, or literal translations (typically to Pascal).
So you should get similar results in the end.
For JPEG Baseline, a rare C decoder that is not calling libjpeg behind the scenes is NanoJPEG and its translation was the base of GID’s decoder.
That’s for open-source software.
Probably you find more efficient libraries in proprietary software.

Yes, the proliferation of generics (especially, multi-level generics, and always macro-expanded by GNAT) leads to a much larger object code.
However, when I do

gprbuild -P gid -XGID_Build_Mode=Smallest mini

I get, with gcc 11, a 764 KiB executable for Windows x64. The size isn’t that dramatic, is it?
The “mini” application converts images and animations of any of the BMP, GIF, JPEG, PNG, PNM, QOI, and TGA formats.

2 Likes

GID is presented as “Unconditionally portable code: OS-, CPU-, compiler- independent”. On the other hand, you claim that GID is fast largely because of its extensive use of generics, but only generic instantiation by macro expansion has that effect. There are compilers that do not do generic instantiation that way. Striving for compiler independence while making implementation decisions based on the internal details of (a) certain compiler(s) seems somewhat contradictory to me.

Gautier, thank you for your work! It’s really great to see such attention to portability while also striving to make it work fast on the most popular compiler. It is really a great feat!

1 Like

Short answer:

  • Macro-expanded generic instantiation is a well-known and documented (by compiler makers) way of implementing instantiations. Some compilers even have switches or pragmata for chosing that method (another well-known one is shared generics).
  • Portability of GID is not hindered by the method the compiler is using for instantiating generics.
  • Execution performance depends on what the compiler has to offer, and this embraces not only generics, but various optimization options.

Now, the longer answer:
A compiler may not provide macro-expanded generic instantiations. It may also not do common subexpression elimination, nor do loop unrolling, nor optimize away unnecessary range checks, and so on.
GID cannot (nor any other source code) do anything about someone chosing a compiler that produces relatively slow machine code.
Perhaps the chosen compiler has other qualities (for instance, it may support exotic targets, or produce very compact machine code), and that’s fine.
If you expect that GID (or any source code) will magically improve the way the compiler is doing its job regarding execution performance, I’m afraid you’re going to be disappointed…

Back to macro-expanded generic instantiations. I see three possibilities, if you want GID to run “on the fast side”:

  1. Your compiler provides macro-expanded generic instantiations, and enables them by default (and perhaps does only that way, like GNAT and ObjectAda). Then it’s all fine.

  2. Your compiler provides macro-expanded generic instantiations, but does not enable them by default. Then, you may want to search in the compiler’s manual a way to enable them.
    I am not aware of a standard way of controlling that from within the Ada sources.
    DEC Ada had a “Inline_Generic” pragma. GNAT recognizes that pragma but doesn’t do anything special about it since it already applies that method.
    And other compilers (of this case 2) may not recognize that pragma, although they would perhaps need to. So your remaining option, in that case, is to look for the right switch in the user manual.

  3. Your compiler does not provide macro-expanded generic instantiations at all.
    In this case, well… sorry! It is a limitation of the compiler (like other possible limitations mentioned above) and you have to live with it (or maybe, not). It doesn’t mean that you won’t be able to compile GID with that compiler. But no Ada source will make that feature (macro expansion of generics) appear magically.
    The compiler maker(s) may want (or perhaps, not) to work a little bit harder on that topic. But it is his/her/their job.
    You may also have the opportunity to switch to another compiler. That’s a nice thing about Ada: there is a compiler market, with some competition.

2 Likes

GID provides great functionality, and I use it. (The Read procedure of Image_IO is just a wrapper around GID for the only use I’ve had for it: reading from a file.) And I also strive for portability. But I’ve never tried to write portable code that was optimized for a specific compiler. It just seems like an unusual choice with competing priorities.

Obviously I have a different perspective…

Optimization and portability sound often contradictory in our minds, because we think to things like assembler snippets targeted at certain CPUs.
But it is doesn’t need to be always the case.

An example are the BLAS routines that are highly portable and contain hand-made loop unrollings (although, in a new implementation of linear algebra computations, I would rather let the compiler do that job).

Another example is precisely GID.
The reason there is that the use of generics as “smart macros” has an algorithmic effect: tests on invariants are moved to a much outer part of the processing.
It will work with all compilers providing macro-expansion.
In the worst case (your compiler doesn’t do macro-expanded instantiations), there is no speed improvement.
But this is unlikely to happen as the compiler landscape evolves.

  • The Ada Reference Manual states (since Ada 95 !): “Generic units can be used to perform the role that macros sometimes play in other languages”.

  • Macro-expansion of generics is not an obscure internal detail of a certain compiler.
    This is a standard technique discussed, at least, in the AARM95 Ada 95 Documents - Ada Resource Association .

  • GNAT and ObjectAda use macro-expansion exclusively. So we have already the “big players”, don’t we?

  • If you meet a compiler that doesn’t provide it, at least optionally, again, it’s not the end of world: it will be able to compile GID, and its produced machine code will be able to decode images.
    The image decoding will not benefit from the performance bonus brought by macro-expansion in this context, but portability is not affected at all.

4 Likes

The use of canonical models to describe language constructs is common. Fixed-point types, for example, are often described using a canonical model of scaled integers. Other implementations are possible and acceptable. The same is true of macro expansion as a canonical model for generic instantiation; one should not read anything more into it.

The trouble with manual “optimizations” such as loop unrolling or inlining is that sometimes they are not optimizations, and sometimes they are the opposite. I have seen unrolled loops that were slower than the loop, and where inlining a subprogram resulted in noticeably slower execution. Under other circumstances, these same changes may have been faster. Since an Ada compiler was able to produce faster code than assembler hand optimized by a team of experts over 30 years ago, and presumably optimizers have improved since then, the smart decision is to leave optimization to the optimizer. Then you will typically get the best speed with all compilers for all platforms. Perversely, manual optimizations can sometimes prevent optimizations that would otherwise be possible.

The important factors are correctness and clarity. Correctness includes meeting timing requirements, but typically reusable libraries have no timing requirements; only the applications in which they are used have them. Since the developer of such a library does not know what compilers or platforms they will be used on, or for what applications, it is generally useless to try to optimize them. More important is making sure their specification is easy to understand and use.

This seems especially true of a decoding library. Decoding is usually associated with input; I have difficulty thinking of a use of GID that is not. Since input is significantly slower than the in-memory operations involved in decoding, even if you can reduce the decoding time by orders of magnitude, it’s unlikely to make a significant difference.

“[M]ost of the CPU time … was spent processing I/O requests … [which] so dominate the run time that … no other part of the program matters at all.” – Software Tools

1 Like

In the world of SSDs, I/O time is much less of a factor than it was 30 years ago (sub 10us read time these days vs 10ms in the old days). Some EEPROMs (FeRAM for example) has ns or ps read times. Unfortunately processors haven’t improved as fast as IO has. I’ve had several projects where my IO was blocked behind heavy computation because IO is super fast. I don’t have experience in image decoding though, so I don’t know which way the pendulum swings here.

That doesn’t imply one should abandon designing for correctness first, but I think zertovitch has already designed for correctness first and is working on optimizing for a common platform, which is a good idea. I didn’t get the impression he was aiming to sacrifice correctness in his efforts.

1 Like

Always check your assumptions!
For instance, if I replace the calls to Load_Frame by Load_Frame_Dynamic_Parameters in GID.Decoding_PNG - actually disabling the optimization by generics - the benchmark program is 54% slower on the PNG images. The run time includes fix costs like input and output!

I agree that I/O is much faster than in 1976. I do a lot of I/O from USB hard disks, which are still quite slow. And hard disks are still common, and still slow. (EEPROMs are memory, not input.) SSDs continue to have limited numbers of writes. So assuming really fast I/O does not seem reasonable yet.

People have developed NVRAM that is as fast as volatile RAM and as durable as a hard disk. Presumably when it becomes competitive we will see systems where everything is memory, and there is no equivalent to disk I/O.

That sounds impressive. My tests are less so. I timed the call to Image_IO.Read in Pure_RGB. Read opens the file, reads it using GID, and closes it. I tested it with 2 PNG files supplied with GID: png_interlaced_bush.png and png_sparse_10k_x_10k.png. I then changed Load_Frame to only call Load_Frame_Dynamic_Parameters in GID.Decoding_PNG and reran them.

For png_interlaced_bush.png, the difference was 11.6%; for png_sparse_10k_x_10k.png, 28.8%. In both cases, the difference was so small compared to the overall running time for the program that I didn’t care about it.

As an aside, I noticed in GID.Decoding_PNG the comment “The image itself is contained in the IDAT chunk(s).” If an image is in multiple chunks, may they be processed in parallel?

Just as a point of interest, they definitely can be input in various applications. I’ve worked on tons of projects where the only non volatile memory was indeed EEPROM. I’ve even seen file systems implemented over them for things like data loggers. Keep in mind there are a lot of embedded applications out there that have unique requirements that standard hard drives or sdcards cannot handle.

It’s been a long time since I used PROMs, so maybe things have changed. But when I did, they required a special device to write your data to them; then they were Read-Only Memory (hence ROM) in your system at some range of addresses. We defined deferred constants with Address clauses to access the data.

It’s kind of a weird naming loophole if you will. The cells are not technically writable and are only readable, but in EEPROM you can erase them and reinitialize them (via a command, no external device). It’s a very loose use of the acronym PROM, but it caught on and became popular (EE means electronically erasable). In the old days this meant you had to be careful how many times you wrote to them as you had limited erase cycles before the internals stopped reliably storing data. One example is a chip I work on has a small 256 byte EEPROM to store some configuration/settings data on. You can only erase each byte up to 10000 times or so before it is no longer reliable to use.

However in the last 10-15 years some new technologies that are faster and more reliable caught on and a few of them have lept into the EEPROM market (Such as FeRAM). They are at a technical level not PROM (more so RAM) but what a lot of designers did was start populating ICs to act like EEPROM using standard EEPROM interfaces and standard EEPROM packages/pin outs. Now even quite a few are marketed as EEPROMs. It’s caught on and over time the term EEPROM has started to change from a technically physical implementation acronym to one that refers to the EEPROM interface standards used (kinda blackboxing the implementation). At work, we’ve actually xrayed and opened up a few of the packages we got and found our EEPROMs were not actually even PROMs internally.

So the acronym’s meaning has morphed over time, at least in the embedded world.

Side note, one of the filesystems I saw was actually implemented over PROM technology. It used a mix of software algorithms to avoid erasing memory when possible to virtually extend the erase lifetime of the chips.

The image processing that follows the decoding can be arbitrarily long, so the image decoding time can be made as unimportant as desired in relative terms to the overall run time.
However, in many cases, the processing is very short (even shorter than writing a new file) and consists in loading images’ bitmaps to a RAM (GID allows you do do it directly during the decoding).
In those cases, gains in decoding time are meaningful.

Regarding the PNG chunks, they contain compressed data and have to be processed sequentially.