I’ve created a GitHub repository describing my experiments in using Ada’s tasking tools to run for-loops across multiple cpu cores. There are a handful of example codes using rendezvous calls and protected objects. I’m posting the link here as it might be useful for others wanting to play this game.
You might also like to look at LWT, the Light Weight Threading library, from the ParaSail project. This is a much more serious attempt at this game than my little effort.
Thanks for posting this! It’s great to see this sort of use of parallelism.
Also thanks for the nice references to the LWT library. One goal of the LWT library is to avoid creating “heavy weight” tasks more frequently than necessary, but instead reusing them. Your timings show that creating and finalizing tasks is probably not a significant time sink, so long as you do enough work with them. The other goal of the LWT library was to support the parallel execution of a heterogeneous collection of light-weight threads. It sounds like that is not too important in your application area, since you are mainly focused on parallel loops, rather than something that might need a more general divide-and-conquer approach.
I have always been missing some benchmarks on Ada’s tasking system and its (performance) behaviour! And also how it compared to other common parallelism models. LWT, afaik, uses OpenMP behind the scenes, which to me, it the parallelism standard (for a single CPU chip). It is great to see that tasks perform quite well! So thank you very very much for the tests
Best regards,
Fer
P.S.: as a LaTeX aficionado and I see you also use Cadabra, you just got a new follower ^^
Actually, LWT uses either OpenMP or a home-grown work-stealing scheduler (which is often faster than OpenMP). LWT has a “plug-in” architecture so you can insert other light-weight-thread schedulers underneath, by "with"ing the package that defines the scheduler. If you don’t plug in any of them, LWT just runs the threads sequentially.
I have to admit that my understanding of LWT is rather limited. So my apologies if I’ve made some errors or omissions in my comments about LWT. I’m quite happy to make any changes that you think might be appropriate. Also, thanks for the nice comments in your first post :).
If I correctly understood your code and results there is no difference between standard Ada tasking performance and whatever fancy library. Is that so?
I would also thankful for any analysis of implementations, since, naively thinking, if the fancy library is backed by the OS threads, then why on earth should anybody expect a result different from Ada tasking?
Some background. I implemented job service in Ada in Simple Components and used it for parallel arbitrary precision arithmetic. In particular multiplication and exponentiation under modulo allow parallelization into multiple jobs. The result benchmarking was rather disappointing. Montgomery single tasking implementation beats parallel algorithms (8 tasks) by margin.
Interestingly a lock-free implementation of the job queue performs worse than one based on Ada protected objects.
So, I am very interested in some deeper dive into the issue.
Sorry for the late reply. I’m pleased that you found my stuff useful. Thank you. I was also surprised to see, on an Ada site, that you’re a fan of LaTeX and Cadabra :). You might find a few of my other GitHub repos useful (one is a tutorial on Cadabra and another that allows Cadabra code and results to be embedded in a LaTeX source).
My examples do not use any fancy libraries (other than the two examples that use the LWT library). The codes are very simple using just standard rendezvous calls and protected objects. I’ve tried to explain my design in pdf/for-loop-tasking.pdf. That’s a summary of my own understanding of multitasking in a Ada. I’m sure most of what’s in there is well known to most people in this group. There really is nothing new in my codes. They were written for one very particular case that I had (large scale processing of arrays of floating point data).
Thanks for explanation. My case is not much different, jobs deal with large integer arrays. Instead of rendezvous I use protected job queue. A task runs infinite loop which takes a job from the queue and then dispatches to the job’s Execute operation. Synchronization is performed by the protected entry call on the job status (completed/failed).
I hoped that the light weight threading term has some substance, e.g. not just user managed threads, but something really faster.
Yes, modern OpenMP and OpenACC allow for device offload. I wish that the Parallel directive in Ada would have been more tightly integrated to those “frameworks”… There is no reason why it cannot, but it is a lot of work. From what I saw in LWP, it just calls OpenMP procedures directly (Import => __omp_XXX), it does not abstract itself over its model Therefore, the OpenMP model that LWT uses is a very very limited form of OpenMP.