Performance: thread placement, numa, AMD etc

Questions, suggestions, feature requests, bug reports, feedbackPerformance: thread placement, numa, AMD etc
Jure Pečar asked 4 years ago

Hi,
I did some analysis on my models imported from xflr5 and using MKL is indeed a great improvement. Solving the matrix is now done in seconds, but preparing the matrix is still single threaded and takes (at least in my case) 10-20x as much time as solving it. Can you figure out how to make the matrix preparation parallel as well? I’d recommend using OpenMP. I’m willing to help on this if you allow me access to the source.
Next, I noticed that in systems with more than one numa domain threads are by default placed in the same domain. I’ll investigate what mkl environment variables must be set to spread this around all sockets to make the best use of available memory bandwidth. This also requires that data structures are initialized accordingly; without looking at the source I’ll do some tests to determine if this is the case or not.
Lastly, you can point people with AMD cpus to this page: https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AMD-Ryzen-and-Threadripper-CPU-s-Effectively-for-Python-Numpy-And-Other-Applications-1637/ … MKL works just fine on them, once you bypass its cpu detection logic.
Thanks for the great product and I hope it gets developed further.

André Deperroistechwinder Staff replied 4 years ago

Hi Jure,
Thanks for the feedback. I’m a bit surprised however because the construction of the matrix and of the RHS vectors are parallelized. A stupid question, but are you sure that multi-threading is activated in the preferences?
The question about the AMD processors raised my curiosity. Now that I have a machine with an AMD processor, I thought I might give it a try. The results are good, although not quite as much as could have been expected given the higher number of cores and processor frequency. See the third graph in this page:
https://flow5.tech/docs/flow5_doc/Performance.html
So either it’s the incompressible overhead tasks which now make most of the analysis times, or the MKL library is slightly less performant on AMD than on Intel processors. Not sure what the correct explanation is, however, this doesn’t seem very important, because it is now possible to work comfortably with high matrix sizes with each configuration.
Finally thanks for the link; however I think I won’t experiment with the undocumented configuration flags, because the intent is to target application stability over full optimization.
André