Web lists-archives.com

SIMDebian: Debian Partial Fork with Radical ISA Baseline

Hi folks,

For most programs the "-march=native" option is not expected to bring any
significant performance improvement. However for some scientific applications
this proposition doesn't hold. When I was creating the tensorflow debian
package, I observed a significant performance gap between generic code and
kabylake (Intel 7XXX Series) code[1].

The significant improvement in performance basically stems from the Eigen
library (header only numerical linear algebra library). Here is a simple
example[2] for demonstrating the performance gap[3] between different ISA
baselines. (elapsed time is roughly measured with "perf stat ...")

Having seen such interesting results, I immediately created a Debian partial
fork named SIMDebian (SIMD + Debian)[0]. It makes great sense to some
applications due to the significant performance gain brought by SIMD code.
Currently this partial fork is still in the very early stage, and it needs

  * More experience about software that benefit a lot from SIMD code
    (e.g. What package would potentially benefit from SIMD code?)
  * Suggestions and comments
    (e.g. Is such a partial fork really useful and valuable?)
  * More people interested in this

SIMDebian is only a PARTIAL fork, which means that it only takes care of
packages that would obviously benefit from SIMD code, because no performance
gain is expected in terms of the majority of packages in the Debian archive.

Generally speaking, in order to bump the ISA baseline for a given package, one
could add the -march=xxx flag to {C,CXX,F}FLAGS by modifying debian/rules.
However SIMDebian employes a more economic approach to this end: forking
dpkg[5] and injecting -march=xxx flag to the system default flag list. With the
resulting dpkg package, most debian packages could be rebuilt with bumped ISA
baseline without any code modification.

I think Debian Science team is interested in this partial fork as well. In the
past there was a highly-related GSoC project[4] (In my fuzzy memory the topic
lead to the creation of the GSoC project was raised by me). However for some
reason (I forgot it) it didn't start.

This is the first time I try to fork Debian and apparently I have no experience
on running a fork. I need comments from especially the Debian Science Team.
Any response/pointer would be much appreciated!

P.S. SIMDebian has an alias: SIGILLbian (SIGILL + Debian).

[0] https://github.com/SIMDebian/SIMDebian

[1] https://github.com/SIMDebian/SIMDebian/blob/master/benchmarks/tensorflow.md

[2] ```c++
#include <iostream>
#include <Eigen/Dense>
using namespace std;

#define N 4096
int main(void)
	auto A = Eigen::MatrixXd::Random(N, N);
	auto B = Eigen::MatrixXd::Random(N, N);
	auto C = A * B;
	//cout << A << endl << B << endl << C << endl;
	(void) C(0,0);
	return 0;

[3] ``` (command-line) (perf-stat-elapsed-time)
CPU: Intel I5-7440HQ

g++ a.cc -I/usr/include/eigen3 -O2 -march=skylake \
	-DEIGEN_USE_MKL_ALL -I/usr/include/mkl -lmkl_rt
1.275162977 (seconds)

g++ a.cc -I/usr/include/eigen3 -O2 \
	-DEIGEN_USE_MKL_ALL -I/usr/include/mkl -lmkl_rt

g++ a.cc -I/usr/include/eigen3 -O2 -march=skylake -fopenmp

g++ a.cc -I/usr/include/eigen3 -O3 -march=skylake -fopenmp

g++ a.cc -I/usr/include/eigen3 -O2 -march=haswell -fopenmp

g++ a.cc -I/usr/include/eigen3 -O2 -march=sandybridge -fopenmp

g++ a.cc -I/usr/include/eigen3 -O2 -march=nehalem -fopenmp

g++ a.cc -I/usr/include/eigen3 -O2 -march=x86-64 -fopenmp

However, please note that Eigen's fastest result is still much slower
than OpenBLAS, even if Eigen called MKL:

~ ❯❯❯ julia -e 'A = rand(Float64, 4096, 4096); A*A; @time A*A;'
  1.011168 seconds (6 allocations: 128.000 MiB, 2.69% gc time)

BLAS optimization is another story. Omitted here.

[4] https://wiki.debian.org/SummerOfCode2017/Projects/Benchmarking

[5] https://github.com/SIMDebian/dpkg
    Currently this fork aims on "haswell" due to availability of AVX2.
    Only minor modification on my patch is reqired to further bump the
    baseline to e.g. icelake (AVX512).