Web lists-archives.com

IBM POWER9 SIMD support? (Was: SIMDebian: ...)




Hi Mo,

I love your foot notes.

My understanding is IBMs "POWER9" CPUs 

    1.) have SIMD instructions[1] and

    2.) are used by the new, and very cool, open
        *hardware* Talos II workstations[2], which

    3.) already run Debian.

FYI,
Kingsley

[1] POWER9
    https://en.wikipedia.org/wiki/POWER9

[2] Introducing Talos II
    https://www.raptorcs.com/TALOSII/

On 02/08/2019 16:25, Mo Zhou wrote:
> Hi folks,
> 
> For most programs the "-march=native" option is not expected to bring any
> significant performance improvement. However for some scientific applications
> this proposition doesn't hold. When I was creating the tensorflow debian
> package, I observed a significant performance gap between generic code and
> kabylake (Intel 7XXX Series) code[1].
> 
> The significant improvement in performance basically stems from the Eigen
> library (header only numerical linear algebra library). Here is a simple
> example[2] for demonstrating the performance gap[3] between different ISA
> baselines. (elapsed time is roughly measured with "perf stat ...")
> 
> Having seen such interesting results, I immediately created a Debian partial
> fork named SIMDebian (SIMD + Debian)[0]. It makes great sense to some
> applications due to the significant performance gain brought by SIMD code.
> Currently this partial fork is still in the very early stage, and it needs
> 
>   * More experience about software that benefit a lot from SIMD code
>     (e.g. What package would potentially benefit from SIMD code?)
>   * Suggestions and comments
>     (e.g. Is such a partial fork really useful and valuable?)
>   * More people interested in this
> 
> SIMDebian is only a PARTIAL fork, which means that it only takes care of
> packages that would obviously benefit from SIMD code, because no performance
> gain is expected in terms of the majority of packages in the Debian archive.
> 
> Generally speaking, in order to bump the ISA baseline for a given package, one
> could add the -march=xxx flag to {C,CXX,F}FLAGS by modifying debian/rules.
> However SIMDebian employes a more economic approach to this end: forking
> dpkg[5] and injecting -march=xxx flag to the system default flag list. With the
> resulting dpkg package, most debian packages could be rebuilt with bumped ISA
> baseline without any code modification.
> 
> I think Debian Science team is interested in this partial fork as well. In the
> past there was a highly-related GSoC project[4] (In my fuzzy memory the topic
> lead to the creation of the GSoC project was raised by me). However for some
> reason (I forgot it) it didn't start.
> 
> This is the first time I try to fork Debian and apparently I have no experience
> on running a fork. I need comments from especially the Debian Science Team.
> Any response/pointer would be much appreciated!
> 
> P.S. SIMDebian has an alias: SIGILLbian (SIGILL + Debian).
> -------------------------------------------------------------------------------
> 
> [0] https://github.com/SIMDebian/SIMDebian
> 
> [1] https://github.com/SIMDebian/SIMDebian/blob/master/benchmarks/tensorflow.md
> 
> [2] ```c++
> #include <iostream>
> #include <Eigen/Dense>
> using namespace std;
> 
> #define N 4096
> int main(void)
> {
> 	auto A = Eigen::MatrixXd::Random(N, N);
> 	auto B = Eigen::MatrixXd::Random(N, N);
> 	auto C = A * B;
> 	//cout << A << endl << B << endl << C << endl;
> 	(void) C(0,0);
> 	return 0;
> }
> ```
> 
> [3] ``` (command-line) (perf-stat-elapsed-time)
> CPU: Intel I5-7440HQ
> 
> g++ a.cc -I/usr/include/eigen3 -O2 -march=skylake \
> 	-DEIGEN_USE_MKL_ALL -I/usr/include/mkl -lmkl_rt
> 1.275162977 (seconds)
> 
> g++ a.cc -I/usr/include/eigen3 -O2 \
> 	-DEIGEN_USE_MKL_ALL -I/usr/include/mkl -lmkl_rt
> 1.382608279
> 
> g++ a.cc -I/usr/include/eigen3 -O2 -march=skylake -fopenmp
> 1.460047514
> 
> g++ a.cc -I/usr/include/eigen3 -O3 -march=skylake -fopenmp
> 1.313478657
> 
> g++ a.cc -I/usr/include/eigen3 -O2 -march=haswell -fopenmp
> 1.334523068
> 
> g++ a.cc -I/usr/include/eigen3 -O2 -march=sandybridge -fopenmp
> 1.988947143
> 
> g++ a.cc -I/usr/include/eigen3 -O2 -march=nehalem -fopenmp
> 3.099827038
> 
> g++ a.cc -I/usr/include/eigen3 -O2 -march=x86-64 -fopenmp
> 3.106337852
> 
> However, please note that Eigen's fastest result is still much slower
> than OpenBLAS, even if Eigen called MKL:
> 
> ~ ❯❯❯ julia -e 'A = rand(Float64, 4096, 4096); A*A; @time A*A;'
>   1.011168 seconds (6 allocations: 128.000 MiB, 2.69% gc time)
> 
> BLAS optimization is another story. Omitted here.
> ```
> 
> [4] https://wiki.debian.org/SummerOfCode2017/Projects/Benchmarking
> 
> [5] https://github.com/SIMDebian/dpkg
>     Currently this fork aims on "haswell" due to availability of AVX2.
>     Only minor modification on my patch is reqired to further bump the
>     baseline to e.g. icelake (AVX512).
> 

-- 
Time is the fire in which we all burn.