Web lists-archives.com

Re: Bits from /me: A humble draft policy on "deep learning v.s. freedom"




Hi PICCA,

On 2019-05-24 12:01, PICCA Frederic-Emmanuel wrote:
> What about ibm power9 with pocl ?
> 
> it seems that this is better than the latest NVIDIA GPU.

The typical workload for training neural networks is linear
operations such as general matrix-matrix multiplication and
convolution.

I know nothing about pocl, but it's hard for CPU to beat
GPU in terms of these highly-parallelizable linear operations.
Try a 4096x4096 multiplication and you will easily find out
the difference.

E.g. my CPU = I5 7440HQ (middle-end mobile CPU), GPU = Nvidia 940MX
(junk)
The junk GPU (CUDA) is 100x faster than my CPU (MKL).

~ ❯❯❯ optirun ipython3
Python 3.7.3 (default, Apr  3 2019, 05:39:12) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.2.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch as th                                              
       

In [2]: x = th.rand(4096, 4096)                                         
       

In [3]: %time x@x                                                       
       
CPU times: user 1.65 s, sys: 38.7 ms, total: 1.69 s
Wall time: 449 ms
Out[3]: 
tensor([[1015.7596, 1004.2767, 1001.6245,  ..., 1026.8447,  996.3105,
         1002.7847],
        [1047.8833, 1014.3856, 1020.8246,  ..., 1055.3224, 1021.6126,
         1031.0334],
        [1049.3168, 1027.7637, 1030.9961,  ..., 1054.3218, 1015.3804,
         1031.6709],
        ...,
        [1039.6516, 1024.6678, 1021.1326,  ..., 1047.0674, 1015.1402,
         1029.5969],
        [1020.1988,  994.0073, 1005.5823,  ..., 1015.6786,  990.2491,
         1008.1358],
        [1022.9388,  991.9886,  990.4608,  ..., 1013.9000,  998.8676,
         1007.8554]])

In [4]: x = x.cuda()                                                    
                                             

In [5]: %time x@x                                                       
                                             
CPU times: user 1.1 ms, sys: 174 µs, total: 1.27 ms
Wall time: 2.67 ms
Out[5]: 
tensor([[1015.7591, 1004.2764, 1001.6254,  ..., 1026.8447,  996.3105,
         1002.7841],
        [1047.8838, 1014.3846, 1020.8243,  ..., 1055.3209, 1021.6123,
         1031.0328],
        [1049.3174, 1027.7644, 1030.9971,  ..., 1054.3210, 1015.3800,
         1031.6727],
        ...,
        [1039.6511, 1024.6686, 1021.1323,  ..., 1047.0674, 1015.1404,
         1029.5974],
        [1020.1982,  994.0067, 1005.5826,  ..., 1015.6784,  990.2482,
         1008.1347],
        [1022.9395,  991.9879,  990.4588,  ..., 1013.9014,  998.8687,
         1007.8544]], device='cuda:0')