c - 64 bit code generated by GCC is 3 times slower than 32 bit -

I've found that my code runs on 64 bit Linux 32 bit Linux or 64 bit window or 64 bit Mac. This is the minimum test case.

  #include & lt; Stdlib.h & gt; Typedef unsigned four UINT8; Zero segment (UINT8 * lineaut, UINT8 * line in, intexas, float * kk) {int x x, x; For {xx = 0; xx; lt; xsize; xx ++} {float ss = 0.0; For (x = 0; x

  and how it runs: 
   $ cc --version cc (Ubuntu 4.8.2-19ubuntu1) 4.8.2 $ cc -O2 - Wal-M64 ./tt.c -o ./tt & amp; Amp; Time ./tt User $ 14.166 $ CC-O2-Wal-M32 ./tt.c -o ./tt & amp; As you can see, the 32 bit version runs around 3 times faster (I have tested both 32bit and 64bit Ubuntu, the result also) ./t User 5.018s  
  And even more weird what the performance depends on the standard: 
   $ cc -O2 -Wall -std = c99-m32 ./tt.c -o ./tt & amp; Time ./tt User $ 15.825 $ cc -O2 -Wall -std = gnu99 -m32 ./tt.c -o ./tt & amp; Amp; Time ./tt User 5.090s  
  How can this be? How can I resolve this to speed up the 64 bit version generated by GCC? 
   Update 1  
  I found Fast 32bit (default and gnu99) and slow (c99) and found: 
   .5: movzbl (% ebx,% eax),% edx # MEM [base: lineIn_10 (d), index: _72, offset: 0b], d. 1543 MPL% ADX, (% SSP) # D. 1543, Field (% ASP) # FMUS (% SE,% EX, 4) # MEM [Base: K.K. 088 (d), index: _72, phase: 4, offset: 0b] adl $ 1,% eax #, x cmpl% ecx,% eax #xsize, x faddp% st,% st (1) #, fstps 12 (% Esp) #Flds12 (% esp) # Jne .5 #,  
  In the fast case  fstps  and  flds  not the command Therefore, GCC stores and load values from memory at each stage. I tried typing  float register , but it does not help. 
   Update 2  
  I have tested GCC-4.9 and it seems that it creates optimal code for 64 bit. And  -fish-math  (suggested by @jp) for both GCC versions  -m32-std = c99  Fix I still for 64 bit on GCC-4.8 I am looking for solution, because now it is 4.9. "> 
 There is a partial dependency stall in the code generated by the older versions of GCC 
   movzbl (% rsi,% rax),% r8d cvtsi2ss% r8d,% xmm0 ;;% all in xmm0 The upper bits are false dependencies  
  The dependency can be broken by  xorps .

 < Code> #ifdef __SSE__ float __datite __ ((as usual) i2f (int v) {float x; __asm __ ("xorps% 0,% 0; cvtsi2ss% 1,% 0": "= x" ( X): "R" (V)); return x;} #else float __attribute __ ((always_inline)) i2f (int v) {return (v) v;} #endif void stretch (UINT8 * lineit, UINT8 * (X = 0; x & lt; for float ss = 0.0; for linein, int xsize, float * kk) {int xx, x; (xx = 0; xx; lt; xsize; xx ++) Xsize; x ++) {SS = = i2f (line [x]) * kk [x];} lineit [xx] = (UINT8) ss;}}

results < / P>

  $ cc -O2 -wall-m64 ./test.c -o ./test64 & time ./test64 ./test64 4.07s user 0.00s system 99% cpu 4.070 Total $ Cc -O2 -Wall-m32 ./test.c -o ./test32 & amp; Amp; Time ./test32 ./test32 3.94s user 0.00s system 99% CPU 3. 9 38 total

Brayer

Search This Blog

c - 64 bit code generated by GCC is 3 times slower than 32 bit -

Comments

Post a Comment