c - 64 bit code generated by GCC is 3 times slower than 32 bit -


I've found that my code runs on 64 bit Linux 32 bit Linux or 64 bit window or 64 bit Mac. This is the minimum test case.

  #include & lt; Stdlib.h & gt; Typedef unsigned four UINT8; Zero segment (UINT8 * lineaut, UINT8 * line in, intexas, float * kk) {int x x, x; For {xx = 0; xx; lt; xsize; xx ++} {float ss = 0.0; For (x = 0; x  

and how it runs:

  $ cc --version cc (Ubuntu 4.8.2-19ubuntu1) 4.8.2 $ cc -O2 - Wal-M64 ./tt.c -o ./tt & amp; Amp; Time ./tt User $ 14.166 $ CC-O2-Wal-M32 ./tt.c -o ./tt & amp; As you can see, the 32 bit version runs around 3 times faster (I have tested both 32bit and 64bit Ubuntu, the result also) ./t User 5.018s  

And even more weird what the performance depends on the standard:

  $ cc -O2 -Wall -std = c99-m32 ./tt.c -o ./tt & amp; Time ./tt User $ 15.825 $ cc -O2 -Wall -std = gnu99 -m32 ./tt.c -o ./tt & amp; Amp; Time ./tt User 5.090s  

How can this be? How can I resolve this to speed up the 64 bit version generated by GCC?

Update 1

I found Fast 32bit (default and gnu99) and slow (c99) and found:

  .5: movzbl (% ebx,% eax),% edx # MEM [base: lineIn_10 (d), index: _72, offset: 0b], d. 1543 MPL% ADX, (% SSP) # D. 1543, Field (% ASP) # FMUS (% SE,% EX, 4) # MEM [Base: K.K. 088 (d), index: _72, phase: 4, offset: 0b] adl $ 1,% eax #, x cmpl% ecx,% eax #xsize, x faddp% st,% st (1) #, fstps 12 (% Esp) #Flds12 (% esp) # Jne .5 #,  

In the fast case fstps and flds not the command Therefore, GCC stores and load values ​​from memory at each stage. I tried typing float register , but it does not help.

Update 2

I have tested GCC-4.9 and it seems that it creates optimal code for 64 bit. And -fish-math (suggested by @jp) for both GCC versions -m32-std = c99 Fix I still for 64 bit on GCC-4.8 I am looking for solution, because now it is 4.9. ">

There is a partial dependency stall in the code generated by the older versions of GCC

  movzbl (% rsi,% rax),% r8d cvtsi2ss% r8d,% xmm0 ;;% all in xmm0 The upper bits are false dependencies  

The dependency can be broken by xorps .

 < Code> #ifdef __SSE__ float __datite __ ((as usual) i2f (int v) {float x; __asm ​​__ ("xorps% 0,% 0; cvtsi2ss% 1,% 0": "= x" ( X): "R" (V)); return x;} #else float __attribute __ ((always_inline)) i2f (int v) {return (v) v;} #endif void stretch (UINT8 * lineit, UINT8 * (X = 0; x & lt; for float ss = 0.0; for linein, int xsize, float * kk) {int xx, x; (xx = 0; xx; lt; xsize; xx ++) Xsize; x ++) {SS = = i2f (line [x]) * kk [x];} lineit [xx] = (UINT8) ss;}}  

results < / P>

  $ cc -O2 -wall-m64 ./test.c -o ./test64 & time ./test64 ./test64 4.07s user 0.00s system 99% cpu 4.070 Total $ Cc -O2 -Wall-m32 ./test.c -o ./test32 & amp; Amp; Time ./test32 ./test32 3.94s user 0.00s system 99% CPU 3. 9 38 total  

Comments