[Re:] Szivárgások az AMD FX processzorok paramétereiről

Hirdetés

Legfrissebb anyagok

PROHARDVER! témák

Mobilarena témák

IT café témák

GAMEPOD témák

LOGOUT témák

Új hozzászólás Aktív témák

#100 P.H. senior tag hugo chávez #98

Új Válasz 2011-07-17 23:16:14 #100
Új hozzászólás
Összes hozzászólása itt Válaszok az összes hozzászólására itt Válaszok erre a hozzászólásra
Privát üzenet küldése

P.H.

senior tag

válasz hugo chávez #98 üzenetére

Ha ennyire kisarkítva nézed, akkor 256 bites AVX esetén igen, ennyi. De nem minden fekete-fehér. Ahogy linkelted is:
"When Intel introduced SSE2 in the P4, each 128-bit instruction was cracked into two 64-bit uops, and the throughput did not substantially improve. This created a chicken and egg problem: Intel wanted developers to use SSE2 (since the P4 was not designed to execute x87 particularly fast), but developers do not want to rewrite or recompile code for a marginal gain.
Sandy Bridge can sustain a full 16 single precision FLOP/cycle or 8 double precision FLOP/cycle – double the capabilities of Nehalem. This guarantees that software which uses AVX will actually see a substantial performance advantage on Sandy Bridge and should spur faster adoption. Intel seems to have learned from the lessons of SSE2 and hopefully, the uptake for AVX amongst the software community will be far swifter."
Adott mindkét oldalon egy-egy 128 bites FPU, külön FADD és FMUL futtató egységekkel: el kellett dönteni, hogy az igen nagy mennyiségű plusz tranzisztort (és az általuk igényelt plusz fogyasztást) mibe fektetik:
- az AMD a 128 bites végrehajtásra és a meglevő programokra helyezte a hangsúlyt: két majdnem azonos képességű FADD+FMUL végrehajtót tettek az FPU-ba, pontosan úgy, ahogy eddig a K7-K10 családban 3 majdnem azonos ALU+AGU van; így teljesen mindegy, hogy a programban milyen az FADD- és FMUL-jellegű utasítások aránya (eddig nagyon nem volt az). Ezt megfejelték azzal, hogy a register-to-register értékmásolás (amik nagy része az AVX alatt feleslegessé válik, de SSEx alatt elég sok van, mivel egy-egy művelet felülírja az egyik paraméterét) 0 órajelet igényel, a registerfile megoldja saját hatáskörben (órajelenként 4-et, ha minden igaz).
Az AVX-es programokat nem túl hatékonyan hajtja végre, de az SSEx-alapúak végrehajtását eléggé felgyorsítja.
- az Intel maradt az 1 FADD + 1 FMUL futtatóegység felépítésnél, ezt látták 256 bites végrehajtókkal, felhasználva hozzá a meglevő integer adatutat is, illetve hozzáadva egy kis energiatakarékosságot (innen):
Floating point warm-up effect
The latencies and throughputs of floating point vector operations is varying according to the processor load. The ideal latency is 3 clock cycles for a floating point vector addition and 5 clock cycles for a vector multiplication regardless of the vector size. The ideal throughput is one vector addition and one vector multiplication per clock cycle. These ideal numbers are obtained only after a warm-up period of several hundred floating point instructions.
The processor is in a cold state when it has not seen any floating point instructions for a while. The latency for 256-bit vector additions and multiplications is initially two clocks longer than the ideal number, then one clock longer, and after several hundred floating point instructions the processor goes to the warm state where latencies are 3 and 5 clocks respectively. The throughput is half the ideal value for 256-bit vector operations in the cold state. 128-bit vector operations are less affected by this warm-up effect. The latency of 128-bit vector additions and multiplications is at most one clock cycle longer than the ideal value, and the throughput is not reduced in the cold state.
The cold state does not affect division, move, shuffle, Boolean and other vector instructions.
There is no official explanation for this warm-up effect yet, but my guess is that the processor can turn off some of the most expensive execution resources to save power, and turn on these resources only when the load is heavy. Another possible explanation is that half the execution resources are initially allocated to the other thread running in the same core.
Mindkettő kihozza a maximumot a 32 nm-es lehetőségekből, mivel mindkettő szinte megduplázza az FPU fizikai méretét. Az AMD annyival van könnyebb helyzetben, hogy mivel a korábbi - K8-alapú - FPU-kat arra tervezte, hogy minden 128 bites utasítás 2x 64 bitesre fordítódik és hajtódik végre, így amikor 128 bitesre bővítette azt, akkor az FPU "kiürült", azonos végrehajtási sebességhez feleannyi belső uop-műveletet kap. Ezt most kitömik a 2. szállal.