X86 Assembly/SSE
SSE stands for Streaming SIMD Extensions. It is essentially the floating-point equivalent of the MMX instructions. The SSE registers are 128 bits, and can be used to perform operations on a variety of data sizes and types. Unlike MMX, the SSE registers do not overlap with the floating point stack.
Contents |
[edit] Registers
SSE, introduced by Intel in 1999 with the Pentium III, creates eight new 128-bit registers:
XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7
Originally, an SSE register could only be used as four 32-bit single precision floating point numbers (the equivalent of a float in C). SSE2 expanded the capabilities of the XMM registers, so they can now be used as:
2 64-bit floating points (double precision) 2 64-bit integers 4 32-bit floating points (single-precision) 4 32-bit integers 8 16-bit integers 16 8-bit characters (bytes)
[edit] Example
The following program (using NASM syntax) performs a few SIMD operations on some numbers.
global _start section .data v1: dd 1.1, 2.2, 3.3, 4.4 ;first set of 4 numbers v2: dd 5.5, 6.6, 7.7, 8.8 ;second set section .bss v3: resd 4 ;result section .text _start: movups xmm0, [v1] ;load v1 into xmm0 movups xmm1, [v2] ;load v2 into xmm1 addps xmm0, xmm1 ;add v2 to v1 mulps xmm0, xmm1 ;multiply v1 by v2 subps xmm0, xmm1 ;subtract v2 from v1 movups [v3], xmm0 ;store v1 in v3 ;end program
The result values should be:
30.800 51.480 77.000 107.360
[edit] SSE Instruction Set
There are literally hundreds of SSE instructions, some of which are capable of much more than simple SIMD arithmetic. For more in-depth references take a look at the resources chapter of this book.
You may notice that many floating point SSE instructions end with something like PS or SD. These suffixes differentiate between different versions of the operation. The first letter describes whether the instruction should be Packed or Scalar. Packed operations are applied to every member of the register, while scalar operations are applied to only the first value. For example, in pseudo-code, a packed add would be executed as:
v1[0] = v1[0] + v2[0] v1[1] = v1[1] + v2[1] v1[2] = v1[2] + v2[2] v1[3] = v1[3] + v2[3]
While a scalar add would only be:
v1[0] = v1[0] + v2[0]
The second letter refers to the data size: either Single or Double. This simply tells the processor whether to use the register as four 32-bit floats or two 64-bit doubles, respectively.
[edit] SSE: Added with Pentium III
Floating-point Instructions:
ADDPS, ADDSS, CMPPS, CMPSS, COMISS, CVTPI2PS, CVTPS2PI, CVTSI2SS, CVTSS2SI, CVTTPS2PI, CVTTSS2SI, DIVPS, DIVSS, LDMXCSR, MAXPS, MAXSS, MINPS, MINSS, MOVAPS, MOVHLPS, MOVHPS, MOVLHPS, MOVLPS, MOVMSKPS, MOVNTPS, MOVSS, MOVUPS, MULPS, MULSS, RCPPS, RCPSS, RSQRTPS, RSQRTSS, SHUFPS, SQRTPS, SQRTSS, STMXCSR, SUBPS, SUBSS, UCOMISS, UNPCKHPS, UNPCKLPS
Integer Instructions:
ANDNPS, ANDPS, ORPS, PAVGB, PAVGW, PEXTRW, PINSRW, PMAXSW, PMAXUB, PMINSW, PMINUB, PMOVMSKB, PMULHUW, PSADBW, PSHUFW, XORPS
[edit] SSE2: Added with Pentium 4
Floating-point Instructions:
ADDPD, ADDSD, ANDNPD, ANDPD, CMPPD, CMPSD*, COMISD, CVTDQ2PD, CVTDQ2PS, CVTPD2DQ, CVTPD2PI, CVTPD2PS, CVTPI2PD, CVTPS2DQ, CVTPS2PD, CVTSD2SI, CVTSD2SS, CVTSI2SD, CVTSS2SD, CVTTPD2DQ, CVTTPD2PI, CVTTPS2DQ, CVTTSD2SI, DIVPD, DIVSD, MAXPD, MAXSD, MINPD, MINSD, MOVAPD, MOVHPD, MOVLPD, MOVMSKPD, MOVSD*, MOVUPD, MULPD, MULSD, ORPD, SHUFPD, SQRTPD, SQRTSD, SUBPD, SUBSD, UCOMISD, UNPCKHPD, UNPCKLPD, XORPD
- * CMPSD and MOVSD have the same name as the string instruction mnemonics CMPSD (CMPS) and MOVSD (MOVS); however, the former refer to scalar double-precision floating-points whereas the latter refer to doubleword strings.
Integer Instructions:
MOVDQ2Q, MOVDQA, MOVDQU, MOVQ2DQ, PADDQ, PSUBQ, PMULUDQ, PSHUFHW, PSHUFLW, PSHUFD, PSLLDQ, PSRLDQ, PUNPCKHQDQ, PUNPCKLQDQ
[edit] SSE3: Added with later Pentium 4
ADDSUBPD, ADDSUBPS, HADDPD, HADDPS, HSUBPD, HSUBPS, MOVDDUP, MOVSHDUP, MOVSLDUP
[edit] SSSE3: Added with Xeon 5100 and early Core 2
PSIGNW, PSIGND, PSIGNB, PSHUFB, PMULHRSW, PMADDUBSW, PHSUBW, PHSUBSW, PHSUBD, PHADDW, PHADDSW, PHADDD, PALIGNR, PABSW, PABSD, PABSB
[edit] SSE4
[edit] SSE4.1: Added with later Core 2
MPSADBW, PHMINPOSUW, PMULLD, PMULDQ, DPPS, DPPD, BLENDPS, BLENDPD, BLENDVPS, BLENDVPD, PBLENDVB, PBLENDW, PMINSB, PMAXSB, PMINUW, PMAXUW, PMINUD, PMAXUD, PMINSD, PMAXSD, ROUNDPS, ROUNDSS, ROUNDPD, ROUNDSD, INSERTPS, PINSRB, PINSRD, PINSRQ, EXTRACTPS, PEXTRB, PEXTRW, PEXTRD, PEXTRQ, PMOVSXBW, PMOVZXBW, PMOVSXBD, PMOVZXBD, PMOVSXBQ, PMOVZXBQ, PMOVSXWD, PMOVZXWD, PMOVSXWQ, PMOVZXWQ, PMOVSXDQ, PMOVZXDQ, PTEST, PCMPEQQ, PACKUSDW, MOVNTDQA
[edit] SSE4a: Added with Phenom
LZCNT, POPCNT, EXTRQ, INSERTQ, MOVNTSD, MOVNTSS
[edit] SSE4.2: Added with Nehalem
CRC32, PCMPESTRI, PCMPESTRM, PCMPISTRI, PCMPISTRM, PCMPGTQ