
Optimizing for SIMD Floating-point Applications
5
5-21
Use of cvttps2pi/cvttss2si Instructions
The
cvttps2pi
and
cvttss2si
instructions encode the truncate/chop
rounding mode implicitly in the instruction, thereby taking precedence
over the rounding mode specified in the
MXCSR
register. This behavior
can eliminate the need to change the rounding mode from
round-nearest, to truncate/chop, and then back to round-nearest to
resume computation. Frequent changes to the
MXCSR
register should be
Example 5-10 Horizontal Add Using Intrinsics with movhlps/movlhps
void horiz_add_intrin(Vertex_soa *in, float *out)
{
__m128 v1, v2, v3, v4;
__m128 tmm0,tmm1,tmm2,tmm3,tmm4,tmm5,tmm6;
// Temporary variables
tmm0 = _mm_load_ps(in->x);
// tmm0 = A1 A2 A3 A4
tmm1 = _mm_load_ps(in->y);
// tmm1 = B1 B2 B3 B4
tmm2 = _mm_load_ps(in->z);
// tmm2 = C1 C2 C3 C4
tmm3 = _mm_load_ps(in->w);
// tmm3 = D1 D2 D3 D4
tmm5 = tmm0;
// tmm0 = A1 A2 A3 A4
tmm5 = _mm_movelh_ps(tmm5, tmm1);
// tmm5 = A1 A2 B1 B2
tmm1 = _mm_movehl_ps(tmm1, tmm0);
// tmm1 = A3 A4 B3 B4
tmm5 = _mm_add_ps(tmm5, tmm1);
// tmm5 = A1+A3 A2+A4 B1+B3 B2+B4
tmm4 = tmm2;
tmm2 = _mm_movelh_ps(tmm2, tmm3);
// tmm2 = C1 C2 D1 D2
tmm3 = _mm_movehl_ps(tmm3, tmm4);
// tmm3 = C3 C4 D3 D4
tmm3 = _mm_add_ps(tmm3, tmm2);
// tmm3 = C1+C3 C2+C4 D1+D3 D2+D4
tmm6 = tmm3;
// tmm6 = C1+C3 C2+C4 D1+D3 D2+D4
tmm6 = _mm_shuffle_ps(tmm3, tmm5, 0xDD);
// tmm6 = A1+A3 B1+B3 C1+C3 D1+D3
tmm5 = _mm_shuffle_ps(tmm5, tmm6, 0x88);
// tmm5 = A2+A4 B2+B4 C2+C4 D2+D4
tmm6 = _mm_add_ps(tmm6, tmm5);
// tmm6 = A1+A2+A3+A4 B1+B2+B3+B4
// C1+C2+C3+C4 D1+D2+D3+D4
_mm_store_ps(out, tmm6);
}
Содержание ARCHITECTURE IA-32
Страница 1: ...IA 32 Intel Architecture Optimization Reference Manual Order Number 248966 013US April 2006...
Страница 220: ...IA 32 Intel Architecture Optimization 3 40...
Страница 434: ...IA 32 Intel Architecture Optimization 9 20...
Страница 514: ...IA 32 Intel Architecture Optimization B 60...
Страница 536: ...IA 32 Intel Architecture Optimization C 22...
Страница 560: ...IA 32 Intel Architecture Optimization E 14...