SIMD library
Types
-
Represents one SIMD matrix type with 12 (3x4) packed single-precision (32-bit) floating-point elements. Based on the implementation, the matrix may be stored using two 256-bits registers (AVX), or four 128-bits registers (SSEn / Neon).
-
Represents one SIMD matrix type with 16 (4x4) packed single-precision (32-bit) floating-point elements. Based on the implementation, the matrix may be stored using two 256-bits registers (AVX), or four 128-bits registers (SSEn / Neon).
Functions
-
Reinterprets SIMD varaible of type
int4
to typefloat4
without changing the data of the varaible. -
Reinterprets SIMD varaible of type
float4
to typeint4
without changing the data of the varaible. -
float4 load_f2(f32 const *mem_addr)
Loads 64-bits (composed of 2 packed single-precision (32-bit) floating-point elements) from memory into the first two elements of
dst
. The rest 2 elements are filled with 0. -
float4 load_f4(f32 const *mem_addr)
Loads 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from memory into
dst
. -
void store_f2(f32 *mem_addr, float4 a)
Stores the lower 2 single-precision (32-bit) floating-point elements from
a
into memory. -
void store_f4(f32 *mem_addr, float4 a)
Stores 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from
a
into memory. -
float4 set_f4(f32 e0, f32 e1, f32 e2, f32 e3)
Sets packed single-precision (32-bit) floating-point elements in
dst
with the supplied values. -
int4 set_i4(i32 e0, i32 e1, i32 e2, i32 e3)
Sets packed 32-bit integers in
dst
with the supplied values. -
Returns vector of type
float4
with all elements set to zero. -
Broadcasts single-precision (32-bit) floating-point value
e0
to all elements ofdst
. -
Stores the first single-precision (32-bit) floating-point element from
a
intodst
. -
float4 setw_f4(float4 a, f32 b)
Replaces the forth element of
a
withb
, and stores the results indst
. -
Broadcasts the first element of
a
to every element ofdst
. -
Broadcasts the second element of
a
to every element ofdst
. -
Broadcasts the third element of
a
to every element ofdst
. -
Broadcasts the forth element of
a
to every element ofdst
. -
int4 cmpeq_f4(float4 a, float4 b)
Compares packed single-precision (32-bit) floating-point elements in
a
andb
for equality, and stores the results indst
. -
int4 cmpneq_f4(float4 a, float4 b)
Compares packed single-precision (32-bit) floating-point elements in
a
andb
for not-equal, and stores the results indst
. -
int4 cmpgt_f4(float4 a, float4 b)
Compares packed single-precision (32-bit) floating-point elements in
a
andb
for greater-than, and stores the results indst
. -
int4 cmplt_f4(float4 a, float4 b)
Compares packed single-precision (32-bit) floating-point elements in
a
andb
for less-than, and stores the results indst
. -
int4 cmpge_f4(float4 a, float4 b)
Compares packed single-precision (32-bit) floating-point elements in
a
andb
for greater-than-or-equal, and stores the results in dst. -
int4 cmple_f4(float4 a, float4 b)
Compares packed single-precision (32-bit) floating-point elements in
a
andb
for less-than-or-equal, and stores the results in dst. -
Converts the comparison result mask to one 32-bit integer.
-
float4 add_f4(float4 a, float4 b)
Adds packed single-precision (32-bit) floating-point elements in
a
andb
, and stores the results indst
. -
float4 sub_f4(float4 a, float4 b)
Subtracts packed single-precision (32-bit) floating-point elements in
a
andb
, and stores the results indst
. -
float4 mul_f4(float4 a, float4 b)
Multiplies packed single-precision (32-bit) floating-point elements in
a
andb
, and stores the results indst
. -
float4 div_f4(float4 a, float4 b)
Divides packed single-precision (32-bit) floating-point elements in a by packed elements in b, and stores the results in dst.
-
float4 scale_f4(float4 a, f32 b)
Scales packed single-precision (32-bit) floating-point elements in
a
using one single-precision (32-bit) floating-point elementb
, and stores the results indst
. -
float4 muladd_f4(float4 a, float4 b, float4 c)
Multiply packed single-precision (32-bit) floating-point elements in
a
andb
, add the intermediate result to packed elements inc
, and store the results indst
. -
float4 negmuladd_f4(float4 a, float4 b, float4 c)
Multiply packed single-precision (32-bit) floating-point elements in
a
andb
, add the negated intermediate result to packed elements inc
, and store the results in dst. -
float4 scaleadd_f4(float4 a, f32 b, float4 c)
Scales packed single-precision (32-bit) floating-point elements in
a
using one single-precision (32-bit) floating-point elementb
, add the intermediate result to packed elements inc
, and stores the results indst
. -
Computes the square root of packed single-precision (32-bit) floating-point elements in
a
, and stores the results indst
. -
Computes the approximate reciprocal square root of packed single-precision (32-bit) floating-point elements in
a
, and stores the results indst
. SSE specific: The maximum relative error for this approximation is less than 1.5*2^-12. -
Computes the reciprocal square root of packed single-precision (32-bit) floating-point elements in
a
, and stores the results indst
. -
float4 max_f4(float4 a, float4 b)
Compare packed single-precision (32-bit) floating-point elements in
a
andb
, and store packed maximum values indst
. SSE specific:dst
does not follow the IEEE Standard for Floating - Point Arithmetic(IEEE 754) maximum value when inputs are NaN or signed - zero values. -
float4 min_f4(float4 a, float4 b)
Compare packed single-precision (32-bit) floating-point elements in
a
andb
, and store packed minimum values indst
. SSE specific:dst
does not follow the IEEE Standard for Floating - Point Arithmetic(IEEE 754) minimum value when inputs are NaN or signed - zero values. -
Computes the bitwise AND of every bit in
a
andb
, and stores the results indst
. -
Computes the bitwise OR of every bit in
a
andb
, and stores the results indst
. -
f32 dot2_f4(float4 a, float4 b)
Computes the dot product on the first two elements of
a
andb
, and stores the result indst
. -
f32 dot3_f4(float4 a, float4 b)
Computes the dot product on the first three elements of
a
andb
, and stores the result indst
. -
f32 dot4_f4(float4 a, float4 b)
Computes the dot product on elements of
a
andb
, and stores the result indst
. -
float4 dot2v_f4(float4 a, float4 b)
Computes the dot product on the first two elements of
a
andb
, and stores the result in each element ofdst
. -
float4 dot3v_f4(float4 a, float4 b)
Computes the dot product on the first three elements of
a
andb
, and stores the result in each element ofdst
. -
float4 dot4v_f4(float4 a, float4 b)
Computes the dot product on elements of
a
andb
, and stores the result in each element ofdst
. -
float4 cross2_f4(float4 a, float4 b)
Computes the cross product on the first two elements of
a
andb
, and stores the result indst
. -
float4 cross3_f4(float4 a, float4 b)
Computes the cross product on the first three elements of
a
andb
, and stores the result in each element ofdst
. -
float4 cross4_f4(float4 a, float4 b, float4 c)
Computes the cross product on elements of
a
,b
andc
, and stores the result in each element ofdst
. -
float4 normalize2_f4(float4 a)
Normalizes the first two elements of
a
, and stores the result indst
. -
float4 normalize3_f4(float4 a)
Normalizes the first three elements of
a
, and stores the result indst
. -
float4 normalize4_f4(float4 a)
Normalizes elements of
a
, and stores the result indst
. -
float4 reflect2_f4(float4 i, float4 n)
Performs reflection operation based on the first two elements of
i
(incident vector) andn
(normal vector), and stores the refected vector indst
. -
float4 reflect3_f4(float4 i, float4 n)
Performs reflection operation based on the first three elements of
i
(incident vector) andn
(normal vector), and stores the refected vector indst
. -
float4 reflect4_f4(float4 i, float4 n)
Performs reflection operation based on elements of
i
(incident vector) andn
(normal vector), and stores the refected vector indst
. -
float4 refract2_f4(float4 i, float4 n, f32 index)
Performs refraction operation based on the first two elements of
i
(incident vector),n
(normal vector) and the scalar valueindex
(refraction index), and stores the refected vector indst
. -
float4 refract3_f4(float4 i, float4 n, f32 index)
Performs refraction operation based on the first three elements of
i
(incident vector),n
(normal vector) and the scalar valueindex
(refraction index), and stores the refected vector indst
. -
float4 refract4_f4(float4 i, float4 n, f32 index)
Performs refraction operation based on elements of
i
(incident vector),n
(normal vector) and the scalar valueindex
(refraction index), and stores the refected vector indst
. -
float4 lerp_f4(float4 a, float4 b, f32 t)
Computes linear interpolation on packed single-precision (32-bit) floating-point elements in
a
andb
using the single-precision (32-bit) floating-point valuet
, and stores the results indst
. -
float4 lerpv_f4(float4 a, float4 b, float4 t)
Computes linear interpolation on packed single-precision (32-bit) floating-point elements in
a
andb
using the corresponding packed single-precision (32-bit) floating-point element int
, and stores the results indst
. -
float4 barycentric_f4(float4 a, float4 b, float4 c, f32 f, f32 g)
Computes barycentric interpolation on packed single-precision (32-bit) floating-point elements in
a
,b
andc
using the single-precision (32-bit) floating-point valuesf
andg
, and stores the results indst
. -
float4 catmull_rom_f4(float4 a, float4 b, float4 c, float4 d, f32 t)
Computes Catmull-Rom spline interpolation on packed single-precision (32-bit) floating-point elements in
a
,b
,c
andd
using the single-precision (32-bit) floating-point valuet
, and stores the results indst
. -
float4 hermite_f4(float4 v0, float4 t0, float4 v1, float4 t1, f32 t)
Computes Hermite spline interpolation on packed single-precision (32-bit) floating-point elements in
v0
,t0
,v1
andt1
using the single-precision (32-bit) floating-point valuet
, and stores the results indst
. -
Shuffles single-precision (32-bit) floating-point elements in
a
based on the control parameter_SelectX
,_SelectY
,_SelectZ
and_SelectW
, and stores the results indst
. -
float4 select_f4(float4 a, float4 b)
Performs a per-component selection between
a
andb
based on the control parameter_SelectX
,_SelectY
,_SelectZ
and_SelectW
, and stores the results indst
. -
float4 permute2_f4(float4 a, float4 b)
Shuffles single-precision (32-bit) floating-point elements in
a
andb
based on the control parameter_SelectX
,_SelectY
,_SelectZ
and_SelectW
, and stores the results indst
. -
float3x4 load_f3x4(f32 const *mem_addr)
Loads 12 packed single-precision (32-bit) floating-point elements from
mem_addr
todst
. The highest 4 packed single-precision (32-bit) floating-point elements are uninitialized. -
float4x4 load_f4x4(f32 const *mem_addr)
Loads 16 packed single-precision (32-bit) floating-point elements from
mem_addr
todst
.mem_addr
must be aligned on a 16-byte boundary or a general-protection exception may be generated. -
float4x4 setf4_f4x4(float4 r0, float4 r1, float4 r2, float4 r3)
Creates one 4x4 matrix by loading four vectors, and stores the result in
dst
. -
void store_f3x4(f32 *mem_addr, float3x4 m)
Stores the first 12 packed single-precision (32-bit) floating-point elements from
m
todst
.mem_addr
must be aligned on a 16-byte boundary or a general-protection exception may be generated. -
void store_f4x4(f32 *mem_addr, float4x4 m)
Stores packed single-precision (32-bit) floating-point elements from
m
todst
.mem_addr
must be aligned on a 16-byte boundary or a general-protection exception may be generated. -
Returns matrix of type
float4x4
with all elements set to zero. -
float3x4 matmul_f3x3(float3x4 a, float3x4 b)
Performs 3x3 matrix multiplication on
a
andb
, and stores the result indst
. -
float4x4 matmul_f4x4(float4x4 a, float4x4 b)
Performs 4x4 matrix multiplication on
a
andb
, and stores the result indst
. -
float4x4 transpose_f4x4(float4x4 a)
Performs matrix transpose on
a
, and stores the result indst
. -
f32 determinant_f3x3(float3x4 a)
Computes the determinant of the 3x3 matrix
a
, and stores the result indst
. -
float4 determinantv_f3x3(float3x4 a)
Computes the determinant of the 3x3 matrix
a
, and stores the result in every element ofdst
. -
float3x4 inverse_f3x3(float3x4 a, f32 *out_determinant)
Computes the determinant and the inverse matrix of
a
, stores the determinant inout_determinant
, and stores the inverse matrix indst
. -
f32 determinant_f4x4(float4x4 a)
Calculates the determinant of matrix
a
, and stores the result indst
. -
float4 determinantv_f4x4(float4x4 a)
Calculates the determinant of matrix
a
, and stores the result in every element ofdst
. -
float4x4 inverse_f4x4(float4x4 a, f32 *out_determinant)
Computes the determinant and the inverse matrix of
a
, stores the determinant inout_determinant
, and stores the inverse matrix indst
. -
Rounds each component of
a
to the nearest even integer. -
Computes the per-component angle modulo 2PI for
a
, and stores the results indst
. The angle is expressed in radians. The result is rounded in [-PI, PI]. -
Computes the sine of packed single-precision (32-bit) floating-point elements in
a
expressed in radians, and stores the results indst
. -
Computes the cosine of packed single-precision (32-bit) floating-point elements in
a
expressed in radians, and stores the results indst
. -
float4 sincos_f4(float4 &out_cos, float4 a)
Computes the sine and cosine of packed single-precision (32-bit) floating-point elements in
a
expressed in radians, and stores the results indst
andout_cos
. -
float4 mulquat_f4(float4 a, float4 b)
Multiplies two quaternion
a
andb
, and stores the result indst
. -
float4 quatinverse_f4(float4 a)
Inverts the quaternion
a
, and stores the result indst
. -
float4 quatnormalangle_f4(float4 n, f32 a)
Computes one rotation quaternion based on the given normal
a
and one anglea
, and stores the result quaternion indst
. -
float4 quateulerangles_f4(float4 a)
Computes a rotation quaternion based on a vector containing the Euler angles (pitch, yaw, and roll), and stores the result in
dst
. -
float4 quatlerp_f4(float4 a, float4 b, f32 t)
Interpolates between two unit quaternions
a
andb
using linear interpolation, and stores the result indst
. -
float4 quatslerp_f4(float4 a, float4 b, f32 t)
Interpolates between two unit quaternions
a
andb
using spherical linear interpolation, and stores the result indst
. -
float3x4 transform2d_f3x4(float4 translation, f32 rotation, float4 scaling)
Builds a 2D affine matrix from translation, rotation and scaling.
-
float3x4 transform2d_translation_f3x4(float4 translation)
Builds a 2D affine matrix from translation.
-
float3x4 transform2d_rotation_f3x4(f32 rotation)
Builds a 2D affine matrix from roation.
-
float3x4 transform2d_scaling_f3x4(float4 scaling)
Builds a 2D affine matrix from scaling.
-
float4x4 transform3d_f4x4(float4 translation, float4 rotation_quaternion, float4 scaling)
Builds a 3D affine matrix from translation, rotation and scaling.
-
float4x4 transform3d_translation_f4x4(float4 translation)
Builds a 3D affine matrix from translation.
-
float4x4 transform3d_rotation_quaternion_f4x4(float4 quaternion)
Builds a 3D affine matrix from rotation represented by one quaternion.
-
float4x4 transform3d_rotation_x_f4x4(f32 rotation)
Builds a 3D affine matrix from rotation along X axis.
-
float4x4 transform3d_rotation_y_f4x4(f32 rotation)
Builds a 3D affine matrix from rotation along Y axis.
-
float4x4 transform3d_rotation_z_f4x4(f32 rotation)
Builds a 3D affine matrix from rotation along Z axis.
-
float4x4 transform3d_rotation_normal_angle_f4x4(float4 normal, f32 angle)
Builds a 3D affine matrix from rotation represented by rotation axis and rotation angle.
-
float4x4 transform3d_rotation_euler_angles_f4x4(float4 pitch_yaw_roll)
Builds a 3D affine matrix from rotation represented by euler angles (pitch, yaw, roll).
-
float4x4 transform3d_scaling_f4x4(float4 scaling)
Builds a 3D affine matrix from scaling.
-
float4x4 transform3d_look_to_f4x4(float4 eye, float4 eyedir, float4 updir)
Creates one affine matrix that trnasforms points and directions in world space to view space.
eyedir
andupdir
must be normalized.