SIMD library

Types

Luna::Simd::float3x4

Represents one SIMD matrix type with 12 (3x4) packed single-precision (32-bit) floating-point elements. Based on the implementation, the matrix may be stored using two 256-bits registers (AVX), or four 128-bits registers (SSEn / Neon).
Luna::Simd::float4x4

Represents one SIMD matrix type with 16 (4x4) packed single-precision (32-bit) floating-point elements. Based on the implementation, the matrix may be stored using two 256-bits registers (AVX), or four 128-bits registers (SSEn / Neon).

Functions

float4 casti_f4(int4 a)

Reinterprets SIMD varaible of type int4 to type float4 without changing the data of the varaible.
int4 castf_i4(float4 a)

Reinterprets SIMD varaible of type float4 to type int4 without changing the data of the varaible.
float4 load_f2(f32 const *mem_addr)

Loads 64-bits (composed of 2 packed single-precision (32-bit) floating-point elements) from memory into the first two elements of dst. The rest 2 elements are filled with 0.
float4 load_f4(f32 const *mem_addr)

Loads 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from memory into dst.
void store_f2(f32 *mem_addr, float4 a)

Stores the lower 2 single-precision (32-bit) floating-point elements from a into memory.
void store_f4(f32 *mem_addr, float4 a)

Stores 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory.
float4 set_f4(f32 e0, f32 e1, f32 e2, f32 e3)

Sets packed single-precision (32-bit) floating-point elements in dst with the supplied values.
int4 set_i4(i32 e0, i32 e1, i32 e2, i32 e3)

Sets packed 32-bit integers in dst with the supplied values.
float4 setzero_f4()

Returns vector of type float4 with all elements set to zero.
float4 dup_f4(f32 e0)

Broadcasts single-precision (32-bit) floating-point value e0 to all elements of dst.
f32 getx_f4(float4 a)

Stores the first single-precision (32-bit) floating-point element from a into dst.
float4 setw_f4(float4 a, f32 b)

Replaces the forth element of a with b, and stores the results in dst.
float4 dupx_f4(float4 a)

Broadcasts the first element of a to every element of dst.
float4 dupy_f4(float4 a)

Broadcasts the second element of a to every element of dst.
float4 dupz_f4(float4 a)

Broadcasts the third element of a to every element of dst.
float4 dupw_f4(float4 a)

Broadcasts the forth element of a to every element of dst.
int4 cmpeq_f4(float4 a, float4 b)

Compares packed single-precision (32-bit) floating-point elements in a and b for equality, and stores the results in dst.
int4 cmpneq_f4(float4 a, float4 b)

Compares packed single-precision (32-bit) floating-point elements in a and b for not-equal, and stores the results in dst.
int4 cmpgt_f4(float4 a, float4 b)

Compares packed single-precision (32-bit) floating-point elements in a and b for greater-than, and stores the results in dst.
int4 cmplt_f4(float4 a, float4 b)

Compares packed single-precision (32-bit) floating-point elements in a and b for less-than, and stores the results in dst.
int4 cmpge_f4(float4 a, float4 b)

Compares packed single-precision (32-bit) floating-point elements in a and b for greater-than-or-equal, and stores the results in dst.
int4 cmple_f4(float4 a, float4 b)

Compares packed single-precision (32-bit) floating-point elements in a and b for less-than-or-equal, and stores the results in dst.
i32 maskint_i4(int4 a)

Converts the comparison result mask to one 32-bit integer.
float4 add_f4(float4 a, float4 b)

Adds packed single-precision (32-bit) floating-point elements in a and b, and stores the results in dst.
float4 sub_f4(float4 a, float4 b)

Subtracts packed single-precision (32-bit) floating-point elements in a and b, and stores the results in dst.
float4 mul_f4(float4 a, float4 b)

Multiplies packed single-precision (32-bit) floating-point elements in a and b, and stores the results in dst.
float4 div_f4(float4 a, float4 b)

Divides packed single-precision (32-bit) floating-point elements in a by packed elements in b, and stores the results in dst.
float4 scale_f4(float4 a, f32 b)

Scales packed single-precision (32-bit) floating-point elements in a using one single-precision (32-bit) floating-point element b, and stores the results in dst.
float4 muladd_f4(float4 a, float4 b, float4 c)

Multiply packed single-precision (32-bit) floating-point elements in a and b, add the intermediate result to packed elements in c, and store the results in dst.
float4 negmuladd_f4(float4 a, float4 b, float4 c)

Multiply packed single-precision (32-bit) floating-point elements in a and b, add the negated intermediate result to packed elements in c, and store the results in dst.
float4 scaleadd_f4(float4 a, f32 b, float4 c)

Scales packed single-precision (32-bit) floating-point elements in a using one single-precision (32-bit) floating-point element b, add the intermediate result to packed elements in c, and stores the results in dst.
float4 sqrt_f4(float4 a)

Computes the square root of packed single-precision (32-bit) floating-point elements in a, and stores the results in dst.
float4 rsqrtest_f4(float4 a)

Computes the approximate reciprocal square root of packed single-precision (32-bit) floating-point elements in a, and stores the results in dst. SSE specific: The maximum relative error for this approximation is less than 1.5*2^-12.
float4 rsqrt_f4(float4 a)

Computes the reciprocal square root of packed single-precision (32-bit) floating-point elements in a, and stores the results in dst.
float4 max_f4(float4 a, float4 b)

Compare packed single-precision (32-bit) floating-point elements in a and b, and store packed maximum values in dst. SSE specific: dst does not follow the IEEE Standard for Floating - Point Arithmetic(IEEE 754) maximum value when inputs are NaN or signed - zero values.
float4 min_f4(float4 a, float4 b)

Compare packed single-precision (32-bit) floating-point elements in a and b, and store packed minimum values in dst. SSE specific: dst does not follow the IEEE Standard for Floating - Point Arithmetic(IEEE 754) minimum value when inputs are NaN or signed - zero values.
int4 and_i4(int4 a, int4 b)

Computes the bitwise AND of every bit in a and b, and stores the results in dst.
int4 or_i4(int4 a, int4 b)

Computes the bitwise OR of every bit in a and b, and stores the results in dst.
f32 dot2_f4(float4 a, float4 b)

Computes the dot product on the first two elements of a and b, and stores the result in dst.
f32 dot3_f4(float4 a, float4 b)

Computes the dot product on the first three elements of a and b, and stores the result in dst.
f32 dot4_f4(float4 a, float4 b)

Computes the dot product on elements of a and b, and stores the result in dst.
float4 dot2v_f4(float4 a, float4 b)

Computes the dot product on the first two elements of a and b, and stores the result in each element of dst.
float4 dot3v_f4(float4 a, float4 b)

Computes the dot product on the first three elements of a and b, and stores the result in each element of dst.
float4 dot4v_f4(float4 a, float4 b)

Computes the dot product on elements of a and b, and stores the result in each element of dst.
float4 cross2_f4(float4 a, float4 b)

Computes the cross product on the first two elements of a and b, and stores the result in dst.
float4 cross3_f4(float4 a, float4 b)

Computes the cross product on the first three elements of a and b, and stores the result in each element of dst.
float4 cross4_f4(float4 a, float4 b, float4 c)

Computes the cross product on elements of a, b and c, and stores the result in each element of dst.
float4 normalize2_f4(float4 a)

Normalizes the first two elements of a, and stores the result in dst.
float4 normalize3_f4(float4 a)

Normalizes the first three elements of a, and stores the result in dst.
float4 normalize4_f4(float4 a)

Normalizes elements of a, and stores the result in dst.
float4 reflect2_f4(float4 i, float4 n)

Performs reflection operation based on the first two elements of i (incident vector) and n(normal vector), and stores the refected vector in dst.
float4 reflect3_f4(float4 i, float4 n)

Performs reflection operation based on the first three elements of i (incident vector) and n(normal vector), and stores the refected vector in dst.
float4 reflect4_f4(float4 i, float4 n)

Performs reflection operation based on elements of i (incident vector) and n(normal vector), and stores the refected vector in dst.
float4 refract2_f4(float4 i, float4 n, f32 index)

Performs refraction operation based on the first two elements of i (incident vector), n(normal vector) and the scalar value index (refraction index), and stores the refected vector in dst.
float4 refract3_f4(float4 i, float4 n, f32 index)

Performs refraction operation based on the first three elements of i (incident vector), n(normal vector) and the scalar value index (refraction index), and stores the refected vector in dst.
float4 refract4_f4(float4 i, float4 n, f32 index)

Performs refraction operation based on elements of i (incident vector), n(normal vector) and the scalar value index (refraction index), and stores the refected vector in dst.
float4 lerp_f4(float4 a, float4 b, f32 t)

Computes linear interpolation on packed single-precision (32-bit) floating-point elements in a and b using the single-precision (32-bit) floating-point value t, and stores the results in dst.
float4 lerpv_f4(float4 a, float4 b, float4 t)

Computes linear interpolation on packed single-precision (32-bit) floating-point elements in a and b using the corresponding packed single-precision (32-bit) floating-point element in t, and stores the results in dst.
float4 barycentric_f4(float4 a, float4 b, float4 c, f32 f, f32 g)

Computes barycentric interpolation on packed single-precision (32-bit) floating-point elements in a, b and c using the single-precision (32-bit) floating-point values f and g, and stores the results in dst.
float4 catmull_rom_f4(float4 a, float4 b, float4 c, float4 d, f32 t)

Computes Catmull-Rom spline interpolation on packed single-precision (32-bit) floating-point elements in a, b, c and d using the single-precision (32-bit) floating-point value t, and stores the results in dst.
float4 hermite_f4(float4 v0, float4 t0, float4 v1, float4 t1, f32 t)

Computes Hermite spline interpolation on packed single-precision (32-bit) floating-point elements in v0, t0, v1 and t1 using the single-precision (32-bit) floating-point value t, and stores the results in dst.
float4 permute_f4(float4 a)

Shuffles single-precision (32-bit) floating-point elements in a based on the control parameter _SelectX, _SelectY, _SelectZ and _SelectW, and stores the results in dst.
float4 select_f4(float4 a, float4 b)

Performs a per-component selection between a and b based on the control parameter _SelectX, _SelectY, _SelectZ and _SelectW, and stores the results in dst.
float4 permute2_f4(float4 a, float4 b)

Shuffles single-precision (32-bit) floating-point elements in a and b based on the control parameter _SelectX, _SelectY, _SelectZ and _SelectW, and stores the results in dst.
float3x4 load_f3x4(f32 const *mem_addr)

Loads 12 packed single-precision (32-bit) floating-point elements from mem_addr to dst. The highest 4 packed single-precision (32-bit) floating-point elements are uninitialized.
float4x4 load_f4x4(f32 const *mem_addr)

Loads 16 packed single-precision (32-bit) floating-point elements from mem_addr to dst. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.
float4x4 setf4_f4x4(float4 r0, float4 r1, float4 r2, float4 r3)

Creates one 4x4 matrix by loading four vectors, and stores the result in dst.
void store_f3x4(f32 *mem_addr, float3x4 m)

Stores the first 12 packed single-precision (32-bit) floating-point elements from m to dst. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.
void store_f4x4(f32 *mem_addr, float4x4 m)

Stores packed single-precision (32-bit) floating-point elements from m to dst. mem_addr must be aligned on a 16-byte boundary or a general-protection exception may be generated.
float4x4 setzero_f4x4()

Returns matrix of type float4x4 with all elements set to zero.
float3x4 matmul_f3x3(float3x4 a, float3x4 b)

Performs 3x3 matrix multiplication on a and b, and stores the result in dst.
float4x4 matmul_f4x4(float4x4 a, float4x4 b)

Performs 4x4 matrix multiplication on a and b, and stores the result in dst.
float4x4 transpose_f4x4(float4x4 a)

Performs matrix transpose on a, and stores the result in dst.
f32 determinant_f3x3(float3x4 a)

Computes the determinant of the 3x3 matrix a, and stores the result in dst.
float4 determinantv_f3x3(float3x4 a)

Computes the determinant of the 3x3 matrix a, and stores the result in every element of dst.
float3x4 inverse_f3x3(float3x4 a, f32 *out_determinant)

Computes the determinant and the inverse matrix of a, stores the determinant in out_determinant, and stores the inverse matrix in dst.
f32 determinant_f4x4(float4x4 a)

Calculates the determinant of matrix a, and stores the result in dst.
float4 determinantv_f4x4(float4x4 a)

Calculates the determinant of matrix a, and stores the result in every element of dst.
float4x4 inverse_f4x4(float4x4 a, f32 *out_determinant)

Computes the determinant and the inverse matrix of a, stores the determinant in out_determinant, and stores the inverse matrix in dst.
float4 round_f4(float4 a)

Rounds each component of a to the nearest even integer.
float4 modangle_f4(float4 a)

Computes the per-component angle modulo 2PI for a, and stores the results in dst. The angle is expressed in radians. The result is rounded in [-PI, PI].
float4 sin_f4(float4 a)

Computes the sine of packed single-precision (32-bit) floating-point elements in a expressed in radians, and stores the results in dst.
float4 cos_f4(float4 a)

Computes the cosine of packed single-precision (32-bit) floating-point elements in a expressed in radians, and stores the results in dst.
float4 sincos_f4(float4 &out_cos, float4 a)

Computes the sine and cosine of packed single-precision (32-bit) floating-point elements in a expressed in radians, and stores the results in dst and out_cos.
float4 mulquat_f4(float4 a, float4 b)

Multiplies two quaternion a and b, and stores the result in dst.
float4 quatinverse_f4(float4 a)

Inverts the quaternion a, and stores the result in dst.
float4 quatnormalangle_f4(float4 n, f32 a)

Computes one rotation quaternion based on the given normal a and one angle a, and stores the result quaternion in dst.
float4 quateulerangles_f4(float4 a)

Computes a rotation quaternion based on a vector containing the Euler angles (pitch, yaw, and roll), and stores the result in dst.
float4 quatlerp_f4(float4 a, float4 b, f32 t)

Interpolates between two unit quaternions a and b using linear interpolation, and stores the result in dst.
float4 quatslerp_f4(float4 a, float4 b, f32 t)

Interpolates between two unit quaternions a and b using spherical linear interpolation, and stores the result in dst.
float3x4 transform2d_f3x4(float4 translation, f32 rotation, float4 scaling)

Builds a 2D affine matrix from translation, rotation and scaling.
float3x4 transform2d_translation_f3x4(float4 translation)

Builds a 2D affine matrix from translation.
float3x4 transform2d_rotation_f3x4(f32 rotation)

Builds a 2D affine matrix from roation.
float3x4 transform2d_scaling_f3x4(float4 scaling)

Builds a 2D affine matrix from scaling.
float4x4 transform3d_f4x4(float4 translation, float4 rotation_quaternion, float4 scaling)

Builds a 3D affine matrix from translation, rotation and scaling.
float4x4 transform3d_translation_f4x4(float4 translation)

Builds a 3D affine matrix from translation.
float4x4 transform3d_rotation_quaternion_f4x4(float4 quaternion)

Builds a 3D affine matrix from rotation represented by one quaternion.
float4x4 transform3d_rotation_x_f4x4(f32 rotation)

Builds a 3D affine matrix from rotation along X axis.
float4x4 transform3d_rotation_y_f4x4(f32 rotation)

Builds a 3D affine matrix from rotation along Y axis.
float4x4 transform3d_rotation_z_f4x4(f32 rotation)

Builds a 3D affine matrix from rotation along Z axis.
float4x4 transform3d_rotation_normal_angle_f4x4(float4 normal, f32 angle)

Builds a 3D affine matrix from rotation represented by rotation axis and rotation angle.
float4x4 transform3d_rotation_euler_angles_f4x4(float4 pitch_yaw_roll)

Builds a 3D affine matrix from rotation represented by euler angles (pitch, yaw, roll).
float4x4 transform3d_scaling_f4x4(float4 scaling)

Builds a 3D affine matrix from scaling.
float4x4 transform3d_look_to_f4x4(float4 eye, float4 eyedir, float4 updir)

Creates one affine matrix that trnasforms points and directions in world space to view space. eyedir and updir must be normalized.