From c7b72a029e28726474c7b28ed410e63b538d85d5 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Tue, 7 Jul 2015 11:20:07 -0700 Subject: [PATCH 01/25] Lay groundwork for SIMD. --- text/0000-simd-infrastructure.md | 411 +++++++++++++++++++++++++++++++ 1 file changed, 411 insertions(+) create mode 100644 text/0000-simd-infrastructure.md diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md new file mode 100644 index 00000000000..7876de442dd --- /dev/null +++ b/text/0000-simd-infrastructure.md @@ -0,0 +1,411 @@ +- Feature Name: simd_basics +- Start Date: 2015-06-02 +- RFC PR: (leave this empty) +- Rust Issue: (leave this empty) + +# Summary + +Lay the ground work for building powerful SIMD functionality. + +# Motivation + +SIMD (Single-Instruction Multiple-Data) is an important part of +performant modern applications. Most CPUs used for that sort of task +provide dedicated hardware and instructions for operating on multiple +values in a single instruction, and exposing this is an important part +of being a low-level language. + +This RFC lays the ground-work for building nice SIMD functionality, +but doesn't fill everything out. The goal here is to provide the raw +types and access to the raw instructions on each platform. + +# Detailed design + +The design comes in three parts: + +- types +- operations +- platform detection + +The general idea is to avoid bad performance cliffs, so that an +intrinsic call in Rust maps to preferably one CPU instruction, or, if +not, the "optimal" sequence required to do the given operation +anyway. This means exposing a *lot* of platform specific details, +since platforms behave very differently: both across architecture +families (x86, x86-64, ARM, MIPS, ...), and even within a family +(x86-64's Skylake, Haswell, Nehalem, ...). + +There is definitely a common core of SIMD functionality shared across +many platforms, but this RFC doesn't try to extract that, it is just +building tools that can be wrapped into a more uniform API later. + +## Background: Where does this code go? + +This RFC is focused on building stable, powerful SIMD functionality in +external crates, not `std`. This makes it much easier to support +functionality only "occasionally" available with Rust's preexisting +`cfg` system. If it were in `std`, there would need to be some highly +delayed `cfg` system so that functions that only work with AVX-2 +support: + +- don't break compilation on systems that don't support it, but +- are still usable on systems that do support it. + +## Types & traits + +A type designed to be used as a SIMD vector is indicated by the +`repr(simd)` attribute. A type marked as such will be compiled to +behave like a SIMD register (as well as the target platform can +support it). + +The types/traits will be defined as follows: + +```rust +#[repr(simd)] +struct Simd2(T, T); +#[repr(simd)] +struct Simd4(T, T, T, T); +#[repr(simd)] +struct Simd8(T, T, T, T, T, T, T, T); +#[repr(simd)] +struct Simd16(T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T); +#[repr(simd)] +struct Simd32(T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, + T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T); +#[repr(simd)] +struct Simd64(T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, + T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, + T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, + T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T); + +trait SimdVector { + type Elem: SimdPrim; + type Bool: SimdVector::Bool>; +} + +impl for Simd2 { + type Elem = T; + type Bool = Simd2; +} +impl for Simd4 { + type Elem = T; + type Bool = Simd4; +} +// ... +impl for Simd64 { + type Elem = T; + type Bool = Simd64; +} + +#[simd_prim_trait] +trait SimdPrim { + type Bool: SimdPrim; +} + +// boolean types, see below +struct bool8i(...); +struct bool16i(...); +struct bool32i(...); +struct bool64i(...); +struct bool32f(...); +struct bool64f(...); + +// specifying what types are SIMD-able. +impl SimdPrim for u8 { type Bool = bool8i; } +impl SimdPrim for i8 { type Bool = bool8i; } +impl SimdPrim for u16 { type Bool = bool16i; } +// ... +impl SimdPrim for i64 { type Bool = bool64i; } + +impl SimdPrim for f32 { type Bool = bool32f; } +impl SimdPrim for f64 { type Bool = bool64f; } + +impl SimdPrim for bool8i { type Bool = bool8i; } +// ... +impl SimdPrim for bool64i { type Bool = bool64i; } + +impl SimdPrim for bool32f { type Bool = bool32f; } +impl SimdPrim for bool64f { type Bool = bool64f; } +``` + +It is illegal to take an internal reference to the fields of a +`repr(simd)` type. + +### `repr(simd)` + +The `simd` `repr` can be attached to a struct and will cause such a +struct to be compiled to a SIMD vector. It is required that the +monomorphised vector consist of only a single "primitive" type, +repeated some number of times. The restrictions on the element type +are exactly the same restrictions as `#[simd_primitive_trait]` traits +impose on their implementing types. + +The `repr(simd)` may not enforce that the trait bound exists/does the +right thing at the type checking level for generic `repr(simd)` +types. As such, it will be possible to get the code-generator to error +out (ala the old `transmute` size errosr), however, this shouldn't +cause problems in practice: libraries wrapping this functionality +would layer type-safety on top (i.e. the `SimdPrim` trait). + +### `simd_primitive_trait` + +Traits marked with the `simd_primitive_trait` attribute are special: +types implementing it are those that can be stored in SIMD +vectors. Initially, only primitives and single-field structs that +store `SimdPrim` types will be allowed to implement it. + +This is explicitly not a lang item: it is legal to have multiple +distinct traits in a compilation. The attribute just adds the +restriction and possibly tweaks type's internal representation (as +such, it's legal for a single type to implement multiple traits with +the attribute, if a bit pointless). + +### Booleans + +SIMD booleans are non-trivial. Many conventional APIs e.g. SSE, and +NEON, use "wide booleans": a large number of bits set to all-zeros +(false) or all-ones, e.g. equality between `Simd4(0_u32, 1, 2, 3)` and +`Simd4(0_u32, 0, 2, 3)` gives (on the CPU) `Simd4(!0_u32, 0, !0, +!0)`. Hence, the boolean types need to have width. It's tempting to +just use the integer types of the appropriate width, but this falls +down for two reasons: + +1. booleans aren't always this format +2. the source of the boolean matters + +The second is easiest: CPUs are complicated beasts, and the hardware +that handles floating point vector operations may be very different to +the hardware that handles integer ones: instructions use different +execution units. It can take several cycles to transfer data between +them. Encoding the provenance/execution unit of the value in the type +makes costs explicit. + +The first is much harder to solve. Some architectures/instruction sets +model booleans as single bits. For example, equality between +`Simd4(0_u32, 1, 2, 3)` and `Simd(0_u32, 0, 2, 3)` gives `1 + 4 + 8 == +0b1101`. One example is AVX-512 which essentially replaces all of the +older SSE through AVX2 boolean-returning instructions with versions +that return those. Using separate types for booleans (and restricting +their API) allows for some serious magic: `Simd4` becomes +`u4`. (This is where the reference-restriction above comes in.) + +## Operations + +CPU vendors usually offer "standard" C headers for their CPU specific +operations, such as [`arm_neon.h`][armneon] and [the `...mmintrin.h` headers for +x86(-64)][x86]. + +[armneon]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf +[x86]: https://software.intel.com/sites/landingpage/IntrinsicsGuide + +All of these would be exposed as (eventually) stable intrinsics with +names very similar to those that the vendor suggests (only difference +would be some form of manual namespacing, e.g. prefixing with the CPU +target), loadable via an `extern` block with an appropriate ABI. + +```rust +extern "rust-intrinsic" { + fn x86_mm_abs_epi16(a: Simd8) -> Simd8; + // ... +} +``` + +These all use entirely concrete types, and this is the core interface +to these intrinsics: essentially it is just allowing code to exactly +specify a CPU instruction to use. These intrinsics only actually work +on a subset of the CPUs that Rust targets, and are only be available +for `extern`ing on those targets. The signatures are typechecked, but +in a "duck-typed" manner: it will just ensure that the types are SIMD +vectors with the appropriate length and element type, it will not +enforce a specific nominal type. + +There would additionally be a small set of cross-platform operations +that are either generally efficiently supported everywhere or are +extremely useful. These won't necessarily map to a single instruction, +but will be shimmed as efficiently as possible. + +- shuffles and extracting/inserting elements +- comparisons + +Lastly, arithmetic and conversions are supported via built-in operators. + +### Shuffles & element operations + +One of the most powerful features of SIMD is the ability to rearrange +data within vectors, giving super-linear speed-ups sometimes. As such, +shuffles are exposed generally: intrinsics that represent arbitrary +shuffles. + +This may violate the "one instruction per instrinsic" principal +depending on the shuffle, but rearranging SIMD vectors is extremely +useful, and providing a direct intrinsic lets the compiler (a) do the +programmers work in synthesising the optimal (short) sequence of +instructions to get a given shuffle and (b) track data through +shuffles without having to understand all the details of every +platform specific intrinsic for shuffling. + +```rust +extern "rust-intrinsic" { + fn simd_shuffle2(v: T, w: T, i0: u32, i1: u32) -> Simd2; + fn simd_shuffle4(v: T, w: T, i0: u32, i1: u32, i2: u32, i3: u32) -> Simd4; + fn simd_shuffle8(v: T, w: T, + i0: u32, i1: u32, i2: u32, i3: u32, + i4: u32, i5: u32, i6: u32, i7: u32) -> Simd8; + fn simd_shuffle16(v: T, w: T, + i0: u32, i1: u32, i2: u32, i3: u32, + i4: u32, i5: u32, i6: u32, i7: u32 + i8: u32, i9: u32, i10: u32, i11: u32, + i12: u32, i13: u32, i14: u32, i15: u32) -> Simd16; +} +``` + +This approach has some downsides: `simd_shuffle32` (e.g. `Simd32` +on AVX, and `Simd32` on AVX-512) and especially `simd_shuffle64` +(e.g. `Simd64` on AVX-512) are unwieldy. These have similar type +"safety"/code-generation errors to the vectors themselves. + +These operations are semantically: + +```rust +// vector of double length +let z = concat(v, w); + +return [z[i0], z[i1], z[i2], ...] +``` + +The indices `iN` have to be compile time constants. + +Similarly, intrinsics for inserting/extracting elements into/out of +vectors are provided, to allow modelling the SIMD vectors as actual +CPU registers as much as possible: + +```rust +extern "rust-intrinsic" { + fn simd_insert(v: T, i0: u32, elem: T::Elem) -> T; + fn simd_extract(v: T, i0: u32) -> T::Elem; +} +``` + +The `i0` indices do not have to be constant. These are equivalent to +`v[i0] = elem` and `v[i0]` respectively. + +### Comparisons + +Comparisons are implemented via intrinsics, because the current +comparison operator infrastructure doesn't easily lend itself to +return vectors, as required. + +A library could give signatures like: + +```rust +extern "rust-intrinsic" { + fn simd_eq(v: T, w: T) -> T::Bool; + fn simd_ne(v: T, w: T) -> T::Bool; + fn simd_lt(v: T, w: T) -> T::Bool; + fn simd_le(v: T, w: T) -> T::Bool; + fn simd_gt(v: T, w: T) -> T::Bool; + fn simd_ge(v: T, w: T) -> T::Bool; +} +``` + + +### Built-in functionality + +Any type marked `repr(simd)` automatically has the `+`, `-` and `*` +operators work. The `/` operator works for floating point, and the +`<<` and `>>` ones work for integers. + +SIMD vectors can be converted with `as`. As with intrinsics, this is +"duck-typed" it is possible to cast a vector type `V` to a type `W` if +their lengths match and their elements are castable (i.e. are +primitives), there's no enforcement of nominal types. + +All of these are never checked: explicit SIMD is essentially only +required for speed, and checking inflates one instruction to 5 or +more. + +## Platform Detection + +The availability of efficient SIMD functionality is very fine-grained, +and our current `cfg(target_arch = "...")` is not precise enough. This +RFC proposes a `target_feature` `cfg`, that would be set to the +features of the architecture that are known to be supported by the +exact target e.g. + +- a default x86-64 compilation would essentially only set + `target_feature = "sse"` and `target_feature = "sse2"` +- compiling with `-C target-feature="+sse4.2"` would set + `target_feature = "sse4.2"`, `target_feature = "sse.4.1"`, ..., + `target_feature = "sse"`. +- compiling with `-C target-cpu=native` on a modern CPU might set + `target_feature = "avx2"`, `target_feature = "avx"`, ... + +(There are other non-SIMD features that might have `target_feature`s +set too, such as `popcnt` and `rdrnd` on x86/x86-64.) + +With a `cfg_if_else!` macro that expands to the first `cfg` that is +satisfied (ala [@alexcrichton's cascade][cascade]), code might look +like: + +[cascade]: https://github.com/alexcrichton/backtrace-rs/blob/03703031babfa87cbe2c723ad6752131819dc554/src/macros.rs + +```rust +cfg_if_else! { + if #[cfg(target_feature = "avx")] { + fn foo() { /* use AVX things */ } + } else if #[cfg(target_feature = "sse4.1")] { + fn foo() { /* use SSE4.1 things */ } + } else if #[cfg(target_feature = "sse2")] { + fn foo() { /* use SSE2 things */ } + } else if #[cfg(target_feature = "neon")] { + fn foo() { /* use NEON things */ } + } else { + fn foo() { /* universal fallback */ } + } +} +``` + +# Extensions + +- scatter/gather operations allow (partially) operating on a SIMD + vector of pointers. This would require extending `SimdPrim` to also + allow pointer types. +- allow (and ignore for everything but type checking) zero-sized types + in `repr(simd)` structs, to allow tagging them with markers + +# Alternatives + +- The SIMD on-route-to-stable intrinsics could have their own ABI +- Intrinsics could instead by namespaced by ABI, `extern + "x86-intrinsic"`, `extern "arm-intrinsic"`. +- There could be more syntactic support for shuffles, either with true + syntax, or with a syntax extension. The latter might look like: + `shuffle![x, y, i0, i1, i2, i3, i4, ...]`. However, this requires + that shuffles are restricted to a single type only (i.e. `Simd4` + can be shuffled to `Simd4` but nothing else), or some sort of + type synthesis. The compiler has to somehow work out the return + value: + + ```rust + let x: Simd4 = ...; + let y: Simd4 = ...; + + // reverse all the elements. + let z = shuffle![x, y, 7, 6, 5, 4, 3, 2, 1, 0]; + ``` + + Presumably `z` should be `Simd8`, but it's not obvious how the + compiler can know this. The `repr(simd)` approach means there may be + more than one SIMD-vector type with the `Simd8` shape (or, in + fact, there may be zero). +- Instead of platform detection, there could be feature detection + (e.g. "platform supports something equivalent to x86's `DPPS`"), but + there probably aren't enough cross-platform commonalities for this + to be worth it. (Each "feature" would essentially be a platform + specific `cfg` anyway.) +- Check vector operators in debug mode just like the scalar versions. + +# Unresolved questions + +- Should integer vectors get `/` and `%` automatically? Most CPUs + don't support them for vectors. From 272ce9b458b32c1e6a2f4e349e5a2b4458673107 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Wed, 8 Jul 2015 10:58:49 -0700 Subject: [PATCH 02/25] First round of changes: remove extraneous traits etc. --- text/0000-simd-infrastructure.md | 176 +++++++++---------------------- 1 file changed, 50 insertions(+), 126 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index 7876de442dd..6fb50f0cb5a 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -19,6 +19,24 @@ This RFC lays the ground-work for building nice SIMD functionality, but doesn't fill everything out. The goal here is to provide the raw types and access to the raw instructions on each platform. +## Where does this code go? Aka. why not in `std`? + +This RFC is focused on building stable, powerful SIMD functionality in +external crates, not `std`. + +This makes it much easier to support functionality only "occasionally" +available with Rust's preexisting `cfg` system. If it were in `std`, +there would need to be some highly delayed `cfg` system so that +functions that only work with (say) AVX-2 support: + +- don't break compilation on systems that don't support it, but +- are still usable on systems that do support it. + +With an external crate, we can leverage `cargo`'s existing build +infrastructure: compiling with some target features will rebuild with +those features enabled. + + # Detailed design The design comes in three parts: @@ -39,113 +57,42 @@ There is definitely a common core of SIMD functionality shared across many platforms, but this RFC doesn't try to extract that, it is just building tools that can be wrapped into a more uniform API later. -## Background: Where does this code go? - -This RFC is focused on building stable, powerful SIMD functionality in -external crates, not `std`. This makes it much easier to support -functionality only "occasionally" available with Rust's preexisting -`cfg` system. If it were in `std`, there would need to be some highly -delayed `cfg` system so that functions that only work with AVX-2 -support: - -- don't break compilation on systems that don't support it, but -- are still usable on systems that do support it. ## Types & traits -A type designed to be used as a SIMD vector is indicated by the -`repr(simd)` attribute. A type marked as such will be compiled to -behave like a SIMD register (as well as the target platform can -support it). - -The types/traits will be defined as follows: +There are two new attributes: `repr(simd)` and `simd_primitive_trait` ```rust #[repr(simd)] -struct Simd2(T, T); -#[repr(simd)] -struct Simd4(T, T, T, T); -#[repr(simd)] -struct Simd8(T, T, T, T, T, T, T, T); -#[repr(simd)] -struct Simd16(T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T); -#[repr(simd)] -struct Simd32(T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, - T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T); -#[repr(simd)] -struct Simd64(T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, - T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, - T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, - T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T); - -trait SimdVector { - type Elem: SimdPrim; - type Bool: SimdVector::Bool>; -} - -impl for Simd2 { - type Elem = T; - type Bool = Simd2; -} -impl for Simd4 { - type Elem = T; - type Bool = Simd4; -} -// ... -impl for Simd64 { - type Elem = T; - type Bool = Simd64; -} +struct f32x4(f32, f32, f23, f23); -#[simd_prim_trait] -trait SimdPrim { - type Bool: SimdPrim; -} +#[repr(simd)] +struct Simd2(T, T); -// boolean types, see below -struct bool8i(...); -struct bool16i(...); -struct bool32i(...); -struct bool64i(...); -struct bool32f(...); -struct bool64f(...); - -// specifying what types are SIMD-able. -impl SimdPrim for u8 { type Bool = bool8i; } -impl SimdPrim for i8 { type Bool = bool8i; } -impl SimdPrim for u16 { type Bool = bool16i; } -// ... -impl SimdPrim for i64 { type Bool = bool64i; } - -impl SimdPrim for f32 { type Bool = bool32f; } -impl SimdPrim for f64 { type Bool = bool64f; } - -impl SimdPrim for bool8i { type Bool = bool8i; } -// ... -impl SimdPrim for bool64i { type Bool = bool64i; } - -impl SimdPrim for bool32f { type Bool = bool32f; } -impl SimdPrim for bool64f { type Bool = bool64f; } +#[simd_primitive_trait] +trait SimdPrim {} ``` -It is illegal to take an internal reference to the fields of a -`repr(simd)` type. - ### `repr(simd)` The `simd` `repr` can be attached to a struct and will cause such a -struct to be compiled to a SIMD vector. It is required that the -monomorphised vector consist of only a single "primitive" type, -repeated some number of times. The restrictions on the element type -are exactly the same restrictions as `#[simd_primitive_trait]` traits -impose on their implementing types. +struct to be compiled to a SIMD vector. It can be generic, but it is +required that any fully monomorphised instance of the type consist of +only a single "primitive" type, repeated some number of times. The +restrictions on the element type are exactly the same restrictions as +`#[simd_primitive_trait]` traits impose on their implementing types. The `repr(simd)` may not enforce that the trait bound exists/does the right thing at the type checking level for generic `repr(simd)` types. As such, it will be possible to get the code-generator to error out (ala the old `transmute` size errosr), however, this shouldn't cause problems in practice: libraries wrapping this functionality -would layer type-safety on top (i.e. the `SimdPrim` trait). +would layer type-safety on top (i.e. generic `repr(simd)` types would +use the `SimdPrim` trait as a bound). + +It is illegal to take an internal reference to the fields of a +`repr(simd)` type, because the representation of booleans may require +to change, so that booleans are bit-packed. ### `simd_primitive_trait` @@ -160,35 +107,6 @@ restriction and possibly tweaks type's internal representation (as such, it's legal for a single type to implement multiple traits with the attribute, if a bit pointless). -### Booleans - -SIMD booleans are non-trivial. Many conventional APIs e.g. SSE, and -NEON, use "wide booleans": a large number of bits set to all-zeros -(false) or all-ones, e.g. equality between `Simd4(0_u32, 1, 2, 3)` and -`Simd4(0_u32, 0, 2, 3)` gives (on the CPU) `Simd4(!0_u32, 0, !0, -!0)`. Hence, the boolean types need to have width. It's tempting to -just use the integer types of the appropriate width, but this falls -down for two reasons: - -1. booleans aren't always this format -2. the source of the boolean matters - -The second is easiest: CPUs are complicated beasts, and the hardware -that handles floating point vector operations may be very different to -the hardware that handles integer ones: instructions use different -execution units. It can take several cycles to transfer data between -them. Encoding the provenance/execution unit of the value in the type -makes costs explicit. - -The first is much harder to solve. Some architectures/instruction sets -model booleans as single bits. For example, equality between -`Simd4(0_u32, 1, 2, 3)` and `Simd(0_u32, 0, 2, 3)` gives `1 + 4 + 8 == -0b1101`. One example is AVX-512 which essentially replaces all of the -older SSE through AVX2 boolean-returning instructions with versions -that return those. Using separate types for booleans (and restricting -their API) allows for some serious magic: `Simd4` becomes -`u4`. (This is where the reference-restriction above comes in.) - ## Operations CPU vendors usually offer "standard" C headers for their CPU specific @@ -295,19 +213,23 @@ Comparisons are implemented via intrinsics, because the current comparison operator infrastructure doesn't easily lend itself to return vectors, as required. -A library could give signatures like: +The raw signatures would look like: ```rust extern "rust-intrinsic" { - fn simd_eq(v: T, w: T) -> T::Bool; - fn simd_ne(v: T, w: T) -> T::Bool; - fn simd_lt(v: T, w: T) -> T::Bool; - fn simd_le(v: T, w: T) -> T::Bool; - fn simd_gt(v: T, w: T) -> T::Bool; - fn simd_ge(v: T, w: T) -> T::Bool; + fn simd_eq(v: T, w: T) -> U; + fn simd_ne(v: T, w: T) -> U; + fn simd_lt(v: T, w: T) -> U; + fn simd_le(v: T, w: T) -> U; + fn simd_gt(v: T, w: T) -> U; + fn simd_ge(v: T, w: T) -> U; } ``` +However, these will be type checked, to ensure that `T` and `U` are +the same length, and that `U` is appropriately shaped for a boolean. A +library actually importing them might use some trait bounds to get +actual type-safety. ### Built-in functionality @@ -340,8 +262,10 @@ exact target e.g. - compiling with `-C target-cpu=native` on a modern CPU might set `target_feature = "avx2"`, `target_feature = "avx"`, ... -(There are other non-SIMD features that might have `target_feature`s -set too, such as `popcnt` and `rdrnd` on x86/x86-64.) +The possible values of `target_feature` will be a selected whitelist, +not necessarily just everything LLVM understands. There are other +non-SIMD features that might have `target_feature`s set too, such as +`popcnt` and `rdrnd` on x86/x86-64.) With a `cfg_if_else!` macro that expands to the first `cfg` that is satisfied (ala [@alexcrichton's cascade][cascade]), code might look From 5893b163931a7f27fe724a6cd456ef38f8cd76c0 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Wed, 8 Jul 2015 14:19:10 -0700 Subject: [PATCH 03/25] Second round of changes: minor tweaks. --- text/0000-simd-infrastructure.md | 88 ++++++++++++++++++++------------ 1 file changed, 54 insertions(+), 34 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index 6fb50f0cb5a..a33f3b27068 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -1,4 +1,4 @@ -- Feature Name: simd_basics +- Feature Name: simd_basics, cfg_target_feature - Start Date: 2015-06-02 - RFC PR: (leave this empty) - Rust Issue: (leave this empty) @@ -25,12 +25,12 @@ This RFC is focused on building stable, powerful SIMD functionality in external crates, not `std`. This makes it much easier to support functionality only "occasionally" -available with Rust's preexisting `cfg` system. If it were in `std`, -there would need to be some highly delayed `cfg` system so that -functions that only work with (say) AVX-2 support: - -- don't break compilation on systems that don't support it, but -- are still usable on systems that do support it. +available with Rust's preexisting `cfg` system. There's no way for +`std` to conditionally provide an API based on the target features +used for the final artifact. Building `std` in every configuration is +certainly untenable. Hence, if it were to be in `std`, there would +need to be some highly delayed `cfg` system to support that sort of +conditional API exposure. With an external crate, we can leverage `cargo`'s existing build infrastructure: compiling with some target features will rebuild with @@ -39,11 +39,11 @@ those features enabled. # Detailed design -The design comes in three parts: +The design comes in three parts, all on the path to stabilisation: -- types -- operations -- platform detection +- types (`feature(simd_basics)`) +- operations (`feature(simd_basics)`) +- platform detection (`feature(cfg_target_feature)`) The general idea is to avoid bad performance cliffs, so that an intrinsic call in Rust maps to preferably one CPU instruction, or, if @@ -92,7 +92,9 @@ use the `SimdPrim` trait as a bound). It is illegal to take an internal reference to the fields of a `repr(simd)` type, because the representation of booleans may require -to change, so that booleans are bit-packed. +to change, so that booleans are bit-packed. The official external +library providing SIMD support will have private fields so this will +not be generally observable. ### `simd_primitive_trait` @@ -107,6 +109,13 @@ restriction and possibly tweaks type's internal representation (as such, it's legal for a single type to implement multiple traits with the attribute, if a bit pointless). +This trait exists to allow new-type wrappers around primitives to also +be usable in a SIMD context. However, this only works in limited +scenarios (i.e. when the type wraps a single primitive) and so needs +to be an explicit part of every type's API: type authors opt-in to +being designed-for-SIMD. If it was implicit, changes to private fields +may break downstream code. + ## Operations CPU vendors usually offer "standard" C headers for their CPU specific @@ -116,10 +125,13 @@ x86(-64)][x86]. [armneon]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf [x86]: https://software.intel.com/sites/landingpage/IntrinsicsGuide -All of these would be exposed as (eventually) stable intrinsics with -names very similar to those that the vendor suggests (only difference -would be some form of manual namespacing, e.g. prefixing with the CPU -target), loadable via an `extern` block with an appropriate ABI. +All of these would be exposed as compiler intrinsics with names very +similar to those that the vendor suggests (only difference would be +some form of manual namespacing, e.g. prefixing with the CPU target), +loadable via an `extern` block with an appropriate ABI. This subset of +intrinsics would be on the path to stabilisation (that is, one can +"import" them with `extern` in stable code), and would not be exported +by `std`. ```rust extern "rust-intrinsic" { @@ -164,19 +176,24 @@ platform specific intrinsic for shuffling. ```rust extern "rust-intrinsic" { - fn simd_shuffle2(v: T, w: T, i0: u32, i1: u32) -> Simd2; - fn simd_shuffle4(v: T, w: T, i0: u32, i1: u32, i2: u32, i3: u32) -> Simd4; - fn simd_shuffle8(v: T, w: T, - i0: u32, i1: u32, i2: u32, i3: u32, - i4: u32, i5: u32, i6: u32, i7: u32) -> Simd8; - fn simd_shuffle16(v: T, w: T, - i0: u32, i1: u32, i2: u32, i3: u32, - i4: u32, i5: u32, i6: u32, i7: u32 - i8: u32, i9: u32, i10: u32, i11: u32, - i12: u32, i13: u32, i14: u32, i15: u32) -> Simd16; + fn simd_shuffle2(v: T, w: T, i0: u32, i1: u32) -> Simd2; + fn simd_shuffle4(v: T, w: T, i0: u32, i1: u32, i2: u32, i3: u32) -> Sidm4; + fn simd_shuffle8(v: T, w: T, + i0: u32, i1: u32, i2: u32, i3: u32, + i4: u32, i5: u32, i6: u32, i7: u32) -> Simd8; + fn simd_shuffle16(v: T, w: T, + i0: u32, i1: u32, i2: u32, i3: u32, + i4: u32, i5: u32, i6: u32, i7: u32 + i8: u32, i9: u32, i10: u32, i11: u32, + i12: u32, i13: u32, i14: u32, i15: u32) -> Simd16; } ``` +The raw definitions are only checked for validity at monomorphisation +time, ensure that `T` is a SIMD vector, `U` is the element type of `T` +etc. Libraries can use traits to ensure that these will be enforced by +the type checker too. + This approach has some downsides: `simd_shuffle32` (e.g. `Simd32` on AVX, and `Simd32` on AVX-512) and especially `simd_shuffle64` (e.g. `Simd64` on AVX-512) are unwieldy. These have similar type @@ -191,7 +208,8 @@ let z = concat(v, w); return [z[i0], z[i1], z[i2], ...] ``` -The indices `iN` have to be compile time constants. +The indices `iN` have to be compile time constants. Out of bounds +indices yield unspecified results. Similarly, intrinsics for inserting/extracting elements into/out of vectors are provided, to allow modelling the SIMD vectors as actual @@ -199,13 +217,14 @@ CPU registers as much as possible: ```rust extern "rust-intrinsic" { - fn simd_insert(v: T, i0: u32, elem: T::Elem) -> T; - fn simd_extract(v: T, i0: u32) -> T::Elem; + fn simd_insert(v: T, i0: u32, elem: Elem) -> T; + fn simd_extract(v: T, i0: u32) -> Elem; } ``` The `i0` indices do not have to be constant. These are equivalent to -`v[i0] = elem` and `v[i0]` respectively. +`v[i0] = elem` and `v[i0]` respectively. They are type checked +similarly to the shuffles. ### Comparisons @@ -226,10 +245,10 @@ extern "rust-intrinsic" { } ``` -However, these will be type checked, to ensure that `T` and `U` are -the same length, and that `U` is appropriately shaped for a boolean. A -library actually importing them might use some trait bounds to get -actual type-safety. +These are type checked during code-generation similarly to the +shuffles. Ensuring that `T` and `U` has the same length, and that `U` +is appropriately "boolean"-y. Libraries can use traits to ensure that +these will be enforced by the type checker too. ### Built-in functionality @@ -333,3 +352,4 @@ cfg_if_else! { - Should integer vectors get `/` and `%` automatically? Most CPUs don't support them for vectors. +- How should out-of-bounds shuffle and insert/extract indices be handled? From f9e48d11f9f9c8c7977da2cc371103e94783c33c Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Thu, 9 Jul 2015 09:28:07 -0700 Subject: [PATCH 04/25] Clarify/fix typos. --- text/0000-simd-infrastructure.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index a33f3b27068..a132e856d6d 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -64,7 +64,7 @@ There are two new attributes: `repr(simd)` and `simd_primitive_trait` ```rust #[repr(simd)] -struct f32x4(f32, f32, f23, f23); +struct f32x4(f32, f32, f32, f32); #[repr(simd)] struct Simd2(T, T); @@ -261,9 +261,10 @@ SIMD vectors can be converted with `as`. As with intrinsics, this is their lengths match and their elements are castable (i.e. are primitives), there's no enforcement of nominal types. -All of these are never checked: explicit SIMD is essentially only -required for speed, and checking inflates one instruction to 5 or -more. +All of these operators and conversions are never checked (in the sense +of the arithmetic overflow checks of `-C debug-assertions`): explicit +SIMD is essentially only required for speed, and checking inflates one +instruction to 5 or more. ## Platform Detection From e2fc223c99bea7aadf39f0859e5e7f0244175e5b Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Thu, 9 Jul 2015 09:59:58 -0700 Subject: [PATCH 05/25] Note that fixed-length arrays could be repr(simd)'d. --- text/0000-simd-infrastructure.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index a132e856d6d..af5fb0b0d15 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -348,6 +348,11 @@ cfg_if_else! { to be worth it. (Each "feature" would essentially be a platform specific `cfg` anyway.) - Check vector operators in debug mode just like the scalar versions. +- Make fixed length arrays `repr(simd)`-able (via just flattening), so + that, say, `#[repr(simd)] struct u32x4([u32; 4]);` and + `#[repr(simd)] struct f64x8([f64; 4], [f64; 4]);` etc works. This + will be most useful if/when we allow generic-lengths, `#[repr(simd)] + struct Simd([T; n]);` # Unresolved questions From efeafdb770728c3fe57a9e34da31421f38740943 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Thu, 9 Jul 2015 17:10:37 -0700 Subject: [PATCH 06/25] Remove the simd_primitive_trait attribute. Not really necessary: the type safety it offers can be provided by libraries. --- text/0000-simd-infrastructure.md | 53 ++++++++++---------------------- 1 file changed, 16 insertions(+), 37 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index af5fb0b0d15..a2a6e9884ed 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -58,9 +58,9 @@ many platforms, but this RFC doesn't try to extract that, it is just building tools that can be wrapped into a more uniform API later. -## Types & traits +## Types -There are two new attributes: `repr(simd)` and `simd_primitive_trait` +There is a new attributes: `repr(simd)`. ```rust #[repr(simd)] @@ -68,54 +68,30 @@ struct f32x4(f32, f32, f32, f32); #[repr(simd)] struct Simd2(T, T); - -#[simd_primitive_trait] -trait SimdPrim {} ``` -### `repr(simd)` - The `simd` `repr` can be attached to a struct and will cause such a struct to be compiled to a SIMD vector. It can be generic, but it is required that any fully monomorphised instance of the type consist of -only a single "primitive" type, repeated some number of times. The -restrictions on the element type are exactly the same restrictions as -`#[simd_primitive_trait]` traits impose on their implementing types. +only a single "primitive" type, repeated some number of times. Types +are flattened, so, for `struct Bar(u64);`, `Simd2` has the same +representation as `Simd2`. -The `repr(simd)` may not enforce that the trait bound exists/does the +The `repr(simd)` may not enforce that any trait bounds exists/does the right thing at the type checking level for generic `repr(simd)` types. As such, it will be possible to get the code-generator to error -out (ala the old `transmute` size errosr), however, this shouldn't +out (ala the old `transmute` size errors), however, this shouldn't cause problems in practice: libraries wrapping this functionality would layer type-safety on top (i.e. generic `repr(simd)` types would -use the `SimdPrim` trait as a bound). +use some `unsafe` trait as a bound that is designed to only be +implemented by types that will work). It is illegal to take an internal reference to the fields of a `repr(simd)` type, because the representation of booleans may require -to change, so that booleans are bit-packed. The official external +modification, so that booleans are bit-packed. The official external library providing SIMD support will have private fields so this will not be generally observable. -### `simd_primitive_trait` - -Traits marked with the `simd_primitive_trait` attribute are special: -types implementing it are those that can be stored in SIMD -vectors. Initially, only primitives and single-field structs that -store `SimdPrim` types will be allowed to implement it. - -This is explicitly not a lang item: it is legal to have multiple -distinct traits in a compilation. The attribute just adds the -restriction and possibly tweaks type's internal representation (as -such, it's legal for a single type to implement multiple traits with -the attribute, if a bit pointless). - -This trait exists to allow new-type wrappers around primitives to also -be usable in a SIMD context. However, this only works in limited -scenarios (i.e. when the type wraps a single primitive) and so needs -to be an explicit part of every type's API: type authors opt-in to -being designed-for-SIMD. If it was implicit, changes to private fields -may break downstream code. - ## Operations CPU vendors usually offer "standard" C headers for their CPU specific @@ -312,8 +288,8 @@ cfg_if_else! { # Extensions - scatter/gather operations allow (partially) operating on a SIMD - vector of pointers. This would require extending `SimdPrim` to also - allow pointer types. + vector of pointers. This would require allowing + pointers(/references?) in `repr(simd)` types. - allow (and ignore for everything but type checking) zero-sized types in `repr(simd)` structs, to allow tagging them with markers @@ -353,9 +329,12 @@ cfg_if_else! { `#[repr(simd)] struct f64x8([f64; 4], [f64; 4]);` etc works. This will be most useful if/when we allow generic-lengths, `#[repr(simd)] struct Simd([T; n]);` +- have 100% guaranteed type-safety for generic `#[repr(simd)]` types + and the generic intrinsics. This would probably require a relatively + complicated set of traits (with compiler integration). # Unresolved questions - Should integer vectors get `/` and `%` automatically? Most CPUs - don't support them for vectors. + don't support them for vectors. However - How should out-of-bounds shuffle and insert/extract indices be handled? From a7c409b3291758eab122d9bef034dfc2f255fb0e Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Fri, 10 Jul 2015 10:43:14 -0700 Subject: [PATCH 07/25] Mention alignment changes due to repr(simd). --- text/0000-simd-infrastructure.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index a2a6e9884ed..f815bb4b9c2 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -92,6 +92,10 @@ modification, so that booleans are bit-packed. The official external library providing SIMD support will have private fields so this will not be generally observable. +Adding `repr(simd)` to a type may increase its minimum/preferred +alignment, based on platform behaviour. (E.g. x86 wants its 128-bit +SSE vectors to be 128-bit aligned.) + ## Operations CPU vendors usually offer "standard" C headers for their CPU specific From f4e2ecfbe09c0798433f49785a625e8d54ac9b04 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Fri, 10 Jul 2015 10:48:55 -0700 Subject: [PATCH 08/25] Note pre-RFC discussion. --- text/0000-simd-infrastructure.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index f815bb4b9c2..0c6af575682 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -19,6 +19,9 @@ This RFC lays the ground-work for building nice SIMD functionality, but doesn't fill everything out. The goal here is to provide the raw types and access to the raw instructions on each platform. +(An earlier variant of this RFC was discussed as a +[pre-RFC](https://internals.rust-lang.org/t/pre-rfc-simd-groundwork/2343).) + ## Where does this code go? Aka. why not in `std`? This RFC is focused on building stable, powerful SIMD functionality in From 1132ede301fae89809ce2932faa9ef8cdad103b5 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Mon, 13 Jul 2015 13:39:43 -0700 Subject: [PATCH 09/25] Clarify how the intrinsics' structural typing works. --- text/0000-simd-infrastructure.md | 33 +++++++++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index 0c6af575682..4da9d3926cf 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -132,6 +132,33 @@ in a "duck-typed" manner: it will just ensure that the types are SIMD vectors with the appropriate length and element type, it will not enforce a specific nominal type. +NB. The structural typing is just for the declaration: if a SIMD intrinsic +is declared to take a type `X`, it must always be called with `X`, +even if other types are structurally equal to `X`. Also, within a +signature, SIMD types that must be structurally equal must be nominal +equal. I.e. if the `add_...` all refer to the same intrinsic to add a +SIMD vector of bytes, + +```rust +// (same length) +struct A(u8, u8, ..., u8); +struct B(u8, u8, ..., u8); + +extern "rust-intrinsic" { + fn add_aaa(x: A, y: A) -> A; // ok + fn add_bbb(x: B, y: B) -> B; // ok + fn add_aab(x: A, y: A) -> B; // error, expected B, found A + fn add_bab(x: B, y: A) -> B; // error, expected A, found B +} + +fn double_a(x: A) -> A { + add_aaa(x, x) +} +fn double_b(x: B) -> B { + add_aaa(x, x) // error, expected A, found B +} +``` + There would additionally be a small set of cross-platform operations that are either generally efficiently supported everywhere or are extremely useful. These won't necessarily map to a single instruction, @@ -173,9 +200,9 @@ extern "rust-intrinsic" { ``` The raw definitions are only checked for validity at monomorphisation -time, ensure that `T` is a SIMD vector, `U` is the element type of `T` -etc. Libraries can use traits to ensure that these will be enforced by -the type checker too. +time, ensure that `T` is a SIMD vector, `Elem` is the element type of +`T` etc. Libraries can use traits to ensure that these will be +enforced by the type checker too. This approach has some downsides: `simd_shuffle32` (e.g. `Simd32` on AVX, and `Simd32` on AVX-512) and especially `simd_shuffle64` From 8317ea49ce509af257722fcb0be27bf23de54377 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Mon, 13 Jul 2015 13:41:13 -0700 Subject: [PATCH 10/25] Add arithmetic intrinsics alternative. --- text/0000-simd-infrastructure.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index 4da9d3926cf..a93fcf3d319 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -366,6 +366,9 @@ cfg_if_else! { - have 100% guaranteed type-safety for generic `#[repr(simd)]` types and the generic intrinsics. This would probably require a relatively complicated set of traits (with compiler integration). +- use generic intrinsics like shuffles for the arithmetic operations, + instead of providing the operations implicitly. + # Unresolved questions From c6ed18ac09c44e01d1076c8b1baf9fbc1877068a Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Mon, 13 Jul 2015 13:53:56 -0700 Subject: [PATCH 11/25] Write down an answer to "why not `asm!`?". --- text/0000-simd-infrastructure.md | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index a93fcf3d319..cc3063a4494 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -276,6 +276,33 @@ of the arithmetic overflow checks of `-C debug-assertions`): explicit SIMD is essentially only required for speed, and checking inflates one instruction to 5 or more. +### Why not inline asm? + +One alternative to providing intrinsics is to instead just use +inline-asm to expose each CPU instruction. However, this approach has +essentially only one benefit (avoiding defining the intrinsics), but +several downsides, e.g. + +- assembly is generally a black-box to optimisers, inhibiting + optimisations, like algebraic simplification/transformation, +- programmers would have to manually synthesise the right sequence of + operations to achieve a given shuffle, while having a generic + shuffle intrinsic lets the compiler do it (NB. the intention is that + the programmer will still have access to the platform specific + operations for when the compiler synthesis isn't quite right), +- inline assembly is not currently stable in + Rust and there's not a strong push for it to be so in the immediate + future (although this could change). + +Benefits of manual assembly writing, like instruction scheduling and +register allocation don't apply to the (generally) one-instruction +`asm!` blocks that replace the intrinsics (they need to be designed so +that the compiler has full control over register allocation, or else +the result will be strictly worse). Those possible advantages of hand +written assembly over intrinsics only come in to play when writing +longer blocks of raw assembly, i.e. some inner loop might be faster +when written as a single chunk of asm rather than as intrinsics. + ## Platform Detection The availability of efficient SIMD functionality is very fine-grained, From 67f78ec728913f56c9e969b2226d854724594ad1 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Tue, 14 Jul 2015 09:57:03 -0700 Subject: [PATCH 12/25] point to cfg-if. --- text/0000-simd-infrastructure.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index cc3063a4494..41c7d8bd2e3 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -324,11 +324,11 @@ not necessarily just everything LLVM understands. There are other non-SIMD features that might have `target_feature`s set too, such as `popcnt` and `rdrnd` on x86/x86-64.) -With a `cfg_if_else!` macro that expands to the first `cfg` that is -satisfied (ala [@alexcrichton's cascade][cascade]), code might look +With a `cfg_if!` macro that expands to the first `cfg` that is +satisfied (ala [@alexcrichton's `cfg-if`][cfg-if]), code might look like: -[cascade]: https://github.com/alexcrichton/backtrace-rs/blob/03703031babfa87cbe2c723ad6752131819dc554/src/macros.rs +[cfg-if]: https://crates.io/crates/cfg-if ```rust cfg_if_else! { From f71c4b32e27215dd40c0c0483a883c825c575ad0 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Mon, 3 Aug 2015 11:30:30 -0700 Subject: [PATCH 13/25] Use intrinsics for arithmetic instead of built-in operators. --- text/0000-simd-infrastructure.md | 35 +++++++++++++++++++------------- 1 file changed, 21 insertions(+), 14 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index 41c7d8bd2e3..7fedaabed8f 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -166,8 +166,8 @@ but will be shimmed as efficiently as possible. - shuffles and extracting/inserting elements - comparisons - -Lastly, arithmetic and conversions are supported via built-in operators. +- arithmetic +- conversions ### Shuffles & element operations @@ -260,21 +260,28 @@ shuffles. Ensuring that `T` and `U` has the same length, and that `U` is appropriately "boolean"-y. Libraries can use traits to ensure that these will be enforced by the type checker too. -### Built-in functionality +### Arithmetic + +Intrinsics will be provided for arithmetic operations like addition +and multiplication. + +```rust +extern { + fn simd_add(x: T, y: T) -> T; + fn simd_mul(x: T, y: T) -> T; + // ... +} +``` -Any type marked `repr(simd)` automatically has the `+`, `-` and `*` -operators work. The `/` operator works for floating point, and the -`<<` and `>>` ones work for integers. +These will have codegen time checks that the element type is correct: -SIMD vectors can be converted with `as`. As with intrinsics, this is -"duck-typed" it is possible to cast a vector type `V` to a type `W` if -their lengths match and their elements are castable (i.e. are -primitives), there's no enforcement of nominal types. +- `add`, `sub`, `mul`: any float or integer type +- `div`: any float type +- `and`, `or`, `xor`, `shl` (shift left), `shr` (shift right): any + integer type -All of these operators and conversions are never checked (in the sense -of the arithmetic overflow checks of `-C debug-assertions`): explicit -SIMD is essentially only required for speed, and checking inflates one -instruction to 5 or more. +(The integer types are `i8`, ..., `i64`, `u8`, ..., `u64` and the +float types are `f32` and `f64`.) ### Why not inline asm? From 8b2ec8c16f8b9eafd4d0bd7fb20a187a0399028a Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Mon, 3 Aug 2015 11:33:18 -0700 Subject: [PATCH 14/25] Accidentally: - an extra word. - a subject-verb agreement. - an ly. - a plural. --- text/0000-simd-infrastructure.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index 7fedaabed8f..cdfd8057c34 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -63,7 +63,7 @@ building tools that can be wrapped into a more uniform API later. ## Types -There is a new attributes: `repr(simd)`. +There is a new attribute: `repr(simd)`. ```rust #[repr(simd)] @@ -135,7 +135,7 @@ enforce a specific nominal type. NB. The structural typing is just for the declaration: if a SIMD intrinsic is declared to take a type `X`, it must always be called with `X`, even if other types are structurally equal to `X`. Also, within a -signature, SIMD types that must be structurally equal must be nominal +signature, SIMD types that must be structurally equal must be nominally equal. I.e. if the `add_...` all refer to the same intrinsic to add a SIMD vector of bytes, @@ -256,7 +256,7 @@ extern "rust-intrinsic" { ``` These are type checked during code-generation similarly to the -shuffles. Ensuring that `T` and `U` has the same length, and that `U` +shuffles: ensuring that `T` and `U` have the same length, and that `U` is appropriately "boolean"-y. Libraries can use traits to ensure that these will be enforced by the type checker too. @@ -406,6 +406,6 @@ cfg_if_else! { # Unresolved questions -- Should integer vectors get `/` and `%` automatically? Most CPUs - don't support them for vectors. However +- Should integer vectors get division automatically? Most CPUs + don't support them for vectors. - How should out-of-bounds shuffle and insert/extract indices be handled? From 47f6ae9c5f3c935b1d943da5366fbd4531a89e35 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Thu, 6 Aug 2015 16:01:19 -0700 Subject: [PATCH 15/25] Use the platform-intrinsic ABI instead of rust-intrinsic. --- text/0000-simd-infrastructure.md | 29 ++++++++++++++++++++--------- 1 file changed, 20 insertions(+), 9 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index cdfd8057c34..40aba05464c 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -1,4 +1,4 @@ -- Feature Name: simd_basics, cfg_target_feature +- Feature Name: simd_basics, platform_intrinsics, cfg_target_feature - Start Date: 2015-06-02 - RFC PR: (leave this empty) - Rust Issue: (leave this empty) @@ -45,7 +45,7 @@ those features enabled. The design comes in three parts, all on the path to stabilisation: - types (`feature(simd_basics)`) -- operations (`feature(simd_basics)`) +- operations (`feature(platform_intrinsics)`) - platform detection (`feature(cfg_target_feature)`) The general idea is to avoid bad performance cliffs, so that an @@ -116,8 +116,10 @@ intrinsics would be on the path to stabilisation (that is, one can "import" them with `extern` in stable code), and would not be exported by `std`. +Example: + ```rust -extern "rust-intrinsic" { +extern "platform-intrinsic" { fn x86_mm_abs_epi16(a: Simd8) -> Simd8; // ... } @@ -144,7 +146,7 @@ SIMD vector of bytes, struct A(u8, u8, ..., u8); struct B(u8, u8, ..., u8); -extern "rust-intrinsic" { +extern "platform-intrinsic" { fn add_aaa(x: A, y: A) -> A; // ok fn add_bbb(x: B, y: B) -> B; // ok fn add_aab(x: A, y: A) -> B; // error, expected B, found A @@ -169,6 +171,16 @@ but will be shimmed as efficiently as possible. - arithmetic - conversions +All of these intrinsics are imported via an `extern` directive similar +to the process for pre-existing intrinsics like `transmute`, however, +the SIMD operations are provided under a special ABI: +`platform-intrinsic`. Use of this ABI (and hence the intrinsics) is +initially feature-gated under the `platform_intrinsics` feature +name. Why `platform-intrinsic` rather than say `simd-intrinsic`? There +are non-SIMD platform-specific instructions that may be nice to expose +(for example, Intel defines an `_addcarry_u32` intrinsic corresponding +to the `ADC` instruction). + ### Shuffles & element operations One of the most powerful features of SIMD is the ability to rearrange @@ -185,7 +197,7 @@ shuffles without having to understand all the details of every platform specific intrinsic for shuffling. ```rust -extern "rust-intrinsic" { +extern "platform-intrinsic" { fn simd_shuffle2(v: T, w: T, i0: u32, i1: u32) -> Simd2; fn simd_shuffle4(v: T, w: T, i0: u32, i1: u32, i2: u32, i3: u32) -> Sidm4; fn simd_shuffle8(v: T, w: T, @@ -226,7 +238,7 @@ vectors are provided, to allow modelling the SIMD vectors as actual CPU registers as much as possible: ```rust -extern "rust-intrinsic" { +extern "platform-intrinsic" { fn simd_insert(v: T, i0: u32, elem: Elem) -> T; fn simd_extract(v: T, i0: u32) -> Elem; } @@ -245,7 +257,7 @@ return vectors, as required. The raw signatures would look like: ```rust -extern "rust-intrinsic" { +extern "platform-intrinsic" { fn simd_eq(v: T, w: T) -> U; fn simd_ne(v: T, w: T) -> U; fn simd_lt(v: T, w: T) -> U; @@ -266,7 +278,7 @@ Intrinsics will be provided for arithmetic operations like addition and multiplication. ```rust -extern { +extern "platform-intrinsic" { fn simd_add(x: T, y: T) -> T; fn simd_mul(x: T, y: T) -> T; // ... @@ -363,7 +375,6 @@ cfg_if_else! { # Alternatives -- The SIMD on-route-to-stable intrinsics could have their own ABI - Intrinsics could instead by namespaced by ABI, `extern "x86-intrinsic"`, `extern "arm-intrinsic"`. - There could be more syntactic support for shuffles, either with true From 4a4e6aefe43b0912b233c29d621e879c978c3875 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Thu, 6 Aug 2015 16:03:06 -0700 Subject: [PATCH 16/25] feature(simd_basics) -> feature(repr_simd) This feature gate now only applies to the attribute, so it might as well be more specific. --- text/0000-simd-infrastructure.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index 40aba05464c..a3527afe8dd 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -1,4 +1,4 @@ -- Feature Name: simd_basics, platform_intrinsics, cfg_target_feature +- Feature Name: repr_simd, platform_intrinsics, cfg_target_feature - Start Date: 2015-06-02 - RFC PR: (leave this empty) - Rust Issue: (leave this empty) @@ -44,7 +44,7 @@ those features enabled. The design comes in three parts, all on the path to stabilisation: -- types (`feature(simd_basics)`) +- types (`feature(repr_simd)`) - operations (`feature(platform_intrinsics)`) - platform detection (`feature(cfg_target_feature)`) From c4bf5e186b339308e073a2e726e8f62e5e80b539 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Wed, 12 Aug 2015 11:48:29 -0700 Subject: [PATCH 17/25] Remove struct flattening. This is non-trivial (for me) to implement, and ended up not being that useful, i.e. it wasn't needed to make useful things. --- text/0000-simd-infrastructure.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index a3527afe8dd..1c6b8cbf74b 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -76,9 +76,7 @@ struct Simd2(T, T); The `simd` `repr` can be attached to a struct and will cause such a struct to be compiled to a SIMD vector. It can be generic, but it is required that any fully monomorphised instance of the type consist of -only a single "primitive" type, repeated some number of times. Types -are flattened, so, for `struct Bar(u64);`, `Simd2` has the same -representation as `Simd2`. +only a single "primitive" type, repeated some number of times. The `repr(simd)` may not enforce that any trait bounds exists/does the right thing at the type checking level for generic `repr(simd)` From 653267060bb2e8307c6eac27d8ecd02a9d136370 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Wed, 12 Aug 2015 11:50:16 -0700 Subject: [PATCH 18/25] Change shuffles to use arrays of indices. This is *far* more scalable than having an argument for each value. Thanks to @pnkfelix for the suggestion. --- text/0000-simd-infrastructure.md | 24 ++++++++---------------- 1 file changed, 8 insertions(+), 16 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index 1c6b8cbf74b..0b6fb616e1d 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -196,16 +196,10 @@ platform specific intrinsic for shuffling. ```rust extern "platform-intrinsic" { - fn simd_shuffle2(v: T, w: T, i0: u32, i1: u32) -> Simd2; - fn simd_shuffle4(v: T, w: T, i0: u32, i1: u32, i2: u32, i3: u32) -> Sidm4; - fn simd_shuffle8(v: T, w: T, - i0: u32, i1: u32, i2: u32, i3: u32, - i4: u32, i5: u32, i6: u32, i7: u32) -> Simd8; - fn simd_shuffle16(v: T, w: T, - i0: u32, i1: u32, i2: u32, i3: u32, - i4: u32, i5: u32, i6: u32, i7: u32 - i8: u32, i9: u32, i10: u32, i11: u32, - i12: u32, i13: u32, i14: u32, i15: u32) -> Simd16; + fn simd_shuffle2(v: T, w: T, idx: [i32; 2]) -> Simd2; + fn simd_shuffle4(v: T, w: T, idx: [i32; 4]) -> Sidm4; + fn simd_shuffle8(v: T, w: T, idx: [i32; 8]) -> Simd8; + fn simd_shuffle16(v: T, w: T, idx: [i32; 16]) -> Simd16; } ``` @@ -214,10 +208,8 @@ time, ensure that `T` is a SIMD vector, `Elem` is the element type of `T` etc. Libraries can use traits to ensure that these will be enforced by the type checker too. -This approach has some downsides: `simd_shuffle32` (e.g. `Simd32` -on AVX, and `Simd32` on AVX-512) and especially `simd_shuffle64` -(e.g. `Simd64` on AVX-512) are unwieldy. These have similar type -"safety"/code-generation errors to the vectors themselves. +This approach has similar type "safety"/code-generation errors to the +vectors themselves. These operations are semantically: @@ -225,10 +217,10 @@ These operations are semantically: // vector of double length let z = concat(v, w); -return [z[i0], z[i1], z[i2], ...] +return [z[idx[0]], z[idx[1]], z[idx[2]], ...] ``` -The indices `iN` have to be compile time constants. Out of bounds +The index array `idx` has to be compile time constants. Out of bounds indices yield unspecified results. Similarly, intrinsics for inserting/extracting elements into/out of From 8e3a0deba208e4b42216cf210570755f79893aca Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Wed, 12 Aug 2015 11:54:37 -0700 Subject: [PATCH 19/25] shuffles don't rely on generic types for return values. This has less type safety, but doesn't require generic simd types to exist: #[repr(simd)] struct Simd2(T, T); --- text/0000-simd-infrastructure.md | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index 0b6fb616e1d..7bc38d2f90d 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -196,17 +196,17 @@ platform specific intrinsic for shuffling. ```rust extern "platform-intrinsic" { - fn simd_shuffle2(v: T, w: T, idx: [i32; 2]) -> Simd2; - fn simd_shuffle4(v: T, w: T, idx: [i32; 4]) -> Sidm4; - fn simd_shuffle8(v: T, w: T, idx: [i32; 8]) -> Simd8; - fn simd_shuffle16(v: T, w: T, idx: [i32; 16]) -> Simd16; + fn simd_shuffle2(v: T, w: T, idx: [i32; 2]) -> U; + fn simd_shuffle4(v: T, w: T, idx: [i32; 4]) -> U; + fn simd_shuffle8(v: T, w: T, idx: [i32; 8]) -> U; + fn simd_shuffle16(v: T, w: T, idx: [i32; 16]) -> U; } ``` The raw definitions are only checked for validity at monomorphisation -time, ensure that `T` is a SIMD vector, `Elem` is the element type of -`T` etc. Libraries can use traits to ensure that these will be -enforced by the type checker too. +time, ensure that `T` and `U` are SIMD vector with the same element +type, `U` has the appropriate length etc. Libraries can use traits to +ensure that these will be enforced by the type checker too. This approach has similar type "safety"/code-generation errors to the vectors themselves. @@ -362,6 +362,17 @@ cfg_if_else! { pointers(/references?) in `repr(simd)` types. - allow (and ignore for everything but type checking) zero-sized types in `repr(simd)` structs, to allow tagging them with markers +- the shuffle intrinsics could be made more relaxed in their type + checking (i.e. not require that they return their second type + parameter), to allow more type safety when combined with generic + simd types: + + #[repr(simd)] struct Simd2(T, T); + extern "platform-intrinsic" { + fn simd_shuffle2(x: T, y: T, idx: [u32; 2]) -> Simd2; + } + + This should be a backwards-compatible generalisation. # Alternatives @@ -404,7 +415,6 @@ cfg_if_else! { - use generic intrinsics like shuffles for the arithmetic operations, instead of providing the operations implicitly. - # Unresolved questions - Should integer vectors get division automatically? Most CPUs From 54b0927ea11ae544549edaa9a79b5ca1b6225a91 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Wed, 12 Aug 2015 11:59:12 -0700 Subject: [PATCH 20/25] Intrinsics-for-operations is now the RFC, not an alternative. Also, the comparison comment no longer makes sense. --- text/0000-simd-infrastructure.md | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index 7bc38d2f90d..d4d8558099a 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -240,11 +240,8 @@ similarly to the shuffles. ### Comparisons -Comparisons are implemented via intrinsics, because the current -comparison operator infrastructure doesn't easily lend itself to -return vectors, as required. - -The raw signatures would look like: +Comparisons are implemented via intrinsics. The raw signatures would +look like: ```rust extern "platform-intrinsic" { @@ -412,8 +409,6 @@ cfg_if_else! { - have 100% guaranteed type-safety for generic `#[repr(simd)]` types and the generic intrinsics. This would probably require a relatively complicated set of traits (with compiler integration). -- use generic intrinsics like shuffles for the arithmetic operations, - instead of providing the operations implicitly. # Unresolved questions From 9e31ad3327327eb51849623f5f5bf7cc2afb58c0 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Wed, 12 Aug 2015 14:25:05 -0700 Subject: [PATCH 21/25] Out of bounds indices are errors (backwards compat to relax). --- text/0000-simd-infrastructure.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index d4d8558099a..dde310e1f16 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -221,7 +221,7 @@ return [z[idx[0]], z[idx[1]], z[idx[2]], ...] ``` The index array `idx` has to be compile time constants. Out of bounds -indices yield unspecified results. +indices yield errors. Similarly, intrinsics for inserting/extracting elements into/out of vectors are provided, to allow modelling the SIMD vectors as actual From 91a2b360ae7b4c6e820b57d937f6012ffc448f73 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Wed, 12 Aug 2015 14:27:05 -0700 Subject: [PATCH 22/25] Only invalid to *call* intrinsics on bad platforms. It's valid to `extern` them, though. --- text/0000-simd-infrastructure.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index dde310e1f16..f0d1f5eec04 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -126,11 +126,11 @@ extern "platform-intrinsic" { These all use entirely concrete types, and this is the core interface to these intrinsics: essentially it is just allowing code to exactly specify a CPU instruction to use. These intrinsics only actually work -on a subset of the CPUs that Rust targets, and are only be available -for `extern`ing on those targets. The signatures are typechecked, but -in a "duck-typed" manner: it will just ensure that the types are SIMD -vectors with the appropriate length and element type, it will not -enforce a specific nominal type. +on a subset of the CPUs that Rust targets, and will result in compile +time errors if they are called on platforms that do not support +them. The signatures are typechecked, but in a "duck-typed" manner: it +will just ensure that the types are SIMD vectors with the appropriate +length and element type, it will not enforce a specific nominal type. NB. The structural typing is just for the declaration: if a SIMD intrinsic is declared to take a type `X`, it must always be called with `X`, From 60931df73c8733fecd662dff900a5f172f01eee4 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Wed, 12 Aug 2015 14:27:55 -0700 Subject: [PATCH 23/25] There can be more shuffles. --- text/0000-simd-infrastructure.md | 1 + 1 file changed, 1 insertion(+) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index f0d1f5eec04..f1ad633420d 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -200,6 +200,7 @@ extern "platform-intrinsic" { fn simd_shuffle4(v: T, w: T, idx: [i32; 4]) -> U; fn simd_shuffle8(v: T, w: T, idx: [i32; 8]) -> U; fn simd_shuffle16(v: T, w: T, idx: [i32; 16]) -> U; + // ... } ``` From 135ba7d7b4809ce0656ddce668c9c1f3f26b20a8 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Fri, 14 Aug 2015 09:55:19 -0700 Subject: [PATCH 24/25] Internal references are legal. Automatic crazy boolean bit-packing is crazy. --- text/0000-simd-infrastructure.md | 6 ------ 1 file changed, 6 deletions(-) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index f1ad633420d..b3474958311 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -87,12 +87,6 @@ would layer type-safety on top (i.e. generic `repr(simd)` types would use some `unsafe` trait as a bound that is designed to only be implemented by types that will work). -It is illegal to take an internal reference to the fields of a -`repr(simd)` type, because the representation of booleans may require -modification, so that booleans are bit-packed. The official external -library providing SIMD support will have private fields so this will -not be generally observable. - Adding `repr(simd)` to a type may increase its minimum/preferred alignment, based on platform behaviour. (E.g. x86 wants its 128-bit SSE vectors to be 128-bit aligned.) From 67fea6e98fecb0a05adb419ddc0cf504e5e0ba04 Mon Sep 17 00:00:00 2001 From: Huon Wilson Date: Fri, 14 Aug 2015 10:09:45 -0700 Subject: [PATCH 25/25] Type-level integer/values alternatives for shuffles. --- text/0000-simd-infrastructure.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md index b3474958311..4c8974fdfce 100644 --- a/text/0000-simd-infrastructure.md +++ b/text/0000-simd-infrastructure.md @@ -390,6 +390,17 @@ cfg_if_else! { compiler can know this. The `repr(simd)` approach means there may be more than one SIMD-vector type with the `Simd8` shape (or, in fact, there may be zero). +- With type-level integers, there could be one shuffle intrinsic: + + fn simd_shuffle(x: T, y: T, idx: [u32; N]) -> U; + + NB. It is possible to add this as an additional intrinsic (possibly + deprecating the `simd_shuffleNNN` forms) later. +- Type-level values can be applied more generally: since the shuffle + indices have to be compile time constants, the shuffle could be + + fn simd_shuffle(x: T, y: T) -> U; + - Instead of platform detection, there could be feature detection (e.g. "platform supports something equivalent to x86's `DPPS`"), but there probably aren't enough cross-platform commonalities for this