From c7b72a029e28726474c7b28ed410e63b538d85d5 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Tue, 7 Jul 2015 11:20:07 -0700
Subject: [PATCH 01/25] Lay groundwork for SIMD.

---
 text/0000-simd-infrastructure.md | 411 +++++++++++++++++++++++++++++++
 1 file changed, 411 insertions(+)
 create mode 100644 text/0000-simd-infrastructure.md

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
new file mode 100644
index 00000000000..7876de442dd
--- /dev/null
+++ b/text/0000-simd-infrastructure.md
@@ -0,0 +1,411 @@
+- Feature Name: simd_basics
+- Start Date: 2015-06-02
+- RFC PR: (leave this empty)
+- Rust Issue: (leave this empty)
+
+# Summary
+
+Lay the ground work for building powerful SIMD functionality.
+
+# Motivation
+
+SIMD (Single-Instruction Multiple-Data) is an important part of
+performant modern applications. Most CPUs used for that sort of task
+provide dedicated hardware and instructions for operating on multiple
+values in a single instruction, and exposing this is an important part
+of being a low-level language.
+
+This RFC lays the ground-work for building nice SIMD functionality,
+but doesn't fill everything out. The goal here is to provide the raw
+types and access to the raw instructions on each platform.
+
+# Detailed design
+
+The design comes in three parts:
+
+- types
+- operations
+- platform detection
+
+The general idea is to avoid bad performance cliffs, so that an
+intrinsic call in Rust maps to preferably one CPU instruction, or, if
+not, the "optimal" sequence required to do the given operation
+anyway. This means exposing a *lot* of platform specific details,
+since platforms behave very differently: both across architecture
+families (x86, x86-64, ARM, MIPS, ...), and even within a family
+(x86-64's Skylake, Haswell, Nehalem, ...).
+
+There is definitely a common core of SIMD functionality shared across
+many platforms, but this RFC doesn't try to extract that, it is just
+building tools that can be wrapped into a more uniform API later.
+
+## Background: Where does this code go?
+
+This RFC is focused on building stable, powerful SIMD functionality in
+external crates, not `std`. This makes it much easier to support
+functionality only "occasionally" available with Rust's preexisting
+`cfg` system. If it were in `std`, there would need to be some highly
+delayed `cfg` system so that functions that only work with AVX-2
+support:
+
+- don't break compilation on systems that don't support it, but
+- are still usable on systems that do support it.
+
+## Types & traits
+
+A type designed to be used as a SIMD vector is indicated by the
+`repr(simd)` attribute. A type marked as such will be compiled to
+behave like a SIMD register (as well as the target platform can
+support it).
+
+The types/traits will be defined as follows:
+
+```rust
+#[repr(simd)]
+struct Simd2<T: SimdPrim>(T, T);
+#[repr(simd)]
+struct Simd4<T: SimdPrim>(T, T, T, T);
+#[repr(simd)]
+struct Simd8<T: SimdPrim>(T, T, T, T, T, T, T, T);
+#[repr(simd)]
+struct Simd16<T: SimdPrim>(T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T);
+#[repr(simd)]
+struct Simd32<T: SimdPrim>(T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T,
+                           T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T);
+#[repr(simd)]
+struct Simd64<T: SimdPrim>(T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T,
+                           T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T,
+                           T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T,
+                           T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T);
+
+trait SimdVector {
+    type Elem: SimdPrim;
+    type Bool: SimdVector<Elem = <Self::Elem as SimdPrim>::Bool>;
+}
+
+impl<T: SimdPrim> for Simd2<T> {
+    type Elem = T;
+    type Bool = Simd2<T::Bool>;
+}
+impl<T: SimdPrim> for Simd4<T> {
+    type Elem = T;
+    type Bool = Simd4<T::Bool>;
+}
+// ...
+impl<T: SimdPrim> for Simd64<T> {
+    type Elem = T;
+    type Bool = Simd64<T::Bool>;
+}
+
+#[simd_prim_trait]
+trait SimdPrim {
+    type Bool: SimdPrim;
+}
+
+// boolean types, see below
+struct bool8i(...);
+struct bool16i(...);
+struct bool32i(...);
+struct bool64i(...);
+struct bool32f(...);
+struct bool64f(...);
+
+// specifying what types are SIMD-able.
+impl SimdPrim for u8 { type Bool = bool8i; }
+impl SimdPrim for i8 { type Bool = bool8i; }
+impl SimdPrim for u16 { type Bool = bool16i; }
+// ...
+impl SimdPrim for i64 { type Bool = bool64i; }
+
+impl SimdPrim for f32 { type Bool = bool32f; }
+impl SimdPrim for f64 { type Bool = bool64f; }
+
+impl SimdPrim for bool8i { type Bool = bool8i; }
+// ...
+impl SimdPrim for bool64i { type Bool = bool64i; }
+
+impl SimdPrim for bool32f { type Bool = bool32f; }
+impl SimdPrim for bool64f { type Bool = bool64f; }
+```
+
+It is illegal to take an internal reference to the fields of a
+`repr(simd)` type.
+
+### `repr(simd)`
+
+The `simd` `repr` can be attached to a struct and will cause such a
+struct to be compiled to a SIMD vector. It is required that the
+monomorphised vector consist of only a single "primitive" type,
+repeated some number of times. The restrictions on the element type
+are exactly the same restrictions as `#[simd_primitive_trait]` traits
+impose on their implementing types.
+
+The `repr(simd)` may not enforce that the trait bound exists/does the
+right thing at the type checking level for generic `repr(simd)`
+types. As such, it will be possible to get the code-generator to error
+out (ala the old `transmute` size errosr), however, this shouldn't
+cause problems in practice: libraries wrapping this functionality
+would layer type-safety on top (i.e. the `SimdPrim` trait).
+
+### `simd_primitive_trait`
+
+Traits marked with the `simd_primitive_trait` attribute are special:
+types implementing it are those that can be stored in SIMD
+vectors. Initially, only primitives and single-field structs that
+store `SimdPrim` types will be allowed to implement it.
+
+This is explicitly not a lang item: it is legal to have multiple
+distinct traits in a compilation. The attribute just adds the
+restriction and possibly tweaks type's internal representation (as
+such, it's legal for a single type to implement multiple traits with
+the attribute, if a bit pointless).
+
+### Booleans
+
+SIMD booleans are non-trivial. Many conventional APIs e.g. SSE, and
+NEON, use "wide booleans": a large number of bits set to all-zeros
+(false) or all-ones, e.g. equality between `Simd4(0_u32, 1, 2, 3)` and
+`Simd4(0_u32, 0, 2, 3)` gives (on the CPU) `Simd4(!0_u32, 0, !0,
+!0)`. Hence, the boolean types need to have width. It's tempting to
+just use the integer types of the appropriate width, but this falls
+down for two reasons:
+
+1. booleans aren't always this format
+2. the source of the boolean matters
+
+The second is easiest: CPUs are complicated beasts, and the hardware
+that handles floating point vector operations may be very different to
+the hardware that handles integer ones: instructions use different
+execution units. It can take several cycles to transfer data between
+them. Encoding the provenance/execution unit of the value in the type
+makes costs explicit.
+
+The first is much harder to solve. Some architectures/instruction sets
+model booleans as single bits. For example, equality between
+`Simd4(0_u32, 1, 2, 3)` and `Simd(0_u32, 0, 2, 3)` gives `1 + 4 + 8 ==
+0b1101`. One example is AVX-512 which essentially replaces all of the
+older SSE through AVX2 boolean-returning instructions with versions
+that return those. Using separate types for booleans (and restricting
+their API) allows for some serious magic: `Simd4<bool32>` becomes
+`u4`. (This is where the reference-restriction above comes in.)
+
+## Operations
+
+CPU vendors usually offer "standard" C headers for their CPU specific
+operations, such as [`arm_neon.h`][armneon] and [the `...mmintrin.h` headers for
+x86(-64)][x86].
+
+[armneon]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf
+[x86]: https://software.intel.com/sites/landingpage/IntrinsicsGuide
+
+All of these would be exposed as (eventually) stable intrinsics with
+names very similar to those that the vendor suggests (only difference
+would be some form of manual namespacing, e.g. prefixing with the CPU
+target), loadable via an `extern` block with an appropriate ABI.
+
+```rust
+extern "rust-intrinsic" {
+    fn x86_mm_abs_epi16(a: Simd8<i16>) -> Simd8<i16>;
+    // ...
+}
+```
+
+These all use entirely concrete types, and this is the core interface
+to these intrinsics: essentially it is just allowing code to exactly
+specify a CPU instruction to use. These intrinsics only actually work
+on a subset of the CPUs that Rust targets, and are only be available
+for `extern`ing on those targets. The signatures are typechecked, but
+in a "duck-typed" manner: it will just ensure that the types are SIMD
+vectors with the appropriate length and element type, it will not
+enforce a specific nominal type.
+
+There would additionally be a small set of cross-platform operations
+that are either generally efficiently supported everywhere or are
+extremely useful. These won't necessarily map to a single instruction,
+but will be shimmed as efficiently as possible.
+
+- shuffles and extracting/inserting elements
+- comparisons
+
+Lastly, arithmetic and conversions are supported via built-in operators.
+
+### Shuffles & element operations
+
+One of the most powerful features of SIMD is the ability to rearrange
+data within vectors, giving super-linear speed-ups sometimes. As such,
+shuffles are exposed generally: intrinsics that represent arbitrary
+shuffles.
+
+This may violate the "one instruction per instrinsic" principal
+depending on the shuffle, but rearranging SIMD vectors is extremely
+useful, and providing a direct intrinsic lets the compiler (a) do the
+programmers work in synthesising the optimal (short) sequence of
+instructions to get a given shuffle and (b) track data through
+shuffles without having to understand all the details of every
+platform specific intrinsic for shuffling.
+
+```rust
+extern "rust-intrinsic" {
+    fn simd_shuffle2<T: SimdVector>(v: T, w: T, i0: u32, i1: u32) -> Simd2<T::Elem>;
+    fn simd_shuffle4<T: SimdVector>(v: T, w: T, i0: u32, i1: u32, i2: u32, i3: u32) -> Simd4<T::Elem>;
+    fn simd_shuffle8<T: SimdVector>(v: T, w: T,
+                                    i0: u32, i1: u32, i2: u32, i3: u32,
+                                    i4: u32, i5: u32, i6: u32, i7: u32) -> Simd8<T::Elem>;
+    fn simd_shuffle16<T: SimdVector>(v: T, w: T,
+                                     i0: u32, i1: u32, i2: u32, i3: u32,
+                                     i4: u32, i5: u32, i6: u32, i7: u32
+                                     i8: u32, i9: u32, i10: u32, i11: u32,
+                                     i12: u32, i13: u32, i14: u32, i15: u32) -> Simd16<T::Elem>;
+}
+```
+
+This approach has some downsides: `simd_shuffle32` (e.g. `Simd32<u8>`
+on AVX, and `Simd32<u16>` on AVX-512) and especially `simd_shuffle64`
+(e.g. `Simd64<u8>` on AVX-512) are unwieldy. These have similar type
+"safety"/code-generation errors to the vectors themselves.
+
+These operations are semantically:
+
+```rust
+// vector of double length
+let z = concat(v, w);
+
+return [z[i0], z[i1], z[i2], ...]
+```
+
+The indices `iN` have to be compile time constants.
+
+Similarly, intrinsics for inserting/extracting elements into/out of
+vectors are provided, to allow modelling the SIMD vectors as actual
+CPU registers as much as possible:
+
+```rust
+extern "rust-intrinsic" {
+    fn simd_insert<T: SimdVector>(v: T, i0: u32, elem: T::Elem) -> T;
+    fn simd_extract<T: SimdVector>(v: T, i0: u32) -> T::Elem;
+}
+```
+
+The `i0` indices do not have to be constant. These are equivalent to
+`v[i0] = elem` and `v[i0]` respectively.
+
+### Comparisons
+
+Comparisons are implemented via intrinsics, because the current
+comparison operator infrastructure doesn't easily lend itself to
+return vectors, as required.
+
+A library could give signatures like:
+
+```rust
+extern "rust-intrinsic" {
+    fn simd_eq<T: SimdVector>(v: T, w: T) -> T::Bool;
+    fn simd_ne<T: SimdVector>(v: T, w: T) -> T::Bool;
+    fn simd_lt<T: SimdVector>(v: T, w: T) -> T::Bool;
+    fn simd_le<T: SimdVector>(v: T, w: T) -> T::Bool;
+    fn simd_gt<T: SimdVector>(v: T, w: T) -> T::Bool;
+    fn simd_ge<T: SimdVector>(v: T, w: T) -> T::Bool;
+}
+```
+
+
+### Built-in functionality
+
+Any type marked `repr(simd)` automatically has the `+`, `-` and `*`
+operators work. The `/` operator works for floating point, and the
+`<<` and `>>` ones work for integers.
+
+SIMD vectors can be converted with `as`. As with intrinsics, this is
+"duck-typed" it is possible to cast a vector type `V` to a type `W` if
+their lengths match and their elements are castable (i.e. are
+primitives), there's no enforcement of nominal types.
+
+All of these are never checked: explicit SIMD is essentially only
+required for speed, and checking inflates one instruction to 5 or
+more.
+
+## Platform Detection
+
+The availability of efficient SIMD functionality is very fine-grained,
+and our current `cfg(target_arch = "...")` is not precise enough. This
+RFC proposes a `target_feature` `cfg`, that would be set to the
+features of the architecture that are known to be supported by the
+exact target e.g.
+
+- a default x86-64 compilation would essentially only set
+  `target_feature = "sse"` and `target_feature = "sse2"`
+- compiling with `-C target-feature="+sse4.2"` would set
+  `target_feature = "sse4.2"`, `target_feature = "sse.4.1"`, ...,
+  `target_feature = "sse"`.
+- compiling with `-C target-cpu=native` on a modern CPU might set
+  `target_feature = "avx2"`, `target_feature = "avx"`, ...
+
+(There are other non-SIMD features that might have `target_feature`s
+set too, such as `popcnt` and `rdrnd` on x86/x86-64.)
+
+With a `cfg_if_else!` macro that expands to the first `cfg` that is
+satisfied (ala [@alexcrichton's cascade][cascade]), code might look
+like:
+
+[cascade]: https://github.com/alexcrichton/backtrace-rs/blob/03703031babfa87cbe2c723ad6752131819dc554/src/macros.rs
+
+```rust
+cfg_if_else! {
+    if #[cfg(target_feature = "avx")] {
+        fn foo() { /* use AVX things */ }
+    } else if #[cfg(target_feature = "sse4.1")] {
+        fn foo() { /* use SSE4.1 things */ }
+    } else if #[cfg(target_feature = "sse2")] {
+        fn foo() { /* use SSE2 things */ }
+    } else if #[cfg(target_feature = "neon")] {
+        fn foo() { /* use NEON things */ }
+    } else {
+        fn foo() { /* universal fallback */ }
+    }
+}
+```
+
+# Extensions
+
+- scatter/gather operations allow (partially) operating on a SIMD
+  vector of pointers. This would require extending `SimdPrim` to also
+  allow pointer types.
+- allow (and ignore for everything but type checking) zero-sized types
+  in `repr(simd)` structs, to allow tagging them with markers
+
+# Alternatives
+
+- The SIMD on-route-to-stable intrinsics could have their own ABI
+- Intrinsics could instead by namespaced by ABI, `extern
+  "x86-intrinsic"`, `extern "arm-intrinsic"`.
+- There could be more syntactic support for shuffles, either with true
+  syntax, or with a syntax extension. The latter might look like:
+  `shuffle![x, y, i0, i1, i2, i3, i4, ...]`. However, this requires
+  that shuffles are restricted to a single type only (i.e. `Simd4<T>`
+  can be shuffled to `Simd4<T>` but nothing else), or some sort of
+  type synthesis. The compiler has to somehow work out the return
+  value:
+
+  ```rust
+  let x: Simd4<u32> = ...;
+  let y: Simd4<u32> = ...;
+
+  // reverse all the elements.
+  let z = shuffle![x, y, 7, 6, 5, 4, 3, 2, 1, 0];
+  ```
+
+  Presumably `z` should be `Simd8<u32>`, but it's not obvious how the
+  compiler can know this. The `repr(simd)` approach means there may be
+  more than one SIMD-vector type with the `Simd8<u32>` shape (or, in
+  fact, there may be zero).
+- Instead of platform detection, there could be feature detection
+  (e.g. "platform supports something equivalent to x86's `DPPS`"), but
+  there probably aren't enough cross-platform commonalities for this
+  to be worth it. (Each "feature" would essentially be a platform
+  specific `cfg` anyway.)
+- Check vector operators in debug mode just like the scalar versions.
+
+# Unresolved questions
+
+- Should integer vectors get `/` and `%` automatically? Most CPUs
+  don't support them for vectors.

From 272ce9b458b32c1e6a2f4e349e5a2b4458673107 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Wed, 8 Jul 2015 10:58:49 -0700
Subject: [PATCH 02/25] First round of changes: remove extraneous traits etc.

---
 text/0000-simd-infrastructure.md | 176 +++++++++----------------------
 1 file changed, 50 insertions(+), 126 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index 7876de442dd..6fb50f0cb5a 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -19,6 +19,24 @@ This RFC lays the ground-work for building nice SIMD functionality,
 but doesn't fill everything out. The goal here is to provide the raw
 types and access to the raw instructions on each platform.
 
+## Where does this code go? Aka. why not in `std`?
+
+This RFC is focused on building stable, powerful SIMD functionality in
+external crates, not `std`.
+
+This makes it much easier to support functionality only "occasionally"
+available with Rust's preexisting `cfg` system. If it were in `std`,
+there would need to be some highly delayed `cfg` system so that
+functions that only work with (say) AVX-2 support:
+
+- don't break compilation on systems that don't support it, but
+- are still usable on systems that do support it.
+
+With an external crate, we can leverage `cargo`'s existing build
+infrastructure: compiling with some target features will rebuild with
+those features enabled.
+
+
 # Detailed design
 
 The design comes in three parts:
@@ -39,113 +57,42 @@ There is definitely a common core of SIMD functionality shared across
 many platforms, but this RFC doesn't try to extract that, it is just
 building tools that can be wrapped into a more uniform API later.
 
-## Background: Where does this code go?
-
-This RFC is focused on building stable, powerful SIMD functionality in
-external crates, not `std`. This makes it much easier to support
-functionality only "occasionally" available with Rust's preexisting
-`cfg` system. If it were in `std`, there would need to be some highly
-delayed `cfg` system so that functions that only work with AVX-2
-support:
-
-- don't break compilation on systems that don't support it, but
-- are still usable on systems that do support it.
 
 ## Types & traits
 
-A type designed to be used as a SIMD vector is indicated by the
-`repr(simd)` attribute. A type marked as such will be compiled to
-behave like a SIMD register (as well as the target platform can
-support it).
-
-The types/traits will be defined as follows:
+There are two new attributes: `repr(simd)` and `simd_primitive_trait`
 
 ```rust
 #[repr(simd)]
-struct Simd2<T: SimdPrim>(T, T);
-#[repr(simd)]
-struct Simd4<T: SimdPrim>(T, T, T, T);
-#[repr(simd)]
-struct Simd8<T: SimdPrim>(T, T, T, T, T, T, T, T);
-#[repr(simd)]
-struct Simd16<T: SimdPrim>(T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T);
-#[repr(simd)]
-struct Simd32<T: SimdPrim>(T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T,
-                           T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T);
-#[repr(simd)]
-struct Simd64<T: SimdPrim>(T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T,
-                           T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T,
-                           T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T,
-                           T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T);
-
-trait SimdVector {
-    type Elem: SimdPrim;
-    type Bool: SimdVector<Elem = <Self::Elem as SimdPrim>::Bool>;
-}
-
-impl<T: SimdPrim> for Simd2<T> {
-    type Elem = T;
-    type Bool = Simd2<T::Bool>;
-}
-impl<T: SimdPrim> for Simd4<T> {
-    type Elem = T;
-    type Bool = Simd4<T::Bool>;
-}
-// ...
-impl<T: SimdPrim> for Simd64<T> {
-    type Elem = T;
-    type Bool = Simd64<T::Bool>;
-}
+struct f32x4(f32, f32, f23, f23);
 
-#[simd_prim_trait]
-trait SimdPrim {
-    type Bool: SimdPrim;
-}
+#[repr(simd)]
+struct Simd2<T>(T, T);
 
-// boolean types, see below
-struct bool8i(...);
-struct bool16i(...);
-struct bool32i(...);
-struct bool64i(...);
-struct bool32f(...);
-struct bool64f(...);
-
-// specifying what types are SIMD-able.
-impl SimdPrim for u8 { type Bool = bool8i; }
-impl SimdPrim for i8 { type Bool = bool8i; }
-impl SimdPrim for u16 { type Bool = bool16i; }
-// ...
-impl SimdPrim for i64 { type Bool = bool64i; }
-
-impl SimdPrim for f32 { type Bool = bool32f; }
-impl SimdPrim for f64 { type Bool = bool64f; }
-
-impl SimdPrim for bool8i { type Bool = bool8i; }
-// ...
-impl SimdPrim for bool64i { type Bool = bool64i; }
-
-impl SimdPrim for bool32f { type Bool = bool32f; }
-impl SimdPrim for bool64f { type Bool = bool64f; }
+#[simd_primitive_trait]
+trait SimdPrim {}
 ```
 
-It is illegal to take an internal reference to the fields of a
-`repr(simd)` type.
-
 ### `repr(simd)`
 
 The `simd` `repr` can be attached to a struct and will cause such a
-struct to be compiled to a SIMD vector. It is required that the
-monomorphised vector consist of only a single "primitive" type,
-repeated some number of times. The restrictions on the element type
-are exactly the same restrictions as `#[simd_primitive_trait]` traits
-impose on their implementing types.
+struct to be compiled to a SIMD vector. It can be generic, but it is
+required that any fully monomorphised instance of the type consist of
+only a single "primitive" type, repeated some number of times. The
+restrictions on the element type are exactly the same restrictions as
+`#[simd_primitive_trait]` traits impose on their implementing types.
 
 The `repr(simd)` may not enforce that the trait bound exists/does the
 right thing at the type checking level for generic `repr(simd)`
 types. As such, it will be possible to get the code-generator to error
 out (ala the old `transmute` size errosr), however, this shouldn't
 cause problems in practice: libraries wrapping this functionality
-would layer type-safety on top (i.e. the `SimdPrim` trait).
+would layer type-safety on top (i.e. generic `repr(simd)` types would
+use the `SimdPrim` trait as a bound).
+
+It is illegal to take an internal reference to the fields of a
+`repr(simd)` type, because the representation of booleans may require
+to change, so that booleans are bit-packed.
 
 ### `simd_primitive_trait`
 
@@ -160,35 +107,6 @@ restriction and possibly tweaks type's internal representation (as
 such, it's legal for a single type to implement multiple traits with
 the attribute, if a bit pointless).
 
-### Booleans
-
-SIMD booleans are non-trivial. Many conventional APIs e.g. SSE, and
-NEON, use "wide booleans": a large number of bits set to all-zeros
-(false) or all-ones, e.g. equality between `Simd4(0_u32, 1, 2, 3)` and
-`Simd4(0_u32, 0, 2, 3)` gives (on the CPU) `Simd4(!0_u32, 0, !0,
-!0)`. Hence, the boolean types need to have width. It's tempting to
-just use the integer types of the appropriate width, but this falls
-down for two reasons:
-
-1. booleans aren't always this format
-2. the source of the boolean matters
-
-The second is easiest: CPUs are complicated beasts, and the hardware
-that handles floating point vector operations may be very different to
-the hardware that handles integer ones: instructions use different
-execution units. It can take several cycles to transfer data between
-them. Encoding the provenance/execution unit of the value in the type
-makes costs explicit.
-
-The first is much harder to solve. Some architectures/instruction sets
-model booleans as single bits. For example, equality between
-`Simd4(0_u32, 1, 2, 3)` and `Simd(0_u32, 0, 2, 3)` gives `1 + 4 + 8 ==
-0b1101`. One example is AVX-512 which essentially replaces all of the
-older SSE through AVX2 boolean-returning instructions with versions
-that return those. Using separate types for booleans (and restricting
-their API) allows for some serious magic: `Simd4<bool32>` becomes
-`u4`. (This is where the reference-restriction above comes in.)
-
 ## Operations
 
 CPU vendors usually offer "standard" C headers for their CPU specific
@@ -295,19 +213,23 @@ Comparisons are implemented via intrinsics, because the current
 comparison operator infrastructure doesn't easily lend itself to
 return vectors, as required.
 
-A library could give signatures like:
+The raw signatures would look like:
 
 ```rust
 extern "rust-intrinsic" {
-    fn simd_eq<T: SimdVector>(v: T, w: T) -> T::Bool;
-    fn simd_ne<T: SimdVector>(v: T, w: T) -> T::Bool;
-    fn simd_lt<T: SimdVector>(v: T, w: T) -> T::Bool;
-    fn simd_le<T: SimdVector>(v: T, w: T) -> T::Bool;
-    fn simd_gt<T: SimdVector>(v: T, w: T) -> T::Bool;
-    fn simd_ge<T: SimdVector>(v: T, w: T) -> T::Bool;
+    fn simd_eq<T, U>(v: T, w: T) -> U;
+    fn simd_ne<T, U>(v: T, w: T) -> U;
+    fn simd_lt<T, U>(v: T, w: T) -> U;
+    fn simd_le<T, U>(v: T, w: T) -> U;
+    fn simd_gt<T, U>(v: T, w: T) -> U;
+    fn simd_ge<T, U>(v: T, w: T) -> U;
 }
 ```
 
+However, these will be type checked, to ensure that `T` and `U` are
+the same length, and that `U` is appropriately shaped for a boolean. A
+library actually importing them might use some trait bounds to get
+actual type-safety.
 
 ### Built-in functionality
 
@@ -340,8 +262,10 @@ exact target e.g.
 - compiling with `-C target-cpu=native` on a modern CPU might set
   `target_feature = "avx2"`, `target_feature = "avx"`, ...
 
-(There are other non-SIMD features that might have `target_feature`s
-set too, such as `popcnt` and `rdrnd` on x86/x86-64.)
+The possible values of `target_feature` will be a selected whitelist,
+not necessarily just everything LLVM understands. There are other
+non-SIMD features that might have `target_feature`s set too, such as
+`popcnt` and `rdrnd` on x86/x86-64.)
 
 With a `cfg_if_else!` macro that expands to the first `cfg` that is
 satisfied (ala [@alexcrichton's cascade][cascade]), code might look

From 5893b163931a7f27fe724a6cd456ef38f8cd76c0 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Wed, 8 Jul 2015 14:19:10 -0700
Subject: [PATCH 03/25] Second round of changes: minor tweaks.

---
 text/0000-simd-infrastructure.md | 88 ++++++++++++++++++++------------
 1 file changed, 54 insertions(+), 34 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index 6fb50f0cb5a..a33f3b27068 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -1,4 +1,4 @@
-- Feature Name: simd_basics
+- Feature Name: simd_basics, cfg_target_feature
 - Start Date: 2015-06-02
 - RFC PR: (leave this empty)
 - Rust Issue: (leave this empty)
@@ -25,12 +25,12 @@ This RFC is focused on building stable, powerful SIMD functionality in
 external crates, not `std`.
 
 This makes it much easier to support functionality only "occasionally"
-available with Rust's preexisting `cfg` system. If it were in `std`,
-there would need to be some highly delayed `cfg` system so that
-functions that only work with (say) AVX-2 support:
-
-- don't break compilation on systems that don't support it, but
-- are still usable on systems that do support it.
+available with Rust's preexisting `cfg` system. There's no way for
+`std` to conditionally provide an API based on the target features
+used for the final artifact. Building `std` in every configuration is
+certainly untenable. Hence, if it were to be in `std`, there would
+need to be some highly delayed `cfg` system to support that sort of
+conditional API exposure.
 
 With an external crate, we can leverage `cargo`'s existing build
 infrastructure: compiling with some target features will rebuild with
@@ -39,11 +39,11 @@ those features enabled.
 
 # Detailed design
 
-The design comes in three parts:
+The design comes in three parts, all on the path to stabilisation:
 
-- types
-- operations
-- platform detection
+- types (`feature(simd_basics)`)
+- operations (`feature(simd_basics)`)
+- platform detection (`feature(cfg_target_feature)`)
 
 The general idea is to avoid bad performance cliffs, so that an
 intrinsic call in Rust maps to preferably one CPU instruction, or, if
@@ -92,7 +92,9 @@ use the `SimdPrim` trait as a bound).
 
 It is illegal to take an internal reference to the fields of a
 `repr(simd)` type, because the representation of booleans may require
-to change, so that booleans are bit-packed.
+to change, so that booleans are bit-packed. The official external
+library providing SIMD support will have private fields so this will
+not be generally observable.
 
 ### `simd_primitive_trait`
 
@@ -107,6 +109,13 @@ restriction and possibly tweaks type's internal representation (as
 such, it's legal for a single type to implement multiple traits with
 the attribute, if a bit pointless).
 
+This trait exists to allow new-type wrappers around primitives to also
+be usable in a SIMD context. However, this only works in limited
+scenarios (i.e. when the type wraps a single primitive) and so needs
+to be an explicit part of every type's API: type authors opt-in to
+being designed-for-SIMD. If it was implicit, changes to private fields
+may break downstream code.
+
 ## Operations
 
 CPU vendors usually offer "standard" C headers for their CPU specific
@@ -116,10 +125,13 @@ x86(-64)][x86].
 [armneon]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf
 [x86]: https://software.intel.com/sites/landingpage/IntrinsicsGuide
 
-All of these would be exposed as (eventually) stable intrinsics with
-names very similar to those that the vendor suggests (only difference
-would be some form of manual namespacing, e.g. prefixing with the CPU
-target), loadable via an `extern` block with an appropriate ABI.
+All of these would be exposed as compiler intrinsics with names very
+similar to those that the vendor suggests (only difference would be
+some form of manual namespacing, e.g. prefixing with the CPU target),
+loadable via an `extern` block with an appropriate ABI. This subset of
+intrinsics would be on the path to stabilisation (that is, one can
+"import" them with `extern` in stable code), and would not be exported
+by `std`.
 
 ```rust
 extern "rust-intrinsic" {
@@ -164,19 +176,24 @@ platform specific intrinsic for shuffling.
 
 ```rust
 extern "rust-intrinsic" {
-    fn simd_shuffle2<T: SimdVector>(v: T, w: T, i0: u32, i1: u32) -> Simd2<T::Elem>;
-    fn simd_shuffle4<T: SimdVector>(v: T, w: T, i0: u32, i1: u32, i2: u32, i3: u32) -> Simd4<T::Elem>;
-    fn simd_shuffle8<T: SimdVector>(v: T, w: T,
-                                    i0: u32, i1: u32, i2: u32, i3: u32,
-                                    i4: u32, i5: u32, i6: u32, i7: u32) -> Simd8<T::Elem>;
-    fn simd_shuffle16<T: SimdVector>(v: T, w: T,
-                                     i0: u32, i1: u32, i2: u32, i3: u32,
-                                     i4: u32, i5: u32, i6: u32, i7: u32
-                                     i8: u32, i9: u32, i10: u32, i11: u32,
-                                     i12: u32, i13: u32, i14: u32, i15: u32) -> Simd16<T::Elem>;
+    fn simd_shuffle2<T, Elem>(v: T, w: T, i0: u32, i1: u32) -> Simd2<Elem>;
+    fn simd_shuffle4<T, Elem>(v: T, w: T, i0: u32, i1: u32, i2: u32, i3: u32) -> Sidm4<Elem>;
+    fn simd_shuffle8<T, Elem>(v: T, w: T,
+                              i0: u32, i1: u32, i2: u32, i3: u32,
+                              i4: u32, i5: u32, i6: u32, i7: u32) -> Simd8<Elem>;
+    fn simd_shuffle16<T, Elem>(v: T, w: T,
+                               i0: u32, i1: u32, i2: u32, i3: u32,
+                               i4: u32, i5: u32, i6: u32, i7: u32
+                               i8: u32, i9: u32, i10: u32, i11: u32,
+                               i12: u32, i13: u32, i14: u32, i15: u32) -> Simd16<Elem>;
 }
 ```
 
+The raw definitions are only checked for validity at monomorphisation
+time, ensure that `T` is a SIMD vector, `U` is the element type of `T`
+etc. Libraries can use traits to ensure that these will be enforced by
+the type checker too.
+
 This approach has some downsides: `simd_shuffle32` (e.g. `Simd32<u8>`
 on AVX, and `Simd32<u16>` on AVX-512) and especially `simd_shuffle64`
 (e.g. `Simd64<u8>` on AVX-512) are unwieldy. These have similar type
@@ -191,7 +208,8 @@ let z = concat(v, w);
 return [z[i0], z[i1], z[i2], ...]
 ```
 
-The indices `iN` have to be compile time constants.
+The indices `iN` have to be compile time constants. Out of bounds
+indices yield unspecified results.
 
 Similarly, intrinsics for inserting/extracting elements into/out of
 vectors are provided, to allow modelling the SIMD vectors as actual
@@ -199,13 +217,14 @@ CPU registers as much as possible:
 
 ```rust
 extern "rust-intrinsic" {
-    fn simd_insert<T: SimdVector>(v: T, i0: u32, elem: T::Elem) -> T;
-    fn simd_extract<T: SimdVector>(v: T, i0: u32) -> T::Elem;
+    fn simd_insert<T, Elem>(v: T, i0: u32, elem: Elem) -> T;
+    fn simd_extract<T, Elem>(v: T, i0: u32) -> Elem;
 }
 ```
 
 The `i0` indices do not have to be constant. These are equivalent to
-`v[i0] = elem` and `v[i0]` respectively.
+`v[i0] = elem` and `v[i0]` respectively. They are type checked
+similarly to the shuffles.
 
 ### Comparisons
 
@@ -226,10 +245,10 @@ extern "rust-intrinsic" {
 }
 ```
 
-However, these will be type checked, to ensure that `T` and `U` are
-the same length, and that `U` is appropriately shaped for a boolean. A
-library actually importing them might use some trait bounds to get
-actual type-safety.
+These are type checked during code-generation similarly to the
+shuffles. Ensuring that `T` and `U` has the same length, and that `U`
+is appropriately "boolean"-y. Libraries can use traits to ensure that
+these will be enforced by the type checker too.
 
 ### Built-in functionality
 
@@ -333,3 +352,4 @@ cfg_if_else! {
 
 - Should integer vectors get `/` and `%` automatically? Most CPUs
   don't support them for vectors.
+- How should out-of-bounds shuffle and insert/extract indices be handled?

From f9e48d11f9f9c8c7977da2cc371103e94783c33c Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Thu, 9 Jul 2015 09:28:07 -0700
Subject: [PATCH 04/25] Clarify/fix typos.

---
 text/0000-simd-infrastructure.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index a33f3b27068..a132e856d6d 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -64,7 +64,7 @@ There are two new attributes: `repr(simd)` and `simd_primitive_trait`
 
 ```rust
 #[repr(simd)]
-struct f32x4(f32, f32, f23, f23);
+struct f32x4(f32, f32, f32, f32);
 
 #[repr(simd)]
 struct Simd2<T>(T, T);
@@ -261,9 +261,10 @@ SIMD vectors can be converted with `as`. As with intrinsics, this is
 their lengths match and their elements are castable (i.e. are
 primitives), there's no enforcement of nominal types.
 
-All of these are never checked: explicit SIMD is essentially only
-required for speed, and checking inflates one instruction to 5 or
-more.
+All of these operators and conversions are never checked (in the sense
+of the arithmetic overflow checks of `-C debug-assertions`): explicit
+SIMD is essentially only required for speed, and checking inflates one
+instruction to 5 or more.
 
 ## Platform Detection
 

From e2fc223c99bea7aadf39f0859e5e7f0244175e5b Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Thu, 9 Jul 2015 09:59:58 -0700
Subject: [PATCH 05/25] Note that fixed-length arrays could be repr(simd)'d.

---
 text/0000-simd-infrastructure.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index a132e856d6d..af5fb0b0d15 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -348,6 +348,11 @@ cfg_if_else! {
   to be worth it. (Each "feature" would essentially be a platform
   specific `cfg` anyway.)
 - Check vector operators in debug mode just like the scalar versions.
+- Make fixed length arrays `repr(simd)`-able (via just flattening), so
+  that, say, `#[repr(simd)] struct u32x4([u32; 4]);` and
+  `#[repr(simd)] struct f64x8([f64; 4], [f64; 4]);` etc works. This
+  will be most useful if/when we allow generic-lengths, `#[repr(simd)]
+  struct Simd<T, n>([T; n]);`
 
 # Unresolved questions
 

From efeafdb770728c3fe57a9e34da31421f38740943 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Thu, 9 Jul 2015 17:10:37 -0700
Subject: [PATCH 06/25] Remove the simd_primitive_trait attribute.

Not really necessary: the type safety it offers can be provided by libraries.
---
 text/0000-simd-infrastructure.md | 53 ++++++++++----------------------
 1 file changed, 16 insertions(+), 37 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index af5fb0b0d15..a2a6e9884ed 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -58,9 +58,9 @@ many platforms, but this RFC doesn't try to extract that, it is just
 building tools that can be wrapped into a more uniform API later.
 
 
-## Types & traits
+## Types
 
-There are two new attributes: `repr(simd)` and `simd_primitive_trait`
+There is a new attributes: `repr(simd)`.
 
 ```rust
 #[repr(simd)]
@@ -68,54 +68,30 @@ struct f32x4(f32, f32, f32, f32);
 
 #[repr(simd)]
 struct Simd2<T>(T, T);
-
-#[simd_primitive_trait]
-trait SimdPrim {}
 ```
 
-### `repr(simd)`
-
 The `simd` `repr` can be attached to a struct and will cause such a
 struct to be compiled to a SIMD vector. It can be generic, but it is
 required that any fully monomorphised instance of the type consist of
-only a single "primitive" type, repeated some number of times. The
-restrictions on the element type are exactly the same restrictions as
-`#[simd_primitive_trait]` traits impose on their implementing types.
+only a single "primitive" type, repeated some number of times. Types
+are flattened, so, for `struct Bar(u64);`, `Simd2<Bar>` has the same
+representation as `Simd2<u64>`.
 
-The `repr(simd)` may not enforce that the trait bound exists/does the
+The `repr(simd)` may not enforce that any trait bounds exists/does the
 right thing at the type checking level for generic `repr(simd)`
 types. As such, it will be possible to get the code-generator to error
-out (ala the old `transmute` size errosr), however, this shouldn't
+out (ala the old `transmute` size errors), however, this shouldn't
 cause problems in practice: libraries wrapping this functionality
 would layer type-safety on top (i.e. generic `repr(simd)` types would
-use the `SimdPrim` trait as a bound).
+use some `unsafe` trait as a bound that is designed to only be
+implemented by types that will work).
 
 It is illegal to take an internal reference to the fields of a
 `repr(simd)` type, because the representation of booleans may require
-to change, so that booleans are bit-packed. The official external
+modification, so that booleans are bit-packed. The official external
 library providing SIMD support will have private fields so this will
 not be generally observable.
 
-### `simd_primitive_trait`
-
-Traits marked with the `simd_primitive_trait` attribute are special:
-types implementing it are those that can be stored in SIMD
-vectors. Initially, only primitives and single-field structs that
-store `SimdPrim` types will be allowed to implement it.
-
-This is explicitly not a lang item: it is legal to have multiple
-distinct traits in a compilation. The attribute just adds the
-restriction and possibly tweaks type's internal representation (as
-such, it's legal for a single type to implement multiple traits with
-the attribute, if a bit pointless).
-
-This trait exists to allow new-type wrappers around primitives to also
-be usable in a SIMD context. However, this only works in limited
-scenarios (i.e. when the type wraps a single primitive) and so needs
-to be an explicit part of every type's API: type authors opt-in to
-being designed-for-SIMD. If it was implicit, changes to private fields
-may break downstream code.
-
 ## Operations
 
 CPU vendors usually offer "standard" C headers for their CPU specific
@@ -312,8 +288,8 @@ cfg_if_else! {
 # Extensions
 
 - scatter/gather operations allow (partially) operating on a SIMD
-  vector of pointers. This would require extending `SimdPrim` to also
-  allow pointer types.
+  vector of pointers. This would require allowing
+  pointers(/references?) in `repr(simd)` types.
 - allow (and ignore for everything but type checking) zero-sized types
   in `repr(simd)` structs, to allow tagging them with markers
 
@@ -353,9 +329,12 @@ cfg_if_else! {
   `#[repr(simd)] struct f64x8([f64; 4], [f64; 4]);` etc works. This
   will be most useful if/when we allow generic-lengths, `#[repr(simd)]
   struct Simd<T, n>([T; n]);`
+- have 100% guaranteed type-safety for generic `#[repr(simd)]` types
+  and the generic intrinsics. This would probably require a relatively
+  complicated set of traits (with compiler integration).
 
 # Unresolved questions
 
 - Should integer vectors get `/` and `%` automatically? Most CPUs
-  don't support them for vectors.
+  don't support them for vectors. However
 - How should out-of-bounds shuffle and insert/extract indices be handled?

From a7c409b3291758eab122d9bef034dfc2f255fb0e Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Fri, 10 Jul 2015 10:43:14 -0700
Subject: [PATCH 07/25] Mention alignment changes due to repr(simd).

---
 text/0000-simd-infrastructure.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index a2a6e9884ed..f815bb4b9c2 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -92,6 +92,10 @@ modification, so that booleans are bit-packed. The official external
 library providing SIMD support will have private fields so this will
 not be generally observable.
 
+Adding `repr(simd)` to a type may increase its minimum/preferred
+alignment, based on platform behaviour. (E.g. x86 wants its 128-bit
+SSE vectors to be 128-bit aligned.)
+
 ## Operations
 
 CPU vendors usually offer "standard" C headers for their CPU specific

From f4e2ecfbe09c0798433f49785a625e8d54ac9b04 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Fri, 10 Jul 2015 10:48:55 -0700
Subject: [PATCH 08/25] Note pre-RFC discussion.

---
 text/0000-simd-infrastructure.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index f815bb4b9c2..0c6af575682 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -19,6 +19,9 @@ This RFC lays the ground-work for building nice SIMD functionality,
 but doesn't fill everything out. The goal here is to provide the raw
 types and access to the raw instructions on each platform.
 
+(An earlier variant of this RFC was discussed as a
+[pre-RFC](https://internals.rust-lang.org/t/pre-rfc-simd-groundwork/2343).)
+
 ## Where does this code go? Aka. why not in `std`?
 
 This RFC is focused on building stable, powerful SIMD functionality in

From 1132ede301fae89809ce2932faa9ef8cdad103b5 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Mon, 13 Jul 2015 13:39:43 -0700
Subject: [PATCH 09/25] Clarify how the intrinsics' structural typing works.

---
 text/0000-simd-infrastructure.md | 33 +++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index 0c6af575682..4da9d3926cf 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -132,6 +132,33 @@ in a "duck-typed" manner: it will just ensure that the types are SIMD
 vectors with the appropriate length and element type, it will not
 enforce a specific nominal type.
 
+NB. The structural typing is just for the declaration: if a SIMD intrinsic
+is declared to take a type `X`, it must always be called with `X`,
+even if other types are structurally equal to `X`. Also, within a
+signature, SIMD types that must be structurally equal must be nominal
+equal. I.e. if the `add_...` all refer to the same intrinsic to add a
+SIMD vector of bytes,
+
+```rust
+// (same length)
+struct A(u8, u8, ..., u8);
+struct B(u8, u8, ..., u8);
+
+extern "rust-intrinsic" {
+    fn add_aaa(x: A, y: A) -> A; // ok
+    fn add_bbb(x: B, y: B) -> B; // ok
+    fn add_aab(x: A, y: A) -> B; // error, expected B, found A
+    fn add_bab(x: B, y: A) -> B; // error, expected A, found B
+}
+
+fn double_a(x: A) -> A {
+    add_aaa(x, x)
+}
+fn double_b(x: B) -> B {
+    add_aaa(x, x) // error, expected A, found B
+}
+```
+
 There would additionally be a small set of cross-platform operations
 that are either generally efficiently supported everywhere or are
 extremely useful. These won't necessarily map to a single instruction,
@@ -173,9 +200,9 @@ extern "rust-intrinsic" {
 ```
 
 The raw definitions are only checked for validity at monomorphisation
-time, ensure that `T` is a SIMD vector, `U` is the element type of `T`
-etc. Libraries can use traits to ensure that these will be enforced by
-the type checker too.
+time, ensure that `T` is a SIMD vector, `Elem` is the element type of
+`T` etc. Libraries can use traits to ensure that these will be
+enforced by the type checker too.
 
 This approach has some downsides: `simd_shuffle32` (e.g. `Simd32<u8>`
 on AVX, and `Simd32<u16>` on AVX-512) and especially `simd_shuffle64`

From 8317ea49ce509af257722fcb0be27bf23de54377 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Mon, 13 Jul 2015 13:41:13 -0700
Subject: [PATCH 10/25] Add arithmetic intrinsics alternative.

---
 text/0000-simd-infrastructure.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index 4da9d3926cf..a93fcf3d319 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -366,6 +366,9 @@ cfg_if_else! {
 - have 100% guaranteed type-safety for generic `#[repr(simd)]` types
   and the generic intrinsics. This would probably require a relatively
   complicated set of traits (with compiler integration).
+- use generic intrinsics like shuffles for the arithmetic operations,
+  instead of providing the operations implicitly.
+
 
 # Unresolved questions
 

From c6ed18ac09c44e01d1076c8b1baf9fbc1877068a Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Mon, 13 Jul 2015 13:53:56 -0700
Subject: [PATCH 11/25] Write down an answer to "why not `asm!`?".

---
 text/0000-simd-infrastructure.md | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index a93fcf3d319..cc3063a4494 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -276,6 +276,33 @@ of the arithmetic overflow checks of `-C debug-assertions`): explicit
 SIMD is essentially only required for speed, and checking inflates one
 instruction to 5 or more.
 
+### Why not inline asm?
+
+One alternative to providing intrinsics is to instead just use
+inline-asm to expose each CPU instruction. However, this approach has
+essentially only one benefit (avoiding defining the intrinsics), but
+several downsides, e.g.
+
+- assembly is generally a black-box to optimisers, inhibiting
+  optimisations, like algebraic simplification/transformation,
+- programmers would have to manually synthesise the right sequence of
+  operations to achieve a given shuffle, while having a generic
+  shuffle intrinsic lets the compiler do it (NB. the intention is that
+  the programmer will still have access to the platform specific
+  operations for when the compiler synthesis isn't quite right),
+- inline assembly is not currently stable in
+  Rust and there's not a strong push for it to be so in the immediate
+  future (although this could change).
+
+Benefits of manual assembly writing, like instruction scheduling and
+register allocation don't apply to the (generally) one-instruction
+`asm!` blocks that replace the intrinsics (they need to be designed so
+that the compiler has full control over register allocation, or else
+the result will be strictly worse). Those possible advantages of hand
+written assembly over intrinsics only come in to play when writing
+longer blocks of raw assembly, i.e. some inner loop might be faster
+when written as a single chunk of asm rather than as intrinsics.
+
 ## Platform Detection
 
 The availability of efficient SIMD functionality is very fine-grained,

From 67f78ec728913f56c9e969b2226d854724594ad1 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Tue, 14 Jul 2015 09:57:03 -0700
Subject: [PATCH 12/25] point to cfg-if.

---
 text/0000-simd-infrastructure.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index cc3063a4494..41c7d8bd2e3 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -324,11 +324,11 @@ not necessarily just everything LLVM understands. There are other
 non-SIMD features that might have `target_feature`s set too, such as
 `popcnt` and `rdrnd` on x86/x86-64.)
 
-With a `cfg_if_else!` macro that expands to the first `cfg` that is
-satisfied (ala [@alexcrichton's cascade][cascade]), code might look
+With a `cfg_if!` macro that expands to the first `cfg` that is
+satisfied (ala [@alexcrichton's `cfg-if`][cfg-if]), code might look
 like:
 
-[cascade]: https://github.com/alexcrichton/backtrace-rs/blob/03703031babfa87cbe2c723ad6752131819dc554/src/macros.rs
+[cfg-if]: https://crates.io/crates/cfg-if
 
 ```rust
 cfg_if_else! {

From f71c4b32e27215dd40c0c0483a883c825c575ad0 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Mon, 3 Aug 2015 11:30:30 -0700
Subject: [PATCH 13/25] Use intrinsics for arithmetic instead of built-in
 operators.

---
 text/0000-simd-infrastructure.md | 35 +++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index 41c7d8bd2e3..7fedaabed8f 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -166,8 +166,8 @@ but will be shimmed as efficiently as possible.
 
 - shuffles and extracting/inserting elements
 - comparisons
-
-Lastly, arithmetic and conversions are supported via built-in operators.
+- arithmetic
+- conversions
 
 ### Shuffles & element operations
 
@@ -260,21 +260,28 @@ shuffles. Ensuring that `T` and `U` has the same length, and that `U`
 is appropriately "boolean"-y. Libraries can use traits to ensure that
 these will be enforced by the type checker too.
 
-### Built-in functionality
+### Arithmetic
+
+Intrinsics will be provided for arithmetic operations like addition
+and multiplication.
+
+```rust
+extern {
+    fn simd_add<T>(x: T, y: T) -> T;
+    fn simd_mul<T>(x: T, y: T) -> T;
+    // ...
+}
+```
 
-Any type marked `repr(simd)` automatically has the `+`, `-` and `*`
-operators work. The `/` operator works for floating point, and the
-`<<` and `>>` ones work for integers.
+These will have codegen time checks that the element type is correct:
 
-SIMD vectors can be converted with `as`. As with intrinsics, this is
-"duck-typed" it is possible to cast a vector type `V` to a type `W` if
-their lengths match and their elements are castable (i.e. are
-primitives), there's no enforcement of nominal types.
+- `add`, `sub`, `mul`: any float or integer type
+- `div`: any float type
+- `and`, `or`, `xor`, `shl` (shift left), `shr` (shift right): any
+  integer type
 
-All of these operators and conversions are never checked (in the sense
-of the arithmetic overflow checks of `-C debug-assertions`): explicit
-SIMD is essentially only required for speed, and checking inflates one
-instruction to 5 or more.
+(The integer types are `i8`, ..., `i64`, `u8`, ..., `u64` and the
+float types are `f32` and `f64`.)
 
 ### Why not inline asm?
 

From 8b2ec8c16f8b9eafd4d0bd7fb20a187a0399028a Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Mon, 3 Aug 2015 11:33:18 -0700
Subject: [PATCH 14/25] Accidentally:

- an extra word.
- a subject-verb agreement.
- an ly.
- a plural.
---
 text/0000-simd-infrastructure.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index 7fedaabed8f..cdfd8057c34 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -63,7 +63,7 @@ building tools that can be wrapped into a more uniform API later.
 
 ## Types
 
-There is a new attributes: `repr(simd)`.
+There is a new attribute: `repr(simd)`.
 
 ```rust
 #[repr(simd)]
@@ -135,7 +135,7 @@ enforce a specific nominal type.
 NB. The structural typing is just for the declaration: if a SIMD intrinsic
 is declared to take a type `X`, it must always be called with `X`,
 even if other types are structurally equal to `X`. Also, within a
-signature, SIMD types that must be structurally equal must be nominal
+signature, SIMD types that must be structurally equal must be nominally
 equal. I.e. if the `add_...` all refer to the same intrinsic to add a
 SIMD vector of bytes,
 
@@ -256,7 +256,7 @@ extern "rust-intrinsic" {
 ```
 
 These are type checked during code-generation similarly to the
-shuffles. Ensuring that `T` and `U` has the same length, and that `U`
+shuffles: ensuring that `T` and `U` have the same length, and that `U`
 is appropriately "boolean"-y. Libraries can use traits to ensure that
 these will be enforced by the type checker too.
 
@@ -406,6 +406,6 @@ cfg_if_else! {
 
 # Unresolved questions
 
-- Should integer vectors get `/` and `%` automatically? Most CPUs
-  don't support them for vectors. However
+- Should integer vectors get division automatically? Most CPUs
+  don't support them for vectors.
 - How should out-of-bounds shuffle and insert/extract indices be handled?

From 47f6ae9c5f3c935b1d943da5366fbd4531a89e35 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Thu, 6 Aug 2015 16:01:19 -0700
Subject: [PATCH 15/25] Use the platform-intrinsic ABI instead of
 rust-intrinsic.

---
 text/0000-simd-infrastructure.md | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index cdfd8057c34..40aba05464c 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -1,4 +1,4 @@
-- Feature Name: simd_basics, cfg_target_feature
+- Feature Name: simd_basics, platform_intrinsics, cfg_target_feature
 - Start Date: 2015-06-02
 - RFC PR: (leave this empty)
 - Rust Issue: (leave this empty)
@@ -45,7 +45,7 @@ those features enabled.
 The design comes in three parts, all on the path to stabilisation:
 
 - types (`feature(simd_basics)`)
-- operations (`feature(simd_basics)`)
+- operations (`feature(platform_intrinsics)`)
 - platform detection (`feature(cfg_target_feature)`)
 
 The general idea is to avoid bad performance cliffs, so that an
@@ -116,8 +116,10 @@ intrinsics would be on the path to stabilisation (that is, one can
 "import" them with `extern` in stable code), and would not be exported
 by `std`.
 
+Example:
+
 ```rust
-extern "rust-intrinsic" {
+extern "platform-intrinsic" {
     fn x86_mm_abs_epi16(a: Simd8<i16>) -> Simd8<i16>;
     // ...
 }
@@ -144,7 +146,7 @@ SIMD vector of bytes,
 struct A(u8, u8, ..., u8);
 struct B(u8, u8, ..., u8);
 
-extern "rust-intrinsic" {
+extern "platform-intrinsic" {
     fn add_aaa(x: A, y: A) -> A; // ok
     fn add_bbb(x: B, y: B) -> B; // ok
     fn add_aab(x: A, y: A) -> B; // error, expected B, found A
@@ -169,6 +171,16 @@ but will be shimmed as efficiently as possible.
 - arithmetic
 - conversions
 
+All of these intrinsics are imported via an `extern` directive similar
+to the process for pre-existing intrinsics like `transmute`, however,
+the SIMD operations are provided under a special ABI:
+`platform-intrinsic`. Use of this ABI (and hence the intrinsics) is
+initially feature-gated under the `platform_intrinsics` feature
+name. Why `platform-intrinsic` rather than say `simd-intrinsic`? There
+are non-SIMD platform-specific instructions that may be nice to expose
+(for example, Intel defines an `_addcarry_u32` intrinsic corresponding
+to the `ADC` instruction).
+
 ### Shuffles & element operations
 
 One of the most powerful features of SIMD is the ability to rearrange
@@ -185,7 +197,7 @@ shuffles without having to understand all the details of every
 platform specific intrinsic for shuffling.
 
 ```rust
-extern "rust-intrinsic" {
+extern "platform-intrinsic" {
     fn simd_shuffle2<T, Elem>(v: T, w: T, i0: u32, i1: u32) -> Simd2<Elem>;
     fn simd_shuffle4<T, Elem>(v: T, w: T, i0: u32, i1: u32, i2: u32, i3: u32) -> Sidm4<Elem>;
     fn simd_shuffle8<T, Elem>(v: T, w: T,
@@ -226,7 +238,7 @@ vectors are provided, to allow modelling the SIMD vectors as actual
 CPU registers as much as possible:
 
 ```rust
-extern "rust-intrinsic" {
+extern "platform-intrinsic" {
     fn simd_insert<T, Elem>(v: T, i0: u32, elem: Elem) -> T;
     fn simd_extract<T, Elem>(v: T, i0: u32) -> Elem;
 }
@@ -245,7 +257,7 @@ return vectors, as required.
 The raw signatures would look like:
 
 ```rust
-extern "rust-intrinsic" {
+extern "platform-intrinsic" {
     fn simd_eq<T, U>(v: T, w: T) -> U;
     fn simd_ne<T, U>(v: T, w: T) -> U;
     fn simd_lt<T, U>(v: T, w: T) -> U;
@@ -266,7 +278,7 @@ Intrinsics will be provided for arithmetic operations like addition
 and multiplication.
 
 ```rust
-extern {
+extern "platform-intrinsic" {
     fn simd_add<T>(x: T, y: T) -> T;
     fn simd_mul<T>(x: T, y: T) -> T;
     // ...
@@ -363,7 +375,6 @@ cfg_if_else! {
 
 # Alternatives
 
-- The SIMD on-route-to-stable intrinsics could have their own ABI
 - Intrinsics could instead by namespaced by ABI, `extern
   "x86-intrinsic"`, `extern "arm-intrinsic"`.
 - There could be more syntactic support for shuffles, either with true

From 4a4e6aefe43b0912b233c29d621e879c978c3875 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Thu, 6 Aug 2015 16:03:06 -0700
Subject: [PATCH 16/25] feature(simd_basics) -> feature(repr_simd)

This feature gate now only applies to the attribute, so it might as well
be more specific.
---
 text/0000-simd-infrastructure.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index 40aba05464c..a3527afe8dd 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -1,4 +1,4 @@
-- Feature Name: simd_basics, platform_intrinsics, cfg_target_feature
+- Feature Name: repr_simd, platform_intrinsics, cfg_target_feature
 - Start Date: 2015-06-02
 - RFC PR: (leave this empty)
 - Rust Issue: (leave this empty)
@@ -44,7 +44,7 @@ those features enabled.
 
 The design comes in three parts, all on the path to stabilisation:
 
-- types (`feature(simd_basics)`)
+- types (`feature(repr_simd)`)
 - operations (`feature(platform_intrinsics)`)
 - platform detection (`feature(cfg_target_feature)`)
 

From c4bf5e186b339308e073a2e726e8f62e5e80b539 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Wed, 12 Aug 2015 11:48:29 -0700
Subject: [PATCH 17/25] Remove struct flattening.

This is non-trivial (for me) to implement, and ended up not being that
useful, i.e. it wasn't needed to make useful things.
---
 text/0000-simd-infrastructure.md | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index a3527afe8dd..1c6b8cbf74b 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -76,9 +76,7 @@ struct Simd2<T>(T, T);
 The `simd` `repr` can be attached to a struct and will cause such a
 struct to be compiled to a SIMD vector. It can be generic, but it is
 required that any fully monomorphised instance of the type consist of
-only a single "primitive" type, repeated some number of times. Types
-are flattened, so, for `struct Bar(u64);`, `Simd2<Bar>` has the same
-representation as `Simd2<u64>`.
+only a single "primitive" type, repeated some number of times.
 
 The `repr(simd)` may not enforce that any trait bounds exists/does the
 right thing at the type checking level for generic `repr(simd)`

From 653267060bb2e8307c6eac27d8ecd02a9d136370 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Wed, 12 Aug 2015 11:50:16 -0700
Subject: [PATCH 18/25] Change shuffles to use arrays of indices.

This is *far* more scalable than having an argument for each value.

Thanks to @pnkfelix for the suggestion.
---
 text/0000-simd-infrastructure.md | 24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index 1c6b8cbf74b..0b6fb616e1d 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -196,16 +196,10 @@ platform specific intrinsic for shuffling.
 
 ```rust
 extern "platform-intrinsic" {
-    fn simd_shuffle2<T, Elem>(v: T, w: T, i0: u32, i1: u32) -> Simd2<Elem>;
-    fn simd_shuffle4<T, Elem>(v: T, w: T, i0: u32, i1: u32, i2: u32, i3: u32) -> Sidm4<Elem>;
-    fn simd_shuffle8<T, Elem>(v: T, w: T,
-                              i0: u32, i1: u32, i2: u32, i3: u32,
-                              i4: u32, i5: u32, i6: u32, i7: u32) -> Simd8<Elem>;
-    fn simd_shuffle16<T, Elem>(v: T, w: T,
-                               i0: u32, i1: u32, i2: u32, i3: u32,
-                               i4: u32, i5: u32, i6: u32, i7: u32
-                               i8: u32, i9: u32, i10: u32, i11: u32,
-                               i12: u32, i13: u32, i14: u32, i15: u32) -> Simd16<Elem>;
+    fn simd_shuffle2<T, Elem>(v: T, w: T, idx: [i32; 2]) -> Simd2<Elem>;
+    fn simd_shuffle4<T, Elem>(v: T, w: T, idx: [i32; 4]) -> Sidm4<Elem>;
+    fn simd_shuffle8<T, Elem>(v: T, w: T, idx: [i32; 8]) -> Simd8<Elem>;
+    fn simd_shuffle16<T, Elem>(v: T, w: T, idx: [i32; 16]) -> Simd16<Elem>;
 }
 ```
 
@@ -214,10 +208,8 @@ time, ensure that `T` is a SIMD vector, `Elem` is the element type of
 `T` etc. Libraries can use traits to ensure that these will be
 enforced by the type checker too.
 
-This approach has some downsides: `simd_shuffle32` (e.g. `Simd32<u8>`
-on AVX, and `Simd32<u16>` on AVX-512) and especially `simd_shuffle64`
-(e.g. `Simd64<u8>` on AVX-512) are unwieldy. These have similar type
-"safety"/code-generation errors to the vectors themselves.
+This approach has similar type "safety"/code-generation errors to the
+vectors themselves.
 
 These operations are semantically:
 
@@ -225,10 +217,10 @@ These operations are semantically:
 // vector of double length
 let z = concat(v, w);
 
-return [z[i0], z[i1], z[i2], ...]
+return [z[idx[0]], z[idx[1]], z[idx[2]], ...]
 ```
 
-The indices `iN` have to be compile time constants. Out of bounds
+The index array `idx` has to be compile time constants. Out of bounds
 indices yield unspecified results.
 
 Similarly, intrinsics for inserting/extracting elements into/out of

From 8e3a0deba208e4b42216cf210570755f79893aca Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Wed, 12 Aug 2015 11:54:37 -0700
Subject: [PATCH 19/25] shuffles don't rely on generic types for return values.

This has less type safety, but doesn't require generic simd types to
exist:

    #[repr(simd)] struct Simd2<T>(T, T);
---
 text/0000-simd-infrastructure.md | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index 0b6fb616e1d..7bc38d2f90d 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -196,17 +196,17 @@ platform specific intrinsic for shuffling.
 
 ```rust
 extern "platform-intrinsic" {
-    fn simd_shuffle2<T, Elem>(v: T, w: T, idx: [i32; 2]) -> Simd2<Elem>;
-    fn simd_shuffle4<T, Elem>(v: T, w: T, idx: [i32; 4]) -> Sidm4<Elem>;
-    fn simd_shuffle8<T, Elem>(v: T, w: T, idx: [i32; 8]) -> Simd8<Elem>;
-    fn simd_shuffle16<T, Elem>(v: T, w: T, idx: [i32; 16]) -> Simd16<Elem>;
+    fn simd_shuffle2<T, U>(v: T, w: T, idx: [i32; 2]) -> U;
+    fn simd_shuffle4<T, U>(v: T, w: T, idx: [i32; 4]) -> U;
+    fn simd_shuffle8<T, U>(v: T, w: T, idx: [i32; 8]) -> U;
+    fn simd_shuffle16<T, U>(v: T, w: T, idx: [i32; 16]) -> U;
 }
 ```
 
 The raw definitions are only checked for validity at monomorphisation
-time, ensure that `T` is a SIMD vector, `Elem` is the element type of
-`T` etc. Libraries can use traits to ensure that these will be
-enforced by the type checker too.
+time, ensure that `T` and `U` are SIMD vector with the same element
+type, `U` has the appropriate length etc. Libraries can use traits to
+ensure that these will be enforced by the type checker too.
 
 This approach has similar type "safety"/code-generation errors to the
 vectors themselves.
@@ -362,6 +362,17 @@ cfg_if_else! {
   pointers(/references?) in `repr(simd)` types.
 - allow (and ignore for everything but type checking) zero-sized types
   in `repr(simd)` structs, to allow tagging them with markers
+- the shuffle intrinsics could be made more relaxed in their type
+  checking (i.e. not require that they return their second type
+  parameter), to allow more type safety when combined with generic
+  simd types:
+
+      #[repr(simd)] struct Simd2<T>(T, T);
+      extern "platform-intrinsic" {
+          fn simd_shuffle2<T, U>(x: T, y: T, idx: [u32; 2]) -> Simd2<U>;
+      }
+
+  This should be a backwards-compatible generalisation.
 
 # Alternatives
 
@@ -404,7 +415,6 @@ cfg_if_else! {
 - use generic intrinsics like shuffles for the arithmetic operations,
   instead of providing the operations implicitly.
 
-
 # Unresolved questions
 
 - Should integer vectors get division automatically? Most CPUs

From 54b0927ea11ae544549edaa9a79b5ca1b6225a91 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Wed, 12 Aug 2015 11:59:12 -0700
Subject: [PATCH 20/25] Intrinsics-for-operations is now the RFC, not an
 alternative.

Also, the comparison comment no longer makes sense.
---
 text/0000-simd-infrastructure.md | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index 7bc38d2f90d..d4d8558099a 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -240,11 +240,8 @@ similarly to the shuffles.
 
 ### Comparisons
 
-Comparisons are implemented via intrinsics, because the current
-comparison operator infrastructure doesn't easily lend itself to
-return vectors, as required.
-
-The raw signatures would look like:
+Comparisons are implemented via intrinsics. The raw signatures would
+look like:
 
 ```rust
 extern "platform-intrinsic" {
@@ -412,8 +409,6 @@ cfg_if_else! {
 - have 100% guaranteed type-safety for generic `#[repr(simd)]` types
   and the generic intrinsics. This would probably require a relatively
   complicated set of traits (with compiler integration).
-- use generic intrinsics like shuffles for the arithmetic operations,
-  instead of providing the operations implicitly.
 
 # Unresolved questions
 

From 9e31ad3327327eb51849623f5f5bf7cc2afb58c0 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Wed, 12 Aug 2015 14:25:05 -0700
Subject: [PATCH 21/25] Out of bounds indices are errors (backwards compat to
 relax).

---
 text/0000-simd-infrastructure.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index d4d8558099a..dde310e1f16 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -221,7 +221,7 @@ return [z[idx[0]], z[idx[1]], z[idx[2]], ...]
 ```
 
 The index array `idx` has to be compile time constants. Out of bounds
-indices yield unspecified results.
+indices yield errors.
 
 Similarly, intrinsics for inserting/extracting elements into/out of
 vectors are provided, to allow modelling the SIMD vectors as actual

From 91a2b360ae7b4c6e820b57d937f6012ffc448f73 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Wed, 12 Aug 2015 14:27:05 -0700
Subject: [PATCH 22/25] Only invalid to *call* intrinsics on bad platforms.

It's valid to `extern` them, though.
---
 text/0000-simd-infrastructure.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index dde310e1f16..f0d1f5eec04 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -126,11 +126,11 @@ extern "platform-intrinsic" {
 These all use entirely concrete types, and this is the core interface
 to these intrinsics: essentially it is just allowing code to exactly
 specify a CPU instruction to use. These intrinsics only actually work
-on a subset of the CPUs that Rust targets, and are only be available
-for `extern`ing on those targets. The signatures are typechecked, but
-in a "duck-typed" manner: it will just ensure that the types are SIMD
-vectors with the appropriate length and element type, it will not
-enforce a specific nominal type.
+on a subset of the CPUs that Rust targets, and will result in compile
+time errors if they are called on platforms that do not support
+them. The signatures are typechecked, but in a "duck-typed" manner: it
+will just ensure that the types are SIMD vectors with the appropriate
+length and element type, it will not enforce a specific nominal type.
 
 NB. The structural typing is just for the declaration: if a SIMD intrinsic
 is declared to take a type `X`, it must always be called with `X`,

From 60931df73c8733fecd662dff900a5f172f01eee4 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Wed, 12 Aug 2015 14:27:55 -0700
Subject: [PATCH 23/25] There can be more shuffles.

---
 text/0000-simd-infrastructure.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index f0d1f5eec04..f1ad633420d 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -200,6 +200,7 @@ extern "platform-intrinsic" {
     fn simd_shuffle4<T, U>(v: T, w: T, idx: [i32; 4]) -> U;
     fn simd_shuffle8<T, U>(v: T, w: T, idx: [i32; 8]) -> U;
     fn simd_shuffle16<T, U>(v: T, w: T, idx: [i32; 16]) -> U;
+    // ...
 }
 ```
 

From 135ba7d7b4809ce0656ddce668c9c1f3f26b20a8 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Fri, 14 Aug 2015 09:55:19 -0700
Subject: [PATCH 24/25] Internal references are legal.

Automatic crazy boolean bit-packing is crazy.
---
 text/0000-simd-infrastructure.md | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index f1ad633420d..b3474958311 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -87,12 +87,6 @@ would layer type-safety on top (i.e. generic `repr(simd)` types would
 use some `unsafe` trait as a bound that is designed to only be
 implemented by types that will work).
 
-It is illegal to take an internal reference to the fields of a
-`repr(simd)` type, because the representation of booleans may require
-modification, so that booleans are bit-packed. The official external
-library providing SIMD support will have private fields so this will
-not be generally observable.
-
 Adding `repr(simd)` to a type may increase its minimum/preferred
 alignment, based on platform behaviour. (E.g. x86 wants its 128-bit
 SSE vectors to be 128-bit aligned.)

From 67fea6e98fecb0a05adb419ddc0cf504e5e0ba04 Mon Sep 17 00:00:00 2001
From: Huon Wilson <dbau.pp+github@gmail.com>
Date: Fri, 14 Aug 2015 10:09:45 -0700
Subject: [PATCH 25/25] Type-level integer/values alternatives for shuffles.

---
 text/0000-simd-infrastructure.md | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/text/0000-simd-infrastructure.md b/text/0000-simd-infrastructure.md
index b3474958311..4c8974fdfce 100644
--- a/text/0000-simd-infrastructure.md
+++ b/text/0000-simd-infrastructure.md
@@ -390,6 +390,17 @@ cfg_if_else! {
   compiler can know this. The `repr(simd)` approach means there may be
   more than one SIMD-vector type with the `Simd8<u32>` shape (or, in
   fact, there may be zero).
+- With type-level integers, there could be one shuffle intrinsic:
+
+     fn simd_shuffle<T, U, const N: usize>(x: T, y: T, idx: [u32; N]) -> U;
+
+  NB. It is possible to add this as an additional intrinsic (possibly
+  deprecating the `simd_shuffleNNN` forms) later.
+- Type-level values can be applied more generally: since the shuffle
+  indices have to be compile time constants, the shuffle could be
+
+      fn simd_shuffle<T, U, const N: usize, const IDX: [u32; N]>(x: T, y: T) -> U;
+
 - Instead of platform detection, there could be feature detection
   (e.g. "platform supports something equivalent to x86's `DPPS`"), but
   there probably aren't enough cross-platform commonalities for this