simd intrinsics with mask: accept unsigned integer masks, and fix some of the errors
It's not clear at all why the mask would have to be signed, it is anyway interpreted bitwise. The backend should just make sure that works no matter the surface-level type; our LLVM backend already does this correctly. The note of "the mask may be widened, which only has the correct behavior for signed integers" explains... nothing? Why can't the code do the widening correctly? If necessary, just cast to the signed type first...
Also while we are at it, fix the errors. For simd_masked_load/store, the errors talked about the "third argument" but they meant the first argument (the mask is the first argument there). They also used the wrong type for `expected_element`.
I have extremely low confidence in the GCC part of this PR.
See [discussion on Zulip](https://rust-lang.zulipchat.com/#narrow/channel/257879-project-portable-simd/topic/On.20the.20sign.20of.20masks)
avoid overflow when generating debuginfo for expanding recursive types
Fixes#135093Fixes#121538Fixes#107362Fixes#100618Fixes#115994
The overflow happens because expanding recursive types keep creating new nested types when recurring into sub fields.
I fixed that by returning an empty stub node when expanding recursion is detected.
Autodiff batching2
~I will rebase it once my first PR landed.~ done.
This autodiff batch mode is more similar to scalar autodiff, since it still only takes one shadow argument.
However, that argument is supposed to be `width` times larger.
r? `@oli-obk`
Tracking:
- https://github.com/rust-lang/rust/issues/124509
Prepend temp files with per-invocation random string to avoid temp filename conflicts
https://github.com/rust-lang/rust/issues/139407 uncovered a very subtle unsoundness with incremental codegen, failing compilation sessions (due to assembler errors), and the "prefer hard linking over copying files" strategy we use in the compiler for file management.
Specifically, imagine we're building a single file 3 times, all with `-Csave-temps -Cincremental=...`. Let's call the object file we're building for the codegen unit for `main` "`XXX.o`" just for clarity since it's probably some gigantic hash name:
```
#[inline(never)]
#[cfg(any(rpass1, rpass3))]
fn a() -> i32 {
0
}
#[cfg(any(cfail2))]
fn a() -> i32 {
1
}
fn main() {
evil::evil();
assert_eq!(a(), 0);
}
mod evil {
#[cfg(any(rpass1, rpass3))]
pub fn evil() {
unsafe {
std::arch::asm!("/* */");
}
}
#[cfg(any(cfail2))]
pub fn evil() {
unsafe {
std::arch::asm!("missing");
}
}
}
```
Session 1 (`rpass1`):
* Type-check, borrow-check, etc.
* Serialize the dep graph to the incremental working directory `.../s-...-working/`.
* Codegen object file to a temp file `XXX.rcgu.o` which is spit out in the cwd.
* Hard-link[^1] `XXX.rcgu.o` to the incremental working directory `.../s-...-working/XXX.o`.
* Save-temps option means we don't delete `XXX.rgcu.o`.
* Link the binary and stuff.
* Finalize[^2] the working incremental session by renaming `.../s-...-working` to ` s-...-asjkdhsjakd` (some other finalized incr comp session dir name).
Session 2 (`cfail2`):
* Load artifacts from the previous *finalized* incremental session, namely the dep graph.
* Type-check, borrow-check, etc. since the file has changed, so most dep graph nodes are red.
* Serialize the dep graph to the incremental working directory `.../s-...-working/`.
* Codegen object file to a temp file `XXX.rcgu.o`. **HERE IS THE PROBLEM**: The hard-link is still set up to point to the inode from `XXX.o` from the first session, so this also modifies the `XXX.o` in the previous finalized session directory.
* Codegen emits an error b/c `missing` is not an instruction, so we abort before finalizing the incremental session. Specifically, this means that the *previous* session is the last finalized session.
Session 3 (`rpass3`):
* Load artifacts from the previous *finalized* incremental session, namely the dep graph. NOTE that this is from session 1.
* All the dep graph nodes are green since we are basically replaying session 1.
* codegen object file `XXX.o`, which is detected as *reused* from session 1 since dep nodes were green. That means we **reuse** `XXX.o` which had been dirtied from session 2.
* Link the binary and stuff.
This results in a binary which reuses some of the build artifacts from session 2, but thinks it's from session 1.
At this point, I hope it's clear to see that the incremental results from session 1 were dirtied from session 2, but we reuse them as if session 1 was the previous (finalized) incremental session we ran. This is at best really buggy, and at worst **unsound**.
This isn't limited to `-C save-temps`, since there are other combinations of flags that may keep around temporary files (hard linked) in the working directory (like `-C debuginfo=1 -C split-debuginfo=unpacked` on darwin, for example).
---
This PR implements a fix which is to prepend temp filenames with a random string that is generated per invocation of rustc. This string is not *deterministic*, but temporary files are transient anyways, so I don't believe this is a problem.
That means that temp files are now something like... `{crate-name}.{cgu}.{invocation_temp}.rcgu.o`, where `{invocation_temp}` is the new temporary string we generate per invocation of rustc.
Fixes https://github.com/rust-lang/rust/issues/139407
[^1]: 175dcc7773/compiler/rustc_fs_util/src/lib.rs (L60)
[^2]: 175dcc7773/compiler/rustc_incremental/src/persist/fs.rs (L1-L40)
add `core::intrinsics::simd::{simd_extract_dyn, simd_insert_dyn}`
fixes https://github.com/rust-lang/rust/issues/137372
adds `core::intrinsics::simd::{simd_extract_dyn, simd_insert_dyn}`, which contrary to their non-dyn counterparts allow a non-const index. Many platforms (but notably not x86_64 or aarch64) have dedicated instructions for this operation, which stdarch can emit with this change.
Future work is to also make the `Index` operation on the `Simd` type emit this operation, but the intrinsic can't be used directly. We'll need some MIR shenanigans for that.
r? `@ghost`
add sret handling for scalar autodiff
r? `@oli-obk`
Fixing one of the todo's which I left in my previous batching PR.
This one handles sret for scalar autodiff. `sret` mostly shows up when we try to return a lot of scalar floats.
People often start testing autodiff which toy functions which just use a few scalars as inputs and outputs, and those were the most likely to be affected by this issue. So this fix should make learning/teaching hopefully a bit easier.
Tracking:
- https://github.com/rust-lang/rust/issues/124509
coverage: Build the CGU's global file table as late as possible
Embedding coverage metadata in the output binary is a delicate dance, because per-function records need to embed references to the per-CGU filename table, but we only want to include files in that table if they are successfully used by at least one function.
The way that we build the file tables has changed a few times over the last few years. This particular change is motivated by experimental work on properly supporting macro-expansion regions, which adds some additional constraints that our previous implementation wasn't equipped to deal with.
LLVM is very strict about not allowing unused entries in local file tables. Currently that's not much of an issue, because we assume one source file per function, but to support expansion regions we need the flexibility to avoid committing to the use of a file until we're completely sure that we are able and willing to produce at least one coverage mapping region for it. In particular, when preparing a function's covfun record, we need the flexibility to decide at a late stage that a particular file isn't needed/usable after all.
(It's OK for the *global* file table to contain unused entries, but we would still prefer to avoid that if possible, and this implementation also achieves that.)
Autodiff batching
Enzyme supports batching, which is especially known from the ML side when training neural networks.
There we would normally have a training loop, where in each iteration we would pass in some data (e.g. an image), and a target vector. Based on how close we are with our prediction we compute our loss, and then use backpropagation to compute the gradients and update our weights.
That's quite inefficient, so what you normally do is passing in a batch of 8/16/.. images and targets, and compute the gradients for those all at once, allowing better optimizations.
Enzyme supports batching in two ways, the first one (which I implemented here) just accepts a Batch size,
and then each Dual/Duplicated argument has not one, but N shadow arguments. So instead of
```rs
for i in 0..100 {
df(x[i], y[i], 1234);
}
```
You can now do
```rs
for i in 0..100.step_by(4) {
df(x[i+0],x[i+1],x[i+2],x[i+3], y[i+0], y[i+1], y[i+2], y[i+3], 1234);
}
```
which will give the same results, but allows better compiler optimizations. See the testcase for details.
There is a second variant, where we can mark certain arguments and instead of having to pass in N shadow arguments, Enzyme assumes that the argument is N times longer. I.e. instead of accepting 4 slices with 12 floats each, we would accept one slice with 48 floats. I'll implement this over the next days.
I will also add more tests for both modes.
For any one preferring some more interactive explanation, here's a video of Tim's llvm dev talk, where he presents his work. https://www.youtube.com/watch?v=edvaLAL5RqU
I'll also add some other docs to the dev guide and user docs in another PR.
r? ghost
Tracking:
- https://github.com/rust-lang/rust/issues/124509
- https://github.com/rust-lang/rust/issues/135283
Rename `is_like_osx` to `is_like_darwin`
Replace `is_like_osx` with `is_like_darwin`, which more closely describes reality (OS X is the pre-2016 name for macOS, and is by now quite outdated; Darwin is the overall name for the OS underlying Apple's macOS, iOS, etc.).
``@rustbot`` label O-apple
r? compiler
Add the new `amx` target features and the `movrs` target feature
Adds 5 new `amx` target features included in LLVM20. These are guarded under `x86_amx_intrinsics` (#126622)
- `amx-avx512`
- `amx-fp8`
- `amx-movrs`
- `amx-tf32`
- `amx-transpose`
Adds the `movrs` target feature (from #137976).
`@rustbot` label O-x86_64 O-x86_32 T-compiler A-target-feature
r? `@Amanieu`
Avoid wrapping constant allocations in packed structs when not necessary
This way LLVM will set the string merging flag if the alloc is a nul terminated string, reducing binary sizes.
try-job: armhf-gnu
cg_llvm: Reduce the visibility of types, modules and using declarations in `rustc_codegen_llvm`.
Final part of #135502
Reduces the visibility of types, modules and using declarations in the `rustc_codegen_llvm` to private or `pub(crate)` where possible, and marks unused fields and enum entries with `#[expect(dead_code)]`.
r? Zalathar
Lower BinOp::Cmp to llvm.{s,u}cmp.* intrinsics
Lowers `mir::BinOp::Cmp` (`three_way_compare` intrinsic) to the corresponding LLVM `llvm.{s,u}cmp.i8.*` intrinsics.
These are the intrinsics mentioned in https://github.com/rust-lang/rust/pull/118310, which are now available in LLVM 19.
I couldn't find any follow-up PRs/discussions about this, please let me know if I missed something.
r? `@scottmcm`
For expansion region support, we will want to be able to convert and check
spans before creating a corresponding local file ID.
If we create local file IDs eagerly, but some expansion turns out to have no
successfully-converted spans, LLVM will complain about that expansion's file ID
having no regions.