Does this change BTX consensus or the mining math?

No. The whole point of the research is that none of it touches consensus. The CPU solver stays the correctness oracle, the M31 integer arithmetic over 2³¹−1 is unchanged, and every GPU result is checked bit-for-bit against the CPU before a block is stamped. The gains come from staging, batching, and per-host tuning — not from changing what counts as a valid block.

Why was batch size 1 the fastest on the M5?

For shared-base mining, larger batches generated thousands of buffer-pool wait events before they recovered the cost in GPU utilization. On a 10-core Metal GPU with a finite buffer pool, batch 1 hit 6,303 nonces/sec with zero pool waits, while batch 32 dropped to 5,321 with over 4,000 waits. Bigger isn't faster when the pool is the bottleneck.

Why is the recommendation 'calibrate' instead of a fixed setting?

Because the best setting changes between mining regimes and across hardware. Shared-base mining wanted batch 1; variable-base (post nonce-seed) mining wanted batch 16 on the same machine. A short calibration pass measures the host and caches the winning parameters keyed by device, GPU core count, shape, and activation regime — so no single hard-coded constant is left guessing for someone else's GPU.

Could CUDA be benchmarked too?

Not locally — a MacBook has no NVIDIA GPU. The CUDA path is build-gated (BTX_ENABLE_CUDA_EXPERIMENTAL, sm_80 or newer) and its fallback and parity scaffolding is tested, but real performance numbers need a multi-GPU Linux host. The biggest CUDA-only gap is multi-GPU nonce-seed sharding, which still concentrates on the first selected device.

What's the single most reusable lesson here?

When your fast path has to match a slow oracle bit-for-bit, your optimization budget goes into staging, batching, and measurement — not clever arithmetic. Keep the slow oracle forever, watch contention counters and not just throughput, and verify the GPU is even being used: a sandbox silently demoted the benchmark to CPU until it was run outside the sandbox.

Whyte ConsolidatedDigital Infrastructure Capital

Macro of a circuit board — representative imagery of GPU compute and matrix-multiply proof-of-work.

Article · June 11, 2026 · Digital infrastructure

20× on Apple Silicon.
And not one line of consensus math changed.

BTX's proof-of-work is a real 512×512 matrix multiplication over the Mersenne-prime field 2³¹−1 — not a hash grind. That makes GPU acceleration both tempting and dangerous: get the arithmetic wrong by a single bit and you fork yourself off the network. We spent a day profiling the Metal backend on an Apple M5 MacBook Air and came away with a 20.8× speedup — and a stronger conviction that the next win is measurement, not a faster kernel.

The Metal path hit 6,303 nonces/sec against a 303 nonces/secCPU baseline, with zero GPU-to-CPU fallbacks and zero digest mismatches. The most surprising result wasn't the headline number — it was that batch size 1 won. This is a field report on what we measured, why the intuitive answers were wrong, and why the highest-leverage change is a per-host calibration layer rather than a consensus rewrite.

apple m5 · 10-core gpu/exact M31 metal kernels/batch 1 won/calibrate, don't guess

Whyte Consolidated Research · 2026-06-11· 10 min read

By the numbers · Apple M5, live-like mainnet shape

303.67

CPU baseline

10 solver threads, nonces/sec

6,303.11

Metal, shared-base

batch 1 — 20.8× the CPU

4,520.08

Metal, variable-base

batch 16 — 14.9× the CPU

0 / 0

Fallbacks / mismatches

across every benchmarked run

1 · A proof-of-work that's literally a matrix multiply

The GPU is allowed to be fast. It is never allowed to be approximately right.

Most proof-of-work chains hash. BTX instead makes miners compute a structured matrix product and commit to its digest. Per candidate, the CPU solver derives a seed from the header, generates a low-rank noise pair, builds the perturbed matrices A' and B', runs the canonical MatMul, and compares the digest against the difficulty target — all in exact M31 integer arithmetic.

On Apple builds the low-rank products canuse Accelerate's split-16 DGEMM for speed. But the code spot-checks every result against scalar M31 math and permanently disables the Accelerate path for the process the instant it sees divergence. That paranoia is the whole game. The CPU path is the correctness oracle: deterministic, portable, and tied directly to consensus verification. Every GPU optimization is judged against it.

The boundary you cannot move

✓M31 field arithmetic over 2³¹−1 — exact, integer, no exceptions.
✓Header serialization, sigma derivation, seed_a and seed_b.
✓Noise-factor and compression-vector derivation.
✓Canonical transcript order and product-committed word order.
✓CPU pre-hash recheck before digest batching.
✓CPU confirmation of candidate blocks before a block is stamped.
✓Clean CPU fallback — or fail-closed — when a backend is unavailable.

Forbidden shortcuts

✕No FP16, tensor cores, or approximate matmul for consensus mining.
✕No lossy nonce prefilters.
✕No reuse of shared A/B base matrices after the nonce-seed activation height.
✕No promoting a fused GPU-hash path unless it's byte-exact against CPU fixtures.

2 · Two regimes, two opposite optimization problems

Same binary, same GPU, opposite winning batch size.

BTX has two mining regimes that look similar and optimize oppositely. Pre nonce-seed (shared base): many candidates share one A/B base-matrix pair, so you upload the base once and reuse it across a huge fan of nonces. Memory traffic is cheap; you want to keep the GPU fed.

Post nonce-seed (variable base): after the v2 activation at block 125,000, every passing candidate derives its own seed_a and seed_b, so every candidate gets its own base matrices. No more reuse. Now the bottleneck is generating and staging per-candidate matrices — and the optimal batch shape moves.

Regime	Best setting	Nonces/sec	vs CPU
CPU baseline, 10 solver threads	--backend cpu	303.67	1.0×
Pre nonce-seed, shared-base Metal	--batch-size 1	6,303.11	20.8×
Post nonce-seed, variable-base Metal	--batch-size 16	4,520.08	14.9×

Two regimes, two different winning batch sizes on the same machine. That single fact is the entire argument for calibration.

3 · Why batch size 1 won

The contention counter explained what throughput alone hid.

Intuition says bigger batches mean better GPU utilization. On this hardware, for shared-base product mining, intuition was wrong. Watch the rightmost column — the buffer-pool wait events:

Batch	Nonces/sec	Pool waits
1	6,303.11	0
2	6,016.16	0
16	5,424.49	3,056
32	5,321.49	4,006

At batch 1 there's no contention. Crank the batch up and you generate thousands of pool waits before you ever recover the cost in throughput. On a 10-core GPU with a finite buffer pool, larger batches bought pressure, not parallelism.

The control runs reinforced a second lesson — let the GPU generate its own inputs. Turning GPU input generation off barely moved median throughput (6,116 vs 6,303) but produced 5,790 pool waits where there had been none. Even a throughput-neutral knob has an obvious right default once you watch the contention counters.

One more trap worth its own line: direct Metal microbenchmarks need explicit pool slots. With the default single slot a microbench measured 303 digests/sec; with 8 slots the same kernel hit 613 — double. Benchmark parallel Metal work without sizing the pool and you're benchmarking contention, not the kernel.

Diagram · Shared-base vs variable-base, and where the cost moves

The two regimes pull in opposite directions. Shared-base mining reuses one matrix pair and is fastest at batch 1; variable-base mining regenerates matrices per candidate and is fastest at a moderate batch of 16. The winning setting is a property of the host and the regime, not a constant.

4 · The benchmark harness

Live-like mainnet shape — and one sandbox gotcha.

Everything ran at the live-like mainnet shape: n=512, b=16, r=8, epsilon_bits=18, nbits=0x1e063c74, with parallel, solver threads, and prepare workers all at 10.

build/bin/btx-matmul-solve-bench \
  --backend metal \
  --iterations 3 --tries 2048 \
  --parallel 10 --solver-threads 10 --prepare-workers 10 \
  --async 1 --gpu-inputs 1 \
  --block-height 125000 --nonce-seed-height 125000 \
  --batch-size 16

One gotcha worth its own line: sandboxed macOS processes couldn't see the Metal GPU.Every GPU benchmark had to run outside the sandbox. If your acceleration layer silently falls back to CPU under a sandbox, your “GPU numbers” are quietly lying to you — which is exactly why zero fallbacks is asserted as a metric in the test gate, not assumed.

The parity suites back the speed claims: 30 cases in the accelerated-solver tests and 31 in the Metal tests, all passing, covering CPU-vs-Metal digest and product-digest parity, mainnet 512/16/8 parity, batch-digest parity, nonce-seed variable-base parity, and concurrent-request pool contention reporting. Every benchmarked solver run reported zero fallbacks and zero digest mismatches — the only reason the speed numbers are allowed to count.

5 · The recommendation

Measure the machine. Don't guess it.

The best optimization here isn't a faster kernel. It's admitting that no fixed heuristic is right across hardware and regimes, and building a layer that figures it out per host.

Priority 1 — per-host calibration.A short calibration mode sweeps a bounded set of parameter tuples and caches the winner, keyed by BTX version, backend, device name, GPU core count or CUDA SM profile, shape and digest mode, and activation regime. On this M5, the cached defaults would land at batch 1 with 8 pool slots for shared-base, and batch 16 for variable-base. It's low-risk because it only selects among existing, parity-gated code paths — no new math, no new trust assumptions.

Priority 2 — a device-resident nonce-seed pipeline. The post-activation regime pays for host readback and re-staging. Compact the GPU pre-hash scan into a pass list instead of a flag per nonce, keep seed and sigma data device-side for passing candidates, feed per-candidate generation from device-owned buffers, and return only (nonce, digest, pass/fail) to the CPU for confirmation. Attack the staging, not the multiply.

Priorities 3–5are a true candidate-grid Metal kernel (candidate index as an extra grid dimension), CUDA multi-GPU nonce-seed sharding — the biggest CUDA-only gap, since nonce-seed mining still concentrates on the first device — and GPU-side product finalization, all gated behind expanded byte-exact tests and CPU confirmation before a block is stamped. CUDA couldn't be benchmarked locally: a MacBook has no NVIDIA GPU, so those numbers need a real multi-GPU Linux host.

6 · What generalizes beyond BTX

Five lessons for any fast path that must match a slow oracle.

Keep a slow oracle and never delete it. The CPU path isn't legacy — it's the definition of correct. The GPU is an optimization of it.
Watch contention counters, not just throughput. Pool-wait events explained every counterintuitive result here. Median nonces/sec alone would have sent us the wrong way on batch size.
Don't trust one batch size across regimes. Shared-base wanted 1; variable-base wanted 16. Same binary, same GPU, opposite answers.
Verify the GPU is even being used.A sandbox silently demoted the benchmark to CPU. Zero fallbacks isn't an assumption — it's a metric you assert in the test gate.
Calibrate instead of constant-folding. The cheapest, safest win was measuring the host and caching the result. Hardware is too varied for a magic number.

No consensus-math changes. No replacing M31 integer arithmetic. The next step is a calibration-backed backend policy plus device-resident nonce-seed batching — and, on real NVIDIA hardware, multi-GPU sharding. Sometimes the highest-leverage GPU optimization is to stop guessing and start measuring.

FAQ

MatMul GPU optimization — questions

Does this change BTX consensus or the mining math?: No. The whole point of the research is that none of it touches consensus. The CPU solver stays the correctness oracle, the M31 integer arithmetic over 2³¹−1 is unchanged, and every GPU result is checked bit-for-bit against the CPU before a block is stamped. The gains come from staging, batching, and per-host tuning — not from changing what counts as a valid block.
Why was batch size 1 the fastest on the M5?: For shared-base mining, larger batches generated thousands of buffer-pool wait events before they recovered the cost in GPU utilization. On a 10-core Metal GPU with a finite buffer pool, batch 1 hit 6,303 nonces/sec with zero pool waits, while batch 32 dropped to 5,321 with over 4,000 waits. Bigger isn't faster when the pool is the bottleneck.
Why is the recommendation 'calibrate' instead of a fixed setting?: Because the best setting changes between mining regimes and across hardware. Shared-base mining wanted batch 1; variable-base (post nonce-seed) mining wanted batch 16 on the same machine. A short calibration pass measures the host and caches the winning parameters keyed by device, GPU core count, shape, and activation regime — so no single hard-coded constant is left guessing for someone else's GPU.
Could CUDA be benchmarked too?: Not locally — a MacBook has no NVIDIA GPU. The CUDA path is build-gated (BTX_ENABLE_CUDA_EXPERIMENTAL, sm_80 or newer) and its fallback and parity scaffolding is tested, but real performance numbers need a multi-GPU Linux host. The biggest CUDA-only gap is multi-GPU nonce-seed sharding, which still concentrates on the first selected device.
What's the single most reusable lesson here?: When your fast path has to match a slow oracle bit-for-bit, your optimization budget goes into staging, batching, and measurement — not clever arithmetic. Keep the slow oracle forever, watch contention counters and not just throughput, and verify the GPU is even being used: a sandbox silently demoted the benchmark to CPU until it was run outside the sandbox.

20× on Apple Silicon.And not one line of consensus math changed.