CUDA
Post-activation nonce-seed path with batched pre-hash scans on GPU. Returns only consensus-passing candidates to the host. Operator-tunable batch size and memory share via env vars. Older-GPU opt-in for compute-capability-gated hardware.
Two days after BTX activated v2 MatMul seeds at block 125,000, the project shipped 0.32.3 — a point release with no new activation height, no new consensus rules, and one very specific job: make the GPU backends fast on the v2 path. The 0.32.2 implementation was correct but unbatched. 0.32.3 replaces it with a production-grade batched CUDA implementation, lands a matching architecture on Metal, restores the pool-side tuning hook, and ships a handful of follow-up fixes the activation week surfaced.
The headline number is the CUDA throughput change at block 125,000: roughly 14.1k nonces per second on the v2 path before this release, and roughly 2.45M nonces per second after — call it 170×. That is not a magic optimisation; it is the gap between a hand-rolled first-cut implementation and one that does the pre-hash scan on the GPU, batches the digests, and only ships consensus-passing candidates back across the PCIe boundary.
Whyte Consolidated Research · 2026-06-09· 9 min read
0.32.3 is not a consensus change. There is no new activation height; the rules at block 125,000 from the v2 MatMul activation stand unchanged. Validation behaviour is unchanged. Wallet behaviour is unchanged. What 0.32.3 ships is a production-grade GPU implementation of the v2 path, a matching Metal architecture, a handful of operator-side quality-of-life improvements, and packaging fixes that the static-release path was missing.
That said, it is described as a mandatory upgradefor anyone running an earlier 0.32.x build, and the reason matters. The 0.32.2 GPU path was correct under the new consensus rules — it would find valid blocks and validate other miners' blocks — but it was effectively unoptimised for v2. A miner running 0.32.2 against 0.32.3 peers is leaving an enormous amount of throughput on the table, and the rest of the network gets to that throughput first.
The release fits the rhythm of the previous two weeks. 0.31 activated C-002 shielded proof and FIPS-205 enforcement at block 123,000. 0.32 activated v2 MatMul at block 125,000. 0.32.3 lands two days later, no activation required, doing exactly the kind of tightening a production network needs in the days after a consensus change.
The reason 0.32.2's CUDA path was slow on v2 is not exotic. Under the v2 rules every nonce produces its own seed and its own matrices, so a naïve implementation generates one set on the host, ships it across the PCIe boundary, runs the multiplication on the GPU, ships the digest back, and asks the host to check it. Throughput is bounded by the slowest hop in that round trip — and for a fast modern GPU that hop is the round trip itself.
0.32.3 collapses the round trip. The pre-hash scan that selects candidate nonces runs on the GPU directly, against a batch of seeds derived in place. The matrix multiplications and digests for the entire batch happen on the GPU in one wave. Only the candidates whose digests already pass the consensus check are shipped back to the host, which is a small fraction of the batch. The host stops being a serialised bottleneck and starts being what it should always have been on a GPU mining loop: a coordinator that hands off work and counts blocks.
The performance the release notes attach to that change — roughly 14.1k to 2.45M nonces per second on the v2 path — is exactly the order-of-magnitude jump a v1-to-v2 rewrite of this shape produces when the original was correct but unbatched. It is the difference between “does it work” and “does it run.” 0.32.2 answered the first. 0.32.3 answers the second.
The Metal backend on Apple Silicon has been the small-operator story since launch — a single Mac mini contributing real work to a pool, profiled at length in Mining pools, AI agents, and the Mac mini that earns. 0.32.3 ports the same pre-hash + variable-base digest batching architecture to Metal that CUDA gets, plus miner-loop routing through the GPU batch path so the host side of the loop on macOS looks structurally identical to the host side on Linux + CUDA.
Device detection improves at the same time. The Metal backend now reports the GPU core count of the host — useful for tuning the per-batch workload and for telling the difference between an M-series base chip and an M-series Max or Ultra at provision time. The release notes do not attach a specific throughput number to the Metal change the way they do for CUDA, but the architecture is the same architecture and the bottleneck pattern is the same bottleneck pattern. Operators should expect a meaningful jump on the Metal side too.
The SOLVE_BATCH_SIZE tuning knob carries forward from 0.32.2 and is the right surface to re-explore on Apple Silicon under 0.32.3. The default that worked best on the 0.32.2 unbatched path is unlikely to be the default that maximises throughput on the new batched architecture — a fact the v2 MatMul piece flagged at the time of activation and that 0.32.3 makes worth acting on.
The new CUDA path is configurable through two environment variables. The first, BTX_MATMUL_NONCE_SEED_BATCH_SIZE, controls how many candidate nonces the GPU batches together before checking digests. Larger batches amortise launch overhead and memory transfers; very large batches eventually saturate the GPU's available memory or trip warp scheduling limits. The release ships with batch sizing tuned for production hosts; the env override is the dial for operators whose hardware does not match the production-host profile.
The second, BTX_MATMUL_CUDA_NONCE_SEED_MEMORY_PERCENT, caps the share of GPU memory the nonce-seed path is allowed to consume. The default keeps the path neighbourly on a GPU also being used for inference, training, or any other workload. An operator running a dedicated mining GPU can raise the cap and trade safety margin for additional batch capacity.
On Metal, SOLVE_BATCH_SIZE continues to be the primary knob and is now meaningfully more interesting than it was under the 0.32.2 implementation. The release notes suggest treating both backends as worth a brief retune pass after upgrading, with the chain-guard RPC introduced in 0.31 as the safe place to do that work — the node will tell you whether to continue, pause, or catch_up at every point.
Post-activation nonce-seed path with batched pre-hash scans on GPU. Returns only consensus-passing candidates to the host. Operator-tunable batch size and memory share via env vars. Older-GPU opt-in for compute-capability-gated hardware.
Matching architecture — pre-hash scan, variable-base digest batching, miner-loop routing through the GPU batch path. Device detection now reports GPU core count, useful for tuning. SOLVE_BATCH_SIZE from 0.32.2 carries forward.
Continues to use the batched solver introduced in 0.32.1 — each thread takes a slice of candidate nonces, each with its own seed. No change in correctness; benefits from the surrounding diagnostics work.
None of these are headline material on their own. Each one removes friction the activation week made visible.
Pool operators get back the early-exit targeting hook for MatMul solvers. Affects digest early-exit only — not what the network accepts.
Failure reasons now visible in logs with concrete error categories. Troubleshooting guide added at doc/btx-cuda-mining-troubleshooting.md.
BTX_CUDA_ALLOW_OLDER_GPUS=1 enables compute-capability-gated GPUs. Supported but slower; the optimised path is tuned for current generations.
New RPC for shielded-state repair observability — nullifier accumulator state, persisted-state freshness, restart-time behaviour. Maintenance tool, not a wallet primitive.
getblock verbosity-2 fee reporting fix. PSBT handling fix. Small but load-bearing for downstream tooling that depends on them.
Static release binaries now bundle libzmq.a. Removes the most common packaging gap for operators wiring ZMQ-driven monitoring against a stock binary.
Look at the trailing seven days. 0.31 activated at block 123,000 with a hardened shielded proof format, FIPS-205 SLH-DSA enforcement, and the mining chain-guard RPC. 0.32 activated at block 125,000 with v2 MatMul seeds that made every nonce attempt require fresh matrices. 0.32.3 lands two days later, no activation, doing exactly the operational tightening the v2 path needed at production speed. None of the three required ordinary wallet users to do anything. All three landed on slack the launch wire had reserved on purpose.
The thing to take away from a cadence like that is not the individual changes. It is the demonstration that the upgrade path works. A network that can ship a consensus change, an activation, and a production-speed catch-up release inside a week without breaking anyone — and without forcing wallets, exchanges, or custodians to scramble — has the right plumbing for the larger changes that are still to come: wBTX as a wrapped-asset spec, the bridge layer, hybrid post-quantum transport, signed auto-update. The slack-and-activate pattern keeps working at the shape it was designed for.
0.32.3 is a tuning release. It is also, structurally, evidence that the activation rhythm is real.
0.32.3 does not change the rules. It catches the GPU backends up to where the rules need them to be — CUDA jumps from a working-but-unbatched 14.1k nonces per second on the v2 path to a production-grade 2.45M, Metal lands the matching pre-hash + batched digest architecture, the pool-side targeting hook returns, the diagnostics get honest, the older-GPU opt-in arrives, and the static binaries finally include ZMQ.
For operators on an earlier 0.32.x build, the upgrade is mandatory in the sense that anyone who skips it is voluntarily mining at a meaningful disadvantage to the rest of the network. For wallet users the release is invisible. For the network it is the third meaningful piece of work to land in a single week, and the second of that three that did not require ordinary users to know it happened. That is the cadence a production settlement chain wants.
This article is a plain-language summary of BTX 0.32.3, the post-activation tuning release that follows the block-125,000 v2 MatMul activation. The items below are the primary sources for the chain itself and related pieces from this site.
For informational purposes only. Not financial, investment, or legal advice. Technical claims reflect the BTX v0.32.3 release notes and may be subject to further iteration. Systems, protocols, and tokens referenced are described for context and are not endorsements. Mining outcomes depend on hardware, difficulty, and market conditions, and are not guaranteed.