collatz-m4m6 is an analysis tool based on the m4/m6 pair predicate decomposition
of the Collatz-type map T(n) = (xn+1)/2d.
In v0.2.0, we introduced a Vec<u64> packed-bit representation and
a Kogge-Stone parallel prefix scan, achieving a significant structural improvement
over the sequential scan of the initial v0.1.
However, in measured range verification (sweep) speed, u128 direct arithmetic achieves approximately 440 million nums/s, while processing via pair scan is several orders of magnitude slower. This report analyzes the structural causes of this speed gap and discusses the computational significance and future prospects of pair predicate decomposition.
Intel Core i7-12650H (Alder Lake, 10C/16T), 64GB DDR5 environment. Measured under 4 conditions: Phase 1 (u128 direct arithmetic) ON/OFF and GPK statistics collection ON/OFF.
| Condition | Elapsed Time | Speed (nums/s) | GPK | Phase1 |
|---|---|---|---|---|
| u128 only | 0.113s | ~442,000,000 | OFF | ON |
| u128 + GPK statistics | 2.243s | ~22,300,000 | ON | ON |
| Packed only | 9.516s | ~5,250,000 | OFF | OFF |
| Packed + GPK statistics | 12.546s | ~3,990,000 | ON | OFF |
The speed gap between u128 only (0.113s) and Packed only (9.516s) is approximately 84x. For 3n+1, nearly all numbers complete within u128 Phase 1, so this gap directly manifests. The difference between GPK ON/OFF for u128 is approximately 20x.
5n+1 contains divergent sequences, so there exist numbers that do not reach a stopping time. Measured under the same 4 conditions.
| Condition | Elapsed Time | Speed (nums/s) | GPK | Phase1 |
|---|---|---|---|---|
| u128 only | 1.667s | ~30,000 | OFF | ON |
| u128 + GPK statistics | 3.555s | ~14,100 | ON | ON |
| Packed only | 2.118s | ~23,600 | OFF | OFF |
| Packed + GPK statistics | 3.995s | ~12,500 | ON | OFF |
For 5n+1, divergence causes bit length to grow, overflowing u128, making the benefit of Phase 1 limited. The gap between u128 only (1.667s) and Packed only (2.118s) narrows to 1.27x.
The following shows a comparison of instruction counts per step, normalized to 128-bit width.
| Operation | Instructions / 128bit | Memory Access | Branching |
|---|---|---|---|
| u128 (MUL+ADD+TZCNT+SHR) | 4 | None (register-only) | None |
| BigUint (multi-precision multiply) | ~6 | Vec<u64> heap | Loop control |
| Packed scan (Kogge-Stone) | ~25 | Vec<u64> ×2 heap | Word loop |
u128 arithmetic is executed directly by the CPU's hardware multiplier circuit. Intel Alder Lake's integer multiplication pipeline completes 64bit×64bit in 3–4 clock cycles, storing the 128-bit result in a register pair. This operation involves no memory access whatsoever and no branches that stall the pipeline.
This is a hardware optimization born from semiconductor design investments on the scale of tens of billions of dollars, and it is fundamentally impossible for software-based bit operations to compete on the same level.
The Collatz-type map is T(n) = (xn+1)/2d, where x is a small constant such as 3, 5, or 9. This corresponds to the multiplication pattern of "large integer × small constant," which is the simplest case in multi-precision multiplication. BigUint × constant requires only 2 instructions (MUL + ADC) per limb and completes in O(n). Pair scan is also O(n), so the asymptotic complexity is identical, but BigUint always wins by a constant factor.
In 3n+1 range verification, the vast majority of inputs in the u64 range reach their stopping time by dropping below the initial value in an average of 3–5 steps. During this process, values almost never exceed u128 (128 bits), so very few numbers transition to pair scan (Phase 2). In other words, making sweep faster means making the u128 loop faster, and improvements to packed scan contribute almost nothing to sweep speed.
From v0.1 to v0.2, the pair scan itself has been significantly improved. However, this improvement is not in sweep speed but in the efficiency of structural analysis.
| Version | Internal Representation | Carry Resolution | Improvement |
|---|---|---|---|
| v0.1 initial | Vec<u8> per bit | Sequential, 1 bit at a time | Baseline |
| v0.1 fast path | u128 + BigUint | CPU multiplier | Sweep made practical |
| v0.2 current | Vec<u64> (64 pairs/word) | Kogge-Stone (6 stages) | 8x memory improvement |
Through packed representation, memory efficiency improved 8x and intra-word carry resolution improved approximately 10x. However, the hot path of sweep is u128 Phase 1, and these improvements only take effect when Phase 2 is reached.
A GPK ON/OFF toggle mechanism was introduced so that when statistics are not needed,
the design completely skips Vec<u64> allocation, store, popcount, and max_carry_chain loops.
As a result, 3n+1 sweep with GPK OFF runs on u128 arithmetic alone, achieving 442M nums/s.
On the other hand, with GPK ON, the bit-by-bit loop in accumulate_gpk_u128 becomes the bottleneck,
reducing throughput to 22M nums/s.
This approximately 20x difference demonstrates the fact that the GPK statistics collection cost itself
is overwhelmingly heavier than the arithmetic core.
Because pair predicate decomposition does not use multiplication, it holds a structural advantage in asymptotic complexity. When bit width becomes sufficiently large, a crossover theoretically exists where the linear cost of pair scan becomes advantageous over the superlinear cost of direct multiplication.
| Bit Width | Direct Multiplication | Pair Scan | Notes |
|---|---|---|---|
| ~128 | Native MUL O(1) | Bit operations ~30 instructions | u128 dominant |
| ~256 | Multi-precision MUL 4 times + addition | Kogge-Stone 1 word | BigUint advantageous |
| ~1,000 | Karatsuba O(n1.58) | O(n) | Asymptotically pair advantageous |
| ~10,000+ | FFT O(n log n) | O(n), small constant | Pair scan dominant |
Note: The Collatz-type map involves "large number × small constant," and BigUint × constant also completes in O(n). Therefore, unlike general multi-precision multiplication crossover tables, when restricted to constant × n multiplication, the asymptotic complexity is identical (both O(n)). In measured benchmarks, BigUint is approximately 100x faster even at 32,768 bits, and the speed crossover in the context of the Collatz map is practically unreachable.
Pair predicate decomposition transparently decomposes each step of the Collatz map not as multiplication but as carry propagation in an adder. Its byproduct, the GPK classification (Generate/Propagate/Kill), quantifies carry generation, propagation, and annihilation at each pair position, making the growth/shrinkage mechanism of numbers statistically observable.
The following shows GPK statistics obtained from verifying all 4.99 billion odd numbers from 3 to 9,999,999,999 (≈ 233) for the 3n+1 map.
G (carry generate) = 113,183,623,064 (38.16%)
P (carry propagate) = 81,071,432,343 (27.34%)
K (carry kill) = 102,328,662,110 (34.50%)
Total pairs = 296,583,717,517
Total steps = 17,463,252,353
All numbers converged (maximum stopping time 282, n = 2,788,008,987)
Carry chain length distribution:
| Chain Length | Occurrences | Chain Length | Occurrences |
|---|---|---|---|
| 1 | 265,878,320 | 14 | 38,274,610 |
| 2 | 1,842,752,124 | 15 | 22,412,468 |
| 3 | 3,494,167,747 | 16 | 12,790,339 |
| 4 | 3,634,880,654 | 17 | 12,412,547 |
| 5 | 2,871,874,977 | 18 | 2,519,910 |
| 6 | 1,991,272,449 | 19 | 415,078 |
| 7 | 1,293,719,531 | 20 | 69,568 |
| 8 | 814,361,343 | 21 | 11,275 |
| 9 | 501,167,658 | 22 | 1,579 |
| 10 | 304,639,101 | 23 | 337 |
| 11 | 183,966,569 | 24 | 32 |
| 12 | 110,232,559 | 25 | 6 |
| 13 | 65,431,568 | 26 | 4 |
Despite G% > K%, all numbers converge because while G contributes at most 1 bit per occurrence, d (division by 2d) on average exceeds the effect of G. The carry chain length distribution peaks at 4–5, with a maximum of 26. This indicates that carry propagation remains local, and long-range propagation is exponentially rare.
These findings — the locality of carry propagation, the stability of the G/K ratio, and the geometric distribution of chain lengths — are structural information that cannot be directly obtained from a multiplication-based approach, and represent the intrinsic value of pair predicate decomposition.
The Kogge-Stone algorithm was originally a design technique for hardware adders, and if pair predicate decomposition is implemented on FPGA/ASIC, one step of the Collatz map can be completed in O(log n) adder depth without going through a multiplier. The advantage of not being constrained to a fixed width also provides natural scalability in circuit design.
However, the claim of this paper is not "achieving the fastest Collatz computation." Hardware implementation is presented as one example of the theoretical consistency by which this method maps directly to adder structures, and the computational significance of pair predicate decomposition should not be reduced to performance comparisons.
The clear practical advantage of pair predicate decomposition lies in the cost of obtaining GPK statistics. From the measured benchmarks (Section 2), we compare the speed difference with and without GPK statistics collection.
| GPK OFF | GPK ON | GPK Overhead | |
|---|---|---|---|
| u128 arithmetic | 442M nums/s | 22M nums/s | ~20x slowdown |
| Packed scan | 5.25M nums/s | 3.99M nums/s | ~1.3x slowdown |
With u128 direct arithmetic, obtaining GPK statistics requires performing bit decomposition through a separate path from the multiplication, causing an approximately 20x speed reduction. In contrast, with pair scan, GPK is naturally obtained as a byproduct of the scan process, so the additional cost remains at only 1.3x.
The effective comparison in scenarios where GPK statistics are needed (i.e., the classification verification of this paper) is:
u128 + GPK: 22M nums/s
Packed + GPK: 4M nums/s
Speed gap: ~5.5x (significantly reduced from the 84x gap when GPK is not needed)
Pair scan is not "slow but reveals structure"; rather, it is a method that "incurs almost no computational speed loss when observing structure."
What this paper demonstrates is the structural properties of the map, not software benchmark rankings. Pair predicate decomposition transparently decomposes the Collatz map as carry propagation in an adder, providing GPK statistics as a quantitative analysis tool that comes as a byproduct of computation. This is the raison d'etre of this tool and its value as a verification platform for the classification paper.
We organize future prospects in two directions: improving pair scan speed and leveraging the structural value of pair predicate decomposition.
| Approach | Overview | Expected Effect | Application Domain |
|---|---|---|---|
| SIMD (AVX2/512) | Simultaneous processing of multiple words at 256/512-bit width | Theoretical 4-8x | Packed scan core |
| Inter-word parallelization | Apply Kogge-Stone across words | Effective for large numbers | 1000 bits or more |
| Sieve | Skip residue classes proven to converge by GPK classification | Mathematical reduction | Sweep overall |
| SIMD multi-number parallel | Simultaneously scan 4 u64 numbers with AVX2 | Theoretical 4x | Sweep (u64 range) |
| FPGA/ASIC implementation | Implement pair predicate decomposition as adder circuit | 1 clock/step | Demonstration of theoretical consistency |
Through residue class analysis based on GPK classification, if a theorem can be derived stating "this residue class is always K-dominant, therefore convergence is guaranteed," then numbers that can be mathematically skipped can be excluded in advance. This is the only pathway where pair predicate decomposition directly contributes to speed, and the direction that most strongly justifies the tool's existence.
AVX2 (256 bits) can process 4 words simultaneously, and AVX-512 (512 bits) can process 8 words simultaneously. SIMD-izing the intra-word operations of Kogge-Stone could theoretically yield a 4–8x improvement. However, as long as the hot path of sweep is u128, packed scan acceleration is limited to tracing large numbers.
In the current implementation, inter-word carry is propagated sequentially
(carry_out = carry_after >> 63 → to the next word).
Applying Kogge-Stone across words would improve carry resolution for k words to O(log k).
This is effective for large numbers of 1000 bits or more, but does not affect the u64 input range of sweep.
Implementing pair predicate decomposition as an adder circuit on FPGA and demonstrating 1-clock/step operation without a hardware multiplier would serve as verification of this method's circuit-level consistency. By placing Kogge-Stone prefix trees in parallel and fixing the wiring for reference patterns, a circuit can be constructed that completes an arbitrary-width Collatz step in the number of adder stages (O(log n)).
The measured speed of collatz-m4m6 v0.2.0 is several orders of magnitude slower than u128 direct arithmetic. However, this gap is the result of an asymmetric comparison between "CPU hardware multiplier vs. software bit operations," and it does not negate the algorithmic value of pair predicate decomposition.
What pair predicate decomposition provides is the transparent decomposition of the Collatz map as carry propagation in an adder, offering GPK statistics as a quantitative analysis tool. In scenarios where GPK statistics are needed, the overhead of pair scan remains at 1.3x, holding a practical advantage over the 20x overhead of u128 arithmetic.
The predicate framework and speed are independent matters. The value of this tool lies in visualizing the structure of the Collatz map and providing a verification platform for the classification paper. Speed is a means to that end, not the end itself.