collatz-m4m6 Performance Analysis Report

Structural analysis of the speed gap between pair predicate decomposition scan and u128 direct arithmetic
2026-02-08  |  v0.2.0  |  Rust (edition 2021)

1. Summary

collatz-m4m6 is an analysis tool based on the m4/m6 pair predicate decomposition of the Collatz-type map T(n) = (xn+1)/2d. In v0.2.0, we introduced a Vec<u64> packed-bit representation and a Kogge-Stone parallel prefix scan, achieving a significant structural improvement over the sequential scan of the initial v0.1.

However, in measured range verification (sweep) speed, u128 direct arithmetic achieves approximately 440 million nums/s, while processing via pair scan is several orders of magnitude slower. This report analyzes the structural causes of this speed gap and discusses the computational significance and future prospects of pair predicate decomposition.

2. Measured Benchmarks

2.1 3n+1 Range Verification (50M odd numbers, n ≤ 99,999,999, max_steps=1000)

Intel Core i7-12650H (Alder Lake, 10C/16T), 64GB DDR5 environment. Measured under 4 conditions: Phase 1 (u128 direct arithmetic) ON/OFF and GPK statistics collection ON/OFF.

ConditionElapsed TimeSpeed (nums/s)GPKPhase1
u128 only0.113s~442,000,000OFFON
u128 + GPK statistics2.243s~22,300,000ONON
Packed only9.516s~5,250,000OFFOFF
Packed + GPK statistics12.546s~3,990,000ONOFF

The speed gap between u128 only (0.113s) and Packed only (9.516s) is approximately 84x. For 3n+1, nearly all numbers complete within u128 Phase 1, so this gap directly manifests. The difference between GPK ON/OFF for u128 is approximately 20x.

2.2 5n+1 Range Verification (50K odd numbers, n ≤ 99,999, max_steps=1000)

5n+1 contains divergent sequences, so there exist numbers that do not reach a stopping time. Measured under the same 4 conditions.

ConditionElapsed TimeSpeed (nums/s)GPKPhase1
u128 only1.667s~30,000OFFON
u128 + GPK statistics3.555s~14,100ONON
Packed only2.118s~23,600OFFOFF
Packed + GPK statistics3.995s~12,500ONOFF

For 5n+1, divergence causes bit length to grow, overflowing u128, making the benefit of Phase 1 limited. The gap between u128 only (1.667s) and Packed only (2.118s) narrows to 1.27x.

2.3 Comparison of Per-Step Instruction Counts

The following shows a comparison of instruction counts per step, normalized to 128-bit width.

OperationInstructions / 128bitMemory AccessBranching
u128 (MUL+ADD+TZCNT+SHR)4None (register-only)None
BigUint (multi-precision multiply)~6Vec<u64> heapLoop control
Packed scan (Kogge-Stone)~25Vec<u64> ×2 heapWord loop

3. Structural Reasons Why u128 Is Overwhelmingly Faster

3.1 The Hardware Multiplier Barrier

u128 arithmetic is executed directly by the CPU's hardware multiplier circuit. Intel Alder Lake's integer multiplication pipeline completes 64bit×64bit in 3–4 clock cycles, storing the 128-bit result in a register pair. This operation involves no memory access whatsoever and no branches that stall the pipeline.

This is a hardware optimization born from semiconductor design investments on the scale of tens of billions of dollars, and it is fundamentally impossible for software-based bit operations to compete on the same level.

3.2 The Special Nature of the Collatz Map: Large Number × Small Constant

The Collatz-type map is T(n) = (xn+1)/2d, where x is a small constant such as 3, 5, or 9. This corresponds to the multiplication pattern of "large integer × small constant," which is the simplest case in multi-precision multiplication. BigUint × constant requires only 2 instructions (MUL + ADC) per limb and completes in O(n). Pair scan is also O(n), so the asymptotic complexity is identical, but BigUint always wins by a constant factor.

3.3 The Reality of Sweep: Nearly All Numbers Complete Within u128

In 3n+1 range verification, the vast majority of inputs in the u64 range reach their stopping time by dropping below the initial value in an average of 3–5 steps. During this process, values almost never exceed u128 (128 bits), so very few numbers transition to pair scan (Phase 2). In other words, making sweep faster means making the u128 loop faster, and improvements to packed scan contribute almost nothing to sweep speed.

4. Optimization History

From v0.1 to v0.2, the pair scan itself has been significantly improved. However, this improvement is not in sweep speed but in the efficiency of structural analysis.

VersionInternal RepresentationCarry ResolutionImprovement
v0.1 initialVec<u8> per bitSequential, 1 bit at a timeBaseline
v0.1 fast pathu128 + BigUintCPU multiplierSweep made practical
v0.2 currentVec<u64> (64 pairs/word)Kogge-Stone (6 stages)8x memory improvement

Through packed representation, memory efficiency improved 8x and intra-word carry resolution improved approximately 10x. However, the hot path of sweep is u128 Phase 1, and these improvements only take effect when Phase 2 is reached.

4.1 Optimization of GPK Statistics Collection

A GPK ON/OFF toggle mechanism was introduced so that when statistics are not needed, the design completely skips Vec<u64> allocation, store, popcount, and max_carry_chain loops. As a result, 3n+1 sweep with GPK OFF runs on u128 arithmetic alone, achieving 442M nums/s. On the other hand, with GPK ON, the bit-by-bit loop in accumulate_gpk_u128 becomes the bottleneck, reducing throughput to 22M nums/s. This approximately 20x difference demonstrates the fact that the GPK statistics collection cost itself is overwhelmingly heavier than the arithmetic core.

5. Crossover Point Analysis

Because pair predicate decomposition does not use multiplication, it holds a structural advantage in asymptotic complexity. When bit width becomes sufficiently large, a crossover theoretically exists where the linear cost of pair scan becomes advantageous over the superlinear cost of direct multiplication.

Bit WidthDirect MultiplicationPair ScanNotes
~128Native MUL O(1)Bit operations ~30 instructionsu128 dominant
~256Multi-precision MUL 4 times + additionKogge-Stone 1 wordBigUint advantageous
~1,000Karatsuba O(n1.58)O(n)Asymptotically pair advantageous
~10,000+FFT O(n log n)O(n), small constantPair scan dominant

Note: The Collatz-type map involves "large number × small constant," and BigUint × constant also completes in O(n). Therefore, unlike general multi-precision multiplication crossover tables, when restricted to constant × n multiplication, the asymptotic complexity is identical (both O(n)). In measured benchmarks, BigUint is approximately 100x faster even at 32,768 bits, and the speed crossover in the context of the Collatz map is practically unreachable.

6. The Essential Value of Pair Predicate Decomposition

6.1 Structure Visualization: GPK Statistics from 233-Scale Verification

Pair predicate decomposition transparently decomposes each step of the Collatz map not as multiplication but as carry propagation in an adder. Its byproduct, the GPK classification (Generate/Propagate/Kill), quantifies carry generation, propagation, and annihilation at each pair position, making the growth/shrinkage mechanism of numbers statistically observable.

The following shows GPK statistics obtained from verifying all 4.99 billion odd numbers from 3 to 9,999,999,999 (≈ 233) for the 3n+1 map.

G (carry generate) = 113,183,623,064  (38.16%)
P (carry propagate) =  81,071,432,343  (27.34%)
K (carry kill) = 102,328,662,110  (34.50%)
Total pairs     = 296,583,717,517
Total steps   =  17,463,252,353
All numbers converged (maximum stopping time 282, n = 2,788,008,987)

Carry chain length distribution:

Chain LengthOccurrencesChain LengthOccurrences
1265,878,3201438,274,610
21,842,752,1241522,412,468
33,494,167,7471612,790,339
43,634,880,6541712,412,547
52,871,874,977182,519,910
61,991,272,44919415,078
71,293,719,5312069,568
8814,361,3432111,275
9501,167,658221,579
10304,639,10123337
11183,966,5692432
12110,232,559256
1365,431,568264

Despite G% > K%, all numbers converge because while G contributes at most 1 bit per occurrence, d (division by 2d) on average exceeds the effect of G. The carry chain length distribution peaks at 4–5, with a maximum of 26. This indicates that carry propagation remains local, and long-range propagation is exponentially rare.

These findings — the locality of carry propagation, the stability of the G/K ratio, and the geometric distribution of chain lengths — are structural information that cannot be directly obtained from a multiplication-based approach, and represent the intrinsic value of pair predicate decomposition.

6.2 Hardware Implementation Potential

The Kogge-Stone algorithm was originally a design technique for hardware adders, and if pair predicate decomposition is implemented on FPGA/ASIC, one step of the Collatz map can be completed in O(log n) adder depth without going through a multiplier. The advantage of not being constrained to a fixed width also provides natural scalability in circuit design.

However, the claim of this paper is not "achieving the fastest Collatz computation." Hardware implementation is presented as one example of the theoretical consistency by which this method maps directly to adder structures, and the computational significance of pair predicate decomposition should not be reduced to performance comparisons.

6.3 Practical Advantage in GPK Statistics Collection

The clear practical advantage of pair predicate decomposition lies in the cost of obtaining GPK statistics. From the measured benchmarks (Section 2), we compare the speed difference with and without GPK statistics collection.

GPK OFFGPK ONGPK Overhead
u128 arithmetic442M nums/s22M nums/s~20x slowdown
Packed scan5.25M nums/s3.99M nums/s~1.3x slowdown

With u128 direct arithmetic, obtaining GPK statistics requires performing bit decomposition through a separate path from the multiplication, causing an approximately 20x speed reduction. In contrast, with pair scan, GPK is naturally obtained as a byproduct of the scan process, so the additional cost remains at only 1.3x.

The effective comparison in scenarios where GPK statistics are needed (i.e., the classification verification of this paper) is:

u128 + GPK:  22M nums/s
Packed + GPK: 4M nums/s
Speed gap: ~5.5x (significantly reduced from the 84x gap when GPK is not needed)

Pair scan is not "slow but reveals structure"; rather, it is a method that "incurs almost no computational speed loss when observing structure."

What this paper demonstrates is the structural properties of the map, not software benchmark rankings. Pair predicate decomposition transparently decomposes the Collatz map as carry propagation in an adder, providing GPK statistics as a quantitative analysis tool that comes as a byproduct of computation. This is the raison d'etre of this tool and its value as a verification platform for the classification paper.

7. Future Prospects

We organize future prospects in two directions: improving pair scan speed and leveraging the structural value of pair predicate decomposition.

ApproachOverviewExpected EffectApplication Domain
SIMD (AVX2/512)Simultaneous processing of multiple words at 256/512-bit widthTheoretical 4-8xPacked scan core
Inter-word parallelizationApply Kogge-Stone across wordsEffective for large numbers1000 bits or more
SieveSkip residue classes proven to converge by GPK classificationMathematical reductionSweep overall
SIMD multi-number parallelSimultaneously scan 4 u64 numbers with AVX2Theoretical 4xSweep (u64 range)
FPGA/ASIC implementationImplement pair predicate decomposition as adder circuit1 clock/stepDemonstration of theoretical consistency

7.1 Most Promising Candidate for Sweep Acceleration: Sieve

Through residue class analysis based on GPK classification, if a theorem can be derived stating "this residue class is always K-dominant, therefore convergence is guaranteed," then numbers that can be mathematically skipped can be excluded in advance. This is the only pathway where pair predicate decomposition directly contributes to speed, and the direction that most strongly justifies the tool's existence.

7.2 Packed Scan Acceleration via SIMD

AVX2 (256 bits) can process 4 words simultaneously, and AVX-512 (512 bits) can process 8 words simultaneously. SIMD-izing the intra-word operations of Kogge-Stone could theoretically yield a 4–8x improvement. However, as long as the hot path of sweep is u128, packed scan acceleration is limited to tracing large numbers.

7.3 Parallelization of Inter-Word Carry

In the current implementation, inter-word carry is propagated sequentially (carry_out = carry_after >> 63 → to the next word). Applying Kogge-Stone across words would improve carry resolution for k words to O(log k). This is effective for large numbers of 1000 bits or more, but does not affect the u64 input range of sweep.

7.4 FPGA Implementation

Implementing pair predicate decomposition as an adder circuit on FPGA and demonstrating 1-clock/step operation without a hardware multiplier would serve as verification of this method's circuit-level consistency. By placing Kogge-Stone prefix trees in parallel and fixing the wiring for reference patterns, a circuit can be constructed that completes an arbitrary-width Collatz step in the number of adder stages (O(log n)).

8. Conclusion

The measured speed of collatz-m4m6 v0.2.0 is several orders of magnitude slower than u128 direct arithmetic. However, this gap is the result of an asymmetric comparison between "CPU hardware multiplier vs. software bit operations," and it does not negate the algorithmic value of pair predicate decomposition.

What pair predicate decomposition provides is the transparent decomposition of the Collatz map as carry propagation in an adder, offering GPK statistics as a quantitative analysis tool. In scenarios where GPK statistics are needed, the overhead of pair scan remains at 1.3x, holding a practical advantage over the 20x overhead of u128 arithmetic.

The predicate framework and speed are independent matters. The value of this tool lies in visualizing the structure of the Collatz map and providing a verification platform for the classification paper. Speed is a means to that end, not the end itself.