collatz-m4m6 Performance Analysis Report

Structural analysis of the speed gap between pair predicate decomposition scan and u128 direct arithmetic
2026-02-08 | v0.2.0 | Rust (edition 2021)

1. Summary

collatz-m4m6 is an analysis tool based on the m4/m6 pair predicate decomposition of the Collatz-type map T(n) = (xn+1)/2^d. In v0.2.0, we introduced a Vec<u64> packed-bit representation and a Kogge-Stone parallel prefix scan, achieving a significant structural improvement over the sequential scan of the initial v0.1.

However, in measured range verification (sweep) speed, u128 direct arithmetic achieves approximately 440 million nums/s, while processing via pair scan is several orders of magnitude slower. This report analyzes the structural causes of this speed gap and discusses the computational significance and future prospects of pair predicate decomposition.

2. Measured Benchmarks

2.1 3n+1 Range Verification (50M odd numbers, n ≤ 99,999,999, max_steps=1000)

Intel Core i7-12650H (Alder Lake, 10C/16T), 64GB DDR5 environment. Measured under 4 conditions: Phase 1 (u128 direct arithmetic) ON/OFF and GPK statistics collection ON/OFF.

The speed gap between u128 only (0.113s) and Packed only (9.516s) is approximately 84x. For 3n+1, nearly all numbers complete within u128 Phase 1, so this gap directly manifests. The difference between GPK ON/OFF for u128 is approximately 20x.

2.2 5n+1 Range Verification (50K odd numbers, n ≤ 99,999, max_steps=1000)

5n+1 contains divergent sequences, so there exist numbers that do not reach a stopping time. Measured under the same 4 conditions.

For 5n+1, divergence causes bit length to grow, overflowing u128, making the benefit of Phase 1 limited. The gap between u128 only (1.667s) and Packed only (2.118s) narrows to 1.27x.

2.3 Comparison of Per-Step Instruction Counts

The following shows a comparison of instruction counts per step, normalized to 128-bit width.

3. Structural Reasons Why u128 Is Overwhelmingly Faster

3.1 The Hardware Multiplier Barrier

u128 arithmetic is executed directly by the CPU's hardware multiplier circuit. Intel Alder Lake's integer multiplication pipeline completes 64bit×64bit in 3–4 clock cycles, storing the 128-bit result in a register pair. This operation involves no memory access whatsoever and no branches that stall the pipeline.

This is a hardware optimization born from semiconductor design investments on the scale of tens of billions of dollars, and it is fundamentally impossible for software-based bit operations to compete on the same level.

3.2 The Special Nature of the Collatz Map: Large Number × Small Constant

Condition	Elapsed Time	Speed (nums/s)	GPK	Phase1
u128 only	0.113s	~442,000,000	OFF	ON
u128 + GPK statistics	2.243s	~22,300,000	ON	ON
Packed only	9.516s	~5,250,000	OFF	OFF
Packed + GPK statistics	12.546s	~3,990,000	ON	OFF

Condition	Elapsed Time	Speed (nums/s)	GPK	Phase1
u128 only	1.667s	~30,000	OFF	ON
u128 + GPK statistics	3.555s	~14,100	ON	ON
Packed only	2.118s	~23,600	OFF	OFF
Packed + GPK statistics	3.995s	~12,500	ON	OFF

Operation	Instructions / 128bit	Memory Access	Branching
u128 (MUL+ADD+TZCNT+SHR)	4	None (register-only)	None
BigUint (multi-precision multiply)	~6	Vec<u64> heap	Loop control
Packed scan (Kogge-Stone)	~25	Vec<u64> ×2 heap	Word loop

The Collatz-type map is T(n) = (xn+1)/2^d, where x is a small constant such as 3, 5, or 9. This corresponds to the multiplication pattern of "large integer × small constant," which is the simplest case in multi-precision multiplication. BigUint × constant requires only 2 instructions (MUL + ADC) per limb and completes in O(n). Pair scan is also O(n), so the asymptotic complexity is identical, but BigUint always wins by a constant factor.

3.3 The Reality of Sweep: Nearly All Numbers Complete Within u128

In 3n+1 range verification, the vast majority of inputs in the u64 range reach their stopping time by dropping below the initial value in an average of 3–5 steps. During this process, values almost never exceed u128 (128 bits), so very few numbers transition to pair scan (Phase 2). In other words, making sweep faster means making the u128 loop faster, and improvements to packed scan contribute almost nothing to sweep speed.

4. Optimization History

From v0.1 to v0.2, the pair scan itself has been significantly improved. However, this improvement is not in sweep speed but in the efficiency of structural analysis.

Through packed representation, memory efficiency improved 8x and intra-word carry resolution improved approximately 10x. However, the hot path of sweep is u128 Phase 1, and these improvements only take effect when Phase 2 is reached.

4.1 Optimization of GPK Statistics Collection

Version	Internal Representation	Carry Resolution	Improvement
v0.1 initial	Vec<u8> per bit	Sequential, 1 bit at a time	Baseline
v0.1 fast path	u128 + BigUint	CPU multiplier	Sweep made practical
v0.2 current	Vec<u64> (64 pairs/word)	Kogge-Stone (6 stages)	8x memory improvement

A GPK ON/OFF toggle mechanism was introduced so that when statistics are not needed, the design completely skips Vec<u64> allocation, store, popcount, and max_carry_chain loops. As a result, 3n+1 sweep with GPK OFF runs on u128 arithmetic alone, achieving 442M nums/s. On the other hand, with GPK ON, the bit-by-bit loop in accumulate_gpk_u128 becomes the bottleneck, reducing throughput to 22M nums/s. This approximately 20x difference demonstrates the fact that the GPK statistics collection cost itself is overwhelmingly heavier than the arithmetic core.

5. Crossover Point Analysis

Because pair predicate decomposition does not use multiplication, it holds a structural advantage in asymptotic complexity. When bit width becomes sufficiently large, a crossover theoretically exists where the linear cost of pair scan becomes advantageous over the superlinear cost of direct multiplication.

Bit Width	Direct Multiplication	Pair Scan	Notes
~128	Native MUL O(1)	Bit operations ~30 instructions	u128 dominant
~256	Multi-precision MUL 4 times + addition	Kogge-Stone 1 word	BigUint advantageous
~1,000	Karatsuba O(n^1.58)	O(n)	Asymptotically pair advantageous
~10,000+	FFT O(n log n)	O(n), small constant	Pair scan dominant

Note: The Collatz-type map involves "large number × small constant," and BigUint × constant also completes in O(n). Therefore, unlike general multi-precision multiplication crossover tables, when restricted to constant × n multiplication, the asymptotic complexity is identical (both O(n)). In measured benchmarks, BigUint is approximately 100x faster even at 32,768 bits, and the speed crossover in the context of the Collatz map is practically unreachable.

6. The Essential Value of Pair Predicate Decomposition

6.1 Structure Visualization: GPK Statistics from 2³³-Scale Verification

Pair predicate decomposition transparently decomposes each step of the Collatz map not as multiplication but as carry propagation in an adder. Its byproduct, the GPK classification (Generate/Propagate/Kill), quantifies carry generation, propagation, and annihilation at each pair position, making the growth/shrinkage mechanism of numbers statistically observable.

The following shows GPK statistics obtained from verifying all 4.99 billion odd numbers from 3 to 9,999,999,999 (≈ 2³³) for the 3n+1 map.

G (carry generate) = 113,183,623,064 (38.16%)
P (carry propagate) = 81,071,432,343 (27.34%)
K (carry kill) = 102,328,662,110 (34.50%)
Total pairs = 296,583,717,517
Total steps = 17,463,252,353
All numbers converged (maximum stopping time 282, n = 2,788,008,987)

Chain Length	Occurrences	Chain Length	Occurrences
1	265,878,320	14	38,274,610
2	1,842,752,124	15	22,412,468
3	3,494,167,747	16	12,790,339
4	3,634,880,654	17	12,412,547
5	2,871,874,977	18	2,519,910
6	1,991,272,449	19	415,078
7	1,293,719,531	20	69,568
8	814,361,343	21	11,275
9	501,167,658	22	1,579
10	304,639,101	23	337
11	183,966,569	24	32
12	110,232,559	25	6
13	65,431,568	26	4

Despite G% > K%, all numbers converge because while G contributes at most 1 bit per occurrence, d (division by 2^d) on average exceeds the effect of G. The carry chain length distribution peaks at 4–5, with a maximum of 26. This indicates that carry propagation remains local, and long-range propagation is exponentially rare.

These findings — the locality of carry propagation, the stability of the G/K ratio, and the geometric distribution of chain lengths — are structural information that cannot be directly obtained from a multiplication-based approach, and represent the intrinsic value of pair predicate decomposition.

6.2 Hardware Implementation Potential

The Kogge-Stone algorithm was originally a design technique for hardware adders, and if pair predicate decomposition is implemented on FPGA/ASIC, one step of the Collatz map can be completed in O(log n) adder depth without going through a multiplier. The advantage of not being constrained to a fixed width also provides natural scalability in circuit design.

However, the claim of this paper is not "achieving the fastest Collatz computation." Hardware implementation is presented as one example of the theoretical consistency by which this method maps directly to adder structures, and the computational significance of pair predicate decomposition should not be reduced to performance comparisons.

6.3 Practical Advantage in GPK Statistics Collection

The clear practical advantage of pair predicate decomposition lies in the cost of obtaining GPK statistics. From the measured benchmarks (Section 2), we compare the speed difference with and without GPK statistics collection.

With u128 direct arithmetic, obtaining GPK statistics requires performing bit decomposition through a separate path from the multiplication, causing an approximately 20x speed reduction. In contrast, with pair scan, GPK is naturally obtained as a byproduct of the scan process, so the additional cost remains at only 1.3x.

The effective comparison in scenarios where GPK statistics are needed (i.e., the classification verification of this paper) is:

	GPK OFF	GPK ON	GPK Overhead
u128 arithmetic	442M nums/s	22M nums/s	~20x slowdown
Packed scan	5.25M nums/s	3.99M nums/s	~1.3x slowdown

u128 + GPK: 22M nums/s
Packed + GPK: 4M nums/s
Speed gap: ~5.5x (significantly reduced from the 84x gap when GPK is not needed)

Pair scan is not "slow but reveals structure"; rather, it is a method that "incurs almost no computational speed loss when observing structure."

What this paper demonstrates is the structural properties of the map, not software benchmark rankings. Pair predicate decomposition transparently decomposes the Collatz map as carry propagation in an adder, providing GPK statistics as a quantitative analysis tool that comes as a byproduct of computation. This is the raison d'etre of this tool and its value as a verification platform for the classification paper.

7. Future Prospects

We organize future prospects in two directions: improving pair scan speed and leveraging the structural value of pair predicate decomposition.

7.1 Most Promising Candidate for Sweep Acceleration: Sieve

Through residue class analysis based on GPK classification, if a theorem can be derived stating "this residue class is always K-dominant, therefore convergence is guaranteed," then numbers that can be mathematically skipped can be excluded in advance. This is the only pathway where pair predicate decomposition directly contributes to speed, and the direction that most strongly justifies the tool's existence.

7.2 Packed Scan Acceleration via SIMD

AVX2 (256 bits) can process 4 words simultaneously, and AVX-512 (512 bits) can process 8 words simultaneously. SIMD-izing the intra-word operations of Kogge-Stone could theoretically yield a 4–8x improvement. However, as long as the hot path of sweep is u128, packed scan acceleration is limited to tracing large numbers.

7.3 Parallelization of Inter-Word Carry

Approach	Overview	Expected Effect	Application Domain
SIMD (AVX2/512)	Simultaneous processing of multiple words at 256/512-bit width	Theoretical 4-8x	Packed scan core
Inter-word parallelization	Apply Kogge-Stone across words	Effective for large numbers	1000 bits or more
Sieve	Skip residue classes proven to converge by GPK classification	Mathematical reduction	Sweep overall
SIMD multi-number parallel	Simultaneously scan 4 u64 numbers with AVX2	Theoretical 4x	Sweep (u64 range)
FPGA/ASIC implementation	Implement pair predicate decomposition as adder circuit	1 clock/step	Demonstration of theoretical consistency

In the current implementation, inter-word carry is propagated sequentially (carry_out = carry_after >> 63 → to the next word). Applying Kogge-Stone across words would improve carry resolution for k words to O(log k). This is effective for large numbers of 1000 bits or more, but does not affect the u64 input range of sweep.

7.4 FPGA Implementation

Implementing pair predicate decomposition as an adder circuit on FPGA and demonstrating 1-clock/step operation without a hardware multiplier would serve as verification of this method's circuit-level consistency. By placing Kogge-Stone prefix trees in parallel and fixing the wiring for reference patterns, a circuit can be constructed that completes an arbitrary-width Collatz step in the number of adder stages (O(log n)).

8. Conclusion

The measured speed of collatz-m4m6 v0.2.0 is several orders of magnitude slower than u128 direct arithmetic. However, this gap is the result of an asymmetric comparison between "CPU hardware multiplier vs. software bit operations," and it does not negate the algorithmic value of pair predicate decomposition.

What pair predicate decomposition provides is the transparent decomposition of the Collatz map as carry propagation in an adder, offering GPK statistics as a quantitative analysis tool. In scenarios where GPK statistics are needed, the overhead of pair scan remains at 1.3x, holding a practical advantage over the 20x overhead of u128 arithmetic.

The predicate framework and speed are independent matters. The value of this tool lies in visualizing the structure of the Collatz map and providing a verification platform for the classification paper. Speed is a means to that end, not the end itself.