● Hugging Face 📅 10/07/2025 à 14:54

Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models

Data Science
Illustration
Back to Articles Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Team Article Published July 10, 2025 Upvote 54 +48 Haiming Wang HaimingW Follow AI-MO Mert Unsal mertunsal Follow AI-MO Xiaohan Lin XiaoHLim Follow AI-MO MantasBaksys MantasBaksys Follow AI-MO Junqi Liu ahhwuhu Follow AI-MO Marco Dos Santos dsantosmarco Follow AI-MO Flood Sung floodsung Follow AI-MO Ying zhenzhe Follow AI-MO Zhu Zekai hehepig166 Follow AI-MO lujianqiao rookiemango Follow AI-MO Hugues de Saxcé desaxce Follow AI-MO Thibaut Barroyer thibautbar Follow AI-MO Ebony Zhang ebony59 Follow AI-MO Bolton Bailey BoltonBailey Follow AI-MO Frederick Pu UnluckyOrangutan Follow AI-MO Jonas Bayer TBUGTB Follow AI-MO Marina Vinyes mavi88 Follow AI-MO Zhengying Liu evariste-liu Follow AI-MO LI Jia liyongsea Follow AI-MO Numina & Kimi Team Models | Demo | GitHub Figure 1: Performance comparison of theorem proving models on the miniF2F-test dataset. We're excited to announce the release of Kimina-Prover-72B, our state-of-the-art theorem proving model trained with the Kimi k1.5[1] RL pipeline based on Qwen2.5-72B [2]. Alongside it, we are also releasing two distilled variants: Kimina-Prover-Distill-8B and 1.7B (based on Qwen3-8B and Qwen3-1.7B[3] respectively). Our key innovations include: Test-Time Reinforcement Learning Search: A trainable agentic proving framework that enables the model to recursively discover, combine and apply multiple lemmas to construct complex proofs, building on a novel lemma-enabled pattern. Error-Fixing Capability: Kimina-Prover can read and interpret Lean’s error messages and propose targeted fixes, demonstrating significantly higher sample efficiency compared to regenerating proofs from scratch. These advancements enable Kimina-Prover to solve challenging mathematical problems and surpass prior methods. As shown in Figure 1, on the widely used miniF2F benchmark, Kimina-Prover achieves a state-of-the-art pass rate of 92.2%. Introduction We focus on automated theorem proving (ATP) in the Lean 4 language, aiming to automate the construction of formal mathematical proofs. Recent advances in neural theorem proving have significantly improved the ability of AI systems to assist with or automate this process. Notable progress includes AlphaProof[4] from Google DeepMind, which demonstrated strong performance on problems at the level of the International Mathematical Olympiad. Open-source systems such as DeepSeek-Prover-V2[5], which incorporates reinforcement learning, have also achieved state-of-the-art results. In addition, neuro-symbolic agentic approaches like DSP+[6] have shown that competitive performance is possible without large-scale training by leveraging off-the-shelf models in a modular framework. Our earlier work of Kimina-Prover Preview[7] introduced a large language model for formal theorem proving in Lean, establishing a new performance baseline on the miniF2F benchmark. Trained with a large-scale reinforcement learning pipeline, the model adopted a reasoning-driven exploration paradigm and demonstrated that larger models can serve as stronger formal reasoners. Its structured reasoning pattern enabled efficient proof search and emulated human-like problem-solving strategies. Following this initial success, we continued to improve the model through further iterations of reinforcement learning. However, single-step reasoning remains insufficient for solving complex problems that require long, multi-stage proofs. To address this limitation, we introduce the Test-Time Reinforcement Learning (TTRL) search framework, which enables the model to discover, combine, and reuse multiple intermediate lemmas autonomously. This framework supports deeper, long-horizon reasoning by decomposing hard problems into reusable subcomponents. A key ingredient of TTRL search is the lemma-enabled pattern, which allows the model to identify and apply intermediate lemmas as part of its proof construction process. This structured reuse of intermediate results significantly expands the model’s problem-solving capability beyond single-step generation. To further enhance robustness, we also integrate an error-fixing mechanism that interprets Lean’s error messages and proposes targeted corrections. This allows the model to refine its outputs through iterative feedback, improving proof reliability and overall sample efficiency. A New State-of-the-Art Model pass@1 pass@32 pass@1024 Kimina-Prover-1.7B 46.7 73.4 — DSP+ 52.5 71.3 80.7 DeepSeek-Prover-V2-7B 58.6 75.6 79.9 Kimina-Prover-8B 61.1 78.3 — DeepSeek-Prover-V2-671B 61.9 82.4 86.6 Kimina-Prover-72B 63.9 84.0 87.7 Table 1: Performance comparison of theorem proving models on the miniF2F-test dataset under equivalent sampling budgets. Kimina-Prover-72B achieves state-of-the-art performance across all evaluated settings. As the combination of the proposed techniques results in significant improvements in formal theorem proving performance. On the miniF2F benchmark, Kimina-Prover achieves a pass rate of 84.0% with pass@32 and 86.4% with the addition of a single round of error correction. With pass@1024, the pass rate reaches 87.7%. Applying the full Test-Time Reinforcement Learning (TTRL) search framework yields a final pass rate of 92.2%, with an estimated pass upper bound of approximately 42,000. However, this pass budget can be substantially optimized in future versions, as a large portion of the current sampling is spent on proving unhelpful or redundant lemmas. Notably, these results indicate a shift in the scaling behavior of the prover. While earlier versions exhibited approximately linear improvements in log scale with increasing sampling budget, the current system shows diminishing returns beyond pass@1024. This suggests that further gains are less dependent on increased sampling and instead require more sophisticated search strategies, such as those introduced by TTRL. Methodology Lemma enabled pattern The lemma-enabled pattern is designed to equip the model with the ability to identify and utilize useful lemmas provided in the input. To support this capability, during reinforcement learning (RL) training, a random subset of one to three formal lemmas is prepended to the problem context, exposing the prover to potentially useful intermediate results that can aid in constructing the final proof. These lemmas are prepared through a two-step pipeline: (1) a general-purpose LLM generates candidate lemmas in natural language; (2) these are then translated into formal statements using our Kimina-Autoformalizer-7b. Initial observations showed that the model had a low tendency to incorporate the provided lemmas. To address this, we introduced a preference-based reward shaping strategy within the RL framework. When a theorem could be proven through multiple trajectories, solutions that successfully leveraged a provided lemma were assigned a higher reward, while those that did not were penalized. This approach proved effective, increasing the lemma utilization rate to a stable 30–40% after training. Importantly, this method not only encouraged lemma usage but also promoted selectivity: the model learned to apply lemmas strategically when they were useful, while ignoring irrelevant ones, demonstrating more efficient and human-like reasoning behavior. TTRL search Figure 2: Diagram of the Test-Time Reinforcement Learning (TTRL) Search framework. It is composed of three main components: RL training, sublemma generation, and negation filtering. Sublemmas are generated and formalized, then integrated into the training loop with dynamic scoring and pruning. A negation filter ensures logical consistency by discarding invalid lemmas. The lemma-enabled pattern allowed the model to incorporate pre-generated lemmas as intermediate steps in constructing proofs. However, we observed that randomly sampling and inserting lemmas was insufficient for solving complex problems requiring highly structured and deeply nested reasoning. To address this limitation, we developed the Test-Time Reinforcement Learning Search framework—a trainable, agentic approach that systematically organizes, filters, and composes a large set of candidate lemmas to build complete proofs. This framework transitions the process from random exploration to a more strategic, goal-directed search. As illustrated in Figure 2, we define the search scope of each problem as the problem itself along with its associated candidate lemmas. TTRL Search tracks a lemma utilization score within each search scope to measure how frequently and effectively each lemma contributes to the final proof. At the beginning of each RL training iteration, for each problem (i.e., search scope), we construct K = 10 input variants by prepending different combinations of lemmas. To balance exploration and exploitation, 60% of the inputs are built using the top-ranking lemmas—those with the highest utilization scores—focusing the model on the most promising proof paths. The remaining 40% are formed by including these top lemmas alongside one to four randomly selected additional lemmas, encouraging exploration of novel and potentially useful lemma combinations. To ensure quality, a filtering mechanism prunes lemmas that consistently fail to contribute meaningfully: any lemma that fails to achieve a utilization score of at least τ=0.10 after 50 insertion attempts is removed from the search pool. A key feature of TTRL is its recursive search mechanism. The search scope is not limited to the original theorem but is also maintained for each lemma, allowing the framework to recursively decompose the problem into smaller subproblems. A parallel sublemma generation process runs throughout, and whenever a theorem or lemma fails to find a proof after N = 128 attempts, new candidate sublemmas are generated. This recursive strategy enables test-time scaling of reasoning depth, significantly extending the model’s effective problem-solving capability. To maintain logical soundness, we address a failure mode wherein incorrectly formalized lemmas could lead to trivial or invalid proofs. In such cases, the model may exploit inconsistencies to construct seemingly valid but unsound solutions. To prevent this, we introduce a negation proving process: for each newly generated lemma, we attempt to prove its logical negation. If the negated statement is provable, it indicates that the original lemma is logically inconsistent and is immediately discarded. This step ensures the reliability and soundness of the overall proof construction process. Enabling Error-Fixing in Kimina-Prover A critical limitation of recent advanced theorem proving models is their lack of the ability to correct proofs based on feedback from the proof assistant—capabilities that human users routinely leverage. To address this gap, we developed a dedicated framework to integrate error-fixing capabilities into Kimina-Prover. SFT Data Generation for Error Correction. General-purpose large language models exhibit a low success rate when interpreting Lean’s error messages and proposing valid corrections. To overcome this, we constructed a specialized Supervised Fine-Tuning (SFT) dataset tailored for error correction. This dataset consists of triplets in the form of (incorrect proof, Lean feedback, correct proof). To enrich the supervision signal, we prompted Claude 3.7 sonnet[8] to synthesize step-by-step reasoning chains explaining how the incorrect proof can be transformed into the correct one using the provided feedback. The result is a high-quality dataset that includes not only the initial and corrected proofs but also the intermediate reasoning, facilitating more effective learning. Batched Failure Replay Strategy. Integrating error correction directly into the reinforcement learning (RL) loop initially proved ineffective due to the low success rate (~1%) of the SFT model in fixing errors, resulting in sparse rewards and unstable training. To address this, we designed the Batched Failure Replay strategy. Rather than attempting to correct errors immediately during an RL iteration, all failed proof attempts from iteration N are collected. In the subsequent iteration N+1, the training batch is composed of a fixed-size union of these prior failures (e.g., 500 samples) and standard problems from the prompt set (e.g., another 500 samples). This ensures consistent and high-volume exposure to error correction tasks in each training step, enabling the model to gradually learn effective error-handling behaviors in a stable and data-efficient manner. 16+16 attempt-and-fix 32×1 brute-force 32+32 attempt-and-fix kimina-prover 35.6 28.8 44.1 Table 2: Performance comparison between error-fixing and brute-force strategies on a selected subset of 59 MiniF2F-Test problems with the lowest win rates. Under equal sample budgets, the error-fixing strategies (16+16) outperform the brute-force baseline (32×1), demonstrating improved sample efficiency. This training approach led to measurable improvements in the model’s ability to recover from failure. The model advanced from correcting basic syntax errors to resolving complex logical mistakes and, eventually, discovering alternative proof strategies when initial attempts failed. Crucially, this capability improved sample efficiency. As shown in Table 2, we compared different strategies under fixed computational budgets. The 16+16 attempt-and-fix strategy—consisting of 16 initial proof attempts, each followed by one error-fixing attempt—achieved a success rate of 35.6%, outperforming the 32×1 brute-force baseline, which reached 28.8% with 32 independent attempts. Further increasing the sample budget to 32+32 with error correction yielded a success rate of 44.1%. These results demonstrate that enabling the model to correct its own errors is a more efficient use of computational resources than repeated trial-and-error. Other improvement In addition to our core TTRL search and Error fixing, we developed several other novel techniques to enhance the model's learning process and problem-solving capabilities. Prompt Set Curation and Iteration. Effective data curation is critical for reinforcement learning (RL) efficiency. Our initial prompt set of over 300K problems was refined to a competition-focused subset of approximately 90K, with an emphasis on problems from sources such as the olympiads-ref subset of NuminaMath 1.5[9]. During RL, we applied dynamic filtering: problems with consistently high success rates were removed to maintain challenge, while persistently difficult problems were decomposed into simpler subproblems or variations using a general-purpose LLM. The final prompt set combines auto-formalized statements for broad coverage, human-annotated problems for quality, and augmented variations for targeted skill development. The curated prompt set will be open-sourced. Continuous Pre-training on Lean Data. To address the limited Lean proficiency in most base models, we applied Continuous Pre-training (CPT). A 6-billion-token CPT dataset was constructed from multiple sources: 260M tokens from online platforms such as GitHub, 5.5B tokens of compiler-validated rollout data from our RL pipeline, and additional structured data in state–tactic–state and state–tactic–error formats to enhance diversity. These structured data supplements the base model with additional domain knowledge. Random Proof Cut Data Augmentation. To better utilize high-quality, human-annotated proofs that lack intermediate reasoning steps, we developed the Random Proof Cut technique. This method augments such data using two variations: Proof truncation: The final part of a proof is removed, and the model must complete it. Proof infilling: A randomly selected internal block is replaced with the placeholder token sorry, and the model must reconstruct the missing steps. This strategy allows the model to learn how to extend existing lines of reasoning and to fill in logical gaps, effectively integrating human-generated content into the training process. Non-proof Problem Solving. Many problems are presented in the form of requiring a final answer, which cannot be naturally framed as a traditional proof task. Inspired by the CombiBench[10] evaluation methodology, we introduced a capability for non-proof problem solving to unify answer generation with proof construction. In this setting, the model is given a problem with the final answer field left blank. It performs a two-stage process within a single inference: it first deduces the correct answer and then generates a complete formal proof to justify that answer. This approach ensures the model’s reasoning is explicitly conditioned on reaching the correct solution. Sample proof cases Here we provide some proof cases found by our newest Kimina-prover model. The complete list of solved miniF2F problems is available on GitHub TTRL search proof example The system recursively decomposed the problem into four layers of lemmas using a TTRL search approach, ultimately generating a 520-line formal proof. During this recursive proving process, initially, only one lemma, sum_cexp_div_pow_ne_zero, remained to be solved. While this conclusion seems straightforward to prove informally, it would still require a lot of steps to prove it rigorously in Lean 4. Our system recursively generated two extra layers of lemmas and obtained a 260-line formal proof for sum_cexp_div_pow_ne_zero. This proof is valid under Mathlib v4.15 Sample proof for imo 1969 p2 Proof structure Figure 3: Proof dependency graph for the theorem imo_1969_p2, illustrating the structure of lemma references and calls. Each node represents a lemma or theorem, with directed edges indicating dependency relationships (i.e., one lemma calling another). Lemma names are randomly generated during proof construction; for clarity, we manually renamed them post hoc to reflect their mathematical content more accurately. Formal code import Mathlib open Finset in theorem norm_le_sum_norm_of_sum_eq_zero {N : ℕ} (hN : N > 0) (z : ℕ → ℂ) (hsum : ∑ i ∈ range N, z i = 0) : ‖z 0‖ ≤ ∑ i ∈ Icc 1 (N - 1), ‖z i‖ := by by_cases h : N > 1 · have h6 : ∑ i in range N, z i = z 0 + ∑ i in Icc 1 (N - 1), z i := by have h10 : range N = {0} ∪ Icc 1 (N - 1) := by ext x simp omega rw [h10] rw [sum_union] · simp · apply disjoint_left.mpr intro x hx1 hx2 simp at hx1 simp [hx1] at hx2 have h3 : z 0 = - (∑ i ∈ Icc 1 (N - 1), z i) := by have h4 : ∑ i in range N, z i = 0 := hsum rw [h6] at h4 calc z 0 = (z 0 + ∑ i in Icc 1 (N - 1), z i) - ∑ i in Icc 1 (N - 1), z i := by ring _ = (0 : ℂ) - ∑ i in Icc 1 (N - 1), z i := by rw [h4] _ = - (∑ i in Icc 1 (N - 1), z i) := by ring have h11 : ‖z 0‖ = ‖(∑ i ∈ Icc 1 (N - 1), z i : ℂ)‖ := by rw [h3] simp [norm_neg] have h12 : ‖(∑ i ∈ Icc 1 (N - 1), z i : ℂ)‖ ≤ ∑ i ∈ Icc 1 (N - 1), ‖z i‖ := by apply norm_sum_le linarith [h11, h12] · have h5 : N = 1 := by omega rw [h5] have h6 : ∑ i in range 1, z i = 0 := by have h7 : N = 1 := h5 have h8 : ∑ i in range N, z i = 0 := hsum rw [show N = 1 by omega] at h8 simpa using h8 have h7 : z 0 = 0 := by simp at h6 all_goals try tauto rw [show z 0 = (0 : ℂ) by exact h7] simp open Finset in theorem sum_pow_one_half_from_two_to_N {N : ℕ} (hN : 0 < N) : ∑ j ∈ Icc 1 (N - 1), (1 / 2 : ℝ) ^ (j + 1) = 1 / 2 - 1 / 2 ^ N := by have h2 : ∀ (k : ℕ), ∑ j in Icc 1 (k), (1 / 2 : ℝ) ^ (j + 1) = 1 / 2 - 1 / 2 ^ (k + 1 : ℕ) := by intro k induction k with | zero => simp | succ k ihk => rw [sum_Icc_succ_top (by linarith)] rw [ihk] field_simp at * ring_nf have h4 : ∑ j in Icc 1 (N - 1), (1 / 2 : ℝ) ^ (j + 1) = 1 / 2 - 1 / 2 ^ (N : ℕ) := by have h5 : ∑ j in Icc 1 (N - 1), (1 / 2 : ℝ) ^ (j + 1) = 1 / 2 - 1 / 2 ^ ((N - 1 : ℕ) + 1 : ℕ) := by apply h2 (N - 1) rw [h5] have h6 : (N - 1 : ℕ) + 1 = N := by omega rw [h6] exact h4 open Complex Finset in theorem sum_eq_zero_of_sq_add_sq_eq_geom_seq_false {N : ℕ} (hN : 0 < N) (x y : Fin N → ℝ) (hxsum : ∑ i : Fin N, x i = 0) (hysum : ∑ i : Fin N, y i = 0) (hxy : ∀ j
← Retour