Exploring Continuous Learning: Reasoning Bank + Recursive Language Models
TLDR
A tiny Reasoning Bank of success and failure reasoning strategies on PubMedQA nudged accuracy from 73% → 77%. The lift was strongest when both success and failure strategies were present, reaching 84% in that subset. No fine‑tuning, just retrieval of distilled memories.
Motivation
2025 gave us million token context windows, and a new failure mode in context rot. Retrieval is as important as ever (I’m avoiding RAG here, as I agree with Jeff Huber’s point that we should just call it retrieval), and I’m interested in exploring memory and reasoning, vs just stuffing prompts with relevant chunks.
So as of late Oct 2025, expanding the context window of LLMs, and in some ways continual learning for LLMs, is a trending topic. I’m still reading and learning, and I’d bucket recent papers I’ve come across (thanks to https://www.alphaxiv.org/) into three categories (these are my working buckets as I learn and build more, not meant to be an academic taxonomy):
- architectural changes - eg Native Sparse Attention for long-horizon efficiency
- parametric updates - eg online SFT/RL such as Sparse Memory Finetuning, which allows continuous knowledge updates without impacting generalization
- “outer loop” engineering - building architecture (or scaffold?) around the llm, eg LightMem, Recursive Language Models, Reasoning Bank, and many others
I was especially interested in recursive language model, and reasoning bank, because it seems to me that they align well with the bitter lesson:
“One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are *search and learning*.”
In the case of Recursive Language Models, we give the AI Agent full autonomy to search in whatever way it deems. We are not crafting tools with specific instructions on how to use them, which clog up context. Instead, allowing the LLM an execution environment to create whatever python code it deems necessary, and spawn up additional LLM processes to gather the required information.
For Reasoning Bank, we are letting the LLM reflect and learn from its own reasoning traces on both success, and more importantly, the incorrect predictions. Unlike many parameter-update methods which often depend on preference winners or scalar rewards, a memory layer allows the LLM to reflect when it got something wrong, and try to distill the “why”.
For this post, we will focus on the PubMedQA dataset, which has fairly limited context windows, so naturally implementing Recursion did not help with performance. We’ll save the combo of RLM+RB for another post with experiments on BrowseComp Plus dataset.
Core idea around Reasoning Bank
When humans learn complex tasks, we benefit from two types of examples:
- Success patterns: “Here’s how to solve this correctly”
- Failure patterns: “I got this wrong, here’s what I should watch out for next time”
Most AI systems only learn from successes. But what if the LLM can learn from both success and failures? And perhaps even more importantly, reflect on why it made a mistake in the first place, so the learnings are generalizable?
Experiment
- Dataset: PubMedQA
- Task: Answer yes/no/maybe based on research abstracts
- LLM: Gemini 2.5 Flash
- Memory Reasoning Bank: 200 training examples → 144 success patterns + 56 failure-patterns
- Retrieval of Memory: Semantic similarity matching via embeddings
- Train/Test: Randomly sample, build (“train”) the Reasoning Bank on 200 examples, inference on 100 examples
- Comparison: Vanilla LLM prediction: 73% accuracy, LLM+MRB: 77% accuracy. Across 10 independent random splits/seeds, the memory‑augmented setup delivered a consistent ~4‑point lift; small but stable, not a one‑off.
Learning from Failure
A failure-pattern is a structured analysis of a reasoning and prediction failure
Extraction
When the LLM produces an incorrect answer, an LLM Judge extracts a failure-pattern with these components:
- Error Identification: What specific reasoning error occurred?
- Warning Signals: What indicators should have been noticed?
- Correct Approach: How should the reasoning have proceeded?
- Generalizable Lessons: What broader takeaways apply to similar questions?
LLM Judge Prompt:
Your task: analyze this failed reasoning trace and extract one failure-pattern
(common reasoning error) as a cautionary example.
Question: {query}
Reasoning trace: {reasoning}
Outcome: failure
Ground truth: {correct_answer}
Model answer: {wrong_answer}
Extract a failure-pattern with:
1. Title: Short name for the error (e.g., "Overgeneralization from Limited Evidence")
2. Description: One-sentence summary of when this error occurs
3. Content: 4-component analysis:
- ERROR: What reasoning mistake was made
- WARNING SIGNALS: What should have raised concerns
- CORRECT APPROACH: How to reason properly
- LESSONS: Generalizable takeaways
4. Tags: Domain/task filters (["pubmedqa", "yes_no_qa", ...])
5. Outcome: "failure"
Example failure-pattern Output:
JSON Structure:
{
"title": "Overgeneralization from Limited Evidence",
"description": "Use as warning when tendency to dismiss feasibility based solely on preliminary/limited data",
"content": "...",
"tags": ["pubmedqa", "yes_no_qa", "feasibility_study"],
"outcome": "failure"
}
Content Field (formatted):
ERROR: Dismissed feasibility of intervention based on limited pilot evidence,
failing to recognize that feasibility studies are designed to test practicality,
not efficacy.
WARNING SIGNALS:
- Question asks about feasibility, not efficacy
- Study explicitly states "feasibility study"
- Evidence shows completion rates and participant feedback
CORRECT APPROACH:
1. Distinguish feasibility questions from efficacy questions
2. Recognize that feasibility focuses on practical implementation
3. Evaluate completion rates and participant acceptance
4. Answer 'yes' for feasibility even if efficacy unclear
LESSONS: Don't apply efficacy standards to feasibility questions.
Limited evidence can still demonstrate feasibility.
Injecting Reasoning Memories
Purpose: Inject both success patterns AND failure-patterns with distinct formatting
Code Implementation:
# Inject success pattern (if retrieved)
if strategies and len(strategies) > 0:
prompt += "\n\n" + "="*60 + "\n"
prompt += "RELEVANT STRATEGY (from similar past questions):\n"
prompt += "="*60 + "\n\n"
strategy = strategies[0]
prompt += f"{strategy.title}\n"
prompt += f"When to use: {strategy.description}\n"
prompt += f"Approach:\n{strategy.content}\n\n"
# Inject failure-pattern (if retrieved) with warning formatting
if anti_patterns and len(anti_patterns) > 0:
prompt += "="*60 + "\n"
prompt += "⚠️ ANTI-PATTERN TO AVOID:\n"
prompt += "="*60 + "\n\n"
anti = anti_patterns[0]
prompt += f"{anti.title}\n"
prompt += f"Common mistake: {anti.description}\n"
prompt += f"Analysis:\n{anti.content}\n\n"
# Closing guidance
if strategies or anti_patterns:
prompt += "="*60 + "\n"
prompt += "Consider these patterns when answering.\n"
prompt += "="*60 + "\n\n"
Prompt with injected Reasoning Memories:
Based on the provided medical research context, answer the following
question with 'yes', 'no', or 'maybe':
Question: Aromatase inhibitor-related musculoskeletal symptoms: is preventing
them with exercise feasible?
Context:
[... medical abstract about feasibility study ...]
============================================================
RELEVANT STRATEGY (from similar past questions):
============================================================
Evaluate Feasibility Study Outcomes
When to use: Use when assessing practical implementation feasibility
Approach:
1. Check completion rates and participant retention
2. Evaluate acceptability metrics
3. Distinguish feasibility from efficacy
4. Answer based on implementation success
============================================================
ANTI-PATTERN TO AVOID
============================================================
Overgeneralization from Limited Evidence
Common mistake: Dismissing feasibility based on limited pilot data
Analysis:
ERROR: Dismissed feasibility of intervention based on limited pilot
evidence, failing to recognize that feasibility studies test practicality,
not efficacy.
WARNING SIGNALS:
- Question asks about feasibility, not efficacy
- Study explicitly states "feasibility study"
- Evidence shows completion rates and participant feedback
CORRECT APPROACH:
1. Distinguish feasibility questions from efficacy questions
2. Recognize that feasibility focuses on practical implementation
3. Evaluate completion rates and participant acceptance
4. Answer 'yes' for feasibility even if efficacy unclear
LESSONS: Don't apply efficacy standards to feasibility questions.
Limited evidence can still demonstrate feasibility.
============================================================
Consider these patterns when answering.
============================================================
Please analyze the context carefully and provide your answer as one
word: yes, no, or maybe.
This structure allows flexibility on the presence or absence of memories. One important finding is that ~19% of examples had both a successful and a failure pattern extracted, and those yielded 84% accuracy (73% baseline for vanilla LLM predictions).
Inference flow
Input
│
Retriever ──► Top‑k Memories
│ ├─ Strategy: "Evaluate Feasibility (completion, retention...)"
│ └─ Anti‑pattern: "Overgeneralize from limited evidence"
│
Prompt Composer
│ (inject strategy/anti‑pattern blocks)
▼
LLM ⇢ answer (yes/no/maybe)
│
Judge (optional) ──► new memory (strategy or anti‑pattern)
Limitations & Next Steps
This is 1 of N posts on continual learning, I just wanted to document learnings so far.
Current Limitations
- only tested on PubMedQA for now
- while everything else about the experiment is very much aligned with bitter lesson’s thesis of leaning into search and learning, there is still a “hand crafted” parameter of embedding similarity at the retrieval step
What’s Next
- the reasoning bank was built (“trained”, but not in the classical ML sense) on ~200 examples, what if we scaled it to more?
- as successful and failure patterns increased, is there a way to “consolidate” (LightMem) the memories so that the reasoning traces and failure reflections are even more generalizable and helpful?
- can the “training” step of building up the reasoning bank also be more iterative, in that example 10 references reason traces developed from examples 1-9, and potentially even updates the memories given some sort of ground truth label?
- I’m curious how this approach compares to using GEPA. I intentionally did not tune the prompt (thanks Claude) for this current experiment
- very curious on trying the combo of RLM + MRB on BrowserComp Plus, as reasoning memories on how the LLM recursively found solutions to complex and long context could be incredibly valuable
This work was inspired by ReasoningBank, Recursive Language Models and builds on the PubMedQA dataset.
Views are strictly my own. Experiments based only on public datasets.
Published: October 27, 2025
Enjoy Reading This Article?
Here are some more articles you might like to read next: