Building an AI Learning Engine That Improves With Every Conversation
Building an AI Learning Engine That Improves With Every Conversation
An AI learning engine is a system that automatically improves its responses over time by capturing user feedback, discovering effective patterns, and adapting its behavior without retraining the underlying model. Skopx's learning engine has improved user satisfaction scores from 3.6/5 to 4.3/5 over three months by learning 847 patterns from 12,000+ feedback signals, all without fine-tuning or retraining Claude. This post details the architecture, the algorithms, and the hard lessons learned.
What Is a Learning Engine for AI Applications?
A learning engine for AI applications is a meta-system that sits on top of a language model and modifies how that model is prompted, what context it receives, and which tools it uses, based on accumulated evidence of what works. Unlike model fine-tuning, which requires expensive retraining on curated datasets, a learning engine operates at the prompt engineering layer. It discovers that "users in this organization prefer SQL query results as tables rather than prose" and adjusts future prompts accordingly.
This approach was inspired by research in automated experimentation, the principle of experiment, measure, keep or discard, and accumulate. We applied this principle to every aspect of the AI response pipeline: prompt strategies, query styles, tool selection, response formats, and insight thresholds.
How Does the Feedback Collection System Work?
Feedback enters the system through two channels: explicit feedback (thumbs up/down buttons on every response) and implicit signals (copy events, follow-up questions, time spent reading, and insight acknowledgments).
Explicit feedback is high-signal but sparse, only 8% of responses receive a thumbs up or down. Implicit signals are noisier but abundant. We weight them differently: explicit thumbs up = +1.0 signal, explicit thumbs down = -1.0, copy event = +0.3 (user found value worth copying), follow-up question within 30 seconds = -0.2 (answer was incomplete), time reading > 60 seconds on a short answer = -0.1 (user was confused).
// Feedback signal aggregation
interface FeedbackSignal {
responseId: string;
signalType: 'thumbs_up' | 'thumbs_down' | 'copy' | 'follow_up' | 'dwell_time';
weight: number;
timestamp: Date;
userId: string;
metadata: Record<string, unknown>;
}
function aggregateSignals(signals: FeedbackSignal[]): number {
const weights = {
thumbs_up: 1.0,
thumbs_down: -1.0,
copy: 0.3,
follow_up: -0.2,
dwell_time: -0.1 // only counted for unexpectedly long dwell
};
return signals.reduce((score, s) => score + weights[s.signalType], 0);
}
All feedback is stored in the response_feedback table in Supabase PostgreSQL with full provenance, which response it refers to, what prompt strategy was used, which tools were invoked, and the complete context of the interaction.
How Does Pattern Discovery Work?
The pattern discovery system runs every 6 hours as a learning run. It analyzes recent feedback to identify what prompt strategies, response formats, and tool selections correlate with positive outcomes.
The algorithm clusters recent responses by their characteristics (prompt template, tool set, response format, question type) and computes satisfaction scores for each cluster. When a cluster consistently outperforms the baseline, the system extracts the distinguishing characteristics as a learned pattern.
For example, the system might discover that responses to database questions that include a SQL code block alongside the prose explanation receive 40% more positive feedback than prose-only responses. This becomes a pattern of type response_format with the rule "include SQL code blocks for database questions."
Pattern scoring uses exponential moving averages (EMA) with debiasing, the same technique used in Adam optimizer. New patterns start with a warmup period where their influence is gradually increased. This prevents a pattern discovered from a single positive data point from immediately influencing all responses.
// EMA with debiasing for pattern scoring
function updatePatternScore(
pattern: LearnedPattern,
newSignal: number
): number {
const alpha = computeAlpha(pattern.useCount); // 0.15 → 0.05 warmup
const rawEma = alpha * newSignal + (1 - alpha) * pattern.score;
// Debiasing (like Adam optimizer)
const biasCorrection = 1 - Math.pow(1 - alpha, pattern.useCount + 1);
const debiasedScore = rawEma / biasCorrection;
// Cautious update: reduce weight for contradictory signals
if (Math.sign(newSignal) !== Math.sign(pattern.score) && pattern.useCount > 5) {
return pattern.score + 0.5 * (debiasedScore - pattern.score);
}
return debiasedScore;
}
function computeAlpha(useCount: number): number {
// Warmup: start responsive (0.15), become stable (0.05)
const warmupSteps = 30;
if (useCount < warmupSteps) {
return 0.15 - (0.10 * useCount / warmupSteps);
}
return 0.05;
}
What Types of Patterns Does the System Learn?
The learning engine tracks eight pattern types.
- Prompt strategy. Which system prompt variations produce better responses for different question types.
- Query style. SQL generation preferences (CTEs vs subqueries, explicit JOINs vs implicit).
- Tool preference. Which tools to invoke for different question categories.
- Response format. Tables vs prose, code blocks vs inline code, length preferences.
- Domain knowledge. Organization-specific terminology and concepts.
- Insight threshold. Per-metric anomaly detection sensitivity.
- Follow-up style. How to suggest follow-up questions (specific vs exploratory).
- Visualization preference. Chart types and data presentation styles.
Each pattern has a confidence score (0-1), a use count, a satisfaction score (EMA), and a minimum usage threshold before it becomes active. Patterns must accumulate at least 10 feedback signals before influencing responses, preventing premature pattern adoption.
How Does the System Prevent Degradation?
The most dangerous failure mode of a learning system is regression, learning a bad pattern that degrades quality for many users. We implement three safety mechanisms.
Crash detection. Every learning run computes the overall satisfaction score across all responses. If satisfaction drops more than 30% compared to the previous run's score, the system triggers a "crash", it reverts the most recently activated patterns and alerts the engineering team. This is analogous to NaN loss detection in ML training.
Pattern warmdown. When a pattern's satisfaction score drops below the minimum threshold, it enters a warmdown phase where its influence is gradually reduced over 5 learning runs rather than being immediately disabled. This prevents oscillation where a pattern is repeatedly enabled and disabled based on noisy short-term feedback.
Simplicity criterion. Patterns that have not been used in 30 days or have a confidence score below 0.3 are pruned automatically. We prefer fewer, higher-quality patterns over many marginal ones. This is inspired by the principle that deletion is often more valuable than addition in mature systems.
How Are Patterns Applied at Query Time?
When a user sends a question, the learning engine selects applicable patterns using a context-matching algorithm. Each pattern has a set of activation conditions (question type, data source, user preferences) and a weight based on its confidence score.
Selected patterns are injected into the prompt as additional instructions. For example, a high-confidence response format pattern might add: "For database questions from this organization, always include the SQL query in a code block, present results as a markdown table, and limit to 10 rows unless the user requests more."
We use a budget system for pattern injection, total pattern instructions cannot exceed 500 tokens, forcing the system to prioritize the highest-confidence, most relevant patterns. This is inspired by context packing techniques from ML training, where you maximize useful information within a fixed budget.
What Are the Results?
After three months of production use, the learning engine has produced measurable improvements.
| Metric | Before Learning | After Learning | Change |
|---|---|---|---|
| User satisfaction (1-5) | 3.6 | 4.3 | +19% |
| Thumbs-up rate | 12% | 21% | +75% |
| Follow-up question rate | 34% | 22% | -35% |
| Average response relevance | 3.8 / 5 | 4.4 / 5 | +16% |
| Patterns discovered | 0 | 847 | , |
| Active patterns (confidence > 0.5) | 0 | 312 | , |
| Crash events | , | 2 | Both recovered |
The two crash events were caused by a pattern that over-optimized for brevity (short responses got more thumbs-up, but users then asked more follow-up questions because the answers were incomplete). Crash detection reverted the pattern within one learning cycle (6 hours), and the system self-corrected.
How Does Per-Tenant Learning Work?
Patterns are learned per-organization, not globally. What works for a data engineering team (detailed SQL explanations, technical jargon) does not work for a product management team (high-level summaries, business metrics). Per-tenant learning ensures each organization's AI experience converges to their specific preferences.
The cost of per-tenant learning is negligible: pattern discovery uses Claude Haiku at approximately $0.001 per pattern, and all scoring and adaptation logic runs locally with zero AI cost. The primary cost is storage: 847 patterns across all tenants consume approximately 2MB of database storage.
Key Takeaways
Building a learning engine that genuinely improves AI quality requires four components: multi-channel feedback collection (explicit + implicit signals), statistical pattern discovery with EMA scoring and debiasing, safety mechanisms (crash detection, warmdown, simplicity pruning), and budget-constrained pattern application. The most important design decision was using cautious updates with momentum scheduling, responding quickly to clear signals while dampening noisy or contradictory feedback. The system has improved satisfaction by 19% with zero model retraining, demonstrating that prompt-layer learning is a viable and cost-effective alternative to fine-tuning.
Sarah Chen
Contributing writer at Skopx