Back to Resources
Technical

Scaling AI Analytics to Handle 1 Million Queries Per Day

Mike Johnson
March 5, 2026
11 min read

Scaling AI Analytics to Handle 1 Million Queries Per Day

Scaling AI analytics is the engineering challenge of serving millions of AI-powered queries per day while maintaining sub-second latency, high accuracy, and reasonable cost. At Skopx, we scaled from 1,000 queries per day to our architecture target of 1 million by addressing four bottlenecks: LLM API throughput, database connection pooling, vector search performance, and intelligent caching. This post covers each bottleneck and the solutions we implemented on our Next.js and Supabase stack hosted on Railway.

What Are the Scaling Bottlenecks in AI Analytics?

AI analytics platforms have a fundamentally different scaling profile from traditional web applications. A typical web request reads from a database and returns JSON, total compute time under 50ms. An AI analytics request involves LLM API calls (200-800ms), vector search (50-150ms), database queries (50-300ms), and response synthesis (200-500ms). Each request consumes 10-50x more compute than a traditional API call.

The dominant cost is LLM API usage. At 1 million queries per day with an average cost of $0.003 per query (using a mix of Haiku and Sonnet), the daily LLM cost alone is $3,000. Without optimization, this makes the business model unsustainable. Our goal was to reduce per-query cost to under $0.001 while maintaining quality.

How Does the Caching Strategy Work?

Caching is the single most impactful optimization. We implement three caching layers: semantic query cache, schema cache, and embedding cache.

Semantic Query Cache. Many users ask similar questions in different words. "How many active users?" and "What's our active user count?" should return the same cached result. We hash the question's embedding vector (rounded to 4 decimal places) to create a semantic cache key. If a new question's embedding is within cosine distance 0.05 of a cached question, we return the cached result. This provides a cache hit rate of 34%, meaning one-third of all queries are served without any LLM call or database query.

// Semantic query cache
interface CachedResult {
  answer: string;
  citations: Citation[];
  timestamp: Date;
  ttl: number; // seconds
  hitCount: number;
}

async function checkSemanticCache(
  question: string,
  orgId: string
): Promise<CachedResult | null> {
  const embedding = await embedQuestion(question);
  const cacheKey = roundedEmbeddingHash(embedding, 4);

  // Check exact hash match first (sub-1ms)
  const exact = await redis.get(`cache:${orgId}:${cacheKey}`);
  if (exact) return JSON.parse(exact);

  // Check semantic similarity for near-matches (5-10ms)
  const nearMatch = await vectorCache.search(embedding, {
    filter: { orgId },
    threshold: 0.95, // cosine similarity
    limit: 1
  });

  if (nearMatch.length > 0) {
    return JSON.parse(nearMatch[0].metadata.result);
  }

  return null;
}

Schema Cache. Database schema introspection is expensive, querying information_schema takes 50-200ms. We cache schema metadata in memory with a 15-minute TTL. Schema changes are rare (most databases change schema less than once per day), so this cache has a 99.8% hit rate.

Embedding Cache. Pre-computed embeddings for the user's codebase and documents are stored in ChromaDB and refreshed incrementally on changes. Query-time embedding (converting the user's question to a vector) is cached for 5 minutes, since users often refine the same question multiple times.

How Does Connection Pooling Work at Scale?

Each user can connect multiple external databases to Skopx. At 10,000 concurrent users with an average of 2 connected databases each, we need to manage 20,000+ database connections. Direct connections to this many databases would exhaust file descriptors, memory, and network sockets.

We use a tiered connection pooling strategy. For Supabase databases (89% of our connections), we route through Supabase's session pooler at pooler.supabase.com port 5432, which handles connection multiplexing on their end. For direct PostgreSQL connections, we maintain a connection pool with a maximum of 5 connections per database, using idle connection timeout of 30 seconds and maximum connection lifetime of 10 minutes.

The connection pool is managed per-process on Railway. With 4 application processes, each process manages its own pool. We use consistent hashing on the user ID to route requests to the same process, which improves cache hit rates and reduces pool fragmentation.

Connection Architecture:
User Request → Railway Load Balancer → App Process (1 of 4)
  → Supabase Pooler (for Supabase DBs) → PostgreSQL
  → Direct Pool (max 5 per DB) → External PostgreSQL
  → ChromaDB (embedded, per-process)
  → Claude API (HTTP, connection reuse)

How Do You Optimize LLM Costs?

LLM API costs are the dominant expense. We reduce them through four techniques.

Model routing. Not every query needs the most expensive model. We classify queries into three tiers: simple (direct lookups, single-table queries) use Claude 3.5 Haiku at $0.0003 per query, moderate (multi-table joins, basic analysis) use Claude 3.5 Sonnet at $0.003 per query, and complex (cross-source federation, deep architectural analysis) use Claude 3.5 Sonnet with extended thinking at $0.008 per query. The classifier itself runs on Haiku and costs $0.0001 per classification.

In practice, 55% of queries are simple, 35% are moderate, and 10% are complex. This distribution plus semantic caching brings the average per-query cost to $0.0008, below our $0.001 target.

Prompt compression. Schema metadata, few-shot examples, and system instructions are compressed using a technique we call "structured abbreviation." Instead of sending full CREATE TABLE statements, we send a compact notation: users(id:uuid, email:text, created_at:ts, org_id:uuid→orgs.id). This reduces schema tokens by 60% with no measurable accuracy loss.

Response streaming. Streaming does not reduce cost, but it reduces perceived latency. Users see the first tokens within 200ms, even for complex queries that take 3-5 seconds total. This dramatically improves the user experience at scale.

Batch processing for insights. The insight generation pipeline runs on a cron schedule (every 15 minutes) rather than in real-time. This allows us to batch multiple metrics into a single LLM call, reducing per-insight cost from $0.003 to $0.001.

How Does Horizontal Scaling Work on Railway?

Railway supports horizontal scaling through multiple service instances behind a load balancer. Our application is stateless, all persistent state lives in Supabase PostgreSQL, and ChromaDB runs as an embedded database within each process with synchronized data.

We scale based on two signals: request queue depth (scale up when queue exceeds 50 requests) and P95 latency (scale up when P95 exceeds 3 seconds). Scale-down is more conservative, we wait for 10 minutes of low utilization before removing an instance, to avoid thrashing during periodic traffic spikes.

Each Railway instance handles approximately 50 concurrent requests, limited primarily by the memory required for ChromaDB indexes (approximately 512MB per instance for a typical workload). At 1 million queries per day with an average request duration of 1.5 seconds, we need approximately 18 concurrent requests on average, but peak traffic (3x average) requires 54 concurrent requests, two Railway instances.

How Do You Monitor Performance at Scale?

We track four key metrics: P50/P95/P99 latency, cache hit rate, LLM cost per query, and error rate. These are emitted as structured logs from every request and aggregated in real-time.

Alert thresholds: P95 latency above 5 seconds triggers investigation. Cache hit rate below 25% triggers cache configuration review. Per-query cost above $0.002 triggers model routing audit. Error rate above 2% triggers immediate incident response.

We also track "quality at scale", whether accuracy degrades as traffic increases. Weekly sampling of 200 responses across different traffic levels has shown no statistically significant accuracy degradation between low-traffic and high-traffic periods, confirming that our caching and model routing strategies maintain quality.

Key Takeaways

Scaling AI analytics requires optimizing across four dimensions simultaneously: LLM costs (model routing and semantic caching reduce per-query cost by 73%), database connections (pooling through Supabase and tiered connection management), vector search (embedded ChromaDB with incremental updates), and compute (stateless horizontal scaling on Railway). The single highest-impact optimization is semantic query caching, which eliminates LLM calls entirely for 34% of queries. The key architectural principle is that AI workloads are compute-heavy but stateless, which makes horizontal scaling straightforward once you externalize all state to the database layer.

Share this article

Mike Johnson

Contributing writer at Skopx

Stay Updated

Get the latest insights on AI-powered code intelligence delivered to your inbox.