Back to Resources
Technical

How We Built Cross-Database Query Federation at Skopx

Sarah Chen
February 18, 2026
11 min read

How We Built Cross-Database Query Federation at Skopx

Cross-database query federation is the ability to answer questions that span multiple databases, APIs, and data sources in a single query, without requiring data to be copied into a central warehouse. At Skopx, users connect PostgreSQL databases, GitHub repositories, Jira boards, Slack channels, and more, then ask questions like "Which customers mentioned in Slack also have open Jira tickets and declining usage in our database?" The federation engine resolves this across three sources in under 4 seconds.

What Is Query Federation?

Query federation is a technique that executes parts of a query against different data sources and combines the results at the application layer. Unlike ETL (Extract, Transform, Load), which copies all data into a single warehouse, federation queries data in place. This eliminates data staleness, reduces storage costs, and avoids the security complexity of maintaining copies of sensitive data in a central location.

The trade-off is performance, federated queries are slower than queries against pre-joined warehouse tables because they involve multiple network round trips. Our architecture accepts this trade-off because freshness and security matter more than sub-second latency for analytical questions.

How Does the Federation Architecture Work?

Our federation engine has four components: the query planner, source adapters, the join engine, and the result synthesizer.

Query Planner. When a user asks a cross-source question, Claude decomposes it into sub-queries targeted at specific sources. For the example above, the planner generates: (1) search Slack for customer mentions, (2) query Jira for open tickets by those customers, (3) query PostgreSQL for usage trends for those customers.

The planner determines execution order based on result set size estimation. We start with the source likely to return the fewest results (the "driving source") to minimize downstream query fan-out. In practice, the driving source is usually the most constrained, a Slack search for a specific keyword returns fewer results than a full Jira board scan.

// Query plan decomposition
interface FederatedQueryPlan {
  steps: QueryStep[];
  joinKeys: JoinKey[];
  executionOrder: number[];
}

interface QueryStep {
  sourceId: string;
  sourceType: 'postgresql' | 'github' | 'jira' | 'slack' | 'notion';
  query: string;        // source-native query
  outputSchema: Field[];
  estimatedRows: number;
}

async function planFederatedQuery(
  question: string,
  sources: ConnectedSource[]
): Promise<FederatedQueryPlan> {
  const plan = await claude.generate({
    prompt: buildPlannerPrompt(question, sources),
    schema: FederatedQueryPlanSchema
  });

  // Order by estimated result size (smallest first)
  plan.executionOrder = plan.steps
    .map((s, i) => ({ index: i, estimate: s.estimatedRows }))
    .sort((a, b) => a.estimate - b.estimate)
    .map(s => s.index);

  return plan;
}

Source Adapters. Each data source has a typed adapter that translates the planner's abstract query into source-native operations. The PostgreSQL adapter generates SQL. The GitHub adapter uses the GraphQL API. The Jira adapter uses JQL. The Slack adapter uses the search API. Each adapter returns results in a normalized format with typed columns and metadata.

Join Engine. The join engine combines results from multiple sources using the join keys identified by the query planner. Joins are performed in-memory at the application layer. For small result sets (under 10,000 rows per source), this is fast, median join time is 12ms. For larger result sets, we push filtering to the source adapters to reduce the data transferred.

The join engine supports four join types: inner join (only matching records), left join (all records from the driving source), entity resolution join (fuzzy matching on names/emails using string similarity), and temporal join (matching events within a time window).

Result Synthesizer. After joining, Claude synthesizes the combined data into a natural language answer with citations to each source. The synthesizer includes specific numbers, links to source records, and confidence indicators.

How Does Entity Resolution Work Across Sources?

The hardest part of federation is entity resolution, determining that "Acme Corp" in your database, "acme-corp" in GitHub, and "Acme Corporation" in Jira all refer to the same entity. We use a multi-signal matching approach.

First, we check for exact ID matches, if a Jira ticket references a GitHub PR number, that is a definitive link. Second, we use normalized string matching with Levenshtein distance for entity names, with a similarity threshold of 0.85. Third, we use email addresses as universal identifiers when available, a user's email in Slack matches their email in Jira.

Our EntityGraph stores resolved entity relationships persistently, so entity resolution improves over time. After a user confirms that "Acme Corp" and "Acme Corporation" are the same entity, future queries use this mapping automatically. The entity resolution accuracy is 94% for previously-seen entities and 78% for novel entities.

What Are the Latency Characteristics?

Federated query latency depends on the number of sources and the slowest source response time. Typical breakdown:

StageLatency
Query planning (Claude)400-800ms
Source queries (parallel)200-2000ms
Join engine5-50ms
Result synthesis (Claude)300-600ms
Total1.2-3.8s

Source query latency varies widely. PostgreSQL queries typically return in 50-200ms. GitHub GraphQL API calls take 200-500ms. Jira API calls take 300-800ms (Jira's API is consistently the slowest). Slack search takes 400-1200ms.

We execute source queries in parallel when there are no data dependencies between them. For the example query ("customers mentioned in Slack with open Jira tickets and declining usage"), the Slack search runs first (driving source), then Jira and PostgreSQL queries run in parallel using the customer list from Slack.

How Do You Handle Source Failures?

In a federated system, any source can fail, an API rate limit, a network timeout, or a database connection error. Our failure handling strategy is graceful degradation rather than total failure.

If a non-essential source fails, we return partial results with a clear indication of what is missing. "I found 3 customers mentioned in Slack with declining database usage, but I could not check Jira for open tickets because the Jira API returned a rate limit error. Here are the results from the available sources."

If the driving source fails, the entire query fails because we cannot determine the entity set to join against. In this case, we suggest an alternative query formulation that uses a different driving source.

We implement retry with exponential backoff for transient errors (rate limits, timeouts) and circuit breakers for persistent failures. The circuit breaker opens after 3 consecutive failures to a source and closes after a successful health check 5 minutes later.

How Do You Maintain Security Across Sources?

Each source query executes with the user's own credentials. When a user connects a GitHub repository, they authenticate with their GitHub OAuth token, and that token (encrypted with AES-256-CBC) is what we use for API calls. This means the user can only access repositories, tickets, and channels they have permission to access in the original tool.

We never elevate permissions during federation. If a user cannot see a private Slack channel in Slack itself, they cannot see its contents through Skopx. This per-source permission model is simpler and more secure than trying to maintain a separate authorization layer.

Key Takeaways

Cross-database query federation enables powerful cross-source analytics without centralized data warehousing. The critical architecture decisions are: AI-powered query decomposition for flexible sub-query generation, smallest-first execution ordering to minimize fan-out, parallel source queries for latency optimization, and graceful degradation for partial failures. Entity resolution across sources remains the hardest technical challenge, fuzzy name matching gets you to 78% accuracy, but persistent entity mappings with user confirmation push this to 94% over time.

Share this article

Sarah Chen

Contributing writer at Skopx

Stay Updated

Get the latest insights on AI-powered code intelligence delivered to your inbox.