Source-to-Staging Bridge

The manual gap between Airbyte/Fivetran and dbt that every data engineer fills by hand, every single time

Highest Frequency Gap Severity Drops 9→7 Automatable

The Problem

VP of Sales says "I need HubSpot data in Looker by Friday." The junior DE enables Airbyte's HubSpot connector. It syncs 47 streams into BigQuery, creating tables like:

raw_hubspot.deals
raw_hubspot.deals_properties_hs_deal_amount
raw_hubspot.deals_associations_company_ids
raw_hubspot.contacts
raw_hubspot.contacts_properties_hs_email
... (42 more tables)

Now the DE needs to turn these into clean dbt staging models. This means:

  • Figuring out which of 47 tables are actually needed
  • Renaming deals_properties_hs_deal_amount to deal_amount
  • Casting string amounts to numeric (SUM(amount) fails on strings)
  • Joining owner IDs to names (otherwise dashboard shows "12345" not "Sarah Chen")
  • Writing schema.yml with column descriptions and tests

This takes 4-8 hours of manual work. Every time. For every new source.

Evidence from Multi-Agent Simulation

9→7
Severity reduction with tooling
4-8h
Manual mapping time per source
2-3x/mo
New source integration frequency
47
Tables from a single HubSpot sync
Simulation Finding — HubSpot Scenario

This was the only scenario where severity dropped from 9 to 7 with tooling. In all other scenarios (schema drift, performance, migration), severity stayed at 8-9 regardless of tools. This means the problem is genuinely solvable with automation.

Observer Identified Breakpoints
  • "The transition point between the Ingestion layer (Airbyte) and the Transformation layer (dbt) is a 'black box' for the engineer"
  • "Mapping Airbyte's auto-generated, flattened column names to dbt staging models" — estimated 2-4 hours, automation potential: HIGH
  • "Accepting that 'hitting the button' in Airbyte is a high-risk, scary event"

Before and After

Without Bridge (current state)

-- Manual: 4-8 hours per source
-- Step 1: Figure out which tables exist
SELECT table_name FROM
  information_schema.tables
  WHERE dataset_id = 'raw_hubspot';
-- (returns 47 tables, most unknown)

-- Step 2: Inspect each table schema
-- Step 3: Guess which columns matter
-- Step 4: Write staging model by hand
SELECT
  id AS deal_id,
  -- Wait, is it 'amount' or
  -- 'properties_hs_deal_amount'?
  -- And it's a STRING??
  CAST(properties_hs_deal_amount
    AS NUMERIC) AS deal_amount,
  -- How do I get owner name...?
  properties_hubspot_owner_id
    AS owner_id  -- shows "12345"
FROM raw_hubspot.deals

With Bridge (auto-generated)

-- Auto-generated in 30 seconds
-- Source: raw_hubspot.deals
-- Relevant streams detected: 5 of 47

SELECT
  id AS deal_id,
  CAST(properties_hs_deal_amount
    AS NUMERIC) AS deal_amount,
  properties_dealstage AS deal_stage,
  properties_closedate AS closed_at,
  o.owner_name,
  a.company_id
FROM raw_hubspot.deals d
LEFT JOIN raw_hubspot.owners o
  ON d.properties_hubspot_owner_id
   = o.owner_id
LEFT JOIN raw_hubspot
  .deals_associations_company_ids a
  ON d.id = a.deal_id

-- schema.yml also generated with:
--   column descriptions
--   not_null tests on deal_id
--   accepted_values on deal_stage
--   positive value test on deal_amount

Why This Gap Exists

Airbyte's job ends at "data landed in warehouse." It doesn't know what dbt needs.

dbt's job starts at "here's my source table." It doesn't know what Airbyte created.

The gap in between — understanding the raw schema, cleaning names, fixing types, adding joins, generating tests — is 100% manual. This is true for Fivetran too.


Existing ToolWhat It DoesWhat It Doesn't Do
AirbyteSyncs data to warehouseDoesn't generate dbt models
Fivetran dbt packagesPre-built models for some sourcesOnly covers ~30 sources; doesn't handle custom properties
dbt codegenGenerates base model from sourceNo smart renaming, no type detection, no join inference
Source-to-Staging BridgeReads raw schema + sample data → generates complete staging model with smart naming, type casting, join inference, and tests

Product Design

Trigger

New raw table detected in BigQuery/Snowflake landing schema (via information_schema polling or event trigger).

Workflow

  1. Schema scan: Read table schema + sample 100 rows
  2. Stream triage: Classify tables as core/association/metadata/irrelevant
  3. Smart rename: deals_properties_hs_deal_amountdeal_amount
  4. Type fix: Detect string-encoded numbers, dates, booleans → generate CAST
  5. Join inference: Detect foreign keys (owner_id → owners table, association tables)
  6. Test generation: not_null on PKs, positive on amounts, accepted_values on enums
  7. Output: staging model SQL + schema.yml + PR-ready commit

Market Sizing

50K+
Companies using Airbyte/Fivetran + dbt
2-3x
New sources per month per team
$300-600
Engineering cost per manual integration

Revenue Model

TierPriceFeatures
Free (Agent Skill)$0Generate staging models on-demand via CLI
Pro$49/moContinuous monitoring for new tables + auto-PR
Team$199/moMulti-source, naming convention enforcement, cost estimation
Enterprise$499/moCustom connectors, PII detection, compliance templates

Year 1 Projection

Competitive Advantage