Source-to-Staging Bridge — Business Case Report

The Problem

VP of Sales says "I need HubSpot data in Looker by Friday." The junior DE enables Airbyte's HubSpot connector. It syncs 47 streams into BigQuery, creating tables like:

raw_hubspot.deals
raw_hubspot.deals_properties_hs_deal_amount
raw_hubspot.deals_associations_company_ids
raw_hubspot.contacts
raw_hubspot.contacts_properties_hs_email
... (42 more tables)

Now the DE needs to turn these into clean dbt staging models. This means:

Figuring out which of 47 tables are actually needed
Renaming deals_properties_hs_deal_amount to deal_amount
Casting string amounts to numeric (SUM(amount) fails on strings)
Joining owner IDs to names (otherwise dashboard shows "12345" not "Sarah Chen")
Writing schema.yml with column descriptions and tests

This takes 4-8 hours of manual work. Every time. For every new source.

Evidence from Multi-Agent Simulation

9→7

Severity reduction with tooling

4-8h

Manual mapping time per source

2-3x/mo

New source integration frequency

Tables from a single HubSpot sync

Simulation Finding — HubSpot Scenario

This was the only scenario where severity dropped from 9 to 7 with tooling. In all other scenarios (schema drift, performance, migration), severity stayed at 8-9 regardless of tools. This means the problem is genuinely solvable with automation.

Observer Identified Breakpoints

"The transition point between the Ingestion layer (Airbyte) and the Transformation layer (dbt) is a 'black box' for the engineer"
"Mapping Airbyte's auto-generated, flattened column names to dbt staging models" — estimated 2-4 hours, automation potential: HIGH
"Accepting that 'hitting the button' in Airbyte is a high-risk, scary event"

Before and After

Without Bridge (current state)

-- Manual: 4-8 hours per source
-- Step 1: Figure out which tables exist
SELECT table_name FROM
  information_schema.tables
  WHERE dataset_id = 'raw_hubspot';
-- (returns 47 tables, most unknown)

-- Step 2: Inspect each table schema
-- Step 3: Guess which columns matter
-- Step 4: Write staging model by hand
SELECT
  id AS deal_id,
  -- Wait, is it 'amount' or
  -- 'properties_hs_deal_amount'?
  -- And it's a STRING??
  CAST(properties_hs_deal_amount
    AS NUMERIC) AS deal_amount,
  -- How do I get owner name...?
  properties_hubspot_owner_id
    AS owner_id  -- shows "12345"
FROM raw_hubspot.deals

With Bridge (auto-generated)

-- Auto-generated in 30 seconds
-- Source: raw_hubspot.deals
-- Relevant streams detected: 5 of 47

SELECT
  id AS deal_id,
  CAST(properties_hs_deal_amount
    AS NUMERIC) AS deal_amount,
  properties_dealstage AS deal_stage,
  properties_closedate AS closed_at,
  o.owner_name,
  a.company_id
FROM raw_hubspot.deals d
LEFT JOIN raw_hubspot.owners o
  ON d.properties_hubspot_owner_id
   = o.owner_id
LEFT JOIN raw_hubspot
  .deals_associations_company_ids a
  ON d.id = a.deal_id

-- schema.yml also generated with:
--   column descriptions
--   not_null tests on deal_id
--   accepted_values on deal_stage
--   positive value test on deal_amount

Why This Gap Exists

Airbyte's job ends at "data landed in warehouse." It doesn't know what dbt needs.

dbt's job starts at "here's my source table." It doesn't know what Airbyte created.

The gap in between — understanding the raw schema, cleaning names, fixing types, adding joins, generating tests — is 100% manual. This is true for Fivetran too.

Existing Tool	What It Does	What It Doesn't Do
Airbyte	Syncs data to warehouse	Doesn't generate dbt models
Fivetran dbt packages	Pre-built models for some sources	Only covers ~30 sources; doesn't handle custom properties
dbt codegen	Generates base model from source	No smart renaming, no type detection, no join inference
Source-to-Staging Bridge	Reads raw schema + sample data → generates complete staging model with smart naming, type casting, join inference, and tests

Product Design

Trigger

New raw table detected in BigQuery/Snowflake landing schema (via information_schema polling or event trigger).

Workflow

Schema scan: Read table schema + sample 100 rows
Stream triage: Classify tables as core/association/metadata/irrelevant
Smart rename: deals_properties_hs_deal_amount → deal_amount
Type fix: Detect string-encoded numbers, dates, booleans → generate CAST
Join inference: Detect foreign keys (owner_id → owners table, association tables)
Test generation: not_null on PKs, positive on amounts, accepted_values on enums
Output: staging model SQL + schema.yml + PR-ready commit

Market Sizing

50K+

Companies using Airbyte/Fivetran + dbt

2-3x

New sources per month per team

$300-600

Engineering cost per manual integration

Revenue Model

Tier	Price	Features
Free (Agent Skill)	$0	Generate staging models on-demand via CLI
Pro	$49/mo	Continuous monitoring for new tables + auto-PR
Team	$199/mo	Multi-source, naming convention enforcement, cost estimation
Enterprise	$499/mo	Custom connectors, PII detection, compliance templates

Year 1 Projection

Free tier: 5000+ installs (marketplace + SkillsMP distribution)
Pro conversion: 5% = 250 customers × $49/mo = $147K ARR
Team/Enterprise: 20 customers × $300/mo avg = $72K ARR
Total: ~$220K ARR Year 1

Competitive Advantage

vs dbt codegen: codegen generates a 1:1 column dump. We generate smart, production-ready models with joins, types, and tests.
vs Fivetran dbt packages: Only covers ~30 popular sources. We work with ANY Airbyte/Fivetran connector including custom ones.
vs manual work: 30 seconds vs 4-8 hours. No contest.
Distribution: Published as an agent skill — zero friction to try. Works in Claude Code, Codex CLI, Gemini CLI.