How do you handle source system schema changes without breaking downstream pipelines?

Schema change handling is designed into the pipeline architecture, not bolted on after a break. For ELT pipelines, raw data lands in the warehouse in its source form, when a source adds a column, the raw table gains a column and the transformation models decide whether to use it. For ETL pipelines, we build schema validation checks that alert when a source deviates from expected structure before the load runs. Critical pipelines include automated schema drift detection so the team knows before a report breaks.

Should we use a managed ETL tool like Fivetran or Airbyte, or build custom pipelines?

Managed ETL tools (Fivetran, Airbyte) are the right starting point when your source systems are standard SaaS tools with supported connectors and your transformation requirements are straightforward. They handle connector maintenance and scheduling so you don't have to. Custom pipeline development makes sense when your source systems are custom databases or internal APIs without supported connectors, when data volume or transformation complexity exceeds what managed tools handle economically, or when you have compliance requirements that restrict data passing through third-party infrastructure. We'll give you an honest assessment of which approach fits before scoping anything.

How long does ETL pipeline development take?

A pipeline connecting 3 to 5 source systems with standard transformations and warehouse delivery typically takes 6 to 10 weeks. A more complex build with custom database connectors, CDC replication, and a full dbt transformation layer typically takes 10 to 16 weeks. Timeline depends on the number of sources, data quality issues in those sources, and transformation complexity. We assess all three during a scoping phase before committing to a timeline.

What happens when a source system is unavailable and the pipeline fails to extract?

Pipeline failure handling is a first-class part of the design. When extraction from a source fails, the pipeline logs the failure, skips the load step (preserving the last successful data state in the warehouse), and triggers an alert. Partial loads, where some records were extracted before a failure, are handled with transactional load patterns that commit only complete batches. Retry logic with backoff handles transient source unavailability without operator intervention.

ETL Pipeline Development | Snowflake BigQuery

Data that lives in source systems is not data you can use. ETL pipelines are what move it to where decisions get made.

ETL (extract, transform, load) and ELT (extract, load, transform) pipelines are the infrastructure that connects source systems, ERP, CRM, WMS, SaaS tools, databases, to your data warehouse or analytics layer. Without them, every report is a manual export and every business question requires someone to spend two days joining spreadsheets.

We design and build ETL/ELT pipelines that run on a defined schedule, handle source system changes gracefully, and deliver clean, consistent data to your warehouse or downstream consumers. Architecture, development, testing, and monitoring, scoped and priced as one engagement.

See our work

Batch ETL pipelines connecting ERP, CRM, SaaS tools, and custom databases
ELT architecture on Snowflake, BigQuery, or Redshift, raw data preserved, transformations versioned
Incremental load patterns that move only changed records rather than full table dumps
Pipeline monitoring with alerting when source schema changes or row counts deviate from expected

Recent outcomes

Voice AI · Research

Text-based interviews converted to automated phone calls

6× deeper insights

AI Automation · Ops

Manual invoice OCR across 40+ gas stations

20k+ txns day one

Loyalty · Retail

SuperValu & Centra loyalty platform with receipt validation

1,062 users in 4 weeks

SaaS · Logistics

Multi-carrier shipping hub for Indonesian eCommerce

2,000+ shipments yr 1

4.9 / 5 on ClutchSee all work

Recognition

Sound familiar?

How many hours per week does someone spend manually exporting data from one system and importing it into another?
When a source system changes its schema, does your pipeline fail silently or alert you before bad data reaches reports?

In short

RaftLabs builds ETL and ELT pipelines that connect ERP, CRM, SaaS tools, and databases to data warehouses on Snowflake, BigQuery, or Redshift. Batch and incremental pipelines, schema change handling, and pipeline monitoring. A pipeline covering 3 to 5 source systems with standard transformations typically costs $15,000 to $40,000 and delivers in 6 to 10 weeks. Custom database connectors and CDC pipelines run 10 to 16 weeks.

Trusted by

ETL and ELT pipelines are the infrastructure layer between your source systems and every report, dashboard, or ML model that depends on that data. Without a pipeline layer, analysts pull manual exports, join spreadsheets, and build one-off scripts that break when a source system changes. The result is reporting that lags, numbers that don't reconcile, and a data team that spends most of its time on data preparation rather than analysis.

Building that pipeline layer is an engineering project with real architecture decisions: which extraction pattern fits each source, how transformations are versioned and tested, what happens when a source is unavailable, and how the team gets alerted when something goes wrong. We scope that as a single engagement, architecture, development, testing, and monitoring, and deliver it at a fixed cost agreed before development starts.

Capabilities

What we build

Batch ETL pipeline development

Scheduled extraction from ERP, CRM, WMS, flat files, databases, and SaaS APIs with the orchestration infrastructure that makes each run reliable, observable, and recoverable when something goes wrong. Apache Airflow as the orchestration platform for pipelines requiring complex dependency graphs, custom operators, and rich monitoring, DAGs define the task dependencies, execution schedule, and retry behaviour; the Airflow UI provides run history, task-level logs, and re-run capability for failed runs without reprocessing already-loaded data. Prefect as an alternative for teams that prefer Python-native workflow definitions with less operational overhead than self-hosted Airflow, Prefect Cloud for managed orchestration at small to medium pipeline scale. Incremental load patterns for tables too large to re-extract fully on each run: watermark-based incremental extraction using the source table's updated_at or created_at column to fetch only records modified since the last successful run, with the high-water mark stored in a pipeline state table. Full-table extraction reserved for small lookup tables and dimension tables where insert-only semantics are not guaranteed and deletions need to be detected. SAP extraction via the SAP NetWeaver RFC SDK, calling BAPIs (Business Application Programming Interfaces) for master data and transaction data extraction, the standard SAP extraction approach that does not require direct database access to the SAP schema. Oracle E-Business Suite extraction via database views and the Oracle JDBC driver. Retry logic with exponential backoff for transient failures: a pipeline that fails due to network timeout retries at 5 minutes, 15 minutes, and 1 hour before alerting, reducing false-alarm escalations while surfacing genuine failures quickly.

ELT architecture on cloud warehouses

Raw data landing in Snowflake, BigQuery, or Redshift in its source form before any transformation runs in the warehouse, the ELT pattern that preserves full source fidelity and allows transformations to be rebuilt without re-extracting from source systems. Snowflake for teams prioritising near-zero storage cost on compressed columnar data, automatic multi-cluster scaling under heavy concurrent query load, and Time Travel for accessing historical data states for up to 90 days. BigQuery for Google Cloud environments and teams with variable query patterns, serverless query pricing (per-byte scanned) works well for intermittent workloads, with reservations purchased for predictable high-volume workloads. Redshift for AWS-native organisations with large volumes of structured data and teams that want tight integration with the AWS ecosystem. dbt (data build tool) as the transformation layer: SQL models in version-controlled .sql files with a {{ ref() }} dependency graph that dbt uses to determine execution order and compile a lineage DAG. Three-layer architecture standard in dbt projects: staging models (stg_*) rename and lightly cast source columns without business logic; intermediate models (int_*) join and aggregate across sources; mart models (fct_* and dim_*) produce the business-ready tables analysts query directly. Incremental materialisations for large fact tables: the is_incremental() Jinja block filters the transformation to process only new or updated records rather than the full table on each run, reducing warehouse compute cost by 80-95% on tables above 100 million rows. Column lineage documentation auto-generated from dbt model definitions so analysts can trace a dashboard metric back to the source column and the transformation logic that derived it.

SaaS and API data ingestion

Connectors for the SaaS tools that hold business data and expose it via REST or GraphQL APIs, built with the rate limiting, pagination, and incremental pull patterns that make high-volume production ingestion reliable. Salesforce extraction via the Bulk API 2.0 for large object volumes (Accounts, Contacts, Opportunities, Activities): the Bulk API handles 150 million records/day per API call without count against the REST API governor limits, processes jobs asynchronously so the connector doesn't block, and supports incremental queries via SOQL WHERE LastModifiedDate > :lastRunTimestamp for changed-record-only extraction. HubSpot CRM extraction via the HubSpot Contacts, Deals, Companies, and Engagements APIs with cursor-based pagination using the vidOffset or after cursor token to walk through paginated responses without missing records on reruns. Stripe financial data via the Stripe Payments, Subscriptions, and Invoices list endpoints with created[gt] timestamp filters for incremental extraction. Shopify extraction via the Admin REST API with the updated_at_min parameter for incremental order and product catalogue extraction, and the GraphQL Bulk Operations API for full catalogue extracts that exceed REST pagination limits. Google Ads and Meta Ads performance data via their respective reporting APIs for marketing spend and campaign performance. Authentication token management: OAuth 2.0 refresh token rotation stored in a secrets manager (AWS Secrets Manager, GCP Secret Manager) with automatic refresh before expiry, preventing pipeline failures from expired tokens. Custom REST API connectors built from an OpenAPI specification or API documentation for internal systems without Fivetran or Airbyte connector support, typically 3 to 5 days engineering per connector depending on API complexity and authentication mechanism.

Database replication and CDC

Change Data Capture (CDC) replication from MySQL, PostgreSQL, SQL Server, and Oracle that captures every insert, update, and delete at the row level as it happens in the source database, without full table scans, without polling queries that load the production database, and without the delete blindness that makes incremental-by-timestamp extraction incorrect for tables where rows are deleted. Debezium as the CDC engine for open-source, production-proven database log reading: Debezium connects to the database's transaction log (MySQL binlog, PostgreSQL WAL via logical replication slot, SQL Server CDC tables, Oracle LogMiner) and streams change events as structured JSON messages to a Kafka topic or directly to the target system. The production database sees only the standard replication client connection, CDC does not add queries to the production workload beyond what a read replica connection would. Kafka Connect sink connectors for warehouse delivery: Confluent JDBC Sink Connector or BigQuery/Snowflake-specific connectors consume the Debezium change stream and write upserts to the target table at configurable batch intervals. Full row history preservation in the warehouse using Type 2 slowly changing dimension (SCD2) pattern: each row update appended as a new record with effective_from, effective_to, and is_current columns rather than overwriting the previous value, the audit trail for tables where row history matters for compliance or analytics. Use cases where CDC is the correct choice over watermark-based incremental extraction: tables with no reliable updated_at column; tables where deletes are frequent and must be propagated; tables too large to re-extract in a batch window; replication latency requirements below 1 minute; and event sourcing patterns where every state transition is a business event that downstream consumers need to process.

Pipeline monitoring and alerting

Run logging, row count validation, schema change detection, and freshness SLA monitoring across all pipeline jobs, the observability layer that means a pipeline failure is discovered by the monitoring system, not by an analyst who noticed a dashboard stopped updating. Row count validation at every pipeline stage: the number of records extracted, the number after transformation (checking for unexpected row inflation or loss from join operations), and the number successfully loaded compared to the expected range based on historical run statistics, a run that delivers 10% of the normal record count triggers an alert rather than silently loading partial data. Schema change detection comparing the source schema against the expected schema on every extraction run: new columns logged and optionally auto-added to the raw table; removed or renamed columns that would break transformation models trigger an alert before the load step runs, preventing broken downstream dbt models from silently producing null values. Data freshness SLA monitoring: each pipeline has a configured maximum acceptable lag (e.g., the sales pipeline must complete by 06:00 daily); a pipeline that hasn't delivered fresh data by the SLA deadline triggers a Slack or email alert to the on-call data engineer. Great Expectations as the data quality framework for custom assertion rules beyond row counts: assertions on value ranges (order amounts between $0 and $1M), referential integrity (every order in the fact table has a matching customer in the dimension table), and completeness (no more than 5% null in required columns) run as dbt test blocks on every model refresh and fail the pipeline if assertions breach. Central Airflow or Prefect monitoring dashboard: run status, last success timestamp, duration trend, record counts, and alert history across all jobs in one view, the data engineering team's operational health view without navigating multiple tools.

Data transformation and modelling

dbt SQL models that define your business entities, metrics, and reporting constructs in version-controlled SQL with full lineage documentation, replacing the ad-hoc analyst queries and undocumented spreadsheet formulas that make data definitions inconsistent across teams. Model architecture follows the staging-intermediate-mart convention: stg_salesforce__accounts.sql normalises and casts the raw Salesforce accounts table with no business logic; int_customers.sql joins across the CRM and billing system to produce a unified customer entity; fct_revenue.sql calculates recognised revenue per month per customer using the business logic agreed with Finance, each model has a single clearly defined purpose and can be changed without breaking layers above or below it. Column-level documentation in schema.yml files: every output column in every mart model described with its definition, the source column it derives from, and the business rule applied, generated as a dbt Docs site that analysts browse without asking the data team what a field means. dbt tests on every model: not_null and unique tests on primary keys to catch extraction or join bugs that produce duplicate or missing records; relationships tests between fact and dimension tables to catch referential integrity violations; custom test_between assertions for business rules (revenue must be positive, discount percentage must be between 0 and 100). dbt source freshness checks validate that source tables in the warehouse have been updated within the expected window before running downstream models, so a dbt run that begins before an upstream pipeline has completed fails fast at the source check rather than silently producing stale mart data. Model performance optimisation: query profiling on slow-running models with warehouse-specific optimisations (Snowflake clustering keys, BigQuery partition pruning, Redshift sort and dist keys applied based on query patterns) to keep mart refresh time below the freshness SLA.

Have a data pipeline project?

Tell us your source systems, what data you need to move, and what breaks today when a pipeline fails. We'll scope the architecture and give you a fixed cost.

Talk about your data pipeline project

Data Engineering Services, full data engineering capability overview
Data Warehouse Development, warehouse design and build on Snowflake, BigQuery, and Redshift
Real-Time Data Pipelines, streaming pipelines for data that needs to be current
Data Quality Management, validation, monitoring, and anomaly detection

Predictive Analytics, ML models built on top of your cleaned data layer
Business Intelligence, dashboards and reporting on top of your data warehouse
Cloud Migration, move your databases and data infrastructure to cloud

How it works

From first call to shipped product: how every build runs.

The same four steps on every engagement. A 6-week voice AI deployment runs the same shape as a 16-week enterprise build.

Week 1
01
Discover
We spend the first week understanding the problem, not presenting a solution. Discovery session, interviews with the people closest to the work, workflow mapping, and a technical audit of what you already have. You leave knowing exactly what's broken and why previous attempts didn't fix it.
Weeks 2–3
02
Design
Low-fidelity wireframes before any code is written. You see the product before we build it. Scope, timeline, and fixed price locked at this stage. No surprises after work starts.
Weeks 4–12
03
Build
Bi-weekly agile sprints. Weekly progress calls. Direct access to the team and project management tools. Working software at the end of every sprint. Not a big-bang delivery at the finish line.
Weeks 12–16
04
Ship
Production deployment, QA sign-off, load testing, and team handover. You own the full codebase from day one. We stay on for post-launch iteration and support. Nothing gets thrown over the wall.

Frequently asked questions

: Schema change handling is designed into the pipeline architecture, not bolted on after a break. For ELT pipelines, raw data lands in the warehouse in its source form, when a source adds a column, the raw table gains a column and the transformation models decide whether to use it. For ETL pipelines, we build schema validation checks that alert when a source deviates from expected structure before the load runs. Critical pipelines include automated schema drift detection so the team knows before a report breaks.
: Managed ETL tools (Fivetran, Airbyte) are the right starting point when your source systems are standard SaaS tools with supported connectors and your transformation requirements are straightforward. They handle connector maintenance and scheduling so you don't have to. Custom pipeline development makes sense when your source systems are custom databases or internal APIs without supported connectors, when data volume or transformation complexity exceeds what managed tools handle economically, or when you have compliance requirements that restrict data passing through third-party infrastructure. We'll give you an honest assessment of which approach fits before scoping anything.
: A pipeline connecting 3 to 5 source systems with standard transformations and warehouse delivery typically takes 6 to 10 weeks. A more complex build with custom database connectors, CDC replication, and a full dbt transformation layer typically takes 10 to 16 weeks. Timeline depends on the number of sources, data quality issues in those sources, and transformation complexity. We assess all three during a scoping phase before committing to a timeline.
: Pipeline failure handling is a first-class part of the design. When extraction from a source fails, the pipeline logs the failure, skips the load step (preserving the last successful data state in the warehouse), and triggers an alert. Partial loads, where some records were extracted before a failure, are handled with transactional load patterns that commit only complete batches. Retry logic with backoff handles transient source unavailability without operator intervention.

Work with us

Tell us what you need. We'll tell you what it would take.

We scope ETL Pipeline Development in 30 minutes. You walk away with a clear cost, timeline, and approach. No commitment required.

Scope and cost agreed before work starts. No surprises. No obligation.
Working prototype within 3 weeks of kickoff.
Pay by milestone. You see progress before each invoice.
60-day post-launch warranty. Bug fixes, UI tweaks, and deployment support. No retainer.
All conversations are NDA-protected.

Data that lives in source systems is not data you can use. ETL pipelines are what move it to where decisions get made.

Sound familiar?

What we build

Batch ETL pipeline development

ELT architecture on cloud warehouses

SaaS and API data ingestion

Database replication and CDC

Pipeline monitoring and alerting

Data transformation and modelling

Have a data pipeline project?

Related data engineering services

Related services

From first call to shipped product: how every build runs.

Discover

Design

Build

Ship

Frequently asked questions

Tell us what you need. We'll tell you what it would take.