Test Driven Development with AI: Proven Workflow That Saved 2.5 Days

Key Takeaways

  • RaftLabs built RaftStack CLI to standardize repo setup and cut project initialization time from several days to under 30 minutes.

  • Manual copying of configs caused configuration drift, inconsistent tooling, and high onboarding and maintenance costs across projects.

  • RaftStack CLI codifies standards like linting, formatting, TypeScript configs, Git hooks, branch naming, and Asana linked commits into an easy interactive command.

  • All configurations are modular, idempotent, versioned, and centrally maintained so updates roll out consistently to every project.

  • RaftLabs extended the same standardization mindset to AI by enforcing a spec driven, test first workflow with Claude Code.

  • Their AI workflow follows a strict sequence of specification, test generation, minimal implementation, and refactoring with strong context isolation.

  • A CLAUDE.md file in each repo encodes TDD rules, testing conventions, and quality standards so all developers use AI in a consistent way.

  • This approach reduces AI output variance, improves test coverage, speeds up delivery, and shifts code reviews from style issues to logic and architecture.

  • The main lessons are to start from real pain points, make the standard path the easiest, constrain AI with clear workflows, and keep evolving versioned standards.

Every engineering team that scales beyond a handful of developers faces the same challenge: maintaining consistency. What starts as a simple problem, setting up a new project, gradually evolves into a multi-day endeavor filled with copy-pasting configuration files, hunting down the "right" version of a linting rule, and reconciling conflicting standards across repositories. At RaftLabs, we watched this problem compound as our team grew, and we decided to solve it systematically.

This article documents our journey of building RaftStack CLI, an internal tool that reduced our project setup time from 2-3 days to under 30 minutes and how we extended this standardization philosophy to AI-assisted development through a spec-driven, test-first workflow with Claude Code. The result: not just faster setup, but consistently high-quality code output across every developer on our team, regardless of their individual coding style or experience level.

Whether you're a tech lead struggling with inconsistent codebases, an engineering manager looking to scale your team's practices, or a developer curious about effective AI integration in development workflows, this deep-dive offers practical insights you can apply to your own organization.

The Problem: Why Repository Setup Became a Bottleneck

Before diving into solutions, let's examine the problem in detail. Understanding the full scope of what we were dealing with helps explain why we chose the approach we did.

The Manual Setup Reality

When a new project kicked off at RaftLabs, the setup process looked something like this: a developer would create a fresh repository, then begin the tedious process of copying configuration files from an existing "reference" project. This included ESLint configurations, Prettier settings, TypeScript configurations, Husky hooks for pre-commit validation, commitlint rules, and various other tooling configurations.

On paper, this sounds like a one-hour task. In practice, it consistently consumed 2-3 days of developer time as every project had subtle differences. The "reference" project might be using an older ESLint version. Its TypeScript config might include settings specific to that project's architecture. The Husky hooks might reference scripts that don't exist in the new project's package.json.

Developers would copy files, run into errors, debug, modify configurations, and iterate until things worked. This process was not only time-consuming but also introduced configuration drift, each new project ended up with slightly different settings, making it harder for developers to context-switch between projects.

The Consistency Crisis

After a year of this process, we audited our repositories and found a troubling pattern: no two projects had identical configurations. Some used tabs, others used spaces. Some enforced strict TypeScript null checks, others didn't. Commit message formats varied wildly. Some projects had comprehensive pre-commit hooks, others had none.

This inconsistency created real problems:

  • Onboarding friction: New developers had to learn the specific conventions of each project

  • Code review inefficiency: Reviewers spent time on style issues rather than logic

  • Cross-project contributions: Developers hesitated to contribute to unfamiliar repositories

  • Tooling maintenance: Updates had to be applied manually to each project

The Traceability Gap

We also noticed a persistent problem with commit traceability. Despite using Asana for project management, there was no enforced connection between commits and tasks. Developers would sometimes add task references manually, but there was no consistency. When investigating bugs or reviewing features months later, connecting code changes to the original requirements often required detective work.

The Solution: RaftStack CLI

We realized that the only way to solve this systematically was to codify our standards into tooling. Rather than documenting conventions and hoping developers would follow them, we would build a CLI that applies configurations automatically and consistently.

Design Philosophy

RaftStack CLI was built around several core principles:

  1. Declarative over imperative: Developers choose what they want; the CLI handles how to implement it
  2. Modular application: Each configuration can be applied independently
  3. Idempotent operations: Running the CLI multiple times produces the same result
  4. Minimal assumptions: Works with any npm-based project structure

Core Features

The CLI provides several key capabilities that address our specific pain points:

Asana Commit Linking enforces that every commit includes a reference to an Asana task. This creates automatic traceability between code changes and project management. When investigating a bug six months from now, any developer can immediately understand the context behind any commit. The format is standardized: [ASANA-12345] Your commit message here.

Pre-push Validation runs build processes and validations before code is pushed to the remote repository. This catches broken builds before they enter the CI pipeline, reducing failed builds and the associated context-switching costs.

Branch Naming Conventions enforce consistent branch names that encode information about the type of work being done. This makes repository navigation intuitive and supports automated workflows that key off branch names.

Code Quality Tools apply our standard ESLint, Prettier, and TypeScript configurations. These are versioned and maintained centrally, so updates propagate to all projects simultaneously.

Implementation Approach

Rather than building everything from scratch, RaftStack CLI orchestrates existing tools. It leverages Husky for Git hooks, commitlint for message validation, and standard linting tools for code quality. The CLI's value is in the curation and configuration of these tools, not in reimplementing their functionality.

Installation is straightforward for any npm-based project:

npx @raftlabs/raftstack

The CLI presents an interactive menu where developers select which configurations to apply. Each selection is independent, allowing teams to adopt RaftStack incrementally.

Extending Standardization to AI-Assisted Development

With repository setup standardized, we turned our attention to a newer challenge: the wildly inconsistent results we were seeing from AI-assisted coding.

The Chaos of Unstructured AI Usage

Like many engineering organizations, we encouraged developers to use AI tools—specifically Claude Code—to accelerate development. The theory was sound: AI assistance should help developers write code faster and with fewer bugs. The reality was more complicated.

Without any standardized workflow, AI usage was essentially random. Each developer had their own approach:

  • Some would ask Claude to write entire features in one shot, resulting in 1000+ line files with no clear structure

  • Others would write code manually and only use AI for debugging

  • Some prompted Claude extensively with context; others provided minimal direction

  • Test coverage varied from comprehensive to non-existent

The result was predictable: code quality varied dramatically depending on who wrote it and how they happened to use AI that day. Code reviews became exercises in rewriting rather than refinement. The promise of AI-accelerated development was undermined by the inconsistency of its outputs.

The Root Cause: Implementation-First Thinking

We analyzed patterns in our AI-assisted code and identified a fundamental problem: Claude Code, like most developers, defaults to implementation-first thinking. When asked to build a feature, it writes the implementation immediately. Tests, if added at all, come afterward and often only cover the happy path.

This approach has several problems:

  • Edge cases are ignored: The AI focuses on making the basic functionality work

  • Tests validate implementation rather than requirements: Tests written after code often just assert what the code does, not what it should do

  • No design pressure: Without tests driving design, code tends toward monolithic, hard-to-test structures

  • Context pollution: When implementation exists in the context window, the AI's test writing is influenced by implementation details

The Spec-Driven TDD Workflow for Claude Code

Our solution was to create a structured workflow that forces test-driven development when using Claude Code. The key insight: AI assistance is most effective when it has clear, verifiable targets. Tests provide exactly that.

The Red-Green-Refactor Cycle with AI

Traditional TDD follows a simple cycle: write a failing test (red), write minimal code to make it pass (green), then refactor. We adapted this for AI-assisted development:

Phase 1: Specification — Before any coding begins, we create a specification document that defines what we're building. This includes requirements, acceptance criteria, and explicitly enumerated edge cases. This specification becomes the "contract" that constrains all subsequent work.

Phase 2: Test Writing (Red) — With the specification complete, we prompt Claude to write comprehensive tests. Critically, this happens before any implementation exists. Claude is explicitly told that implementation doesn't exist yet, preventing it from making assumptions about how the code will work. We run these tests to confirm they fail as expected.

Phase 3: Implementation (Green) — Only after tests are written and committed do we ask Claude to implement the feature. The instruction is specific: write the minimal code necessary to make the tests pass. This constraint prevents over-engineering and ensures the implementation is driven by requirements, not assumptions.

Phase 4: Refactoring — With passing tests providing a safety net, we can refactor confidently. Claude can suggest improvements, and we can verify that refactoring doesn't break functionality by running the test suite.

Context Isolation: The Key to Consistent Results

A critical aspect of our workflow is context isolation. When Claude is writing tests, there should not be implementation code in the context window. When it's writing implementation, it should focus only on making tests pass, not on what a "good" implementation might look like, independent of the tests.

We achieve this through structured prompting and, in more sophisticated setups, through Claude Code's sub-agent capabilities. Each phase gets exactly the context it needs and nothing more. This prevents the "bleeding" of implementation thinking into test writing and vice versa.

The practical benefit is significant: two developers using this workflow on the same feature will produce remarkably similar code. The tests constrain the solution space, and the minimal-implementation requirement prevents stylistic divergence.

CLAUDE.md Configuration

We codify our TDD workflow in the project's CLAUDE.md file, which Claude Code reads automatically. This file includes:

  • Testing conventions and frameworks used in the project

  • Explicit instruction to follow TDD methodology

  • Quality standards for code structure and documentation

  • Commands for running tests and validating changes

By baking these instructions into the project itself, we ensure that any developer using Claude Code in the repository follows the same workflow automatically. This is the same philosophy as RaftStack CLI: encode standards in tooling rather than documentation.

Three Approaches to Development: A Side-by-Side Comparison

To understand the value of our AI-driven TDD approach, it helps to see how it compares to traditional development methods. Here's how the same feature development plays out under three different workflows:

Scenario: Building a User Authentication Module

Traditional Development (No TDD, No AI)

  • Developer reads requirements

  • Writes implementation code based on understanding

  • Manually tests in development environment

  • Discovers edge cases through bug reports

  • Writes tests after bugs are found (if at all)

  • Refactors when code becomes unmaintainable

Typical Timeline: 2-3 days for initial implementation, ongoing bug fixes

Common Issues: Missed edge cases, inconsistent error handling, difficult to refactor, bugs discovered in production

Traditional TDD (No AI)

  • Developer writes specification

  • Manually writes comprehensive test cases

  • Writes minimal implementation to pass tests

  • Refactors with test safety net

  • Tests catch edge cases before production

Typical Timeline: 3-4 days (testing overhead upfront)

Common Issues: Perceived slowness, developer fatigue from writing tests, inconsistent test quality across team members

AI-Driven TDD (Our Approach)

  • Developer writes specification

  • Claude generates comprehensive test suite based on spec

  • Developer reviews and refines tests

  • Claude writes minimal implementation to pass tests

  • Refactor with AI suggestions and test validation

Typical Timeline: 1-2 days (faster test writing, consistent quality)

Benefits: Speed of AI + quality of TDD, consistent outputs across team, comprehensive edge case coverage, reduced developer fatigue

How Different AI Coding Assistants Fit This Workflow

Not all AI tools work the same way with test-driven development. Here's how the major options compare when used with our TDD methodology:

FeatureClaude Code (Our Choice)GitHub CopilotCursorChatGPT/GPT-4
Context AwarenessExcellent - reads project files including CLAUDE.mdGood - inline suggestions based on current fileExcellent - full codebase contextLimited - requires manual context copying
Test-First SupportNative - can be explicitly instructed to write tests before implementationWeak - tends to suggest implementation firstModerate - can follow instructions but requires promptingGood - follows instructions but requires conversation management
Spec UnderstandingExcellent - can work from detailed specificationsLimited - works best with code contextGood - can interpret specificationsExcellent - strong at understanding requirements
Context IsolationStrong - sub-agent capabilities allow separate contexts for test/implementation phasesN/A - no concept of phasesModerate - requires manual prompt engineeringModerate - requires careful conversation threading
Consistency Across TeamHigh - CLAUDE.md ensures same behavior for all developersVariable - depends on individual usage patternsModerate - configurable but requires setupLow - each developer has separate conversations
Local Development IntegrationExcellent - CLI tool with full terminal accessExcellent - IDE integrationExcellent - IDE integrationPoor - separate web interface
CostUsage-based API pricingSubscription per developerSubscription per developerSubscription or API pricing

Why We Chose Claude Code:

The critical factor for us was consistency and workflow enforcement. Claude Code's ability to read and follow the CLAUDE.md configuration file means every developer on our team gets the same TDD workflow automatically. The sub-agent capabilities also make context isolation cleaner and we can explicitly separate the test-writing phase from the implementation phase.

That said, the principles of our TDD workflow can be adapted to other tools. The key is enforcing the spec → test → implement → refactor sequence through disciplined prompting, regardless of which AI assistant you use.

Results and Impact

The combination of RaftStack CLI and our TDD workflow has produced measurable improvements across several dimensions.

Quantitative Improvements

MetricBeforeAfter
Project setup time2-3 days< 30 minutes
Configuration consistency~15% identical100% identical
Commit traceability~40% linked100% linked
Code review style comments~35% of comments< 5% of comments
AI-generated code quality varianceHigh (developer-dependent)Low (consistent)

Qualitative Observations

Beyond the numbers, we've observed several qualitative improvements:

Faster onboarding: New developers can start contributing to any project immediately. They don't need to learn project-specific conventions because there aren't any—every project uses the same standards.

Confident refactoring: With comprehensive test coverage driven by TDD, developers refactor code confidently. The test suite catches regressions immediately, enabling more aggressive improvement of legacy code.

Better code reviews: Reviewers can focus on logic, architecture, and edge cases rather than formatting and style. Reviews are faster and more valuable.

Predictable AI assistance: Claude Code has become a reliable tool rather than a lottery. Developers know what to expect and can plan their work accordingly. This predictability has increased adoption and trust in AI-assisted development.

Lessons Learned and Recommendations

Our journey to standardization wasn't without challenges. Here's what we learned along the way.

Start with Pain Points, Not Perfection

We didn't try to standardize everything at once. RaftStack started with commit message validation because that was our biggest immediate pain point. As developers experienced the benefit, adoption of additional features followed naturally. If we'd launched with a comprehensive "you must use all these tools" mandate, we would have faced resistance.

Make the Right Way the Easy Way

The CLI succeeds because following standards is easier than not following them. Developers don't have to remember conventions or look up documentation. They run one command and get the correct setup. The same applies to our TDD workflow—the structured approach actually reduces cognitive load compared to ad-hoc AI prompting.

Version Your Standards

Configurations in RaftStack are versioned. When we update ESLint rules or add new pre-commit hooks, we can roll out changes incrementally and roll back if needed. This gives us confidence to iterate on our standards without fear of breaking existing projects.

AI Workflows Need Constraints

The counterintuitive lesson from our AI work: more constraints lead to better outputs. Giving Claude free rein produces unpredictable results. Constraining it with tests, specifications, and explicit phase boundaries produces consistent, high-quality code. Think of it like giving directions—"go somewhere interesting" produces random results; "go to the coffee shop on Main Street" gets you exactly where you need to be.

5 Mistakes We Made Implementing AI-Driven TDD (So You Don't Have To)

Building this workflow wasn't a straight path. We encountered several pitfalls along the way. Here are the mistakes that cost us the most time and how we ultimately solved them:

Mistake #1: Letting Claude See Implementation While Writing Tests

What Happened: In our early experiments, we would ask Claude to write tests for existing code. This seemed efficient—we had code that needed test coverage, so why not generate tests for it?

The Problem: Tests written this way invariably validated what the code did, not what it should do. If the implementation had bugs, the tests codified those bugs. If the implementation missed edge cases, the tests missed them too. We ended up with high test coverage but low actual validation.

The Solution: Strict context isolation. When writing tests, Claude's context includes only the specification and requirements—never the implementation. This forces the tests to be based on what the feature should do according to requirements, not what some existing code happens to do.

Warning Sign: If your tests consistently pass on the first run without any code changes, they're probably just validating existing behavior rather than driving development.

Mistake #2: Writing Vague Specifications

What Happened: We assumed that Claude's intelligence would compensate for underspecified requirements. We'd write brief specs like "build a user login endpoint" and expect Claude to infer all the details.

The Problem: Claude filled in the blanks based on common patterns, which wasn't always what we needed. Different developers got different implementations for the same vague spec. Edge cases were handled inconsistently. Security requirements were sometimes overlooked.

The Solution: Explicit, enumerated specifications. We now document expected behavior, edge cases, error conditions, and security requirements in detail before any code is written. Yes, this takes time upfront, but it saves far more time in revision and debugging.

Warning Sign: If you find yourself frequently saying "that's not what I meant" during code review, your specifications aren't detailed enough.

Mistake #3: Skipping the Test Review Step

What Happened: Since Claude was generating comprehensive test suites quickly, we initially just ran the tests to make sure they failed (red phase), then moved straight to implementation.

The Problem: AI-generated tests sometimes include redundant test cases, miss important edge cases that weren't explicit in the spec, or use testing patterns inconsistent with the rest of the codebase. Skipping the review meant these issues compounded over time.

The Solution: Mandatory human review of generated tests before proceeding to implementation. A developer must read through the test suite and ask: "Do these tests validate our actual requirements? Are there gaps? Are they testing at the right level of abstraction?" This review typically takes 5-10 minutes but catches issues that would take hours to debug later.

Warning Sign: If you discover bugs in production that your test suite didn't catch, review your test quality control process.

Mistake #4: Allowing "One-Shot" Feature Implementation

What Happened: Developers would sometimes prompt Claude with "build the entire user authentication system" and let it generate hundreds of lines of code in one go.

The Problem: Large, single-shot implementations inevitably required significant refactoring. Claude would make architectural decisions that didn't align with our patterns, create monolithic files, or implement features we didn't actually need. The code worked but wasn't maintainable.

The Solution: Enforce iterative, small-batch development. Break features into small, testable units. Write tests for one unit, implement it, commit, then move to the next. This creates natural checkpoints and prevents architectural drift.

Warning Sign: If your Git commits routinely include 500+ lines of changes, you're batching too large.

Mistake #5: Treating AI Output as Final Code

What Happened: Early on, we sometimes treated Claude's implementation as production-ready code that just needed to pass tests.

The Problem: AI-generated code often works but lacks the nuance that comes from deep context about the system. Variable names might be generic, error messages unclear, or performance characteristics suboptimal. The code passed tests but wasn't excellent.

The Solution: Always include a refactoring phase. With tests providing safety, developers refine the AI-generated implementation: improve naming, add comments where complexity is necessary, optimize performance-critical paths, and ensure consistency with existing codebase patterns. This is where developer expertise adds the most value.

Warning Sign: If your code reviews consistently require significant changes to AI-generated code, you're not allocating enough time for the refactoring phase.

Anti-Patterns to Avoid When Prompting AI

Beyond these specific mistakes, we've identified several anti-patterns that undermine effective AI-assisted TDD:

The "Magic Prompt" Fallacy: Spending excessive time crafting the perfect prompt. Effective prompting is important, but the workflow (spec → test → implement → refactor) matters more than prompt optimization. A mediocre prompt within a good workflow outperforms a perfect prompt with no workflow.

Context Dumping: Copying your entire codebase into the prompt hoping the AI will understand everything. This actually degrades performance—too much irrelevant context confuses the model. Provide focused, relevant context only.

Over-Correction: When Claude produces something that doesn't quite match your intent, some developers write increasingly specific prompts trying to steer the output. Often it's faster to just edit the code directly. Know when to switch from prompting to coding.

Test After Debugging: Writing implementation, debugging it manually until it works, then asking Claude to generate tests for the debugged code. This completely defeats the purpose of TDD. The tests should drive the debugging.

Inconsistent Workflow: Following the TDD workflow on some features but not others, or following it strictly on Monday but abandoning it under deadline pressure on Friday. Inconsistency produces inconsistent results. The workflow only works when it's habitual.

How to Know If Your AI Workflow Isn't Working

Watch for these warning signs that your AI-assisted development process needs adjustment:

Code Review Gridlock: If reviews consistently require major rewrites of AI-generated code, your specifications aren't detailed enough, or your prompting strategy isn't aligned with your quality standards.

Test Suite Doesn't Catch Bugs: If bugs slip into production that should have been caught by tests, your test generation process needs improvement. Review whether specs explicitly covered the failure case.

Inconsistent Code Quality: If quality varies significantly between features or developers, your workflow isn't standardized enough. This is exactly the problem we built RaftStack to solve.

Developer Frustration: If developers complain that "AI slows me down" or "I could have written this faster myself," they probably aren't following the workflow correctly or the workflow needs refinement for your context.

Degrading Over Time: If code quality was initially good but has degraded over weeks or months, developers are likely taking shortcuts and skipping workflow steps under time pressure. This requires process reinforcement, not just individual correction.

The good news: all of these issues are solvable with workflow adjustments, better tooling, or additional training. The key is recognizing them early rather than letting them compound.

Getting Started with RaftStack

If you're interested in implementing similar standardization in your organization, RaftStack is available as a public npm package. You can try it immediately on any npm-based project:

npx @raftlabs/raftstack

The CLI will guide you through available configurations. For teams looking to customize or extend RaftStack for their own needs, the package is designed to be extended.

For the TDD workflow, we recommend starting with these steps:

  1. Create a CLAUDE.md file in your project root with explicit TDD instructions
  2. Document your testing conventions so Claude knows which frameworks and patterns to use
  3. Train your team on the spec → test → implement → refactor workflow
  4. Review early outputs carefully to establish expectations and refine prompting strategies

Conclusion

The journey from chaotic, inconsistent repositories to standardized, predictable development workflows wasn't completed overnight. It required identifying real pain points, building tooling that makes the right way the easy way, and continuously iterating based on team feedback.

RaftStack CLI solved our immediate repository standardization problems, cutting setup time dramatically and eliminating configuration drift. But the more significant insight was recognizing that the same principles could be applied to AI-assisted development. By constraining Claude Code with structured workflows, specifications, and test-first methodology, we transformed it from an unpredictable assistant into a reliable tool that produces consistent, high-quality code.

The combination of these approaches creates a multiplier effect. Developers spend less time on setup and more time on valuable work. Code reviews focus on substance rather than style. AI assistance accelerates development without sacrificing quality. And perhaps most importantly, every developer on the team—regardless of experience level or personal coding style—produces code that meets the same high standard.

If your team is struggling with similar challenges, we encourage you to try RaftStack and experiment with structured AI workflows. The investment in tooling and process pays dividends in every project that follows. In an industry obsessed with moving fast, sometimes the fastest path forward is taking time to build the right foundation.

Ready to Standardize Your Development Workflow?

RaftLabs helps engineering teams build scalable, consistent development practices. Whether you need help implementing tooling like RaftStack, establishing AI-assisted development workflows, or building custom solutions for your team's specific challenges, we'd love to discuss how we can help.



Frequently Asked Questions

Sharing is caring

Insights from our team

AWS Amplify Gen 2 in Action: What We Learned About Speed and Scalability

AWS Amplify Gen 2 in Action: What We Learned About Speed and Scalability

Hotel Booking App Development Cost in 2026: (Complete Pricing & Budget Guide)

Hotel Booking App Development Cost in 2026: (Complete Pricing & Budget Guide)

Top 15 Web Application Development Companies in 2026

Top 15 Web Application Development Companies in 2026