
Test Driven Development with AI: Proven Workflow That Saved 2.5 Days
- Aravind Jaimon
![Aravind Jaimon]()
- Development
- Last updated on
Key Takeaways
RaftLabs built RaftStack CLI to standardize repo setup and cut project initialization time from several days to under 30 minutes.
Manual copying of configs caused configuration drift, inconsistent tooling, and high onboarding and maintenance costs across projects.
RaftStack CLI codifies standards like linting, formatting, TypeScript configs, Git hooks, branch naming, and Asana linked commits into an easy interactive command.
All configurations are modular, idempotent, versioned, and centrally maintained so updates roll out consistently to every project.
RaftLabs extended the same standardization mindset to AI by enforcing a spec driven, test first workflow with Claude Code.
Their AI workflow follows a strict sequence of specification, test generation, minimal implementation, and refactoring with strong context isolation.
A CLAUDE.md file in each repo encodes TDD rules, testing conventions, and quality standards so all developers use AI in a consistent way.
This approach reduces AI output variance, improves test coverage, speeds up delivery, and shifts code reviews from style issues to logic and architecture.
The main lessons are to start from real pain points, make the standard path the easiest, constrain AI with clear workflows, and keep evolving versioned standards.
Every engineering team that scales beyond a handful of developers faces the same challenge: maintaining consistency. What starts as a simple problem, setting up a new project, gradually evolves into a multi-day endeavor filled with copy-pasting configuration files, hunting down the "right" version of a linting rule, and reconciling conflicting standards across repositories. At RaftLabs, we watched this problem compound as our team grew, and we decided to solve it systematically.
This article documents our journey of building RaftStack CLI, an internal tool that reduced our project setup time from 2-3 days to under 30 minutes and how we extended this standardization philosophy to AI-assisted development through a spec-driven, test-first workflow with Claude Code. The result: not just faster setup, but consistently high-quality code output across every developer on our team, regardless of their individual coding style or experience level.
Whether you're a tech lead struggling with inconsistent codebases, an engineering manager looking to scale your team's practices, or a developer curious about effective AI integration in development workflows, this deep-dive offers practical insights you can apply to your own organization.
The Problem: Why Repository Setup Became a Bottleneck
Before diving into solutions, let's examine the problem in detail. Understanding the full scope of what we were dealing with helps explain why we chose the approach we did.
The Manual Setup Reality
When a new project kicked off at RaftLabs, the setup process looked something like this: a developer would create a fresh repository, then begin the tedious process of copying configuration files from an existing "reference" project. This included ESLint configurations, Prettier settings, TypeScript configurations, Husky hooks for pre-commit validation, commitlint rules, and various other tooling configurations.
On paper, this sounds like a one-hour task. In practice, it consistently consumed 2-3 days of developer time as every project had subtle differences. The "reference" project might be using an older ESLint version. Its TypeScript config might include settings specific to that project's architecture. The Husky hooks might reference scripts that don't exist in the new project's package.json.
Developers would copy files, run into errors, debug, modify configurations, and iterate until things worked. This process was not only time-consuming but also introduced configuration drift, each new project ended up with slightly different settings, making it harder for developers to context-switch between projects.
The Consistency Crisis
After a year of this process, we audited our repositories and found a troubling pattern: no two projects had identical configurations. Some used tabs, others used spaces. Some enforced strict TypeScript null checks, others didn't. Commit message formats varied wildly. Some projects had comprehensive pre-commit hooks, others had none.
This inconsistency created real problems:
Onboarding friction: New developers had to learn the specific conventions of each project
Code review inefficiency: Reviewers spent time on style issues rather than logic
Cross-project contributions: Developers hesitated to contribute to unfamiliar repositories
Tooling maintenance: Updates had to be applied manually to each project
The Traceability Gap
We also noticed a persistent problem with commit traceability. Despite using Asana for project management, there was no enforced connection between commits and tasks. Developers would sometimes add task references manually, but there was no consistency. When investigating bugs or reviewing features months later, connecting code changes to the original requirements often required detective work.
The Solution: RaftStack CLI
We realized that the only way to solve this systematically was to codify our standards into tooling. Rather than documenting conventions and hoping developers would follow them, we would build a CLI that applies configurations automatically and consistently.
Design Philosophy
RaftStack CLI was built around several core principles:
- Declarative over imperative: Developers choose what they want; the CLI handles how to implement it
- Modular application: Each configuration can be applied independently
- Idempotent operations: Running the CLI multiple times produces the same result
- Minimal assumptions: Works with any npm-based project structure
Core Features
The CLI provides several key capabilities that address our specific pain points:
Asana Commit Linking enforces that every commit includes a reference to an Asana task. This creates automatic traceability between code changes and project management. When investigating a bug six months from now, any developer can immediately understand the context behind any commit. The format is standardized: [ASANA-12345] Your commit message here.
Pre-push Validation runs build processes and validations before code is pushed to the remote repository. This catches broken builds before they enter the CI pipeline, reducing failed builds and the associated context-switching costs.
Branch Naming Conventions enforce consistent branch names that encode information about the type of work being done. This makes repository navigation intuitive and supports automated workflows that key off branch names.
Code Quality Tools apply our standard ESLint, Prettier, and TypeScript configurations. These are versioned and maintained centrally, so updates propagate to all projects simultaneously.
Implementation Approach
Rather than building everything from scratch, RaftStack CLI orchestrates existing tools. It leverages Husky for Git hooks, commitlint for message validation, and standard linting tools for code quality. The CLI's value is in the curation and configuration of these tools, not in reimplementing their functionality.
Installation is straightforward for any npm-based project:
npx @raftlabs/raftstack
The CLI presents an interactive menu where developers select which configurations to apply. Each selection is independent, allowing teams to adopt RaftStack incrementally.
Extending Standardization to AI-Assisted Development
With repository setup standardized, we turned our attention to a newer challenge: the wildly inconsistent results we were seeing from AI-assisted coding.
The Chaos of Unstructured AI Usage
Like many engineering organizations, we encouraged developers to use AI tools—specifically Claude Code—to accelerate development. The theory was sound: AI assistance should help developers write code faster and with fewer bugs. The reality was more complicated.
Without any standardized workflow, AI usage was essentially random. Each developer had their own approach:
Some would ask Claude to write entire features in one shot, resulting in 1000+ line files with no clear structure
Others would write code manually and only use AI for debugging
Some prompted Claude extensively with context; others provided minimal direction
Test coverage varied from comprehensive to non-existent
The result was predictable: code quality varied dramatically depending on who wrote it and how they happened to use AI that day. Code reviews became exercises in rewriting rather than refinement. The promise of AI-accelerated development was undermined by the inconsistency of its outputs.
The Root Cause: Implementation-First Thinking
We analyzed patterns in our AI-assisted code and identified a fundamental problem: Claude Code, like most developers, defaults to implementation-first thinking. When asked to build a feature, it writes the implementation immediately. Tests, if added at all, come afterward and often only cover the happy path.
This approach has several problems:
Edge cases are ignored: The AI focuses on making the basic functionality work
Tests validate implementation rather than requirements: Tests written after code often just assert what the code does, not what it should do
No design pressure: Without tests driving design, code tends toward monolithic, hard-to-test structures
Context pollution: When implementation exists in the context window, the AI's test writing is influenced by implementation details
The Spec-Driven TDD Workflow for Claude Code
Our solution was to create a structured workflow that forces test-driven development when using Claude Code. The key insight: AI assistance is most effective when it has clear, verifiable targets. Tests provide exactly that.
The Red-Green-Refactor Cycle with AI
Traditional TDD follows a simple cycle: write a failing test (red), write minimal code to make it pass (green), then refactor. We adapted this for AI-assisted development:
Phase 1: Specification — Before any coding begins, we create a specification document that defines what we're building. This includes requirements, acceptance criteria, and explicitly enumerated edge cases. This specification becomes the "contract" that constrains all subsequent work.
Phase 2: Test Writing (Red) — With the specification complete, we prompt Claude to write comprehensive tests. Critically, this happens before any implementation exists. Claude is explicitly told that implementation doesn't exist yet, preventing it from making assumptions about how the code will work. We run these tests to confirm they fail as expected.
Phase 3: Implementation (Green) — Only after tests are written and committed do we ask Claude to implement the feature. The instruction is specific: write the minimal code necessary to make the tests pass. This constraint prevents over-engineering and ensures the implementation is driven by requirements, not assumptions.
Phase 4: Refactoring — With passing tests providing a safety net, we can refactor confidently. Claude can suggest improvements, and we can verify that refactoring doesn't break functionality by running the test suite.
Context Isolation: The Key to Consistent Results
A critical aspect of our workflow is context isolation. When Claude is writing tests, there should not be implementation code in the context window. When it's writing implementation, it should focus only on making tests pass, not on what a "good" implementation might look like, independent of the tests.
We achieve this through structured prompting and, in more sophisticated setups, through Claude Code's sub-agent capabilities. Each phase gets exactly the context it needs and nothing more. This prevents the "bleeding" of implementation thinking into test writing and vice versa.
The practical benefit is significant: two developers using this workflow on the same feature will produce remarkably similar code. The tests constrain the solution space, and the minimal-implementation requirement prevents stylistic divergence.
CLAUDE.md Configuration
We codify our TDD workflow in the project's CLAUDE.md file, which Claude Code reads automatically. This file includes:
Testing conventions and frameworks used in the project
Explicit instruction to follow TDD methodology
Quality standards for code structure and documentation
Commands for running tests and validating changes
By baking these instructions into the project itself, we ensure that any developer using Claude Code in the repository follows the same workflow automatically. This is the same philosophy as RaftStack CLI: encode standards in tooling rather than documentation.
Three Approaches to Development: A Side-by-Side Comparison
To understand the value of our AI-driven TDD approach, it helps to see how it compares to traditional development methods. Here's how the same feature development plays out under three different workflows:
Scenario: Building a User Authentication Module
Traditional Development (No TDD, No AI)
Developer reads requirements
Writes implementation code based on understanding
Manually tests in development environment
Discovers edge cases through bug reports
Writes tests after bugs are found (if at all)
Refactors when code becomes unmaintainable
Typical Timeline: 2-3 days for initial implementation, ongoing bug fixes
Common Issues: Missed edge cases, inconsistent error handling, difficult to refactor, bugs discovered in production
Traditional TDD (No AI)
Developer writes specification
Manually writes comprehensive test cases
Writes minimal implementation to pass tests
Refactors with test safety net
Tests catch edge cases before production
Typical Timeline: 3-4 days (testing overhead upfront)
Common Issues: Perceived slowness, developer fatigue from writing tests, inconsistent test quality across team members
AI-Driven TDD (Our Approach)
Developer writes specification
Claude generates comprehensive test suite based on spec
Developer reviews and refines tests
Claude writes minimal implementation to pass tests
Refactor with AI suggestions and test validation
Typical Timeline: 1-2 days (faster test writing, consistent quality)
Benefits: Speed of AI + quality of TDD, consistent outputs across team, comprehensive edge case coverage, reduced developer fatigue
How Different AI Coding Assistants Fit This Workflow
Not all AI tools work the same way with test-driven development. Here's how the major options compare when used with our TDD methodology:
| Feature | Claude Code (Our Choice) | GitHub Copilot | Cursor | ChatGPT/GPT-4 |
|---|---|---|---|---|
| Context Awareness | Excellent - reads project files including CLAUDE.md | Good - inline suggestions based on current file | Excellent - full codebase context | Limited - requires manual context copying |
| Test-First Support | Native - can be explicitly instructed to write tests before implementation | Weak - tends to suggest implementation first | Moderate - can follow instructions but requires prompting | Good - follows instructions but requires conversation management |
| Spec Understanding | Excellent - can work from detailed specifications | Limited - works best with code context | Good - can interpret specifications | Excellent - strong at understanding requirements |
| Context Isolation | Strong - sub-agent capabilities allow separate contexts for test/implementation phases | N/A - no concept of phases | Moderate - requires manual prompt engineering | Moderate - requires careful conversation threading |
| Consistency Across Team | High - CLAUDE.md ensures same behavior for all developers | Variable - depends on individual usage patterns | Moderate - configurable but requires setup | Low - each developer has separate conversations |
| Local Development Integration | Excellent - CLI tool with full terminal access | Excellent - IDE integration | Excellent - IDE integration | Poor - separate web interface |
| Cost | Usage-based API pricing | Subscription per developer | Subscription per developer | Subscription or API pricing |
Why We Chose Claude Code:
The critical factor for us was consistency and workflow enforcement. Claude Code's ability to read and follow the CLAUDE.md configuration file means every developer on our team gets the same TDD workflow automatically. The sub-agent capabilities also make context isolation cleaner and we can explicitly separate the test-writing phase from the implementation phase.
That said, the principles of our TDD workflow can be adapted to other tools. The key is enforcing the spec → test → implement → refactor sequence through disciplined prompting, regardless of which AI assistant you use.
Results and Impact
The combination of RaftStack CLI and our TDD workflow has produced measurable improvements across several dimensions.
Quantitative Improvements
| Metric | Before | After |
|---|---|---|
| Project setup time | 2-3 days | < 30 minutes |
| Configuration consistency | ~15% identical | 100% identical |
| Commit traceability | ~40% linked | 100% linked |
| Code review style comments | ~35% of comments | < 5% of comments |
| AI-generated code quality variance | High (developer-dependent) | Low (consistent) |
Qualitative Observations
Beyond the numbers, we've observed several qualitative improvements:
Faster onboarding: New developers can start contributing to any project immediately. They don't need to learn project-specific conventions because there aren't any—every project uses the same standards.
Confident refactoring: With comprehensive test coverage driven by TDD, developers refactor code confidently. The test suite catches regressions immediately, enabling more aggressive improvement of legacy code.
Better code reviews: Reviewers can focus on logic, architecture, and edge cases rather than formatting and style. Reviews are faster and more valuable.
Predictable AI assistance: Claude Code has become a reliable tool rather than a lottery. Developers know what to expect and can plan their work accordingly. This predictability has increased adoption and trust in AI-assisted development.
Lessons Learned and Recommendations
Our journey to standardization wasn't without challenges. Here's what we learned along the way.
Start with Pain Points, Not Perfection
We didn't try to standardize everything at once. RaftStack started with commit message validation because that was our biggest immediate pain point. As developers experienced the benefit, adoption of additional features followed naturally. If we'd launched with a comprehensive "you must use all these tools" mandate, we would have faced resistance.
Make the Right Way the Easy Way
The CLI succeeds because following standards is easier than not following them. Developers don't have to remember conventions or look up documentation. They run one command and get the correct setup. The same applies to our TDD workflow—the structured approach actually reduces cognitive load compared to ad-hoc AI prompting.
Version Your Standards
Configurations in RaftStack are versioned. When we update ESLint rules or add new pre-commit hooks, we can roll out changes incrementally and roll back if needed. This gives us confidence to iterate on our standards without fear of breaking existing projects.
AI Workflows Need Constraints
The counterintuitive lesson from our AI work: more constraints lead to better outputs. Giving Claude free rein produces unpredictable results. Constraining it with tests, specifications, and explicit phase boundaries produces consistent, high-quality code. Think of it like giving directions—"go somewhere interesting" produces random results; "go to the coffee shop on Main Street" gets you exactly where you need to be.
5 Mistakes We Made Implementing AI-Driven TDD (So You Don't Have To)
Building this workflow wasn't a straight path. We encountered several pitfalls along the way. Here are the mistakes that cost us the most time and how we ultimately solved them:
Mistake #1: Letting Claude See Implementation While Writing Tests
What Happened: In our early experiments, we would ask Claude to write tests for existing code. This seemed efficient—we had code that needed test coverage, so why not generate tests for it?
The Problem: Tests written this way invariably validated what the code did, not what it should do. If the implementation had bugs, the tests codified those bugs. If the implementation missed edge cases, the tests missed them too. We ended up with high test coverage but low actual validation.
The Solution: Strict context isolation. When writing tests, Claude's context includes only the specification and requirements—never the implementation. This forces the tests to be based on what the feature should do according to requirements, not what some existing code happens to do.
Warning Sign: If your tests consistently pass on the first run without any code changes, they're probably just validating existing behavior rather than driving development.
Mistake #2: Writing Vague Specifications
What Happened: We assumed that Claude's intelligence would compensate for underspecified requirements. We'd write brief specs like "build a user login endpoint" and expect Claude to infer all the details.
The Problem: Claude filled in the blanks based on common patterns, which wasn't always what we needed. Different developers got different implementations for the same vague spec. Edge cases were handled inconsistently. Security requirements were sometimes overlooked.
The Solution: Explicit, enumerated specifications. We now document expected behavior, edge cases, error conditions, and security requirements in detail before any code is written. Yes, this takes time upfront, but it saves far more time in revision and debugging.
Warning Sign: If you find yourself frequently saying "that's not what I meant" during code review, your specifications aren't detailed enough.
Mistake #3: Skipping the Test Review Step
What Happened: Since Claude was generating comprehensive test suites quickly, we initially just ran the tests to make sure they failed (red phase), then moved straight to implementation.
The Problem: AI-generated tests sometimes include redundant test cases, miss important edge cases that weren't explicit in the spec, or use testing patterns inconsistent with the rest of the codebase. Skipping the review meant these issues compounded over time.
The Solution: Mandatory human review of generated tests before proceeding to implementation. A developer must read through the test suite and ask: "Do these tests validate our actual requirements? Are there gaps? Are they testing at the right level of abstraction?" This review typically takes 5-10 minutes but catches issues that would take hours to debug later.
Warning Sign: If you discover bugs in production that your test suite didn't catch, review your test quality control process.
Mistake #4: Allowing "One-Shot" Feature Implementation
What Happened: Developers would sometimes prompt Claude with "build the entire user authentication system" and let it generate hundreds of lines of code in one go.
The Problem: Large, single-shot implementations inevitably required significant refactoring. Claude would make architectural decisions that didn't align with our patterns, create monolithic files, or implement features we didn't actually need. The code worked but wasn't maintainable.
The Solution: Enforce iterative, small-batch development. Break features into small, testable units. Write tests for one unit, implement it, commit, then move to the next. This creates natural checkpoints and prevents architectural drift.
Warning Sign: If your Git commits routinely include 500+ lines of changes, you're batching too large.
Mistake #5: Treating AI Output as Final Code
What Happened: Early on, we sometimes treated Claude's implementation as production-ready code that just needed to pass tests.
The Problem: AI-generated code often works but lacks the nuance that comes from deep context about the system. Variable names might be generic, error messages unclear, or performance characteristics suboptimal. The code passed tests but wasn't excellent.
The Solution: Always include a refactoring phase. With tests providing safety, developers refine the AI-generated implementation: improve naming, add comments where complexity is necessary, optimize performance-critical paths, and ensure consistency with existing codebase patterns. This is where developer expertise adds the most value.
Warning Sign: If your code reviews consistently require significant changes to AI-generated code, you're not allocating enough time for the refactoring phase.
Anti-Patterns to Avoid When Prompting AI
Beyond these specific mistakes, we've identified several anti-patterns that undermine effective AI-assisted TDD:
The "Magic Prompt" Fallacy: Spending excessive time crafting the perfect prompt. Effective prompting is important, but the workflow (spec → test → implement → refactor) matters more than prompt optimization. A mediocre prompt within a good workflow outperforms a perfect prompt with no workflow.
Context Dumping: Copying your entire codebase into the prompt hoping the AI will understand everything. This actually degrades performance—too much irrelevant context confuses the model. Provide focused, relevant context only.
Over-Correction: When Claude produces something that doesn't quite match your intent, some developers write increasingly specific prompts trying to steer the output. Often it's faster to just edit the code directly. Know when to switch from prompting to coding.
Test After Debugging: Writing implementation, debugging it manually until it works, then asking Claude to generate tests for the debugged code. This completely defeats the purpose of TDD. The tests should drive the debugging.
Inconsistent Workflow: Following the TDD workflow on some features but not others, or following it strictly on Monday but abandoning it under deadline pressure on Friday. Inconsistency produces inconsistent results. The workflow only works when it's habitual.
How to Know If Your AI Workflow Isn't Working
Watch for these warning signs that your AI-assisted development process needs adjustment:
Code Review Gridlock: If reviews consistently require major rewrites of AI-generated code, your specifications aren't detailed enough, or your prompting strategy isn't aligned with your quality standards.
Test Suite Doesn't Catch Bugs: If bugs slip into production that should have been caught by tests, your test generation process needs improvement. Review whether specs explicitly covered the failure case.
Inconsistent Code Quality: If quality varies significantly between features or developers, your workflow isn't standardized enough. This is exactly the problem we built RaftStack to solve.
Developer Frustration: If developers complain that "AI slows me down" or "I could have written this faster myself," they probably aren't following the workflow correctly or the workflow needs refinement for your context.
Degrading Over Time: If code quality was initially good but has degraded over weeks or months, developers are likely taking shortcuts and skipping workflow steps under time pressure. This requires process reinforcement, not just individual correction.
The good news: all of these issues are solvable with workflow adjustments, better tooling, or additional training. The key is recognizing them early rather than letting them compound.
Getting Started with RaftStack
If you're interested in implementing similar standardization in your organization, RaftStack is available as a public npm package. You can try it immediately on any npm-based project:
npx @raftlabs/raftstack
The CLI will guide you through available configurations. For teams looking to customize or extend RaftStack for their own needs, the package is designed to be extended.
For the TDD workflow, we recommend starting with these steps:
- Create a CLAUDE.md file in your project root with explicit TDD instructions
- Document your testing conventions so Claude knows which frameworks and patterns to use
- Train your team on the spec → test → implement → refactor workflow
- Review early outputs carefully to establish expectations and refine prompting strategies
Conclusion
The journey from chaotic, inconsistent repositories to standardized, predictable development workflows wasn't completed overnight. It required identifying real pain points, building tooling that makes the right way the easy way, and continuously iterating based on team feedback.
RaftStack CLI solved our immediate repository standardization problems, cutting setup time dramatically and eliminating configuration drift. But the more significant insight was recognizing that the same principles could be applied to AI-assisted development. By constraining Claude Code with structured workflows, specifications, and test-first methodology, we transformed it from an unpredictable assistant into a reliable tool that produces consistent, high-quality code.
The combination of these approaches creates a multiplier effect. Developers spend less time on setup and more time on valuable work. Code reviews focus on substance rather than style. AI assistance accelerates development without sacrificing quality. And perhaps most importantly, every developer on the team—regardless of experience level or personal coding style—produces code that meets the same high standard.
If your team is struggling with similar challenges, we encourage you to try RaftStack and experiment with structured AI workflows. The investment in tooling and process pays dividends in every project that follows. In an industry obsessed with moving fast, sometimes the fastest path forward is taking time to build the right foundation.
Ready to Standardize Your Development Workflow?
RaftLabs helps engineering teams build scalable, consistent development practices. Whether you need help implementing tooling like RaftStack, establishing AI-assisted development workflows, or building custom solutions for your team's specific challenges, we'd love to discuss how we can help.



