What is LLM fine-tuning and when do I need it?

Fine-tuning is the process of continuing to train a pre-trained language model on your specific data so it adapts to your task, domain, and output requirements. Use fine-tuning when prompt engineering alone cannot produce consistent output format, when you need significant inference cost reduction at scale (a fine-tuned smaller model can outperform a larger model with a long system prompt), or when domain-specific vocabulary significantly degrades base model performance. Fine-tuning is not always the right answer; start with prompt engineering and RAG first.

What models can be fine-tuned?

OpenAI fine-tuning API supports GPT-4o mini and GPT-3.5 Turbo (hosted fine-tuning, no infrastructure required). Open-source models include Llama 3 (8B, 70B), Mistral 7B, Phi-3, and Gemma (require GPU infrastructure for training). Google Gemini fine-tuning is available via Vertex AI. The right model depends on your budget, data privacy requirements, and accuracy needs. Open-source models eliminate per-token costs and run on your own infrastructure.

How much training data do I need for fine-tuning?

For OpenAI fine-tuning, 50 to 100 high-quality examples is the minimum; 500 to 1,000 is recommended for reliable improvement; 5,000 or more for significant domain adaptation. Quality matters more than quantity. For open-source model fine-tuning using LoRA or QLoRA adapters, expect 1,000 to 50,000 examples depending on the degree of adaptation required. We assess your existing data and help curate or generate training examples if your dataset is thin.

What is LoRA fine-tuning?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that trains a small set of adapter weights rather than the full model. It is cheaper in compute and memory than full fine-tuning while achieving comparable results for most tasks. QLoRA extends this with quantisation for even lower memory requirements. LoRA is the standard approach for fine-tuning open-source models on modest GPU infrastructure. We use LoRA and QLoRA for open-source model fine-tuning and full fine-tuning only when the task requires it.

How do you evaluate whether fine-tuning actually improved the model?

We establish a benchmark before fine-tuning. A representative set of inputs with expected outputs is evaluated on your task-specific metrics (accuracy, format compliance, domain terminology usage, output length consistency). The fine-tuned model is evaluated against this benchmark on a held-out test set. We only recommend production deployment when benchmark improvement is statistically significant. Fine-tuning that does not improve over the baseline prompt-engineered base model is not worth the cost.

What does LLM fine-tuning cost?

Fine-tuning project cost covers training data curation, fine-tuning run costs, evaluation, and deployment. For OpenAI fine-tuning (GPT-4o mini or GPT-3.5), the OpenAI training API costs are low ($1 to $10 for typical datasets); the project cost is primarily in data curation and evaluation work ($8,000 to $25,000). For open-source model fine-tuning with infrastructure setup, expect $20,000 to $60,000 including GPU compute, deployment infrastructure, and evaluation framework.

LLM Fine-Tuning Services

General-purpose language models are trained to be useful to everyone. Fine-tuning makes them specifically useful to you, adapting their behaviour, vocabulary, tone, and output format to your domain, your data, and your product requirements.
We fine-tune language models on your datasets to improve accuracy on your specific tasks, reduce prompt length and inference cost, and produce outputs that match your brand voice and format requirements without extensive prompt engineering.

See our work

Fine-tuning on OpenAI, Llama 3, Mistral, and Phi models
Domain adaptation, output format alignment, and tone calibration
Training data curation, model evaluation, and production deployment
Cost and latency analysis, fine-tuning vs. RAG vs. prompt engineering for your use case

Recent outcomes

LLM fine-tuning · Clinical documentation platform

Fine-tuned Llama 3 8B on clinical note data; model now correctly uses ICD-10 terminology and terse note style without prompt engineering.

88% inference cost reduction vs GPT-4o

Output format alignment · B2B SaaS workflow tool

Fine-tuned GPT-4o mini on 800 labeled examples; JSON schema compliance rate reached 99.4% in production.

99.4% schema compliance rate

Conversational AI · Operational workflows

Fine-tuned chatbot handling routine queries end-to-end without human intervention, deployed in 12 weeks.

70% queries handled without human

4.9 / 5 on ClutchSee all work

Recognition

Sound familiar?

Spending significant tokens on system prompts trying to get the model to behave consistently?
Base model producing outputs in the wrong format or style despite detailed prompting?

In short

RaftLabs builds LLM fine-tuning solutions for businesses in the US, UK, and Australia. We fine-tune OpenAI, Llama 3, and Mistral models on client data, with 100+ AI products shipped. Teams achieve up to 88% inference cost reduction on well-scoped tasks.

Trusted by

AI development, by the numbers

AI products shipped in 24 months: 20+

from kick-off to production-ready AI product: 12 weeks

rated by clients on Clutch: 4.9/5

years shipping software and AI products: 9+

When to fine-tune vs. prompt vs. RAG

Most teams reach for fine-tuning too early. The decision tree:

Try prompt engineering first. A well-structured system prompt with few-shot examples solves most output format and consistency problems without any training data.

Add RAG if the model needs your knowledge. When the model needs to answer questions about your specific documents, products, or data, retrieval-augmented generation gives it that knowledge without fine-tuning.

Fine-tune when: prompt engineering cannot achieve consistent output format despite detailed instructions, when inference cost at your expected volume makes large model usage uneconomical, or when domain-specific terminology significantly degrades base model accuracy.

We will tell you which path is right for your use case, including if fine-tuning is not the answer.

Services

What we do

Training data curation

The quality of your fine-tuning data determines the quality of the fine-tuned model, and most projects that fail at fine-tuning fail because of training data problems, not model architecture problems. 50-100 high-quality examples can produce measurable improvement for format and style alignment tasks; domain vocabulary adaptation typically requires 500-2,000 examples; substantial domain knowledge adaptation requires 5,000+. Quality assessment process before any training runs: we audit your candidate training data for consistency (does every example follow the same input/output format?), coverage (do examples cover the range of inputs the model will see in production?), and correctness (are the expected outputs actually correct, or do they propagate errors from your historical data?). Training data format design for the target model: OpenAI fine-tuning uses a messages format with system/user/assistant turns; Llama 3 and Mistral use the [INST]...[/INST] instruction template; the format matters because the base model was trained with specific delimiters that the fine-tuning data must match. Synthetic data generation using GPT-4o as a data generator when your existing examples are thin: GPT-4o prompted with your domain knowledge and a few seed examples to generate additional training pairs, then reviewed and filtered before use, the approach that produces usable volume without expensive human labelling. Deduplication and near-duplicate removal using MinHash or embedding similarity to prevent the fine-tuned model from overfitting to repeated patterns in the training set. Train/validation/test split: 80/10/10 split with the test set held completely separate from training to measure generalisation rather than memorisation.

Domain adaptation

Fine-tuning a general-purpose model on your domain vocabulary, technical terminology, abbreviation conventions, and document structure, targeting the accuracy gap that appears when a base model encounters specialised language it was underrepresented on during pre-training. Medical domain adaptation: clinical note abbreviations (SOB, HTN, DM2, CVA), ICD-10 and CPT code terminology, and the imperative terse style of clinical documentation all diverge from how a general model was trained to produce text, a fine-tuned clinical model correctly uses "patient presents with" where a general model might write "the individual reported experiencing." Legal domain adaptation: contract clause terminology, defined terms conventions, citation formats (e.g., distinguishing between a defined term in all-caps and a general reference), and the specific formulaic language of different contract types that base models frequently paraphrase rather than reproduce precisely. Financial domain adaptation: earnings call transcription correction, financial statement extraction where specific GAAP/IFRS line items have precise meanings, and the structured table formats that financial reports require. Technical documentation: API endpoint names, version-specific syntax, and the specific formatting conventions of your documentation format that a general model interpolates incorrectly. Domain evaluation benchmark established before fine-tuning begins: 200-500 held-out domain-specific examples with expected outputs, evaluated on terminology accuracy rate, abbreviation expansion correctness, and task-specific format compliance, measuring improvement against baseline, not just loss curves on the training set.

Output format alignment

When your application requires structured JSON output, specific response structure, or constrained output styles that the base model produces inconsistently despite detailed prompting instructions, fine-tuning encodes the format directly into the model's learned behaviour rather than relying on prompt instructions that the model may or may not follow on any given generation. Structured JSON extraction: a fine-tuned model that has seen hundreds of examples of your specific JSON schema reliably produces valid, schema-conformant output where a base model might produce valid JSON with wrong field names, nested differently, or with hallucinated fields not in your schema. OpenAI's response_format: { type: "json_schema" } parameter is the right first approach for JSON enforcement without fine-tuning, fine-tuning for format is the escalation step when structured output mode still produces wrong field values or when inference cost from a larger model is the driver. Brand voice and tone adaptation: a customer-facing chatbot that should always respond in a specific register (warm but professional, direct without being terse, never using industry jargon) is difficult to maintain consistently via prompt instructions when the system prompt competes with user input, fine-tuning the tone into the model's weights produces more consistent behaviour across diverse user inputs. Label classification format: a model that must output one of a fixed set of classification labels without wrapping them in explanation text learns the exact output format through fine-tuning more reliably than through few-shot examples that compete with the model's default conversational output style. Format compliance rate as the primary evaluation metric: the percentage of test-set outputs that exactly match the required format without post-processing corrections, target above 99% for production classification and extraction tasks.

Inference cost reduction

A fine-tuned smaller model can match the output quality of a larger base model on a well-defined specific task, the principle that drives the inference cost reduction use case for fine-tuning. A GPT-4o production deployment at $5/1M output tokens processing 10 million tokens per month costs $50/month in inference; a fine-tuned GPT-4o mini deployment at $0.60/1M output tokens for the same task costs $6/month, an 88% inference cost reduction if the fine-tuned model matches GPT-4o's quality on that specific task. The cost reduction is real only when the task is narrow enough that a smaller model fine-tuned on it can match the larger model's performance, a complex reasoning task that genuinely requires GPT-4o's reasoning capability will not be successfully distilled into GPT-4o mini regardless of fine-tuning data volume. Task suitability assessment before committing to the cost reduction approach: we run the task against GPT-4o mini without fine-tuning to establish a baseline gap, then model whether fine-tuning is likely to close that gap based on task complexity and our experience with comparable tasks. Open-source model cost reduction: a fine-tuned Llama 3 8B running on a single A100 GPU at ~$2/hour can process 50,000+ tokens per minute on simple tasks, eliminating per-token costs entirely at sufficient volume, the economics that makes open-source fine-tuning attractive above ~500,000 tokens per day. Distillation approach: the larger model's outputs on a representative input set used as training data for the smaller model, transferring the larger model's task-specific behaviour rather than curating human-labelled training data, typically faster and cheaper than human labelling at equivalent quality when the base model outputs are reliable enough to serve as targets.

Open-source model fine-tuning

Fine-tuning Llama 3 (8B and 70B), Mistral 7B and Mixtral 8x7B, Phi-3, and Gemma models using LoRA/QLoRA adapters, the parameter-efficient approach that trains a small set of low-rank adapter matrices rather than all model weights, reducing GPU memory requirements by 4-8x compared to full fine-tuning while achieving comparable accuracy for most adaptation tasks. QLoRA (Quantized LoRA) as the standard approach for resource-constrained training: the base model loaded in 4-bit NF4 quantisation (reducing VRAM from ~28GB to ~6GB for a 7B model), with LoRA adapters trained at full precision on top, enabling Llama 3 8B fine-tuning on a single 24GB GPU that costs ~$0.50/hour on Lambda Labs or RunPod rather than requiring an expensive multi-GPU cluster. Training infrastructure setup on AWS (p3.2xlarge with V100, or p4d.24xlarge for larger models), GCP (A2 instances with A100s), or reserved GPU compute from Lambda Labs or RunPod depending on your infrastructure preferences and training timeline. Hugging Face Transformers with PEFT (Parameter Efficient Fine-Tuning) library as the standard training stack: well-maintained, compatible with all major open-source model families, and straightforward to reproduce across GPU environments. vLLM as the production inference server for fine-tuned open-source models: vLLM achieves 10-30x higher throughput than naive transformer inference through PagedAttention, continuous batching, and CUDA graph optimisation, the difference between a model that handles 10 requests/second and one that handles 200 requests/second on the same GPU. Air-gapped deployment for regulated environments (healthcare, finance, government) where data cannot leave your infrastructure: the fine-tuned model and inference server deployed in your VPC with no outbound network access required for inference, with the LoRA adapter files stored in your own S3 or equivalent object storage.

Evaluation and regression testing

A fine-tuned model without an evaluation framework is a liability, you cannot know if it improved over baseline, and you cannot catch regression when models or prompts are updated. Evaluation benchmark construction before any fine-tuning begins: a representative test set of 200-500 input/output pairs covering the full range of inputs the model will see in production, held out completely from training and used only for evaluation. Task-specific metrics designed for your use case: exact match rate for classification and label extraction tasks (the percentage of outputs that exactly match the expected label); ROUGE-L for text generation tasks where exact match is too strict but semantic similarity matters; JSON schema validation rate for structured extraction tasks; format compliance rate for constrained output tasks. Before/after comparison on the held-out test set: baseline metrics for the unmodified base model with your best prompt, metrics for the fine-tuned model, and statistical significance testing to confirm the improvement is real and not within the noise range of the evaluation set size. LLM-as-judge evaluation using GPT-4o as a rater for tasks where automated metrics cannot capture quality (brand voice consistency, response helpfulness, factual accuracy on domain knowledge), RAGAS evaluation framework for retrieval-augmented tasks. Automated regression testing in CI: the evaluation benchmark runs against a new model version before production promotion using a GitHub Action or equivalent CI step, with the PR blocked if evaluation metrics drop below the configured threshold. LangSmith or Langfuse for production evaluation: a sample of live production inputs and outputs evaluated by the LLM judge on a daily basis to catch the quality drift that happens when the production input distribution shifts away from the training distribution.

How we work

From scope to shipped

Every project follows the same four phases. Scope is locked and price is fixed before development starts.

Week 1
01
Discover and scope
We map the problem, the task, and your data. You leave week 1 with a written scope: which model, which fine-tuning approach (LoRA, full, distillation), training data requirements, and a fixed-price quote. No training runs start without your sign-off.
Weeks 2-3
02
Data curation and preparation
We audit your candidate data, design the training format for the target model, and curate or generate examples to the required volume. Quality decisions made here determine 80% of the final model quality.
Weeks 4-8
03
Fine-tune, evaluate, and iterate
Training runs with evaluation against the benchmark after each iteration. We report accuracy, format compliance, and cost metrics. You see the numbers before we recommend a production deployment.
Weeks 8-12+
04
Deploy and monitor
Production deployment with inference infrastructure, monitoring, and automated regression tests in CI. 8 weeks of post-launch support included in every project.

Why us

Why teams choose RaftLabs

Senior engineers build what they scope
The engineers who assess your fine-tuning problem also run the training and deploy the model. No bait-and-switch, no offshore handoff after the contract is signed. The team you meet in week 1 ships in week 12.
Fixed price before development starts
We scope the work, calculate the cost, and lock it in writing before any training starts. A scope change is a change request: priced, agreed, or dropped. It never absorbs into the project and appears on the final invoice.
9 years and 100+ products shipped
Clients include Vodafone, T-Mobile, Aldi, Nike, Cisco, and Lockheed Martin. Track record across AI, SaaS, mobile, automation, and enterprise platforms across healthcare, fintech, logistics, and hospitality.
Compliance built in from the start
GDPR, HIPAA, SOC 2 — compliance requirements are scoped in week 1, not retrofitted before launch. Air-gapped fine-tuning and deployment is available for regulated industries where data cannot leave your infrastructure.

Not sure if fine-tuning is the right path?

Tell us the use case, your current prompt approach, and where the base model falls short. We'll tell you whether fine-tuning is the answer, or whether there's a faster fix.

Talk about your fine-tuning project

Related services

Frequently asked questions

: Fine-tuning is the process of continuing to train a pre-trained language model on your specific data so it adapts to your task, domain, and output requirements. Use fine-tuning when prompt engineering alone cannot produce consistent output format, when you need significant inference cost reduction at scale (a fine-tuned smaller model can outperform a larger model with a long system prompt), or when domain-specific vocabulary significantly degrades base model performance. Fine-tuning is not always the right answer; start with prompt engineering and RAG first.
: OpenAI fine-tuning API supports GPT-4o mini and GPT-3.5 Turbo (hosted fine-tuning, no infrastructure required). Open-source models include Llama 3 (8B, 70B), Mistral 7B, Phi-3, and Gemma (require GPU infrastructure for training). Google Gemini fine-tuning is available via Vertex AI. The right model depends on your budget, data privacy requirements, and accuracy needs. Open-source models eliminate per-token costs and run on your own infrastructure.
: For OpenAI fine-tuning, 50 to 100 high-quality examples is the minimum; 500 to 1,000 is recommended for reliable improvement; 5,000 or more for significant domain adaptation. Quality matters more than quantity. For open-source model fine-tuning using LoRA or QLoRA adapters, expect 1,000 to 50,000 examples depending on the degree of adaptation required. We assess your existing data and help curate or generate training examples if your dataset is thin.
: LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that trains a small set of adapter weights rather than the full model. It is cheaper in compute and memory than full fine-tuning while achieving comparable results for most tasks. QLoRA extends this with quantisation for even lower memory requirements. LoRA is the standard approach for fine-tuning open-source models on modest GPU infrastructure. We use LoRA and QLoRA for open-source model fine-tuning and full fine-tuning only when the task requires it.
: We establish a benchmark before fine-tuning. A representative set of inputs with expected outputs is evaluated on your task-specific metrics (accuracy, format compliance, domain terminology usage, output length consistency). The fine-tuned model is evaluated against this benchmark on a held-out test set. We only recommend production deployment when benchmark improvement is statistically significant. Fine-tuning that does not improve over the baseline prompt-engineered base model is not worth the cost.
: Fine-tuning project cost covers training data curation, fine-tuning run costs, evaluation, and deployment. For OpenAI fine-tuning (GPT-4o mini or GPT-3.5), the OpenAI training API costs are low ($1 to $10 for typical datasets); the project cost is primarily in data curation and evaluation work ($8,000 to $25,000). For open-source model fine-tuning with infrastructure setup, expect $20,000 to $60,000 including GPU compute, deployment infrastructure, and evaluation framework.

Work with us

Tell us what you need. We'll tell you what it would take.

We scope LLM Fine-Tuning Services in 30 minutes. You walk away with a clear cost, timeline, and approach. No commitment required.

Scope and cost agreed before work starts. No surprises. No obligation.
Working prototype within 3 weeks of kickoff.
Pay by milestone. You see progress before each invoice.
60-day post-launch warranty. Bug fixes, UI tweaks, and deployment support. No retainer.
All conversations are NDA-protected.

Go deeper

LLM fine-tuning guide RAG vs fine-tuning for business AI How to choose your AI technology stack Free AI cost estimator Browse our AI case studies

LLM Fine-Tuning Services

Sound familiar?

AI development, by the numbers

When to fine-tune vs. prompt vs. RAG

What we do

Training data curation

Domain adaptation

Output format alignment

Inference cost reduction

Open-source model fine-tuning

Evaluation and regression testing

From scope to shipped

Discover and scope

Data curation and preparation

Fine-tune, evaluate, and iterate

Deploy and monitor

Why teams choose RaftLabs

Senior engineers build what they scope

Fixed price before development starts

9 years and 100+ products shipped

Compliance built in from the start

Not sure if fine-tuning is the right path?

Related services

Frequently asked questions

Tell us what you need. We'll tell you what it would take.

AI by industry