General-purpose language models are trained to be useful to everyone. Fine-tuning makes them specifically useful to you -- adapting their behaviour, vocabulary, tone, and output format to your domain, your data, and your product requirements.
We fine-tune language models on your datasets to improve accuracy on your specific tasks, reduce prompt length and inference cost, and produce outputs that match your brand voice and format requirements without extensive prompt engineering.
Fine-tuning on OpenAI, Llama 3, Mistral, and Phi models
Domain adaptation, output format alignment, and tone calibration
Training data curation, model evaluation, and production deployment
Cost and latency analysis -- fine-tuning vs. RAG vs. prompt engineering for your use case
RaftLabs provides LLM fine-tuning services for teams that need language models adapted to their specific domain, output format, or tone requirements. We handle training data curation, fine-tuning on OpenAI (GPT-4o mini, GPT-3.5), open-source models (Llama 3, Mistral, Phi), evaluation against task-specific benchmarks, and production deployment. Fine-tuning is the right choice when prompt engineering alone can't achieve consistent output quality, when inference cost reduction is required at scale, or when domain-specific vocabulary significantly affects model performance.
Trusted by
When to fine-tune vs. prompt vs. RAG
Most teams reach for fine-tuning too early. The decision tree:
Try prompt engineering first. A well-structured system prompt with few-shot examples solves most output format and consistency problems without any training data.
Add RAG if the model needs your knowledge. When the model needs to answer questions about your specific documents, products, or data, retrieval-augmented generation gives it that knowledge without fine-tuning.
Fine-tune when: prompt engineering cannot achieve consistent output format despite detailed instructions, when inference cost at your expected volume makes large model usage uneconomical, or when domain-specific terminology significantly degrades base model accuracy.
We will tell you which path is right for your use case -- including if fine-tuning is not the answer.
Services
What we do
Training data curation
The quality of your fine-tuning data determines the quality of the fine-tuned model -- and most projects that fail at fine-tuning fail because of training data problems, not model architecture problems. 50-100 high-quality examples can produce measurable improvement for format and style alignment tasks; domain vocabulary adaptation typically requires 500-2,000 examples; substantial domain knowledge adaptation requires 5,000+. Quality assessment process before any training runs: we audit your candidate training data for consistency (does every example follow the same input/output format?), coverage (do examples cover the range of inputs the model will see in production?), and correctness (are the expected outputs actually correct, or do they propagate errors from your historical data?). Training data format design for the target model: OpenAI fine-tuning uses a messages format with system/user/assistant turns; Llama 3 and Mistral use the [INST]...[/INST] instruction template; the format matters because the base model was trained with specific delimiters that the fine-tuning data must match. Synthetic data generation using GPT-4o as a data generator when your existing examples are thin: GPT-4o prompted with your domain knowledge and a few seed examples to generate additional training pairs, then reviewed and filtered before use -- the approach that produces usable volume without expensive human labelling. Deduplication and near-duplicate removal using MinHash or embedding similarity to prevent the fine-tuned model from overfitting to repeated patterns in the training set. Train/validation/test split: 80/10/10 split with the test set held completely separate from training to measure generalisation rather than memorisation.
Domain adaptation
Fine-tuning a general-purpose model on your domain vocabulary, technical terminology, abbreviation conventions, and document structure -- targeting the accuracy gap that appears when a base model encounters specialised language it was underrepresented on during pre-training. Medical domain adaptation: clinical note abbreviations (SOB, HTN, DM2, CVA), ICD-10 and CPT code terminology, and the imperative terse style of clinical documentation all diverge from how a general model was trained to produce text -- a fine-tuned clinical model correctly uses "patient presents with" where a general model might write "the individual reported experiencing." Legal domain adaptation: contract clause terminology, defined terms conventions, citation formats (e.g., distinguishing between a defined term in all-caps and a general reference), and the specific formulaic language of different contract types that base models frequently paraphrase rather than reproduce precisely. Financial domain adaptation: earnings call transcription correction, financial statement extraction where specific GAAP/IFRS line items have precise meanings, and the structured table formats that financial reports require. Technical documentation: API endpoint names, version-specific syntax, and the specific formatting conventions of your documentation format that a general model interpolates incorrectly. Domain evaluation benchmark established before fine-tuning begins: 200-500 held-out domain-specific examples with expected outputs, evaluated on terminology accuracy rate, abbreviation expansion correctness, and task-specific format compliance -- measuring improvement against baseline, not just loss curves on the training set.
Output format alignment
When your application requires structured JSON output, specific response structure, or constrained output styles that the base model produces inconsistently despite detailed prompting instructions, fine-tuning encodes the format directly into the model's learned behaviour rather than relying on prompt instructions that the model may or may not follow on any given generation. Structured JSON extraction: a fine-tuned model that has seen hundreds of examples of your specific JSON schema reliably produces valid, schema-conformant output where a base model might produce valid JSON with wrong field names, nested differently, or with hallucinated fields not in your schema. OpenAI's response_format: { type: "json_schema" } parameter is the right first approach for JSON enforcement without fine-tuning -- fine-tuning for format is the escalation step when structured output mode still produces wrong field values or when inference cost from a larger model is the driver. Brand voice and tone adaptation: a customer-facing chatbot that should always respond in a specific register (warm but professional, direct without being terse, never using industry jargon) is difficult to maintain consistently via prompt instructions when the system prompt competes with user input -- fine-tuning the tone into the model's weights produces more consistent behaviour across diverse user inputs. Label classification format: a model that must output one of a fixed set of classification labels without wrapping them in explanation text learns the exact output format through fine-tuning more reliably than through few-shot examples that compete with the model's default conversational output style. Format compliance rate as the primary evaluation metric: the percentage of test-set outputs that exactly match the required format without post-processing corrections -- target above 99% for production classification and extraction tasks.
Inference cost reduction
A fine-tuned smaller model can match the output quality of a larger base model on a well-defined specific task -- the principle that drives the inference cost reduction use case for fine-tuning. A GPT-4o production deployment at $5/1M output tokens processing 10 million tokens per month costs $50/month in inference; a fine-tuned GPT-4o mini deployment at $0.60/1M output tokens for the same task costs $6/month -- an 88% inference cost reduction if the fine-tuned model matches GPT-4o's quality on that specific task. The cost reduction is real only when the task is narrow enough that a smaller model fine-tuned on it can match the larger model's performance -- a complex reasoning task that genuinely requires GPT-4o's reasoning capability will not be successfully distilled into GPT-4o mini regardless of fine-tuning data volume. Task suitability assessment before committing to the cost reduction approach: we run the task against GPT-4o mini without fine-tuning to establish a baseline gap, then model whether fine-tuning is likely to close that gap based on task complexity and our experience with comparable tasks. Open-source model cost reduction: a fine-tuned Llama 3 8B running on a single A100 GPU at ~$2/hour can process 50,000+ tokens per minute on simple tasks, eliminating per-token costs entirely at sufficient volume -- the economics that makes open-source fine-tuning attractive above ~500,000 tokens per day. Distillation approach: the larger model's outputs on a representative input set used as training data for the smaller model, transferring the larger model's task-specific behaviour rather than curating human-labelled training data -- typically faster and cheaper than human labelling at equivalent quality when the base model outputs are reliable enough to serve as targets.
Open-source model fine-tuning
Fine-tuning Llama 3 (8B and 70B), Mistral 7B and Mixtral 8x7B, Phi-3, and Gemma models using LoRA/QLoRA adapters -- the parameter-efficient approach that trains a small set of low-rank adapter matrices rather than all model weights, reducing GPU memory requirements by 4-8x compared to full fine-tuning while achieving comparable accuracy for most adaptation tasks. QLoRA (Quantized LoRA) as the standard approach for resource-constrained training: the base model loaded in 4-bit NF4 quantisation (reducing VRAM from ~28GB to ~6GB for a 7B model), with LoRA adapters trained at full precision on top -- enabling Llama 3 8B fine-tuning on a single 24GB GPU that costs ~$0.50/hour on Lambda Labs or RunPod rather than requiring an expensive multi-GPU cluster. Training infrastructure setup on AWS (p3.2xlarge with V100, or p4d.24xlarge for larger models), GCP (A2 instances with A100s), or reserved GPU compute from Lambda Labs or RunPod depending on your infrastructure preferences and training timeline. Hugging Face Transformers with PEFT (Parameter Efficient Fine-Tuning) library as the standard training stack: well-maintained, compatible with all major open-source model families, and straightforward to reproduce across GPU environments. vLLM as the production inference server for fine-tuned open-source models: vLLM achieves 10-30x higher throughput than naive transformer inference through PagedAttention, continuous batching, and CUDA graph optimisation -- the difference between a model that handles 10 requests/second and one that handles 200 requests/second on the same GPU. Air-gapped deployment for regulated environments (healthcare, finance, government) where data cannot leave your infrastructure: the fine-tuned model and inference server deployed in your VPC with no outbound network access required for inference, with the LoRA adapter files stored in your own S3 or equivalent object storage.
Evaluation and regression testing
A fine-tuned model without an evaluation framework is a liability -- you cannot know if it improved over baseline, and you cannot catch regression when models or prompts are updated. Evaluation benchmark construction before any fine-tuning begins: a representative test set of 200-500 input/output pairs covering the full range of inputs the model will see in production, held out completely from training and used only for evaluation. Task-specific metrics designed for your use case: exact match rate for classification and label extraction tasks (the percentage of outputs that exactly match the expected label); ROUGE-L for text generation tasks where exact match is too strict but semantic similarity matters; JSON schema validation rate for structured extraction tasks; format compliance rate for constrained output tasks. Before/after comparison on the held-out test set: baseline metrics for the unmodified base model with your best prompt, metrics for the fine-tuned model, and statistical significance testing to confirm the improvement is real and not within the noise range of the evaluation set size. LLM-as-judge evaluation using GPT-4o as a rater for tasks where automated metrics cannot capture quality (brand voice consistency, response helpfulness, factual accuracy on domain knowledge) -- RAGAS evaluation framework for retrieval-augmented tasks. Automated regression testing in CI: the evaluation benchmark runs against a new model version before production promotion using a GitHub Action or equivalent CI step, with the PR blocked if evaluation metrics drop below the configured threshold. LangSmith or Langfuse for production evaluation: a sample of live production inputs and outputs evaluated by the LLM judge on a daily basis to catch the quality drift that happens when the production input distribution shifts away from the training distribution.
Not sure if fine-tuning is the right path?
Tell us the use case, your current prompt approach, and where the base model falls short. We'll tell you whether fine-tuning is the answer -- or whether there's a faster fix.
Fine-tuning is the process of continuing to train a pre-trained language model on your specific data so it adapts to your task, domain, and output requirements. Use fine-tuning when: prompt engineering alone cannot produce consistent output format (the model keeps ignoring your instructions), when you need significant inference cost reduction at scale (a fine-tuned smaller model can outperform a larger model with a long system prompt), or when domain-specific vocabulary significantly degrades base model performance. Fine-tuning is not always the right answer -- start with prompt engineering and RAG first.
OpenAI fine-tuning API: GPT-4o mini and GPT-3.5 Turbo (hosted fine-tuning, no infrastructure required). Open-source models: Llama 3 (8B, 70B), Mistral 7B, Phi-3, and Gemma (require GPU infrastructure for training). Google Gemini fine-tuning via Vertex AI. The right model depends on: your budget (open-source eliminates per-token costs), your data privacy requirements (open-source runs on your infrastructure), and your accuracy requirements (larger models generally fine-tune to higher accuracy but cost more to run).
For OpenAI fine-tuning: 50--100 high-quality examples is the minimum; 500--1,000 is recommended for reliable improvement; 5,000+ for significant domain adaptation. Quality matters more than quantity -- 100 carefully curated examples outperform 10,000 inconsistent ones. For open-source model fine-tuning (full fine-tuning or LoRA/QLoRA adapters): 1,000--50,000 examples depending on the degree of adaptation required. We assess your existing data and help curate or generate training examples if your dataset is thin.
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that trains a small set of adapter weights rather than the full model. It is dramatically cheaper in compute and memory than full fine-tuning while achieving comparable results for most tasks. QLoRA extends this with quantisation for even lower memory requirements. LoRA is the standard approach for fine-tuning open-source models on modest GPU infrastructure. We use LoRA/QLoRA for open-source model fine-tuning and full fine-tuning only when the task requires it.
We establish a benchmark before fine-tuning -- a representative set of inputs with expected outputs, evaluated on your task-specific metrics (accuracy, format compliance, domain terminology usage, output length consistency). The fine-tuned model is evaluated against this benchmark on a held-out test set. We only recommend proceeding to production deployment when benchmark improvement is statistically significant. Fine-tuning that does not improve over the baseline prompt-engineered base model is not worth the cost.
Fine-tuning project cost covers training data curation, fine-tuning run costs, evaluation, and deployment. For OpenAI fine-tuning (GPT-4o mini or GPT-3.5), the OpenAI training API costs are low ($1--10 for typical datasets) -- the project cost is primarily in data curation and evaluation work ($8,000--$25,000). For open-source model fine-tuning with infrastructure setup, $20,000--$60,000 including GPU compute, deployment infrastructure, and evaluation framework.
Work with us
Tell us what you need. We'll tell you what it would take.
We scope LLM Fine-Tuning Services in 30 minutes. You walk away with a clear cost, timeline, and approach. No commitment required.
Scope and cost agreed before work starts. No surprises. No obligation.
Working prototype within 3 weeks of kickoff.
Pay by milestone. You see progress before each invoice.
60-day post-launch warranty. Bug fixes, UI tweaks, and deployment support. No retainer.