In early 2025, a lab called DeepSeek released models that matched the reasoning of far pricier frontier systems for a training budget that looked like a rounding error, and the AI world collectively lost its composure. One word kept turning up in every explainer: distillation. A year later the topic landed back on the front page when Anthropic reported that several labs had been covertly copying its Claude model at industrial scale, generating over 16 million exchanges with Claude through approximately 24,000 fraudulent accounts.
So what is the technique sitting behind both stories, and does it belong anywhere near your roadmap? This guide covers model distillation from the ground up: what it is, how it works, where it earns its keep, the one thing it genuinely cannot do, and whether it is even legal to distill ChatGPT or distill Claude. If you are considering it for a product, our AI model distillation services team can help you plan the project and understand the costs before you commit any budget.
What Is Model Distillation?
Model distillation is the process of transferring the knowledge of a large, capable model (the teacher) into a smaller, cheaper model (the student), so the student approaches the teacher’s quality on a specific task while running faster and costing far less. The student rarely matches the teacher across everything it can do. On a narrow domain, though, it can get remarkably close. The idea is not new. It was formalized by Geoffrey Hinton and colleagues in their 2015 paper Distilling the Knowledge in a Neural Network, and it has been a quiet workhorse of practical machine learning ever since.
The classic proof point is DistilBERT. BERT was a landmark Google language model that powered a wave of search and text-understanding tools, and DistilBERT is a compressed version of it built by the team at Hugging Face. According to Sanh et al., 2019, DistilBERT kept about 97% of BERT’s language understanding while being roughly 40% smaller and 60% faster. That single result captures the whole appeal of AI model distillation: keep most of the smarts, drop most of the cost.
How Does Model Distillation Work?
You take a strong model, use it to produce high-quality answers, and then train a smaller model to imitate those answers until the small one behaves like a compact copy of the big one for your use case. The student is not memorizing a lookup table. It is learning the teacher’s patterns of reasoning and response on the task you care about. Most modern LLM distillation uses one of two mechanisms, and the difference matters when you plan a project.
Response-Based (Data) Distillation
This is the common approach for language models, and it is the one most platforms automate. You run the teacher across a large set of prompts, capture its outputs, and fine-tune the student on those input-output pairs. The student learns to reproduce the teacher’s behavior directly. It is exactly how DeepSeek built its distilled models: the team compiled a dataset of 800,000 examples generated by its own R1 model, then used that data to fine-tune existing open models such as Qwen and Llama.
Soft-Label (Logit) Distillation
Instead of training only on the teacher’s final text, the student trains on the teacher’s full probability distribution over possible tokens, the so-called soft labels. That signal carries more information, since it tells the student not just that the answer was A but that B was almost as likely.
The catch is access. Closed APIs like ChatGPT do not expose their internal probabilities, so soft-label distillation is mostly available when you control the teacher or use an open one (an openly licensed model whose internals you can inspect and run yourself, such as Llama or Qwen).
Feature-Based Distillation
This variant goes a layer deeper than the final answer or its probabilities and looks inside the teacher itself. Rather than copying only what the teacher says, the student learns to match the teacher’s intermediate representations, the internal patterns the model forms in its hidden layers as it works toward an output. A rough analogy is having the student reproduce the teacher’s working, not just its final answer. Like soft-label distillation, it needs access to the teacher’s internals rather than just its text, so it suits open or self-owned models, and it was central to how DistilBERT was built.
Model Distillation Techniques and Types
The model distillation techniques you will meet sort along three simple questions: what signal the student learns from, how the teacher and student relate during training, and how broad the target task is. The first question, the training signal, is the one we covered in the mechanisms above: response-based learns from final outputs, logit-based from soft probabilities, and feature-based from the teacher’s hidden representations. The other two questions produce the types worth knowing before you brief an engineering team, and getting the combination right is where experienced large language model development separates a clean win from a wasted quarter.
Offline Distillation
This is the standard, everyday setup. The teacher is already trained and stays frozen while the student learns from its outputs. Because the teacher never changes, the process is simple to run and easy to reason about, which is why most production projects start here.
Online Distillation
Here the teacher and student train at the same time rather than one after the other. The teacher keeps improving alongside the student, which can produce a stronger result, but it is more complex and more expensive to coordinate. Teams usually reach for it only when offline distillation leaves performance on the table.
Self-Distillation
In self-distillation, a model acts as its own teacher and passes its knowledge to a smaller or later version of itself. It sounds circular, but it is a practical way to compress a model or to clean up its behavior without bringing in a separate, larger teacher. The approach shows up when a team wants a leaner version of a model it already owns.
Task-Specific Distillation
This targets one narrow job, such as support-ticket triage, document classification, or structured data extraction. Because the student only needs to be good at a single task, it can be small, cheap, and very fast. This is the most cost-effective form of distillation and the one most businesses should consider first.
General Distillation
This aims to transfer broad capability rather than a single skill, producing a smaller model that stays competent across many tasks. It is the heavier lift, since it demands far more data and compute, and it is the route DeepSeek took with its general-purpose reasoning models. Most companies do not need to go this far, but it is the right call when one model has to handle a wide range of work.
Model Distillation vs Fine-Tuning: What Is the Difference?
People often use the two terms interchangeably, and the confusion is understandable because the final step is the same in both. The cleanest distinction is the source of the training signal. In ordinary LLM fine-tuning, the model learns from human-written gold labels you provide. In distillation, the labels come from another model, the teacher, which generates the data the student learns from.
In other words, distillation is a way of producing training data, and fine-tuning is the act of training on it. A distillation pipeline almost always ends in a supervised fine-tuning run, which is why the line blurs in practice. The reason to care about the difference is cost and ownership. Fine-tuning on your own labeled data is about teaching style or format. Distillation is about cheaply inheriting a much larger model’s capability on a defined task.
Why Teams Are Betting on Model Distillation
The economics are the entire point, and they are getting more compelling by the quarter. On its managed model distillation service, AWS reports that distilled models can run up to 500% faster and 75% cheaper than the originals, with less than 2% accuracy loss for use cases like retrieval-augmented generation. You also get smaller models that fit on cheaper hardware or run on-device, lower energy use, and, when you self-host, full control over the weights with no per-token API fees.
The broader cost curve points the same direction. According to Stanford HAI’s 2025 AI Index Report, driven by increasingly capable small models, the inference cost for a GPT-3.5-level system fell more than 280-fold between November 2022 and October 2024, roughly from $20 to $0.07 per million tokens. The market is already shifting accordingly. Gartner predicts that by 2027 organizations will use small, task-specific AI models at least three times more than general-purpose large language models.
Quality on narrow tasks is the part that surprises people. The DeepSeek-R1 paper reports its distilled 32B student scoring 94.3% on the competition-level MATH-500 benchmark against its 671B-parameter teacher’s 97.3%, a small gap from a model roughly twenty times smaller. Specialization is becoming the norm rather than the exception, and Gartner expects that by 2027 more than 50% of the GenAI models enterprises use will be specific to an industry or business function, up from about 1% in 2023.
What Model Distillation Cannot Do
A distilled model is a specialist, not a generalist, and pretending otherwise is the fastest way to a disappointing pilot. The student rarely exceeds the teacher, and it inherits the teacher’s blind spots and biases along with its strengths. If the teacher is wrong about something, the student will be confidently wrong about the same thing, so bias checks belong in the plan from day one.
Narrow generalization is the subtler trap. A student tuned for one domain can quietly degrade elsewhere, and one recent study found that every DeepSeek-R1-Distill-Qwen checkpoint scored below the size-matched baseline on a constraint-solving benchmark, even though those same models excel at math and code. Distillation buys task performance, not universal improvement. There is also iterative cost creep to watch, because generating synthetic data from a high-end teacher means paying that teacher’s premium token rates, and repeated training cycles add up even when the final model runs cheaply.
Is Model Distillation Legal? Can You Distill ChatGPT or Claude?
Distillation itself is a legitimate, decades-old technique. The legal question is entirely about whose outputs you train on and under what terms. Distilling within a provider’s own platform is not just allowed, it is a supported feature: OpenAI offers model distillation in its API so you can use a larger GPT model like GPT-4o to fine-tune a smaller one like GPT-4o mini, all in one place.
Trying to distill ChatGPT from the outside is a different matter. OpenAI’s terms prohibit using its outputs to train competing models, which is exactly the scrutiny DeepSeek drew in early 2025. The same goes for any attempt to distill Claude against the rules. Anthropic reported in February 2026 that three labs had run industrial-scale distillation attacks against Claude, and the company now runs detection systems that flag distillation-style traffic and coordinated account activity.
The honest takeaway is that you do not get a closed model’s weights or probabilities, and you cannot legally clone a competitor’s API to build a rival. The realistic routes are distilling inside a provider’s platform or using an openly licensed teacher such as Llama, Qwen, or DeepSeek, where you own the entire pipeline. We walk through the practical mechanics in our companion piece, how to distill an LLM step-by-step.
What Does It Actually Cost?
A small, task-specific project is cheaper than most people expect. Take a realistic example: generate a few thousand training samples from a strong teacher, fine-tune a compact student, then serve it for a month of moderate traffic. The one-time setup tends to land in the low tens of dollars when you use a managed platform, and the ongoing inference bill for the distilled specialist can run a few dollars a month.
The same monthly query volume on the original frontier model would cost many times more, which is the entire argument for distilling in the first place. Self-hosting changes the shape of the bill rather than just the size of it: you trade per-token API fees for GPU hosting, which wins at high volume and loses at low volume.
Before you commit a budget, run the math with your own numbers. Pull the current per-token and GPU-hour rates straight from the provider’s pricing page, since published rates shift monthly and different providers quote different figures. Estimate your real monthly volume (roughly, your expected number of queries times the tokens each one uses), then price out both options side by side: the frontier teacher on a pay-per-token API versus the distilled student, whether you serve it serverless or self-host it. That comparison, not any single headline number, tells you whether distillation pays off at your scale.
Build Your Model Distillation Roadmap With Redwerk
If distillation looks like a fit, the next step is a plan grounded in your data and your numbers. As an AI development company, Redwerk helps businesses select the right model and the proper tech stack for model distillation. Through a professional discovery phase, you get a roadmap tailored to your industry and use cases, along with a clear picture of the cost and effort involved, and we provide an estimate before any work begins.
We apply the fundamental engineering principles and security best practices honed through decades of building custom software for businesses across North America and Europe, including Fortune 500 companies like Siemens, J.B. Hunt, and Universal Music Group. That means a distilled model that performs on your task, holds up in production, and does not quietly leak the IP or compliance risks that distillation can introduce when it is done carelessly. Tell us about your use case, and we will help you decide whether to distill, fine-tune, or take a different route entirely.
See how we built an AI-powered recruitment app acquired by a US staffing giant