Engineering · April 2026

Fine-Tuning AI to Sound Like a Specific Person

We took a generic open-weight model, trained it on one person’s writing for six hours on a laptop, and produced outputs that evaluators could not reliably distinguish from the original. The method has direct applications for any sales team still sending AI-generated emails that read like AI-generated emails.

Foundation models write competent prose. They also write prose that sounds like every other foundation model. For sales and executive communication, this is a liability. Buyers have learned to recognize the cadence: the em-dash reveals, the three-part lists, the words “delve” and “landscape” and “navigate.” Generic AI copy triggers the same mental filter as the word “Viagra” did in 2006. It gets deleted.

We wanted to test whether Supervised Fine-Tuning (SFT) on a small, curated dataset could close the gap between generic model output and a specific person's voice. Not improve it marginally. Close it to the point where humans cannot tell the difference.

The target was Box CEO Aaron Levie, chosen because his writing style is distinct enough to serve as a clean signal: short, declarative, full of industry-specific analogies, consistently forward-looking. If SFT can capture this voice, it can capture your top sales rep's voice or your CEO's investor update tone.

This matters more than most teams realize. In our research on lean sales teams, the companies generating 2.8x more revenue per employee were not just using AI—they were using AI that did not sound like AI. The efficiency gap widens when outbound actually gets replies.

01The Experiment

Why SFT and not prompting

The obvious question: why not just use a detailed system prompt? We tried. A well-crafted prompt gets you 60–70% of the way there for the first few outputs. Then the model drifts. It reverts to its training distribution. The hedging returns, the dramatic analogies creep back, the responses bloat. Prompt engineering is a runtime constraint. SFT changes the weights. A prompted model is always fighting its default behavior. A fine-tuned model has a new default.

02Setup

Everything ran locally on consumer Apple Silicon with unified memory. No cloud GPUs, no API calls during training. The framework was MLX, optimized for Apple hardware.

Parameters
14B
Peak memory
17.8 GB
Training iterations
750
Total training time
6 hrs

Base model: DeepSeek-R1-Distill-Qwen-14B-4bit. Strong reasoning capabilities out of the box. Writes like a helpful assistant out of the box, which is precisely the problem we are solving.

The training dataset was several hundred input-output pairs. Each input was a piece of tech news or industry commentary (a “Quoted Tweet” with links redacted for training hygiene). Each output was Levie's actual historical response. The system prompt enforced constraints before training began: no hashtags, no emojis, declarative tone, extrapolation from present signals to future implications.

Fine-tuning used LoRA (Low-Rank Adaptation), which updates a small subset of the model's weights rather than the full parameter set. This is what makes training on a laptop feasible. The full 14B model has billions of parameters. LoRA touches a fraction of them, enough to shift stylistic behavior without destroying the base model's reasoning ability.

What the training data looked like

Each training example paired a piece of external content (the stimulus) with the target response (what Levie actually wrote). We deliberately included examples where the response was short (one sentence) and examples where it ran long (multiple paragraphs), so the model could learn when brevity was the right move. This variation mattered. Early training runs on a more uniform dataset produced a model that was always terse, regardless of whether the topic deserved longer treatment.

03Blind Test

Before the evaluation scores: raw output. Below are three examples from the evaluation set. In each, one response is Levie's actual writing. The other was generated by the fine-tuned model responding to the same quoted tweet. Click the one you think is the AI.

These three were selected because the evaluator scored the SFT response 5/5 on freedom from AI-isms.

Quoted TweetEven the best developer tools mostly still don't let you sign up for an account via API. This is a big miss in the claude code age because it means that claude can't sign up on its own. Putting all your account management functions in your API should be tablestakes now.
Quoted TweetIt is becoming clearer that Jevons paradox applies to competent human software engineers. If AI makes them more efficient and more productive, demand for their work will increase.
Quoted TweetThe software industry is apparently dying but job postings for software engineers are rapidly rising!

If you picked wrong on any of those, that is the point. The SFT model did not memorize phrases. It learned the framework: how the target thinks about a topic, what level of specificity to operate at, when to stop writing.

Look at the first example. The human wrote a long, detailed response. The AI wrote a single sentence that distilled the same idea. Both are valid Levie-style outputs. The model learned that sometimes the right move is compression, sometimes expansion. That is learned judgment, not prompt engineering.

04Evaluation

We evaluated 30 identical scenarios across three sources: the DeepSeek base model (un-tuned), the SFT model (fine-tuned), and the human original. Each response was scored 1–5 on six dimensions by an LLM-as-judge pipeline.

The scoring rubric for “Freedom from AI-isms” used our open-source Stop Slop prompt instructions. The evaluator penalized throat-clearing openers, business jargon, performative emphasis, and formulaic binary contrasts. The remaining dimensions: Brevity measured whether the response gets to the insight or wastes words summarizing the prompt. Declarative Confidence measured whether it speaks as an authority or hides behind hedges. Industry Wit checked for natural use of tech-insider shorthand. Forward-Looking Vision assessed whether the response extrapolates from present signals or merely restates the news. Grammar was the control.

Base Model vs. Fine-Tuned SFT
DeepSeek BaseFine-Tuned SFT
BrevityConfidenceWitVisionAI-ism FreeGrammar
Fine-Tuned SFT vs. Human Original
Fine-Tuned SFTHuman Original
BrevityConfidenceWitVisionGrammar
The base model (left, gray) scores high on grammar but falls short on brevity and wit. SFT reshapes the profile toward the human baseline (right, green). The shapes nearly overlap.

The style gap, not the intelligence gap

Both models scored 5.0 or near-perfect on grammar. The base model is not failing because it lacks intelligence. It is failing because it lacks taste. It writes at length when it should be brief. It reaches for dramatic analogies when it should be dry. Teams often assume fine-tuning makes the model “smarter.” It does not. The base model already knows enough. SFT teaches it which capabilities to deploy and which to suppress.

Freedom from AI-isms
Fine-Tuned SFT
4.33
DeepSeek Base
4.10
Industry Wit
DeepSeek Base
3.90
Fine-Tuned SFT
3.57
Human Original
3.37

When a higher score is worse

The wit metric illustrates something counterintuitive. The base model scored 3.90, higher than both the SFT model (3.57) and the human (3.37). Reading the raw outputs explains why. The base model reaches for dramatic metaphors in nearly every response: “AI is the new gravity,” “this is a simulation lab for the modern workforce.” It is performing wit rather than exercising it. The SFT model uses analogies sparingly, the way someone who actually works in the industry would. A real person does not try to coin a phrase in every response.

Prompt engineering cannot reliably calibrate this. You can write “be witty but not too witty” and the model will interpret it differently each time. SFT gives the model hundreds of examples of what the right amount looks like. It learns the boundary by seeing where the target stops.

Validation Loss
Validation Loss During Training
0123102030405060708090100

What the loss curve tells us

The validation loss drops from 2.5 to under 0.5 in the first 100 iterations, then oscillates in a narrow band. The steep initial drop is the model learning macro-level constraints: stop padding, stop hedging, stop restating the question. These patterns are consistent across the training data, so they are learned fast. The plateau is where harder, more contextual behaviors emerge: when to be brief versus elaborate, when to deploy an analogy versus state facts plainly.

Practical implication: you can get a usable model in under two hours. Getting from usable to polished takes the remaining four. For internal email drafts that a human reviews, the two-hour model is sufficient. For autonomous outbound at scale, the extra training time pays for itself.

05What Generic Output Looks Like vs. Fine-Tuned Output

Same input to both models. The highlighted phrases in the base model output are patterns that buyers and spam filters recognize immediately.

Prompt: a quoted tweet noting engineering job openings at a three-year high, discussing the paradox of AI increasing demand.

DeepSeek-R1 Base

“Engineering job openings are at the highest levels we've seen in over 3 years—67,000 globally, 26,000 in the U.S. alone. This isn't just a blip; it's a signal. AI is the new gravity pulling more engineers into the orbit of tech. We're seeing a shift as significant as when cloud computing reshaped IT—only this time, it's AI rewriting the rules of software development. The future of work is being rewritten by agents, not just coders.

Fine-Tuned SFT

“We don't know if there would have been more open roles if not for AI. But what we do know is that we're in a period of massive demand for talent unlike any we've seen, driven by rapid technology adoption. This is the demand spike you'd expect when a new technology agent like AI agents are being adopted at scale, and companies finally start to fully integrate them into their workflows.”

Prompt: a tweet about a PC game that simulates building and managing a data center, suggesting hyperscalers should learn from this “edutainment.”

DeepSeek-R1 Base

“Data Center sim is lowkey genius—because it's not just a game, it's a simulation lab for the modern workforce. Imagine if every engineer could tinker with server racks, cooling systems, and load balancing without breaking budgets or uptime. This is the kind of edutainment that turns abstract cloud concepts into visceral experiences. Hyperscalers take note: the future of learning isn't in textbooks, it's in immersive environments like this. The next generation of infrastructure pros are already playing it.”

Fine-Tuned SFT

“The best way to understand something is to build it. The ability to play with real concepts in a safe space is why we love games. This concept of ‘edutainment’ is one of the most powerful ways to learn—and it's been missing from our most critical fields of work.”

The base model manufactures significance. It parrots slang from the prompt (“lowkey genius”), stacks dramatic fragments, tells the reader how important the moment is. The SFT model makes an observation, provides context, and stops.

There is a subtlety in the second example. The base model mirrors the prompt's language. The SFT model takes the concept and reframes it in its own words. One echoes. The other has a point of view. In outbound email, the difference between these two modes is the difference between a delete and a reply.

06When SFT Makes Sense and When It Does Not

SFT is not always the right tool. It has a specific cost structure: a curated dataset (hundreds of examples minimum), a few hours of compute, and someone who can evaluate whether the output is better. For some use cases this investment pays off immediately. For others, prompt engineering is sufficient.

SFT works well when the target voice is distinct and consistent, the output volume is high enough to justify training costs, and the audience can detect generic AI. Executive-level outbound, investor communications, brand-voice content at scale. These are cases where “sounds like AI” versus “sounds like our VP of Sales” directly affects pipeline.

SFT is overkill when the content is internal and low-stakes, when a human will heavily edit every output, or when the target voice is not meaningfully different from standard professional prose. If your CEO writes in the same style as a ChatGPT default, fine-tuning will not produce a noticeable improvement. SFT captures distinctiveness. If there is no distinctiveness to capture, you are training on noise.

There is also a data quality threshold. Fifty carefully selected examples from a person's best writing will outperform three hundred examples that include off-brand or low-effort outputs. SFT on poor examples produces a model that confidently reproduces bad habits. Curating the dataset is the most labor-intensive part of the process, and the part that determines the ceiling of the results.

The teams we see succeeding with this approach share a pattern with the workforce contraction data we published recently: they are not hiring more people to write more emails. They are investing in tooling that lets fewer people write better ones. It also tracks with the selectivity we see in the funding data: winners get backed faster, and capital efficiency is now a precondition for the next round, not a nice-to-have. The category-level Series A data shows the same selectivity sharpening by sector.

07Implications for Sales Teams

01Generic sequences are depreciating
Merging a prospect's recent funding round into a template is data insertion, not personalization. Buyers recognize the pattern. Fine-tuned models produce contextual outreach that matches the tone and brevity of a senior operator, which is harder to dismiss because it reads like a person wrote it.
02Top-performer knowledge can be codified
Sales orgs run on a power law. A few reps close most of the revenue. SFT provides a mechanism to capture how those reps handle objections, frame ROI, and write follow-ups, then distribute that behavior across the team through internal agents.
03Local training keeps proprietary data private
Sales playbooks, competitive battlecards, closed-won email threads. These are among a company's most valuable assets. Training locally on Apple Silicon (or in a single-tenant environment) means none of this data passes through a third-party API.
04SFT is onboarding. DPO is coaching.
SFT shows the model what good looks like. Direct Preference Optimization (DPO) is the ongoing feedback loop. When an AE reviews an AI draft, deletes a generic opener, and rewrites the CTA, that edit becomes training data. The rejected version and the chosen version feed back into the model. It improves continuously.
05AI-isms are the new spam triggers
Words like “delve,” “landscape,” “testament,” and “synergistic” are being filtered the same way “FREE MONEY” was filtered in 2005. Email deliverability filters and human pattern recognition are both learning to route un-tuned AI output to trash. Suppressing these markers through fine-tuning is a deliverability requirement, not a style preference.
06The moat is the training data, not the model
Open-weight models from DeepSeek, Meta, and others are freely available and capable. The competitive advantage is not which model you use. It is the quality and volume of the proprietary examples you have to fine-tune it with: your best reps' emails, your executives' investor updates, your winning deal threads.

08Methodology

Base model: DeepSeek-R1-Distill-Qwen-14B-4bit. Fine-tuning: LoRA via MLX on Apple Silicon (unified memory). Training set: Several hundred quoted-tweet / response pairs from a single author. Evaluation: 30 scenarios scored 1–5 across six dimensions by LLM-as-judge pipeline. “Freedom from AI-isms” scored against open-source Stop Slop rubric. All training and inference ran locally. No data was sent to external APIs during training.