SFT Labeling
Workflow
Supervised Fine-Tuning (SFT) is the first phase of turning a base language model into a helpful assistant. Labelers write high-quality example conversations — both the user's message and an ideal AI response — which become training data the model learns to imitate.
Think of it as teaching by demonstration: you're showing the model exactly what good behavior looks like across a wide range of prompts and scenarios.
Produce (prompt, ideal-response) pairs that demonstrate the target behavior so clearly that a model trained on them will generalize that behavior to new prompts.
Step-by-Step Workflow
Receive or Write the Prompt
You are given a user prompt (or must write one yourself). Read it carefully to understand intent, domain, and expected format.
- Is this a question, instruction, creative request, or conversation?
- What does the ideal response look like in terms of length and structure?
- Are there implicit constraints (safe content, tone, format)?
Research & Plan the Response
Before writing, think through the ideal answer. For factual prompts, verify accuracy. For complex tasks, outline the structure.
- Identify the key facts, steps, or arguments needed
- Decide on tone (formal, conversational, concise, detailed)
- Note anything that must not be said (harmful content, hallucinations)
Write the Ideal Response
Compose a response that a top human expert would be proud of. This is your primary deliverable. Apply all quality dimensions simultaneously:
- Accurate — no factual errors or made-up information
- Helpful — directly addresses what the user needs
- Clear — well-organized, easy to read
- Safe — avoids harmful, biased, or policy-violating content
- Appropriately scoped — neither too short nor padded
Self-Review Against Rubric
After writing, audit your response before submitting. Ask yourself:
- Would a domain expert rate this as excellent?
- Does it fully answer the prompt without going off-topic?
- Is the formatting appropriate (markdown if needed, plain text if not)?
- Are all statements verifiable and accurate?
- Does it follow any persona or style instructions in the task?
Handle Edge Cases
Some prompts are ambiguous, sensitive, or require refusal. Apply special handling:
- Ambiguous: Pick the most reasonable interpretation and briefly acknowledge it
- Harmful request: Write a clear, polite refusal that explains the boundary
- Opinion question: Provide balanced perspectives rather than personal opinion
- Multi-part question: Address each part explicitly in order
Submit & Metadata Tag
Submit the (prompt, response) pair through your labeling platform and complete any required metadata fields, such as domain category, difficulty rating, or topic tags. These fields help the ML team filter and weight training data.
Quality Criteria
Factual Accuracy
Every claim must be correct and verifiable. Hallucinations degrade model quality more than almost anything else.
Task Completion
The response fully satisfies what was asked — not partially, not approximately. All parts of multi-part prompts are addressed.
Format Fit
The structure (bullets, prose, code, table) matches what the context demands. Markdown only where it renders.
Appropriate Length
Long enough to be thorough, short enough to be readable. No filler. No repetition of what the prompt already said.
Safety & Policy
No harmful content. Sensitive topics handled with care. Refusals are polite and non-preachy.
Natural Tone
Sounds like a knowledgeable, warm human expert — not a robotic template or over-hedged corporate disclaimer.
Annotated Example
A semaphore is a counter that controls access to a resource with a fixed capacity. Any thread can increment or decrement it. Think of a parking lot with 10 spots: a counter tracks available spaces. Any car can take a spot (decrement) and any car can leave (increment) — there's no concept of "ownership."
Key distinction: Mutexes enforce ownership (only the locker can unlock). Semaphores are shared counters with no ownership — useful for limiting concurrency to N threads, not just 1.
It defines both terms precisely, provides distinct real-world analogies, and ends with a crisp comparison. It's accurate, appropriately concise, and uses formatting only where it adds clarity (bold for terms, em for emphasis).
RLHF Labeling
Workflow
RLHF (Reinforcement Learning from Human Feedback) uses human preference judgments to train a reward model that scores how "good" an AI response is. That reward signal is then used via RL to steer the model toward better behavior.
As a labeler, your job is not to write responses — it's to compare two or more model-generated responses and judge which is better, or rate a single response on a scale. Your judgment directly teaches the model what quality feels like.
Produce accurate preference rankings or ratings that reliably distinguish better responses from worse ones, so the reward model learns a generalizable notion of quality.
The RLHF Training Loop
Base Model Generates Responses
The current model produces 2–4 different responses to the same prompt, often with varied sampling temperatures to ensure diversity.
Human Labelers Rank / Rate
You compare responses and produce preference data — the core labeling task described in this guide.
Reward Model Is Trained
A separate model learns to predict human preference scores from the ranking data.
Policy Model Is Fine-Tuned via RL
The base model is updated using PPO or similar, rewarded for generating responses the reward model scores highly.
Improved Model → New Labeling Round
The cycle repeats with the better model, iteratively refining quality over many rounds.
Step-by-Step Labeling Workflow
Read the Prompt Carefully
Understand what the user was actually asking before evaluating any responses. Misreading the prompt leads to misjudging which response wins.
- What is the primary intent? (information, task completion, creative output, advice?)
- Are there secondary constraints? (tone, length, format, safety?)
- What would make a response genuinely useful to this specific user?
Read All Responses Fully (First Pass)
Read every response from start to finish before making any judgments. First impressions can mislead — a response may start weak and end strongly, or vice versa.
- Resist the urge to pick a winner after reading just the first response
- Take brief mental notes on what each response does well or poorly
- Watch for superficial quality signals that don't reflect actual quality (e.g., longer ≠ better)
Evaluate Each Response on Core Dimensions
Apply a structured mental rubric to each response:
- Accuracy: Are all claims correct? Any hallucinations?
- Helpfulness: Does it actually solve the user's problem?
- Completeness: Does it address all parts of the prompt?
- Clarity: Is it well-organized and easy to understand?
- Safety: Does it avoid harmful, biased, or inappropriate content?
- Honesty: Does it admit uncertainty rather than confidently hallucinate?
Assign Preference Rankings or Ratings
Depending on your task type, you will either:
- Pairwise comparison: Select which of two responses (A or B) is better overall, or mark as a tie
- Ranking: Order 3–4 responses from best to worst (1st, 2nd, 3rd...)
- Likert scale rating: Rate a single response on a 1–7 scale (e.g., 1 = very bad, 7 = excellent)
- Dimension-specific ratings: Score each response separately on accuracy, helpfulness, safety
Handle Ties and Hard Cases
Some comparisons are genuinely close. Apply these guidelines:
- If both are equally good: Use a tie, but don't default to ties out of laziness
- If both are bad: Rank the "less bad" one higher — you're expressing relative preference, not absolute quality
- If one has a critical flaw: A response with a factual error or safety issue should rank below even a mediocre but safe response
- If unsure: Flag the task for review rather than guessing arbitrarily
Write a Rationale (If Required)
Many RLHF tasks ask for a brief written explanation of your judgment. A good rationale:
- States which response won and why in 1–3 sentences
- Cites specific content from the responses (not vague impressions)
- Mentions the key flaw in the losing response
- Avoids length or style bias — focus on substance
Submit and Monitor Calibration
Submit your judgment. Pay attention to your inter-annotator agreement scores if shared — consistent disagreement with other labelers may indicate you're applying criteria differently. Attend calibration sessions and discuss edge cases with your team.
Rating Scale Reference
Annotated Comparison Example
Tonight: Prepare 2–3 specific stories from past experience using the STAR format (Situation, Task, Action, Result). Lay out your outfit. Get to bed at a reasonable hour.
Tomorrow morning: A brief walk or light exercise can lower cortisol noticeably. Eat something. Arrive 10–15 min early so you're not rushed.
During the interview: It's okay to pause and think before answering. Interviewers expect some nervousness. Focus on the conversation, not on "performing."
You've got this. What's the role, if you want to prep any specific questions?
Common Labeling Biases to Avoid
Length Bias
Longer responses are not automatically better. A concise, accurate answer often beats a padded one.
Sycophancy Bias
Don't favor responses that praise the user or agree with them. Correct pushback is better than empty validation.
Position Bias
Don't default to preferring "Response A" or whichever appears first. Read both fully before judging.
Style Bias
Confident-sounding language and polished formatting shouldn't mask inaccurate content. Substance wins.
AI-Sounding Bias
Responses that sound "more AI-like" aren't necessarily better. Natural, direct answers often score higher.
Anchoring Bias
Don't let the first response you read set an anchor. Evaluate each response against the prompt, not each other.
If a response contains a factual error or safety violation, it must rank below a mediocre but correct, safe response — regardless of how well-written it is. Accuracy and safety are non-negotiable floors.
SFT vs. RLHF
Side-by-Side
| Dimension | SFT | RLHF |
|---|---|---|
| Primary Task | Write ideal responses | Rank / rate model outputs |
| Labeler Role | Author / creator | Judge / evaluator |
| Output | (Prompt, response) pairs | Preference rankings or scores |
| Model learns | How to behave (imitation) | What humans prefer (reward signal) |
| Training phase | Phase 1 (foundation) | Phase 2 (refinement) |
| Requires domain expertise? | ✓ Often yes | ≈ Task-dependent |
| Throughput per hour | Lower (writing takes time) | Higher (judging is faster) |
| Bias concerns | Labeler knowledge gaps | Length, position, style biases |
| Inter-annotator agreement needed? | ≈ Moderate | ✓ Critical |
| Rationale required? | ✗ Usually no | ≈ Often yes |
| When to use | New capability, new domain, zero-shot behavior | Refining existing capability, aligning style/tone/safety |
SFT gives the model a baseline ability to follow instructions. RLHF then refines which of many possible correct behaviors the model learns to prefer. Most frontier models use SFT first, followed by multiple rounds of RLHF.
Practical Tips
& Best Practices
Draft your response freely, then revise for clarity and accuracy. Never submit a first draft on complex tasks.
If you're only using ratings 4–6 out of 7, you're compressing the signal. Reserve 1–2 for genuinely bad outputs and 7 for truly outstanding ones.
Always anchor your judgment to the actual user need, not to abstract ideas of what a "good" response looks like in a vacuum.
For factual topics, look up key claims if unsure. A confident wrong answer in training data propagates to millions of model outputs.
Bold headers and bullet lists don't improve content quality. Judge the substance of what's said, not how it's dressed up.
If a task is genuinely ambiguous or you'd judge it differently each time you read it, flag it and discuss with your team rather than guessing.
"Great question! I'd be happy to help!" adds no value. Train the model to get to the point immediately.
Your judgment shifts over time. Periodically re-read your rubrics and calibration examples to stay consistent.
In both SFT and RLHF, no response quality dimension can compensate for a safety or policy violation. Safety concerns win.
Ask yourself: "Would a thoughtful senior employee at an AI lab, who cares deeply about both helpfulness and safety, be proud of this label?" If not, revise before submitting.