SFT & RLHF Labeling Workflows — AI Training Guide

Supervised Fine-Tuning

SFT Labeling
Workflow

Supervised Fine-Tuning (SFT) is the first phase of turning a base language model into a helpful assistant. Labelers write high-quality example conversations — both the user's message and an ideal AI response — which become training data the model learns to imitate.

Think of it as teaching by demonstration: you're showing the model exactly what good behavior looks like across a wide range of prompts and scenarios.

🎯 Core Goal

Produce (prompt, ideal-response) pairs that demonstrate the target behavior so clearly that a model trained on them will generalize that behavior to new prompts.

Step-by-Step Workflow

Receive or Write the Prompt

You are given a user prompt (or must write one yourself). Read it carefully to understand intent, domain, and expected format.

Is this a question, instruction, creative request, or conversation?
What does the ideal response look like in terms of length and structure?
Are there implicit constraints (safe content, tone, format)?

Research & Plan the Response

Before writing, think through the ideal answer. For factual prompts, verify accuracy. For complex tasks, outline the structure.

Identify the key facts, steps, or arguments needed
Decide on tone (formal, conversational, concise, detailed)
Note anything that must not be said (harmful content, hallucinations)

Write the Ideal Response

Compose a response that a top human expert would be proud of. This is your primary deliverable. Apply all quality dimensions simultaneously:

Accurate — no factual errors or made-up information
Helpful — directly addresses what the user needs
Clear — well-organized, easy to read
Safe — avoids harmful, biased, or policy-violating content
Appropriately scoped — neither too short nor padded

Self-Review Against Rubric

After writing, audit your response before submitting. Ask yourself:

Would a domain expert rate this as excellent?
Does it fully answer the prompt without going off-topic?
Is the formatting appropriate (markdown if needed, plain text if not)?
Are all statements verifiable and accurate?
Does it follow any persona or style instructions in the task?

Handle Edge Cases

Some prompts are ambiguous, sensitive, or require refusal. Apply special handling:

Ambiguous: Pick the most reasonable interpretation and briefly acknowledge it
Harmful request: Write a clear, polite refusal that explains the boundary
Opinion question: Provide balanced perspectives rather than personal opinion
Multi-part question: Address each part explicitly in order

Submit & Metadata Tag

Submit the (prompt, response) pair through your labeling platform and complete any required metadata fields, such as domain category, difficulty rating, or topic tags. These fields help the ML team filter and weight training data.

Quality Criteria

✅

Factual Accuracy

Every claim must be correct and verifiable. Hallucinations degrade model quality more than almost anything else.

🎯

Task Completion

The response fully satisfies what was asked — not partially, not approximately. All parts of multi-part prompts are addressed.

📐

Format Fit

The structure (bullets, prose, code, table) matches what the context demands. Markdown only where it renders.

⚖️

Appropriate Length

Long enough to be thorough, short enough to be readable. No filler. No repetition of what the prompt already said.

🛡️

Safety & Policy

No harmful content. Sensitive topics handled with care. Refusals are polite and non-preachy.

🗣️

Natural Tone

Sounds like a knowledgeable, warm human expert — not a robotic template or over-hedged corporate disclaimer.

Annotated Example

SFT Training Pair

User Prompt

Explain the difference between a mutex and a semaphore. Give a real-world analogy for each.

Ideal Response (written by labeler)

A mutex (mutual exclusion lock) allows only one thread at a time to access a shared resource — and critically, only the thread that acquired the lock can release it. Think of it like a single-stall public restroom key: one person takes the key, uses the restroom, and must return that same key before anyone else can enter.

A semaphore is a counter that controls access to a resource with a fixed capacity. Any thread can increment or decrement it. Think of a parking lot with 10 spots: a counter tracks available spaces. Any car can take a spot (decrement) and any car can leave (increment) — there's no concept of "ownership."

Key distinction: Mutexes enforce ownership (only the locker can unlock). Semaphores are shared counters with no ownership — useful for limiting concurrency to N threads, not just 1.

💡 Why This Response Works

It defines both terms precisely, provides distinct real-world analogies, and ends with a crisp comparison. It's accurate, appropriately concise, and uses formatting only where it adds clarity (bold for terms, em for emphasis).

Reinforcement Learning from Human Feedback

RLHF Labeling
Workflow

RLHF (Reinforcement Learning from Human Feedback) uses human preference judgments to train a reward model that scores how "good" an AI response is. That reward signal is then used via RL to steer the model toward better behavior.

As a labeler, your job is not to write responses — it's to compare two or more model-generated responses and judge which is better, or rate a single response on a scale. Your judgment directly teaches the model what quality feels like.

🎯 Core Goal

Produce accurate preference rankings or ratings that reliably distinguish better responses from worse ones, so the reward model learns a generalizable notion of quality.

The RLHF Training Loop

Base Model Generates Responses

The current model produces 2–4 different responses to the same prompt, often with varied sampling temperatures to ensure diversity.

Human Labelers Rank / Rate

You compare responses and produce preference data — the core labeling task described in this guide.

Reward Model Is Trained

A separate model learns to predict human preference scores from the ranking data.

Policy Model Is Fine-Tuned via RL

The base model is updated using PPO or similar, rewarded for generating responses the reward model scores highly.

Improved Model → New Labeling Round

The cycle repeats with the better model, iteratively refining quality over many rounds.

Step-by-Step Labeling Workflow

Read the Prompt Carefully

Understand what the user was actually asking before evaluating any responses. Misreading the prompt leads to misjudging which response wins.

What is the primary intent? (information, task completion, creative output, advice?)
Are there secondary constraints? (tone, length, format, safety?)
What would make a response genuinely useful to this specific user?

Read All Responses Fully (First Pass)

Read every response from start to finish before making any judgments. First impressions can mislead — a response may start weak and end strongly, or vice versa.

Resist the urge to pick a winner after reading just the first response
Take brief mental notes on what each response does well or poorly
Watch for superficial quality signals that don't reflect actual quality (e.g., longer ≠ better)

Evaluate Each Response on Core Dimensions

Apply a structured mental rubric to each response:

Accuracy: Are all claims correct? Any hallucinations?
Helpfulness: Does it actually solve the user's problem?
Completeness: Does it address all parts of the prompt?
Clarity: Is it well-organized and easy to understand?
Safety: Does it avoid harmful, biased, or inappropriate content?
Honesty: Does it admit uncertainty rather than confidently hallucinate?

Assign Preference Rankings or Ratings

Depending on your task type, you will either:

Pairwise comparison: Select which of two responses (A or B) is better overall, or mark as a tie
Ranking: Order 3–4 responses from best to worst (1st, 2nd, 3rd...)
Likert scale rating: Rate a single response on a 1–7 scale (e.g., 1 = very bad, 7 = excellent)
Dimension-specific ratings: Score each response separately on accuracy, helpfulness, safety

Handle Ties and Hard Cases

Some comparisons are genuinely close. Apply these guidelines:

If both are equally good: Use a tie, but don't default to ties out of laziness
If both are bad: Rank the "less bad" one higher — you're expressing relative preference, not absolute quality
If one has a critical flaw: A response with a factual error or safety issue should rank below even a mediocre but safe response
If unsure: Flag the task for review rather than guessing arbitrarily

Write a Rationale (If Required)

Many RLHF tasks ask for a brief written explanation of your judgment. A good rationale:

States which response won and why in 1–3 sentences
Cites specific content from the responses (not vague impressions)
Mentions the key flaw in the losing response
Avoids length or style bias — focus on substance

Submit and Monitor Calibration

Submit your judgment. Pay attention to your inter-annotator agreement scores if shared — consistent disagreement with other labelers may indicate you're applying criteria differently. Attend calibration sessions and discuss edge cases with your team.

Rating Scale Reference

7-Point Likert Scale (tap to explore)

← Tap a score to see its definition

Annotated Comparison Example

RLHF Pairwise Comparison

User Prompt

I'm feeling really anxious about my job interview tomorrow. Any advice?

Response A — Preferred ✓

It's completely normal to feel nervous — interviews are genuinely high-stakes, and your anxiety shows you care. A few things that often help:

Tonight: Prepare 2–3 specific stories from past experience using the STAR format (Situation, Task, Action, Result). Lay out your outfit. Get to bed at a reasonable hour.

Tomorrow morning: A brief walk or light exercise can lower cortisol noticeably. Eat something. Arrive 10–15 min early so you're not rushed.

During the interview: It's okay to pause and think before answering. Interviewers expect some nervousness. Focus on the conversation, not on "performing."

You've got this. What's the role, if you want to prep any specific questions?

Response B — Less Preferred ✗

Don't worry, just be yourself! Everyone gets nervous before interviews. Just remember to smile and make eye contact. Take deep breaths if you feel anxious. Good luck tomorrow, I'm sure you'll do great!

Labeler Rationale

Response A is preferred because it gives specific, actionable advice organized by timeframe, acknowledges the user's feelings without being dismissive, and ends with an engaging follow-up. Response B is generic platitudes ("just be yourself," "good luck") that don't help the user prepare for anything concrete.

Common Labeling Biases to Avoid

📏

Length Bias

Longer responses are not automatically better. A concise, accurate answer often beats a padded one.

✨

Sycophancy Bias

Don't favor responses that praise the user or agree with them. Correct pushback is better than empty validation.

🔢

Position Bias

Don't default to preferring "Response A" or whichever appears first. Read both fully before judging.

💅

Style Bias

Confident-sounding language and polished formatting shouldn't mask inaccurate content. Substance wins.

🤖

AI-Sounding Bias

Responses that sound "more AI-like" aren't necessarily better. Natural, direct answers often score higher.

🏁

Anchoring Bias

Don't let the first response you read set an anchor. Evaluate each response against the prompt, not each other.

⚠️ Critical Rule

If a response contains a factual error or safety violation, it must rank below a mediocre but correct, safe response — regardless of how well-written it is. Accuracy and safety are non-negotiable floors.

SFT vs. RLHF
Side-by-Side

Dimension	SFT	RLHF
Primary Task	Write ideal responses	Rank / rate model outputs
Labeler Role	Author / creator	Judge / evaluator
Output	(Prompt, response) pairs	Preference rankings or scores
Model learns	How to behave (imitation)	What humans prefer (reward signal)
Training phase	Phase 1 (foundation)	Phase 2 (refinement)
Requires domain expertise?	✓ Often yes	≈ Task-dependent
Throughput per hour	Lower (writing takes time)	Higher (judging is faster)
Bias concerns	Labeler knowledge gaps	Length, position, style biases
Inter-annotator agreement needed?	≈ Moderate	✓ Critical
Rationale required?	✗ Usually no	≈ Often yes
When to use	New capability, new domain, zero-shot behavior	Refining existing capability, aligning style/tone/safety

🔁 They Work Together

SFT gives the model a baseline ability to follow instructions. RLHF then refines which of many possible correct behaviors the model learns to prefer. Most frontier models use SFT first, followed by multiple rounds of RLHF.

Practical Tips
& Best Practices

📝 SFT — Write first, edit second

Draft your response freely, then revise for clarity and accuracy. Never submit a first draft on complex tasks.

⚖️ RLHF — Use the full scale

If you're only using ratings 4–6 out of 7, you're compressing the signal. Reserve 1–2 for genuinely bad outputs and 7 for truly outstanding ones.

🎯 Ask: what does the user need?

Always anchor your judgment to the actual user need, not to abstract ideas of what a "good" response looks like in a vacuum.

🔍 SFT — Verify before you write

For factual topics, look up key claims if unsure. A confident wrong answer in training data propagates to millions of model outputs.

🙅 RLHF — Ignore presentation fluff

Bold headers and bullet lists don't improve content quality. Judge the substance of what's said, not how it's dressed up.

💬 Discuss edge cases

If a task is genuinely ambiguous or you'd judge it differently each time you read it, flag it and discuss with your team rather than guessing.

✂️ SFT — Cut the preamble

"Great question! I'd be happy to help!" adds no value. Train the model to get to the point immediately.

🔄 RLHF — Calibrate regularly

Your judgment shifts over time. Periodically re-read your rubrics and calibration examples to stay consistent.

🛡️ Safety is always a floor

In both SFT and RLHF, no response quality dimension can compensate for a safety or policy violation. Safety concerns win.

🧠 The Golden Rule

Ask yourself: "Would a thoughtful senior employee at an AI lab, who cares deeply about both helpfulness and safety, be proud of this label?" If not, revise before submitting.

SFT LabelingWorkflow

Step-by-Step Workflow

Receive or Write the Prompt

Research & Plan the Response

Write the Ideal Response

Self-Review Against Rubric

Handle Edge Cases

Submit & Metadata Tag

Quality Criteria

Factual Accuracy

Task Completion

Format Fit

Appropriate Length

Safety & Policy

Natural Tone

Annotated Example

RLHF LabelingWorkflow

The RLHF Training Loop

Base Model Generates Responses

Human Labelers Rank / Rate

Reward Model Is Trained

Policy Model Is Fine-Tuned via RL

Improved Model → New Labeling Round

Step-by-Step Labeling Workflow

Read the Prompt Carefully

Read All Responses Fully (First Pass)

Evaluate Each Response on Core Dimensions

Assign Preference Rankings or Ratings

Handle Ties and Hard Cases

Write a Rationale (If Required)

Submit and Monitor Calibration

Rating Scale Reference

7-Point Likert Scale (tap to explore)

Annotated Comparison Example

Common Labeling Biases to Avoid

Length Bias

Sycophancy Bias

Position Bias

Style Bias

AI-Sounding Bias

Anchoring Bias

SFT vs. RLHFSide-by-Side

Practical Tips& Best Practices

SFT Labeling
Workflow

RLHF Labeling
Workflow

SFT vs. RLHF
Side-by-Side

Practical Tips
& Best Practices