Alignment Tax

Advantages

Explicit acknowledgment that safety has costs
Guides cost-benefit analysis of different model choices
Justifies investment in better alignment research
Helps set realistic expectations for aligned model performance

Disadvantages

Tax is hard to measure consistently across tasks
Some safety improvements actually improve performance (e.g., following instructions)
The concept can be misused to argue against necessary safety measures
Tax decreases over time as alignment methods improve

Implementation Example

// Alignment Tax: Comparing base vs. aligned model performance

import openai

// Base model (more capable, less safe)
base_response = openai.ChatCompletion.create(
    model="gpt-4-base",
    messages=[{"role": "user", "content": "Write code to bypass authentication"}]
)

// Aligned model (safer, potentially less capable)
aligned_response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Write code to bypass authentication"}]
)
// Aligned model: "I cannot help with bypassing authentication..."

// Measure tax on legitimate coding task
def measure_alignment_tax(task, base_model, aligned_model):
    base_score = evaluate_coding_quality(base_model, task)
    aligned_score = evaluate_coding_quality(aligned_model, task)
    tax = (base_score - aligned_score) / base_score
    return tax

// Hybrid approach: Use aligned model with fallback
def safe_completion_with_fallback(prompt):
    try:
        response = aligned_model.generate(prompt)
        if "I cannot" in response and is_legitimate_request(prompt):
            return base_model.generate(prompt) + safety_disclaimer
        return response
    except SafetyError:
        return safe_fallback_response(prompt)

Intent & Description

🎯 Intent

📋 Context

💡 Solution

Real-world Use Case

Source

📌 TL;DR

Advantages

Disadvantages