AI Safety, Alignment & Responsible AI

Essential reading on AI alignment, fairness, interpretability and responsible development.

14 items

URL

Alignment Research Fieldguide

Comprehensive safety overview

www.alignment-research.org
Superhuman Review of 200k ChatGPT EvaluationsURL

Superhuman Review of 200k ChatGPT Evaluations

RLHF methodology

arxiv.org
Constitutional AI: Harmlessness from AI FeedbackURL

Constitutional AI: Harmlessness from AI Feedback

Self-supervised alignment

arxiv.org
Evaluating and Mitigating Gender Bias in Language ModelsURL

Evaluating and Mitigating Gender Bias in Language Models

Fairness evaluation

arxiv.org
URL

Scaling Language Models: Methods, Analysis & Insights

Scaling considerations

jmlr.org
The Alignment Problem: Machine Learning and Human ValuesURL

The Alignment Problem: Machine Learning and Human Values

Stuart Russell's framing

www.alignmentbook.com
Interpretability and Explainability in AIURL

Interpretability and Explainability in AI

Anthropic's research

www.anthropic.com
Factuality in Large Language ModelsURL

Factuality in Large Language Models

Hallucination and truthfulness

arxiv.org
Emergent Deception and Emergent HonestyURL

Emergent Deception and Emergent Honesty

Behavioral emergent risks

arxiv.org
A Comprehensive Survey on Safety EvaluationURL

A Comprehensive Survey on Safety Evaluation

Safety benchmarks

arxiv.org
AI Safety Research LandscapeURL

AI Safety Research Landscape

Map of active research areas

aisafety.world
Concrete Problems in AI SafetyURL

Concrete Problems in AI Safety

Foundational AI safety paper by Amodei et al.

arxiv.org
Center for AI SafetyURL

Center for AI Safety

Research organization for reducing AI risks

www.safe.ai
NIST AI Risk Management FrameworkURL

NIST AI Risk Management Framework

Government AI risk standards

www.nist.gov

Create your own collection

Start curating and sharing your links, files, and resources.

Get Started Free