❤️‍🩹
  • Posts
Root@LLM:~$
🤖

Root@LLM:~$

Edited
May 7, 2025 11:23 PM
Tags

Learning ML sec, one research paper at a time.

FROM:

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

www.anthropic.com

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Questions I want to answer:

  • What are RLHF (Reinforcement Learning Human Feedback) models?
  • What do parameters refer to in the context of ML? For example what does 52B parameters mean?
  • What is rejection sampling?
  • why are RLHF more difficult to red team as they scale?
  • How did anthropic classify harmless?
  • what is difference between generative vs NLP?
  • what does it mean that generative models are stochastic?
  • what are decoder-only transformer models?
  • what does n-shot learning mean? for example 14-shot learning?
  • what is context distillation?
  • why is RLHF computationally expensive at train time but efficient at test time, and RS is vice versa?
  • RS models tend to be harmless by being evasive → is this literal or some other concept?
  • what does it mean “the residual stream in the 48th layer of the 52B prompted LM”?
    • what are residual streams?
    • what are layers?