Edited
May 7, 2025 11:23 PM
Tags
Learning ML sec, one research paper at a time.
FROM:
Questions I want to answer:
- What are RLHF (Reinforcement Learning Human Feedback) models?
- What do parameters refer to in the context of ML? For example what does 52B parameters mean?
- What is rejection sampling?
- why are RLHF more difficult to red team as they scale?
- How did anthropic classify harmless?
- what is difference between generative vs NLP?
- what does it mean that generative models are stochastic?
- what are decoder-only transformer models?
- what does n-shot learning mean? for example 14-shot learning?
- what is context distillation?
- why is RLHF computationally expensive at train time but efficient at test time, and RS is vice versa?
- RS models tend to be harmless by being evasive → is this literal or some other concept?
- what does it mean “the residual stream in the 48th layer of the 52B prompted LM”?
- what are residual streams?
- what are layers?