Root@LLM:~$
🤖

Root@LLM:~$

Edited
May 7, 2025 11:23 PM
Tags

Learning ML sec, one research paper at a time.

FROM:

Questions I want to answer:

  • What are RLHF (Reinforcement Learning Human Feedback) models?
  • What do parameters refer to in the context of ML? For example what does 52B parameters mean?
  • What is rejection sampling?
  • why are RLHF more difficult to red team as they scale?
  • How did anthropic classify harmless?
  • what is difference between generative vs NLP?
  • what does it mean that generative models are stochastic?
  • what are decoder-only transformer models?
  • what does n-shot learning mean? for example 14-shot learning?
  • what is context distillation?
  • why is RLHF computationally expensive at train time but efficient at test time, and RS is vice versa?
  • RS models tend to be harmless by being evasive → is this literal or some other concept?
  • what does it mean “the residual stream in the 48th layer of the 52B prompted LM”?
    • what are residual streams?
    • what are layers?