Reinforcement Learning - Basic Concepts
Ref: https://www.udemy.com/course/aws-ai-practitioner-certified/learn/lecture/44886629
- 🔧 A type of ML where an agent learns to make decisions by performing actions in an environment to maximize cumulative rewards
- Many simulations → agent slowly learns from mistakes and successes (reinforced learning through feedback)
- Applications: gaming (chess, go), robotics, finance, healthcare, autonomous vehicles…
- Key concepts:
- Agent – the learner or decision-maker
- Environment – the external system the agent interacts with
- Action – the choices made by the agent
- Reward – the feedback from the environment based on the agent’s actions
- State – the current situation of the environment
- Policy – the strategy the agent uses to determine actions based on the state
- Learning Process (similar to a state machine)
- Agent observes current State of the Environment
- Agent selects an Action based on its Policy
- Environment transitions to a new State and provides a Reward
- Agent updates its Policy to improve future decisions
- Goal: Maximize cumulative reward over time
- Example: RL to exit a maze → -1 per step (ensures we take as few steps as possible), -10 for hitting a wall (ensures we avoid walls), +100 for reaching the exit (ensures we end at the exit) → outcome: robot learns to navigate efficiently over time
- Another example in YouTube channel AI Warehouse
Reinforcement Learning from Human Feedback (RLHF)
Ref: https://www.udemy.com/course/aws-ai-practitioner-certified/learn/lecture/45375323
- 🔧 Use human feedback in a reward function to help ML models self-learn more efficiently
- Aligns desired output with human goals, wants and needs
- Used throughout GenAI apps to significantly increase model performance
- e.g. grading text translations from “technically correct” to “human”
RLHF process (example: internal company knowledge chatbot)
- Data collection
- Set of human-generated prompts and responses are created
- “Where is the location of the HR department in Boston?”
- Supervised fine-tuning of a model
- Fine-tune an existing model with internal knowledge
- Then the model creates responses for the human-generated prompts
- Responses are mathematically compared to human-generated answers
- Build a separate reward model
- Humans can indicate which response they prefer from the same prompt (Answer 1 > Answer 2)
- The reward model can now estimate how a human would prefer a prompt response
- Optimize the original model with the reward-based model
- Use the reward model as a reward function for RL
- This part can be fully automated