0% found this document useful (0 votes)

8 views15 pages

RL Report

a basic reinforcement learning

Uploaded by

aryanchakravortu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views15 pages

RL Report

a basic reinforcement learning

Uploaded by

aryanchakravortu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

A REPORT

On
“Reinforcement Learning: From Theory to Practice ”

Submitted to
KIIT Deemed to be University
In Partial Fulfilment of the Requirement for the Award of

BACHELOR’S DEGREE IN
Computer Science And Engineering

Submitted By:-
Aryan Chakravorty
22054123
B.Tech-C.S.E.

Submitted To:-
Dr. Subhra Priyadarshini Biswal
School of Computer Engineering

SCHOOL OF COMPUTER ENGINEERING

KALINGA INSTITUTE OF INDUSTRIAL TECHNOLOGY
BHUBANESWAR, ODISHA - 751024
April 2025
Acknowledgement

I would like to express our deepest gratitude to Dr. Subhra Priyadarshini Biswal, our project
guide, for their invaluable guidance, encouragement, and support throughout the learning of,
"Reinforcement Learning: From Theory to Practice" His expertise and constructive feedback were
instrumental in overcoming challenges and achieving the objectives of this project.

I am also grateful to the faculty members of the School of Computer Engineering, KIIT Deemed
to be University, for providing us with the knowledge and resources necessary to complete this
project successfully. Their lectures and mentorship have been a source of inspiration throughout
our academic journey.

I extend our heartfelt thanks to our peers and colleagues for their constant motivation . Their
insights and suggestions helped us refine our approach and improve the overall quality of the
system.

Finally, we would like to thank our families for their unwavering support and encouragement .
Their belief in us has been a driving force behind our success.

This would not have been possible without the collective efforts, guidance, and support of all
these individuals. I am truly grateful for their contributions.
Table of Contents
Executive Summary

Chapter 1: Introduction to Reinforcement Learning

1.1 Defining Reinforcement Learning

1.2 The Core Learning Paradigm

1.3 A Comparative Overview: RL vs. Other Machine Learning Types

Chapter 2: The Foundations of Reinforcement Learning

2.1 The Agent-Environment Interface

2.2 The Goal: Maximizing Cumulative Reward

2.3 The Mathematical Framework: Markov Decision Processes (MDPs)

Chapter 3: Core Concepts and Challenges

3.1 The Agent's Brain: Policies and Value Functions

3.2 The Heart of RL: The Bellman Equations

3.3 The Fundamental Dilemma: Exploration vs. Exploitation

Chapter 4: A Taxonomy of Reinforcement Learning Algorithms

4.1 Model-Based vs. Model-Free Approaches

4.2 Value-Based Methods: Q-Learning

4.3 Policy-Based Methods

4.4 Actor-Critic Methods

Chapter 5: The Deep Reinforcement Learning Revolution

5.1 The Limits of Traditional RL

5.2 The Breakthrough: Deep Q-Networks (DQN)

5.3 Landmark Achievements: Atari and AlphaGo

Chapter 6: Real-World Applications of Reinforcement Learning

6.1 Robotics and Autonomous Control

6.2 Recommender Systems and Personalization

6.3 Finance and Algorithmic Trading

6.4 Resource Management

6.5 AI Alignment: Reinforcement Learning from Human Feedback (RLHF)

Chapter 7: Challenges and Future Directions

7.1 Key Challenges in Modern RL

7.2 The Future of Reinforcement Learning

Chapter 8: Conclusion

References
Executive Summary
Reinforcement Learning (RL) is a paradigm of machine learning where an intelligent agent learns
to make optimal decisions through trial and error. Unlike supervised learning, which requires a
labeled dataset, or unsupervised learning, which finds patterns in unlabeled data, RL agents learn
from interacting with an environment, guided only by a scalar reward signal. The agent's sole
objective is to develop a strategy, or policy, that maximizes its cumulative reward over time.

This report provides a comprehensive overview of the field, starting with the foundational
concepts of agents, environments, states, actions, and rewards. It delves into the mathematical
framework of Markov Decision Processes (MDPs) and the cornerstone Bellman equations, which
provide the theoretical basis for nearly all RL algorithms. A central challenge in RL, the
exploration vs. exploitation trade-off, is discussed, highlighting the agent's need to balance acting
on current knowledge with seeking new information.

The report surveys the primary categories of RL algorithms, including value-based, policy-based,
and actor-critic methods. A significant focus is placed on the Deep Reinforcement Learning
(DRL) revolution, where deep neural networks are used as function approximators, enabling RL
to solve problems of immense scale and complexity. Landmark achievements like DeepMind's
successes with Atari games and AlphaGo are detailed as key inflection points for the field.

Finally, the report explores the growing landscape of real-world applications, from robotics and
recommender systems to financial trading and the critical role of RL from Human Feedback
(RLHF) in aligning large language models. It concludes by examining the primary challenges
facing the field—such as sample inefficiency and safety—and looks ahead to future research
directions, solidifying RL's position as a cornerstone of modern artificial intelligence.
Introduction to Reinforcement Learning
Defining Reinforcement Learning

Reinforcement Learning (RL) is a goal-oriented learning paradigm based on behavioral psychology. It is

concerned with how an intelligent agent ought to take actions in an environment in order to maximize some
notion of cumulative reward. The learning process is interactive and driven by trial and error; the agent discovers
which actions yield the most reward by trying them, rather than by being explicitly told which actions to take.

The Core Learning Paradigm

The intuition behind RL can be easily understood through a simple analogy: training a dog.

 The dog is the agent.

 The room it is in, along with its trainer, is the environment.
 When the trainer gives a command like "sit," this represents a state.
 The dog's decision to sit or stand is its action.
 If the dog performs the correct action, it receives a treat, which is a positive reward.

The dog does not understand the abstract concept of "sitting." It simply learns, through repeated interaction, that
performing a specific muscle movement (action) in a particular context (state) leads to a desirable outcome
(reward). Over time, it develops a strategy, or policy, to maximize the number of treats it receives. RL formalizes
this intuitive process into a computational framework.

A Comparative Overview: RL vs. Other Machine Learning Types

To fully appreciate RL's unique position, it is useful to contrast it with the other two primary machine learning
paradigms: supervised and unsupervised learning.

Paradigm Data Input Learning Goal Feedback

Mechanism
Labeled Learn a mapping Instructive. Direct
Supervised feedback on every
function f(X)=Y.
Dataset (X,Y) prediction with the
correct label.

Unlabeled Discover None. The algorithm

UnSupervised explores the data's
underlying
Dataset (X) structures,
intrinsic structure.

patterns, or
clusters.
Learn a policy Evaluative. A scalar
Reinforcement No predefined reward signal
π(S)=A to
dataset; maximize long-term
indicates how "good"
generated via an action was, but not
reward. which action was best.
interaction.
The key differentiator is the nature of the feedback. Supervised learning relies on a "teacher" who provides the
correct answers. Reinforcement learning relies on a "critic" who scores the agent's actions without revealing the
optimal action. This makes RL exceptionally well-suited for problems involving sequential decision-making and
long-term planning, where the notion of a single "correct" label for a given state does not exist.
The Foundations of Reinforcement Learning
The Agent-Environment Interface

All RL problems are framed as an interaction between an agent and an environment. This
interaction occurs in a sequence of discrete time steps,

 Agent: The learner and decision-maker.

 Environment: Everything outside the agent; the world with which it interacts.

 State (St ): A representation of the environment at time t.

 Action (At ): A decision made by the agent based on the state St.

 Reward (Rt+1 ): A scalar feedback signal received by the agent after taking action At in
state St.

The process unfolds in a continuous loop:

 The agent observes the current state St.

 Based on St , it selects an action At.

 The environment transitions to a new state St+1.

 The environment provides the agent with a reward Rt+1.

 The cycle repeats.

The Goal: Maximizing Cumulative Reward

A myopic agent that only tries to maximize its immediate reward may fail to achieve the best
long-term outcome. For instance, in chess, sacrificing a pawn (negative immediate reward) might
be necessary to win the game (high future reward). Therefore, the agent's goal is to maximize the
cumulative reward, known as the return.

The return at time t, denoted Gt , is the sum of all future rewards. To handle tasks that may run
indefinitely (continuous tasks) and to prioritize nearer rewards, a discount factor, γ (where 0≤γ≤1),
is introduced.

The discounted return is defined as:

Gt=Rt+1+γRt+2+γ2Rt+3+...=k=0∑∞γkRt+k+1
The discount factor determines the present value of future rewards. A γ close to 0 leads to a
"short-sighted" agent, while a γ close to 1 leads to a "far-sighted" agent that strives for long-term
gain.

Policy-Based Methods

Instead of learning a value function, policy-based methods directly learn the parameters of a
policy πθ (a∣s) that maximizes the expected return. They typically work by calculating the
gradient of the expected return with respect to the policy parameters θ and updating the
parameters using gradient ascent. A classic example is the REINFORCE algorithm.

Actor-Critic Methods

Actor-critic methods are a hybrid approach that combines the strengths of value-based and
policy-based methods. They consist of two components:

 The Actor: A policy that controls how the agent behaves.

 The Critic: A value function that measures how good the actions taken by the actor are.

The critic evaluates the actor's actions, and the actor updates its policy in the direction suggested
by the critic. This allows for more stable and efficient learning than pure policy-based methods.
The Deep Reinforcement Learning Revolution
The Limits of Traditional RL

Traditional algorithms like Q-learning rely on tabular representations for value functions or
policies. This approach is only feasible for problems with small, discrete state and action spaces.
For problems with high-dimensional state spaces (like processing images from a camera) or
continuous state spaces (like robot joint angles), these tables become intractably large. This is
known as the "curse of dimensionality."

The Breakthrough: Deep Q-Networks (DQN)

The modern era of RL began in 2013 when researchers at DeepMind combined Q-learning with
deep neural networks, creating the Deep Q-Network (DQN). Instead of a Q-table, a neural
network is used as a function approximator to estimate the Q-value function: Q(s,a;θ)≈Q∗ (s,a).

The network takes the state (e.g., raw pixels from a game screen) as input and outputs the Q-
values for all possible actions. Two key innovations made this stable:

1. Experience Replay: The agent stores its experiences in a replay buffer and learns by sampling
random mini-batches from it. This breaks the correlation between consecutive samples,
improving stability.
2. Target Network: A separate, slowly updated target network is used to generate the TD targets,
preventing the optimization from chasing a moving target and diverging.

Landmark Achievements: Atari and AlphaGo

 Atari Games (2015): The DQN algorithm was tested on a suite of 49 classic Atari 2600
games. Using only the raw screen pixels as input, a single, general-purpose agent learned to
play all of them, achieving superhuman performance on more than half. It famously
discovered novel strategies, such as the tunneling technique in the game Breakout.

 AlphaGo (2016): DeepMind developed a more sophisticated DRL system to tackle the
ancient game of Go, a problem with a state-space complexity far exceeding that of chess.
AlphaGo combined deep neural networks with Monte Carlo Tree Search. In a historic match,
it defeated 18-time world champion Lee Sedol, a feat that experts believed was at least a
decade away. This demonstrated that DRL could master tasks requiring deep, intuitive
strategy.
Real-World Applications of Reinforcement
Learning
The successes in game-playing have catalyzed the application of RL to complex, real-world
problems.

Robotics and Autonomous Control

RL is used to train robots for tasks that are difficult to hand-engineer, such as bipedal walking,
object manipulation in cluttered environments, and autonomous drone navigation. The agent
learns a control policy directly from trial and error, often in simulation before being transferred to
a physical robot.

Recommender Systems and Personalization

Platforms like YouTube, Netflix, and TikTok use RL to personalize content feeds. The problem is
framed as an MDP where the "state" is the user's history, the "action" is which video to
recommend, and the "reward" is user engagement (e.g., watch time, likes). RL allows these
systems to optimize for long-term user satisfaction.

Finance and Algorithmic Trading

In finance, RL is applied to problems like optimal trade execution, portfolio management, and
developing high-frequency trading strategies. The agent learns a policy to buy, sell, or hold assets
based on market state information to maximize profit.

Resource Management

RL has been successfully used to optimize resource allocation in dynamic environments. A

notable example is Google's use of DRL to manage the cooling systems of its data centers, which
resulted in a significant reduction in energy consumption. Other applications include traffic light
signal control and managing communication networks.

AI Alignment: Reinforcement Learning from Human Feedback (RLHF)

Perhaps the most impactful recent application of RL is in aligning Large Language Models
(LLMs). RLHF is a technique used to fine-tune models like ChatGPT. The process involves:

1. Generating multiple responses from the LLM to a prompt.

2. Having a human rank these responses from best to worst.
3. Using this ranking data to train a "reward model" that learns to predict human preferences.
4. Using this reward model as the reward function to fine-tune the LLM's policy with RL,
optimizing it to produce responses that humans prefer.
Key Challenges in Modern RL
Despite its successes, RL faces several significant challenges:

 Sample Inefficiency: RL algorithms often require millions or even billions of interactions

with the environment to learn an effective policy, making them costly and slow for real-world
applications.
 Reward Design: The performance of an RL agent is highly sensitive to the design of its
reward function. A poorly designed reward can lead to "reward hacking," where the agent
finds loopholes to maximize the reward signal without achieving the intended goal.
 The Sim-to-Real Gap: For physical systems like robots, it is often safer and cheaper to train
in simulation. However, policies trained in a simulator frequently fail when transferred to the
real world due to subtle differences in dynamics.
 Safety and Reliability: Ensuring that an agent does not take catastrophic actions during its
exploration phase or in unforeseen situations is a critical and unsolved problem.

The Future of Reinforcement Learning

Research in RL is rapidly advancing to address these challenges. Key future directions include:

1. Multi-Agent Reinforcement Learning (MARL): Studying how multiple agents can learn to
interact, either cooperatively (e.g., a fleet of autonomous delivery drones) or competitively
(e.g., strategic game-playing).
2. Offline RL: Developing methods to learn effective policies from large, static datasets of
previously collected experiences, without requiring further interaction with the environment.
3. Generalization and Meta-Learning: Creating agents that can leverage knowledge from
previously solved tasks to learn new tasks more quickly, a capability often referred to as
"learning to learn."
Conclusion
Reinforcement Learning has firmly established itself as a fundamental pillar of modern artificial
intelligence. Evolving from its theoretical roots in psychology and optimal control, it has been
supercharged by the power of deep learning to become a framework capable of solving
previously intractable problems in sequential decision-making. The landmark successes in
complex games have demonstrated its potential, and its ongoing integration into real-world
systems—from robotics to the very core of large language models—highlights its practical utility.

While significant challenges in efficiency, safety, and generalization remain, the field is a hotbed
of innovation. As research continues to advance, Reinforcement Learning holds the promise of
creating more adaptive, intelligent, and autonomous systems capable of tackling some of the most
complex dynamic optimization problems facing science and industry.
References
 Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning:
An Introduction. MIT Press.

 Mnih, V., et al. (2015). Human-level control through deep

reinforcement learning. Nature, 518(7540), 529-533.

 Silver, D., et al. (2016). Mastering the game of Go with deep

neural networks and tree search. Nature, 529(7587), 484-489.

 Ouyang, L., et al. (2022). Training language models to follow

instructions with human feedback. arXiv preprint
arXiv:2203.02155.

RL Introduction
No ratings yet
RL Introduction
225 pages
Classification of Lung Diseases Using Deep Learning Models
No ratings yet
Classification of Lung Diseases Using Deep Learning Models
120 pages
Project Quality Plan
No ratings yet
Project Quality Plan
5 pages
CH 5 Vahid
100% (1)
CH 5 Vahid
88 pages
Daytona Operator Training Manual v1 PDF
No ratings yet
Daytona Operator Training Manual v1 PDF
33 pages
Value Stream Mapping Fundamentals
88% (8)
Value Stream Mapping Fundamentals
28 pages
RL
No ratings yet
RL
94 pages
Lecture 1: Introduction To Reinforcement Learning: David Silver
No ratings yet
Lecture 1: Introduction To Reinforcement Learning: David Silver
46 pages
A Concise Introduction To Reinforcement Learning: February 2018
No ratings yet
A Concise Introduction To Reinforcement Learning: February 2018
12 pages
Reinforcement Learning in AI
No ratings yet
Reinforcement Learning in AI
4 pages
ITILv3 Overview All Part1
No ratings yet
ITILv3 Overview All Part1
49 pages
NodeB Carrier Management (RAN17.1 - 02)
No ratings yet
NodeB Carrier Management (RAN17.1 - 02)
82 pages
Sara Reinforcement Learning
No ratings yet
Sara Reinforcement Learning
69 pages
Schematic Diagram
100% (1)
Schematic Diagram
12 pages
Final
No ratings yet
Final
18 pages
Lec 01
No ratings yet
Lec 01
60 pages
Operating Systems Lab 1 2013 Regulation
No ratings yet
Operating Systems Lab 1 2013 Regulation
116 pages
TDL Reference Manual
No ratings yet
TDL Reference Manual
321 pages
Frequently Asked Questions Transition From UL 508C To UL 61800-5-1
No ratings yet
Frequently Asked Questions Transition From UL 508C To UL 61800-5-1
6 pages
Small CorelDraw Tutorial E - Book
No ratings yet
Small CorelDraw Tutorial E - Book
28 pages
DBT Re-Exam
No ratings yet
DBT Re-Exam
3 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
L-14 - Reinforcement-L-d-07062024-111949am
No ratings yet
L-14 - Reinforcement-L-d-07062024-111949am
22 pages
Trend Micro - Server Protect
No ratings yet
Trend Micro - Server Protect
166 pages
RL & DL Notes
No ratings yet
RL & DL Notes
73 pages
万能密码fuzz
No ratings yet
万能密码fuzz
4 pages
Unleashing The Power of Reinforcement Learning
No ratings yet
Unleashing The Power of Reinforcement Learning
2 pages
What Is Reinforcement Learning
No ratings yet
What Is Reinforcement Learning
15 pages
AI Unit - 3
No ratings yet
AI Unit - 3
102 pages
Reinforcement Learning (RL) : Agent
No ratings yet
Reinforcement Learning (RL) : Agent
35 pages
C++ Classes: A Beginner's Guide
No ratings yet
C++ Classes: A Beginner's Guide
80 pages
Renesas 2012 - Identifying, Analyzing and Mitigating First-, Second - and Third-Order Effects On Motor Control Performance in Vector Control of PMSM Motor Applications
No ratings yet
Renesas 2012 - Identifying, Analyzing and Mitigating First-, Second - and Third-Order Effects On Motor Control Performance in Vector Control of PMSM Motor Applications
31 pages
ML Assign Shubham
No ratings yet
ML Assign Shubham
13 pages
Playbook Executive Briefing Reinforcement Learning
No ratings yet
Playbook Executive Briefing Reinforcement Learning
20 pages
BMS Institute of Technology PDF
No ratings yet
BMS Institute of Technology PDF
53 pages
PLC Programming 1
No ratings yet
PLC Programming 1
46 pages
UNIT V Reinforcement Learning
No ratings yet
UNIT V Reinforcement Learning
8 pages
Screencast: Powerpoint 101: Everything You Need To Make A Basic Presentationby 17 Feb 2014
No ratings yet
Screencast: Powerpoint 101: Everything You Need To Make A Basic Presentationby 17 Feb 2014
13 pages
Reinforcement Learning - Basics
No ratings yet
Reinforcement Learning - Basics
7 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Handout 11 - Introduction - To - MICT V20140101-1.0.0
No ratings yet
Handout 11 - Introduction - To - MICT V20140101-1.0.0
9 pages
The Pivot Table Tutorial
No ratings yet
The Pivot Table Tutorial
30 pages
Water Quality Prediction: SVM vs XGBoost
No ratings yet
Water Quality Prediction: SVM vs XGBoost
104 pages
Reinforcement Learning With Python
No ratings yet
Reinforcement Learning With Python
24 pages
Computer Shops in Matara - Sri Lanka
No ratings yet
Computer Shops in Matara - Sri Lanka
2 pages
Unit 3
No ratings yet
Unit 3
12 pages
Module 01
No ratings yet
Module 01
66 pages
Self Assessment Capability Survey
No ratings yet
Self Assessment Capability Survey
3 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Unit 4
No ratings yet
Unit 4
56 pages
RL & DL Notes
No ratings yet
RL & DL Notes
43 pages
ML Assignment 2
No ratings yet
ML Assignment 2
6 pages
B.Tech Seminar: Autonomous Cars
No ratings yet
B.Tech Seminar: Autonomous Cars
5 pages
Networks, Telecommunications and The Internet: Slide 5.1
No ratings yet
Networks, Telecommunications and The Internet: Slide 5.1
41 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
2 pages
Intro to Reinforcement Learning
No ratings yet
Intro to Reinforcement Learning
9 pages
Daily Pre Post Comparison - Nokia L8&L26 Daily - V3.2
No ratings yet
Daily Pre Post Comparison - Nokia L8&L26 Daily - V3.2
116 pages
L11 Reinforcement Learning 1
No ratings yet
L11 Reinforcement Learning 1
18 pages
Seminar Report
No ratings yet
Seminar Report
12 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
3 pages
Four
No ratings yet
Four
5 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Reinforcement Learning Presentation
No ratings yet
Reinforcement Learning Presentation
9 pages
RL Week - 1
No ratings yet
RL Week - 1
53 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
15 pages
Unit 5 ML
No ratings yet
Unit 5 ML
49 pages
tiếng anhi
No ratings yet
tiếng anhi
7 pages
Split Range Control Strat
No ratings yet
Split Range Control Strat
2 pages
Reinforcement Learning Basics and Beyond
No ratings yet
Reinforcement Learning Basics and Beyond
1 page
Reinforcement Learning Enhanced
No ratings yet
Reinforcement Learning Enhanced
3 pages
UNIT-V-Reinforcement Learning
No ratings yet
UNIT-V-Reinforcement Learning
4 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
19 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Introduction To Reinforcement Learning and Its Applications
No ratings yet
Introduction To Reinforcement Learning and Its Applications
2 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Compose Testing Cheatsheet
No ratings yet
Compose Testing Cheatsheet
1 page
First Reinforcement Learning Blog Post
No ratings yet
First Reinforcement Learning Blog Post
2 pages
RL Unit - Iii
No ratings yet
RL Unit - Iii
20 pages
RL Unit-1
No ratings yet
RL Unit-1
52 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
63 pages
Reinforcement Learning Detailed
No ratings yet
Reinforcement Learning Detailed
7 pages
Introduction - Week 1
No ratings yet
Introduction - Week 1
52 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
4 pages
Introduction To Reinforcement Learning (RL)
No ratings yet
Introduction To Reinforcement Learning (RL)
3 pages
Intro
No ratings yet
Intro
28 pages