A dummy implementation of a reinforcement learning problem with a dummy environment and a simple agent.
Just run:-
$ python3 simple_rl_agent_environment.py
Install conda and create a virtual environment to install gymnasium and other corresponding gymnasium environments. Just use python version 3.9 because of compatibility issues arising from being able to use different environments from ALE.
$ conda create --name deep_rl_python_3.9 python=3.9
$ conda activate deep_rl_python_3.9
$ pip install -r requirements.txt
Now run the following scripts:-
simple_gymnasium_rl_agent.py
: Shows the use of gymnasium library to code up your first RL agent.
$ python3 simple_gymnasium_rl_agent.py
Sample Output with rendered graphics:-

opt/anaconda3/envs/deep_rl/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
from pkg_resources import resource_stream, resource_exists
2025-08-04 15:54:00.478 python3[36605:14153423] +[IMKClient subclass]: chose IMKClient_Modern
2025-08-04 15:54:00.478 python3[36605:14153423] +[IMKInputSession subclass]: chose IMKInputSession_Modern
Selected random action: 1
Current Episode finished. Steps executed = 12
Selected random action: 1
Current Episode finished. Steps executed = 37
Selected random action: 1
Current Episode finished. Steps executed = 53
Selected random action: 1
Selected random action: 0
Selected random action: 1
Current Episode finished. Steps executed = 86
Selected random action: 0
Selected random action: 1
Current Episode finished. Steps executed = 113
Selected random action: 0
Selected random action: 0
Selected random action: 0
Selected random action: 1
Current Episode finished. Steps executed = 157
...
...
Current Episode finished. Steps executed = 966
Selected random action: 1
Selected random action: 0
Selected random action: 0
Current Episode finished. Steps executed = 983
Selected random action: 0
Current Episode finished. Steps executed = 995
Selected random action: 1
RL Agent interaction complete after 1000 steps with total reward = 1000.00
gan_atari_image_synthesis.py
: A GAN based model used to generate Atari images using the ALE evironment for different Atari games, specifically 10 popular games -Breakout
,Alien
,Atlantis
,Robotank
,Pitfall
,VideoCube
,VideoCheckers
,BattleZone
,Qbert
, andKungFuMaster
.
Using PyTorch Ignite for training code [Preferred Approach]:-
python gan_atari_image_synthesis.py --use_ignite
Using normal PyTorch training code:-
python gan_atari_image_synthesis.py
Sample images from tensorboard where real
corresponds to the real Atari games' images and fake
corresponds to the images generated by the network after some number of iterations.





Running in PRODUCTION mode...
A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
Running the code on device: cuda
Image Shape: (3, 128, 128)
List of 10 Environments:-
1: ALE/Breakout-v5
2: ALE/Alien-v5
3: ALE/Atlantis-v5
4: ALE/Robotank-v5
5: ALE/Pitfall-v5
6: ALE/VideoCube-v5
7: ALE/VideoCheckers-v5
8: ALE/BattleZone-v5
9: ALE/Qbert-v5
10: ALE/KungFuMaster-v5
Trained the GAN for Atari games image synthesis for 50 iterations in 0 hours, 1 mins, and 10 secs: Generator_Loss=14.673180 Discriminator_Loss=1.260259
Trained the GAN for Atari games image synthesis for 100 iterations in 0 hours, 2 mins, and 20 secs: Generator_Loss=35.574182 Discriminator_Loss=0.087913
Trained the GAN for Atari games image synthesis for 150 iterations in 0 hours, 3 mins, and 30 secs: Generator_Loss=32.809126 Discriminator_Loss=0.298351
Trained the GAN for Atari games image synthesis for 200 iterations in 0 hours, 4 mins, and 41 secs: Generator_Loss=79.532125 Discriminator_Loss=0.052655
Trained the GAN for Atari games image synthesis for 250 iterations in 0 hours, 5 mins, and 51 secs: Generator_Loss=82.223944 Discriminator_Loss=0.000000
...
Saved GAN generated and real images in tensorboard at iteration number 500
...
...
Trained the GAN for Atari games image synthesis for 30250 iterations in 11 hours, 6 mins, and 53 secs: Generator_Loss=9.519476 Discriminator_Loss=0.552391
Trained the GAN for Atari games image synthesis for 30300 iterations in 11 hours, 8 mins, and 10 secs: Generator_Loss=10.392010 Discriminator_Loss=0.168032
Trained the GAN for Atari games image synthesis for 30350 iterations in 11 hours, 9 mins, and 20 secs: Generator_Loss=10.334563 Discriminator_Loss=0.066586
...
Saved GAN generated and real images in tensorboard at iteration number 100000
Finished training the network: 100000 iterations in 34 hours, 12 mins, and 41 secs.
cross_entropy_cartpole_frozenlake_rl_agent.py
:CartPole-v1
&FrozenLake-v1
games environment: Cross Entropy - Model-Free, Policy Based, On-Policy Method.
a) Model Free -> Does not build a model of the environment to predict next action or reward.
b) Policy Based π(a|s)
-> Builds a probability distribution over actions with observations as input.
Different from Value based methods that check all actions to select the action which gives the best reward.
c) On-Policy -> Uses observations from actions that we get from the current policy that we are updating. Does not use the observations from previous episodes.
i) CartPole-v1

$ python cross_entropy_cartpole_frozenlake_rl_agent.py --env CartPole
Playing the game environment: CartPole-v1
env_obs_space_shape = (4,) action.size = 2
Running the code on device: cuda
Policy Network:-
Policy(
(network): Sequential(
(0): Linear(in_features=4, out_features=256, bias=True)
(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(2): ReLU()
(3): Linear(in_features=256, out_features=512, bias=True)
(4): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(5): ReLU()
(6): Linear(in_features=512, out_features=2, bias=True)
)
)
Iteration no 1: Avg Loss from filtered 162 episodes = 0.6875171661376953 Reward Threshold = 20.5 Reward Mean = 20.1875 Elapsed Time: 0 hours, 0 mins, and 7 secs
Iteration no 2: Avg Loss from filtered 414 episodes = 0.6084024310112 Reward Threshold = 49.0 Reward Mean = 45.9375 Elapsed Time: 0 hours, 0 mins, and 22 secs
Iteration no 3: Avg Loss from filtered 458 episodes = 0.5593582391738892 Reward Threshold = 70.5 Reward Mean = 65.875 Elapsed Time: 0 hours, 0 mins, and 44 secs
Iteration no 4: Avg Loss from filtered 461 episodes = 0.5255176424980164 Reward Threshold = 64.0 Reward Mean = 60.125 Elapsed Time: 0 hours, 1 mins, and 3 secs
Iteration no 5: Avg Loss from filtered 771 episodes = 0.5293067097663879 Reward Threshold = 120.0 Reward Mean = 99.5625 Elapsed Time: 0 hours, 1 mins, and 36 secs
...
...
Iteration no 33: Avg Loss from filtered 2236 episodes = 0.4848739206790924 Reward Threshold = 414.0 Reward Mean = 369.875 Elapsed Time: 0 hours, 31 mins, and 19 secs
Iteration no 34: Avg Loss from filtered 2495 episodes = 0.45061811804771423 Reward Threshold = 493.0 Reward Mean = 428.0 Elapsed Time: 0 hours, 33 mins, and 38 secs
Iteration no 35: Avg Loss from filtered 3500 episodes = 0.4489649534225464 Reward Threshold = 500.0 Reward Mean = 445.6875 Elapsed Time: 0 hours, 36 mins, and 2 secs
Iteration no 36: Avg Loss from filtered 7500 episodes = 0.439805805683136 Reward Threshold = 500.0 Reward Mean = 498.25 Elapsed Time: 0 hours, 38 mins, and 43 secs
Finished solving CartPole-v1 env in 36 iterations with final episode loss = 0.4398. Elapsed Time: 0 hours, 38 mins, and 43 secs.
ii) FrozenLake-v1




python cross_entropy_cartpole_frozenlake_rl_agent.py --env FrozenLake
Playing the game environment: FrozenLake-v1
env_obs_space_shape = (16,) action.size = 4
Running the code on device: mps
Policy Network:-
Policy(
(network): Sequential(
(0): Linear(in_features=16, out_features=256, bias=True)
(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(2): ReLU()
(3): Linear(in_features=256, out_features=512, bias=True)
(4): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(5): ReLU()
(6): Linear(in_features=512, out_features=4, bias=True)
)
)
Iteration no 1: Avg Loss from 3 filtered elite episodes with 28 episode steps = 1.3304394483566284 Reward Threshold = 0.531441 Reward Mean = 0.011906761345496104 Elapsed Time: 0 hours, 3 mins, and 30 secs
Iteration no 2: Avg Loss from 4 filtered elite episodes with 34 episode steps = 1.1078670024871826 Reward Threshold = 0.531441 Reward Mean = 0.01671958383057874 Elapsed Time: 0 hours, 6 mins, and 33 secs
Iteration no 3: Avg Loss from 6 filtered elite episodes with 54 episode steps = 0.9129630327224731 Reward Threshold = 0.531441 Reward Mean = 0.023868519799030968 Elapsed Time: 0 hours, 9 mins, and 10 secs
Iteration no 4: Avg Loss from 9 filtered elite episodes with 87 episode steps = 0.8517952561378479 Reward Threshold = 0.531441 Reward Mean = 0.03243154459454831 Elapsed Time: 0 hours, 11 mins, and 57 secs
Iteration no 5: Avg Loss from 12 filtered elite episodes with 131 episode steps = 0.8823782801628113 Reward Threshold = 0.531441 Reward Mean = 0.03823237425524295 Elapsed Time: 0 hours, 14 mins, and 47 secs
...
...
Iteration no 19: Avg Loss from 46 filtered elite episodes with 476 episode steps = 0.7461742758750916 Reward Threshold = 0.531441 Reward Mean = 0.11597856840308171 Elapsed Time: 0 hours, 53 mins, and 40 secs
Iteration no 20: Avg Loss from 51 filtered elite episodes with 530 episode steps = 0.7389541268348694 Reward Threshold = 0.531441 Reward Mean = 0.1248935337542685 Elapsed Time: 0 hours, 56 mins, and 16 secs
Iteration no 21: Avg Loss from 56 filtered elite episodes with 594 episode steps = 0.7564355731010437 Reward Threshold = 0.531441 Reward Mean = 0.1304198625031866 Elapsed Time: 0 hours, 58 mins, and 56 secs
Iteration no 22: Avg Loss from 59 filtered elite episodes with 630 episode steps = 0.7613645792007446 Reward Threshold = 0.531441 Reward Mean = 0.13234486306288937 Elapsed Time: 1 hours, 1 mins, and 28 secs