Codestin Search App

Simple RL Agent and Environment.

A dummy implementation of a reinforcement learning problem with a dummy environment and a simple agent.

Just run:-

$ python3 simple_rl_agent_environment.py

Install conda and create a virtual environment to install gymnasium and other corresponding gymnasium environments. Just use python version 3.9 because of compatibility issues arising from being able to use different environments from ALE.

$ conda create --name deep_rl_python_3.9 python=3.9
$ conda activate deep_rl_python_3.9
$ pip install -r requirements.txt

Now run the following scripts:-

simple_gymnasium_rl_agent.py: Shows the use of gymnasium library to code up your first RL agent.

$ python3 simple_gymnasium_rl_agent.py

Sample Output with rendered graphics:-

opt/anaconda3/envs/deep_rl/lib/python3.12/site-packages/pygame/pkgdata.py:25: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_stream, resource_exists
2025-08-04 15:54:00.478 python3[36605:14153423] +[IMKClient subclass]: chose IMKClient_Modern
2025-08-04 15:54:00.478 python3[36605:14153423] +[IMKInputSession subclass]: chose IMKInputSession_Modern
Selected random action:  1
Current Episode finished. Steps executed = 12
Selected random action:  1
Current Episode finished. Steps executed = 37
Selected random action:  1
Current Episode finished. Steps executed = 53
Selected random action:  1
Selected random action:  0
Selected random action:  1
Current Episode finished. Steps executed = 86
Selected random action:  0
Selected random action:  1
Current Episode finished. Steps executed = 113
Selected random action:  0
Selected random action:  0
Selected random action:  0
Selected random action:  1
Current Episode finished. Steps executed = 157
...
...
Current Episode finished. Steps executed = 966
Selected random action:  1
Selected random action:  0
Selected random action:  0
Current Episode finished. Steps executed = 983
Selected random action:  0
Current Episode finished. Steps executed = 995
Selected random action:  1
RL Agent interaction complete after 1000 steps with total reward = 1000.00

gan_atari_image_synthesis.py: A GAN based model used to generate Atari images using the ALE evironment for different Atari games, specifically 10 popular games - Breakout, Alien, Atlantis, Robotank, Pitfall, VideoCube, VideoCheckers, BattleZone, Qbert, and KungFuMaster.

Using PyTorch Ignite for training code [Preferred Approach]:-

python gan_atari_image_synthesis.py --use_ignite

Using normal PyTorch training code:-

python gan_atari_image_synthesis.py

Sample images from tensorboard where real corresponds to the real Atari games' images and fake corresponds to the images generated by the network after some number of iterations.

Screenshot for GANs with Deep Reinforcement Learning - 0_1

Screenshot for GANs with Deep Reinforcement Learning - 0_2

Screenshot for GANs with Deep Reinforcement Learning - 0_3

Screenshot for GANs with Deep Reinforcement Learning - 2

Screenshot for GANs with Deep Reinforcement Learning - 1

Running in PRODUCTION mode...

A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]

Running the code on device: cuda

Image Shape:  (3, 128, 128)

List of 10  Environments:-
	1: ALE/Breakout-v5
	2: ALE/Alien-v5
	3: ALE/Atlantis-v5
	4: ALE/Robotank-v5
	5: ALE/Pitfall-v5
	6: ALE/VideoCube-v5
	7: ALE/VideoCheckers-v5
	8: ALE/BattleZone-v5
	9: ALE/Qbert-v5
	10: ALE/KungFuMaster-v5

Trained the GAN for Atari games image synthesis for 50 iterations in 0 hours, 1 mins, and 10 secs: Generator_Loss=14.673180  Discriminator_Loss=1.260259
Trained the GAN for Atari games image synthesis for 100 iterations in 0 hours, 2 mins, and 20 secs: Generator_Loss=35.574182  Discriminator_Loss=0.087913
Trained the GAN for Atari games image synthesis for 150 iterations in 0 hours, 3 mins, and 30 secs: Generator_Loss=32.809126  Discriminator_Loss=0.298351
Trained the GAN for Atari games image synthesis for 200 iterations in 0 hours, 4 mins, and 41 secs: Generator_Loss=79.532125  Discriminator_Loss=0.052655
Trained the GAN for Atari games image synthesis for 250 iterations in 0 hours, 5 mins, and 51 secs: Generator_Loss=82.223944  Discriminator_Loss=0.000000
...

Saved GAN generated and real images in tensorboard at iteration number 500

...
...
Trained the GAN for Atari games image synthesis for 30250 iterations in 11 hours, 6 mins, and 53 secs: Generator_Loss=9.519476  Discriminator_Loss=0.552391
Trained the GAN for Atari games image synthesis for 30300 iterations in 11 hours, 8 mins, and 10 secs: Generator_Loss=10.392010  Discriminator_Loss=0.168032
Trained the GAN for Atari games image synthesis for 30350 iterations in 11 hours, 9 mins, and 20 secs: Generator_Loss=10.334563  Discriminator_Loss=0.066586

...

Saved GAN generated and real images in tensorboard at iteration number 100000

Finished training the network: 100000 iterations in 34 hours, 12 mins, and 41 secs.

cross_entropy_cartpole_frozenlake_rl_agent.py: CartPole-v1 & FrozenLake-v1 games environment: Cross Entropy - Model-Free, Policy Based, On-Policy Method.

a) Model Free -> Does not build a model of the environment to predict next action or reward.

b) Policy Based π(a|s) -> Builds a probability distribution over actions with observations as input. Different from Value based methods that check all actions to select the action which gives the best reward.

c) On-Policy -> Uses observations from actions that we get from the current policy that we are updating. Does not use the observations from previous episodes.

i) CartPole-v1

$ python cross_entropy_cartpole_frozenlake_rl_agent.py --env CartPole

Playing the game environment: CartPole-v1
env_obs_space_shape = (4,) 	action.size = 2

Running the code on device: cuda

Policy Network:-
 Policy(
  (network): Sequential(
    (0): Linear(in_features=4, out_features=256, bias=True)
    (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=512, bias=True)
    (4): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (5): ReLU()
    (6): Linear(in_features=512, out_features=2, bias=True)
  )
)

Iteration no 1: Avg Loss from filtered 162 episodes = 0.6875171661376953  Reward Threshold = 20.5  Reward Mean = 20.1875  Elapsed Time: 0 hours, 0 mins, and 7 secs
Iteration no 2: Avg Loss from filtered 414 episodes = 0.6084024310112  Reward Threshold = 49.0  Reward Mean = 45.9375  Elapsed Time: 0 hours, 0 mins, and 22 secs
Iteration no 3: Avg Loss from filtered 458 episodes = 0.5593582391738892  Reward Threshold = 70.5  Reward Mean = 65.875  Elapsed Time: 0 hours, 0 mins, and 44 secs
Iteration no 4: Avg Loss from filtered 461 episodes = 0.5255176424980164  Reward Threshold = 64.0  Reward Mean = 60.125  Elapsed Time: 0 hours, 1 mins, and 3 secs
Iteration no 5: Avg Loss from filtered 771 episodes = 0.5293067097663879  Reward Threshold = 120.0  Reward Mean = 99.5625  Elapsed Time: 0 hours, 1 mins, and 36 secs
...
...
Iteration no 33: Avg Loss from filtered 2236 episodes = 0.4848739206790924  Reward Threshold = 414.0  Reward Mean = 369.875  Elapsed Time: 0 hours, 31 mins, and 19 secs
Iteration no 34: Avg Loss from filtered 2495 episodes = 0.45061811804771423  Reward Threshold = 493.0  Reward Mean = 428.0  Elapsed Time: 0 hours, 33 mins, and 38 secs
Iteration no 35: Avg Loss from filtered 3500 episodes = 0.4489649534225464  Reward Threshold = 500.0  Reward Mean = 445.6875  Elapsed Time: 0 hours, 36 mins, and 2 secs
Iteration no 36: Avg Loss from filtered 7500 episodes = 0.439805805683136  Reward Threshold = 500.0  Reward Mean = 498.25  Elapsed Time: 0 hours, 38 mins, and 43 secs
Finished solving CartPole-v1 env in 36 iterations with final episode loss = 0.4398. Elapsed Time: 0 hours, 38 mins, and 43 secs.

ii) FrozenLake-v1

python cross_entropy_cartpole_frozenlake_rl_agent.py --env FrozenLake

Playing the game environment: FrozenLake-v1
env_obs_space_shape = (16,) 	action.size = 4

Running the code on device: mps

Policy Network:-
 Policy(
  (network): Sequential(
    (0): Linear(in_features=16, out_features=256, bias=True)
    (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (2): ReLU()
    (3): Linear(in_features=256, out_features=512, bias=True)
    (4): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (5): ReLU()
    (6): Linear(in_features=512, out_features=4, bias=True)
  )
)

Iteration no 1: Avg Loss from 3 filtered elite episodes with 28 episode steps = 1.3304394483566284  Reward Threshold = 0.531441  Reward Mean = 0.011906761345496104  Elapsed Time: 0 hours, 3 mins, and 30 secs
Iteration no 2: Avg Loss from 4 filtered elite episodes with 34 episode steps = 1.1078670024871826  Reward Threshold = 0.531441  Reward Mean = 0.01671958383057874  Elapsed Time: 0 hours, 6 mins, and 33 secs
Iteration no 3: Avg Loss from 6 filtered elite episodes with 54 episode steps = 0.9129630327224731  Reward Threshold = 0.531441  Reward Mean = 0.023868519799030968  Elapsed Time: 0 hours, 9 mins, and 10 secs
Iteration no 4: Avg Loss from 9 filtered elite episodes with 87 episode steps = 0.8517952561378479  Reward Threshold = 0.531441  Reward Mean = 0.03243154459454831  Elapsed Time: 0 hours, 11 mins, and 57 secs
Iteration no 5: Avg Loss from 12 filtered elite episodes with 131 episode steps = 0.8823782801628113  Reward Threshold = 0.531441  Reward Mean = 0.03823237425524295  Elapsed Time: 0 hours, 14 mins, and 47 secs
...
...
Iteration no 19: Avg Loss from 46 filtered elite episodes with 476 episode steps = 0.7461742758750916  Reward Threshold = 0.531441  Reward Mean = 0.11597856840308171  Elapsed Time: 0 hours, 53 mins, and 40 secs
Iteration no 20: Avg Loss from 51 filtered elite episodes with 530 episode steps = 0.7389541268348694  Reward Threshold = 0.531441  Reward Mean = 0.1248935337542685  Elapsed Time: 0 hours, 56 mins, and 16 secs
Iteration no 21: Avg Loss from 56 filtered elite episodes with 594 episode steps = 0.7564355731010437  Reward Threshold = 0.531441  Reward Mean = 0.1304198625031866  Elapsed Time: 0 hours, 58 mins, and 56 secs
Iteration no 22: Avg Loss from 59 filtered elite episodes with 630 episode steps = 0.7613645792007446  Reward Threshold = 0.531441  Reward Mean = 0.13234486306288937  Elapsed Time: 1 hours, 1 mins, and 28 secs

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
README.md		README.md
cross_entropy_cartpole_frozenlake_rl_agent.py		cross_entropy_cartpole_frozenlake_rl_agent.py
gan_atari_image_synthesis.py		gan_atari_image_synthesis.py
pytorch_model_dev.py		pytorch_model_dev.py
pytorch_tensorboard_dev.py		pytorch_tensorboard_dev.py
requirements.txt		requirements.txt
simple_gymnasium_rl_agent.py		simple_gymnasium_rl_agent.py
simple_rl_agent_environment.py		simple_rl_agent_environment.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Simple RL Agent and Environment.

About

Uh oh!

Releases

Packages

Languages

robinkalia/deep_reinforcement_learning

Folders and files

Latest commit

History

Repository files navigation

Simple RL Agent and Environment.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages