Simplified Code Examples

The simplified code examples have been created for users who are new to the field of Deep Reinforcement Learning. Due to the flexible configuration, it is possible for even newcomers to train a wide variety of algorithms in different environments without having to make any changes to the code. This allows quick testing and experimentation with the option to easily adjust important settings if necessary.

In the following, examples are given to explain how the settings can be adjusted.

Train Agents

Run Default Code Example

To run the code example execute:

python code_examples/simplified_code_examples/run.py

This will execute the default code example, running PPO 1 on the OpenAI Gym 2 environment CartPole-v0.

Train Different Agent

To change the default code example and train another agent there are two ways to adapt the code. Either you go into the overall config and change in cfg/conf.yaml the default agent parameter to another agent e.g. Soft Actor-Critic 3 (SAC). Or you just override the default configuration by an additional terminal input that defines the agent new, e.g. training sac on the default CartPole-v0 environment:

python code_examples/simplified_code_examples/run.py agent=sac

For the possible agents you can train visit the section Available Algorithms in the documentation.

Train On Different Environment

In the case you want to train on a different environment you can change that similar to the agent in two ways either in the default conf.yaml file or via the terminal input, e.g. training sac on the PyBullet 4 Environments:

python code_examples/simplified_code_examples/run.py agent=sac environment=pybullet

Here the default task is set to AntBulletEnv-v0. If you want to change that just add the depending environment ID to the input. For example if you want to train on the HalfCheetahBulletEnv-v0:

python code_examples/simplified_code_examples/run.py agent=sac  environment=pybullet environment.task=HalfCheetahBulletEnv-v0

For the possible environments you can train the PyTorchRL agents visit the section Available Environments in the documentation.

Train On Your Custom Environment

Will be updated soon!

Advanced Training Config Changes

In this section we cover the options if you want to on top of agent and training environment also want to adapt the training scheme and agent details like architecture or storage.

Change Agent Details

In case you want to change the default parameter of the selected agent you can have a look at your specific agent in the config what hyperparameters exist and how they are set as default. In the case of PPO check:

code_examples/simplified_code_examples/cfg/agent/ppo.yaml

If you decide you want to change for example the learning rate for PPO you can do it the following way:

python code_examples/simplified_code_examples/run.py agent=ppo agent.ppo_config.lr=1.0e-2

Similar you can change any other hyperparameter in PPO or of other agents in PyTorchRL.

Change Agent Actor Architecture

Similarly to the agent hyperparameter you can also change the overall architecture of the actors. Meaning, add additional layer to the policy network of PPO or change to a recurrent policy at all. You can see all possible parameters to change at:

code_examples/simplified_code_examples/cfg/agent/actor

Inside here you have a yaml file for off-policy algorithms like DDPG, TD3, SAC and a on-policy file for algorithms like PPO. That said, if you decide to change the PPO policy to be a recurrent neural network you can do so with:

python code_examples/simplified_code_examples/run.py agent=ppo agent.actor.recurrent_nets=True

Change Agent Storage

Currently changes regarding the storage types need to be done directly in the config files. But this will be changed and updated in the future!

Change Training Scheme

In this section we show you how you can change the training scheme so that you can scale up your experiments. Will be updated soon!

Config

This section visualizes the overal config structure in case you want to dont want to adapt your training run parameters via terminal inputs and specify new default parameters.

Overall Config Structure

cfg
│   README.md
│   conf.yaml

└───agent
|   |   ppo.yaml
│   │   ddpg.yaml
│   │   td3.yaml
│   │   sac.yaml
│   │   mpo.yaml
│   │
│   └───actor
│   |      off_policy.yaml
│   |      on_policy.yaml
│   |
|   └───storage
|          gae_buffer.yaml
|          replay_buffer.yaml
|          her_buffer.yaml
|
└───scheme
|      a3c.yaml
|      apex.yaml
|      ddppo.yaml
|      default.yaml
|      impala.yaml
|      r2d2.yaml
|      rapid.yaml
|
└───environment
    atari.yaml
    causalworld.yaml
    crafter.yaml
    gym.yaml
    mujoco.yaml
    pybullet.yaml

Available Algorithms

In this section you can see all possible algorithms that can be utilized with the simplified code examples.

Off-Policy Algorithms

  • Deep Deterministic Policy Gradient 5 (DDPG) in the config used as ddpg

  • Twin Delayed DDPG 6 (TD3) in the config used as td3

  • Soft Actor-Critic 3 (SAC) in the config uses as sac

  • Maximum a Posteriori Policy Optimisation 7 (MPO) in the config used as mpo

On-Policy Algorithms

  • Proximal Policy Optimisation 1 (PPO) in the config used as ppo


1(1,2)

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

2

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. CoRR, 2016. URL: http://arxiv.org/abs/1606.01540, arXiv:1606.01540.

3(1,2)

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR, 2018. URL: http://arxiv.org/abs/1801.01290, arXiv:1801.01290.

4

Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2021.

5

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. 2015. URL: https://arxiv.org/abs/1509.02971, doi:10.48550/ARXIV.1509.02971.

6

Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. 2018. URL: https://arxiv.org/abs/1802.09477, doi:10.48550/ARXIV.1802.09477.

7

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. 2018. URL: https://arxiv.org/abs/1806.06920, doi:10.48550/ARXIV.1806.06920.