On-policy

Advantage Actor Critic (A2C)

class pytorchrl.agent.algorithms.on_policy.a2c.A2C(device, envs, actor, lr_v=0.0001, lr_pi=0.0001, gamma=0.99, test_every=5000, max_grad_norm=0.5, num_test_episodes=5, policy_loss_addons=[])[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Algorithm class to execute A2C, from Mnih et al. 2016 (https://arxiv.org/pdf/1602.01783.pdf).

Parameters

device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor_critic class instance.
lr_v (float) – Value network learning rate.
lr_pi (float) – Policy network learning rate.
gamma (float) – Discount factor parameter.
num_test_episodes (int) – Number of episodes to complete in each test phase.
max_grad_norm (float) – Gradient clipping parameter.
test_every (int) – Regularity of test evaluations in actor updates.
num_test_episodes – Number of episodes to complete in each test phase.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

acting_step(obs, rhs, done, deterministic=False)[source]

A2C acting function.

Parameters

obs (torch.tensor) – Current world observation
rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or take the mode.

Returns

action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
other (dict) – Additional A2C predictions, value score and action log probability.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided.

Parameters: gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters

data (dict) – data batch containing all required tensors to compute A2C loss.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

grads (list of tensors) – List of actor gradients.
info (dict) – Dict containing current A2C iteration information.

compute_loss(data)[source]

Calculate A2C loss

Parameters: data (dict) – Data batch dict containing all required tensors to compute A2C loss.
Returns: loss – A2C loss.
Return type: torch.tensor

classmethod create_factory(lr_v=0.0001, lr_pi=0.0001, gamma=0.99, test_every=5000, max_grad_norm=0.5, num_test_episodes=5, policy_loss_addons=[])[source]

Returns a function to create new A2C instances.

Parameters

lr_v (float) – Value network learning rate.
lr_pi (float) – Policy network learning rate.
gamma (float) – Discount factor parameter.
num_test_episodes (int) – Number of episodes to complete in each test phase.
max_grad_norm (float) – Gradient clipping parameter.
test_every (int) – Regularity of test evaluations in actor updates.
num_test_episodes – Number of episodes to complete in each test phase.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Returns

create_algo_instance (func) – Function that creates a new A2C class instance.
algo_name (str) – Name of the algorithm.

property gamma: Returns discount factor gamma.

property mini_batch_size: Returns the number of mini batches per epoch.

property num_epochs: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes: Returns the number of episodes to complete when testing.

set_weights(actor_weights)[source]

Update actor with the given weights.

Parameters: actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps: Returns the number of steps to collect with initial random policy.

property test_every: Number of network updates between test evaluations.

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters

parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.

property update_every: Returns the number of data samples collected between network update stages.

Proximal Policy Optimization (PPO)

class pytorchrl.agent.algorithms.on_policy.ppo.PPO(device, envs, actor, lr=0.0001, eps=1e-08, gamma=0.99, num_epochs=4, clip_param=0.2, num_mini_batch=1, test_every=1000, max_grad_norm=0.5, entropy_coef=0.01, value_loss_coef=0.5, num_test_episodes=5, use_clipped_value_loss=True, policy_loss_addons=[])[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Proximal Policy Optimization algorithm class.

Algorithm class to execute PPO, from Schulman et al. (https://arxiv.org/abs/1707.06347). Algorithms are modules generally required by multiple workers, so PPO.algo_factory(…) returns a function that can be passed on to workers to instantiate their own PPO module.

Parameters

device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
lr (float) – Optimizer learning rate.
eps (float) – Optimizer epsilon parameter.
num_epochs (int) – Number of PPO epochs.
gamma (float) – Discount factor parameter.
clip_param (float) – PPO clipping parameter.
num_mini_batch (int) – Number of batches to create from collected data for actor updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations.
max_grad_norm (float) – Gradient clipping parameter.
entropy_coef (float) – PPO entropy coefficient parameter.
value_loss_coef (float) – PPO value coefficient parameter.
use_clipped_value_loss (bool) – Prevent value loss from shifting too fast.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Examples

>>> create_algo = PPO.create_factory(
    lr=0.01, eps=1e-5, num_epochs=4, clip_param=0.2,
    entropy_coef=0.01, value_loss_coef=0.5, max_grad_norm=0.5,
    num_mini_batch=4, use_clipped_value_loss=True, gamma=0.99)

acting_step(obs, rhs, done, deterministic=False)[source]

PPO acting function.

Parameters

obs (torch.tensor) – Current world observation
rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or take the mode.

Returns

action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
other (dict) – Additional PPO predictions, value score and action log probability.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided.

Parameters: gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters

data (dict) – data batch containing all required tensors to compute PPO loss.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

grads (list of tensors) – List of actor gradients.
info (dict) – Dict containing current PPO iteration information.

compute_loss(data)[source]

Compute PPO loss from data batch.

Parameters

data (dict) – Data batch dict containing all required tensors to compute PPO loss.

Returns

value_loss (torch.tensor) – value term of PPO loss.
action_loss (torch.tensor) – policy term of PPO loss.
dist_entropy (torch.tensor) – policy term of PPO loss.
loss (torch.tensor) – PPO loss.

classmethod create_factory(lr=0.0001, eps=1e-08, gamma=0.99, num_epochs=4, clip_param=0.2, num_mini_batch=1, test_every=1000, max_grad_norm=0.5, entropy_coef=0.01, value_loss_coef=0.5, num_test_episodes=5, use_clipped_value_loss=True, policy_loss_addons=[])[source]

Returns a function to create new PPO instances.

Parameters

lr (float) – Optimizer learning rate.
eps (float) – Optimizer epsilon parameter.
num_epochs (int) – Number of PPO epochs.
gamma (float) – Discount factor parameter.
clip_param (float) – PPO clipping parameter.
num_mini_batch (int) – Number of batches to create from collected data for actor update.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations.
max_grad_norm (float) – Gradient clipping parameter.
entropy_coef (float) – PPO entropy coefficient parameter.
value_loss_coef (float) – PPO value coefficient parameter.
use_clipped_value_loss (bool) – Prevent value loss from shifting too fast.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Returns

create_algo_instance (func) – Function that creates a new PPO class instance.
algo_name (str) – Name of the algorithm.

property gamma: Returns discount factor gamma.

property mini_batch_size: Returns the number of mini batches per epoch.

property num_epochs: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes: Returns the number of episodes to complete when testing.

set_weights(actor_weights)[source]

Update actor with the given weights

Parameters: actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps: Returns the number of steps to collect with initial random policy.

property test_every: Number of network updates between test evaluations.

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters

parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.

property update_every: Returns the number of data samples collected between network update stages.

Proximal Policy Optimization (PPO) with Random Network Distillation (RND)

class pytorchrl.agent.algorithms.on_policy.rnd_ppo.RND_PPO(envs, actor, device, lr=0.0001, eps=1e-08, gamma=0.99, num_epochs=4, clip_param=0.2, num_mini_batch=1, test_every=1000, max_grad_norm=2.0, entropy_coef=0.01, value_loss_coef=0.5, num_test_episodes=5, gamma_intrinsic=0.99, ext_adv_coeff=2.0, int_adv_coeff=1.0, predictor_proportion=2.0, pre_normalization_steps=50, pre_normalization_length=128, use_clipped_value_loss=False, intrinsic_rewards_network=None, intrinsic_rewards_target_network_kwargs={}, intrinsic_rewards_predictor_network_kwargs={}, policy_loss_addons=[])[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Exploration by Random Network Distillation with Proximal Policy Optimization algorithm class.

Algorithm class to execute RND PPO, from Burda et al., 2018 (https://arxiv.org/abs/1810.12894). Algorithms are modules generally required by multiple workers, so RND_PPO.algo_factory(…) returns a function that can be passed on to workers to instantiate their own RND_PPO module.

Parameters

device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
lr (float) – Optimizer learning rate.
eps (float) – Optimizer epsilon parameter.
num_epochs (int) – Number of PPO epochs.
gamma (float) – Discount factor parameter.
clip_param (float) – PPO clipping parameter.
num_mini_batch (int) – Number of batches to create from collected data for actor updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations.
max_grad_norm (float) – Gradient clipping parameter.
entropy_coef (float) – PPO entropy coefficient parameter.
value_loss_coef (float) – PPO value coefficient parameter.
use_clipped_value_loss (bool) – Prevent value loss from shifting too fast.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
gamma_intrinsic (float) – Discount factor parameter for intrinsic rewards.
ext_adv_coeff (float) – Extrinsic advantage coefficient.
int_adv_coeff (float) – Intrinsic advantage coefficient.
predictor_proportion (float) – Proportion of buffer sample to use to train the predictor network.
pre_normalization_steps (int) – Number of obs running average normalization steps to take before starting to train.
pre_normalization_length (int) – Length of each pre normalization steps (in environment steps).
intrinsic_rewards_network (nn.Module) – PyTorch nn.Module used for target and predictor networks.
intrinsic_rewards_target_network_kwargs (dict) – Keyword arguments for the target network.
intrinsic_rewards_predictor_network_kwargs (dict) – Keyword arguments for the predictor network.

Examples

>>> create_algo = RND_PPO.create_factory(
    lr=0.01, eps=1e-5, num_epochs=4, clip_param=0.2,
    entropy_coef=0.01, value_loss_coef=0.5, max_grad_norm=0.5,
    num_mini_batch=4, use_clipped_value_loss=True, gamma=0.99)

acting_step(obs, rhs, done, deterministic=False)[source]

PPO acting function.

Parameters

obs (torch.tensor) – Current world observation
rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or take the mode.

Returns

action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
other (dict) – Additional PPO predictions, value score and action log probability.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided.

Parameters: gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters

data (dict) – data batch containing all required tensors to compute PPO loss.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

grads (list of tensors) – List of actor gradients.
info (dict) – Dict containing current PPO iteration information.

compute_loss(data)[source]

Compute PPO loss from data batch.

Parameters

data (dict) – Data batch dict containing all required tensors to compute PPO loss.

Returns

value_loss (torch.tensor) – value term of PPO loss.
action_loss (torch.tensor) – policy term of PPO loss.
dist_entropy (torch.tensor) – policy term of PPO loss.
loss (torch.tensor) – PPO loss.

classmethod create_factory(lr=0.0001, eps=1e-08, gamma=0.99, num_epochs=4, clip_param=0.2, num_mini_batch=1, test_every=1000, max_grad_norm=0.5, entropy_coef=0.01, value_loss_coef=0.5, num_test_episodes=5, gamma_intrinsic=0.99, ext_adv_coeff=2.0, int_adv_coeff=1.0, predictor_proportion=2.0, pre_normalization_steps=50, pre_normalization_length=128, use_clipped_value_loss=True, intrinsic_rewards_network=None, intrinsic_rewards_target_network_kwargs={}, intrinsic_rewards_predictor_network_kwargs={}, policy_loss_addons=[])[source]

Returns a function to create new RND PPO instances.

Parameters

lr (float) – Optimizer learning rate.
eps (float) – Optimizer epsilon parameter.
num_epochs (int) – Number of PPO epochs.
gamma (float) – Discount factor parameter.
clip_param (float) – PPO clipping parameter.
num_mini_batch (int) – Number of batches to create from collected data for actor update.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations.
max_grad_norm (float) – Gradient clipping parameter.
entropy_coef (float) – PPO entropy coefficient parameter.
value_loss_coef (float) – PPO value coefficient parameter.
use_clipped_value_loss (bool) – Prevent value loss from shifting too fast.
gamma_intrinsic (float) – Discount factor parameter for intrinsic rewards.
ext_adv_coeff (float) – Extrinsic advantage coefficient.
int_adv_coeff (float) – Intrinsic advantage coefficient.
predictor_proportion (float) – Proportion of buffer sample to use to train the predictor network.
pre_normalization_steps (int) – Number of obs running average normalization steps to take before starting to train.
pre_normalization_length (int) – Length of each pre normalization steps (in environment steps).
intrinsic_rewards_network (nn.Module) – PyTorch nn.Module used for target and predictor networks.
intrinsic_rewards_target_network_kwargs (dict) – Keyword arguments for the target network.
intrinsic_rewards_predictor_network_kwargs (dict) – Keyword arguments for the predictor network.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Returns

create_algo_instance (func) – Function that creates a new PPO class instance.
algo_name (str) – Name of the algorithm.

property gamma: Returns discount factor gamma.

property mini_batch_size: Returns the number of mini batches per epoch.

property num_epochs: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes: Returns the number of episodes to complete when testing.

set_weights(actor_weights)[source]

Update actor with the given weights

Parameters: actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps: Returns the number of steps to collect with initial random policy.

property test_every: Number of network updates between test evaluations.

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters

parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.

property update_every: Returns the number of data samples collected between network update stages.