On-policy

Advantage Actor Critic (A2C)

class pytorchrl.agent.algorithms.on_policy.a2c.A2C(device, envs, actor, lr_v=0.0001, lr_pi=0.0001, gamma=0.99, test_every=5000, max_grad_norm=0.5, num_test_episodes=5, policy_loss_addons=[])[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Algorithm class to execute A2C, from Mnih et al. 2016 (https://arxiv.org/pdf/1602.01783.pdf).

Parameters
  • device (torch.device) – CPU or specific GPU where class computations will take place.

  • envs (VecEnv) – Vector of environments instance.

  • actor (Actor) – Actor_critic class instance.

  • lr_v (float) – Value network learning rate.

  • lr_pi (float) – Policy network learning rate.

  • gamma (float) – Discount factor parameter.

  • num_test_episodes (int) – Number of episodes to complete in each test phase.

  • max_grad_norm (float) – Gradient clipping parameter.

  • test_every (int) – Regularity of test evaluations in actor updates.

  • num_test_episodes – Number of episodes to complete in each test phase.

  • policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

acting_step(obs, rhs, done, deterministic=False)[source]

A2C acting function.

Parameters
  • obs (torch.tensor) – Current world observation

  • rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).

  • done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.

  • deterministic (bool) – Whether to randomly sample action from predicted distribution or take the mode.

Returns

  • action (torch.tensor) – Predicted next action.

  • clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).

  • rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).

  • other (dict) – Additional A2C predictions, value score and action log probability.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided.

Parameters

gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters
  • data (dict) – data batch containing all required tensors to compute A2C loss.

  • grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

  • grads (list of tensors) – List of actor gradients.

  • info (dict) – Dict containing current A2C iteration information.

compute_loss(data)[source]

Calculate A2C loss

Parameters

data (dict) – Data batch dict containing all required tensors to compute A2C loss.

Returns

loss – A2C loss.

Return type

torch.tensor

classmethod create_factory(lr_v=0.0001, lr_pi=0.0001, gamma=0.99, test_every=5000, max_grad_norm=0.5, num_test_episodes=5, policy_loss_addons=[])[source]

Returns a function to create new A2C instances.

Parameters
  • lr_v (float) – Value network learning rate.

  • lr_pi (float) – Policy network learning rate.

  • gamma (float) – Discount factor parameter.

  • num_test_episodes (int) – Number of episodes to complete in each test phase.

  • max_grad_norm (float) – Gradient clipping parameter.

  • test_every (int) – Regularity of test evaluations in actor updates.

  • num_test_episodes – Number of episodes to complete in each test phase.

  • policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Returns

  • create_algo_instance (func) – Function that creates a new A2C class instance.

  • algo_name (str) – Name of the algorithm.

property gamma

Returns discount factor gamma.

property mini_batch_size

Returns the number of mini batches per epoch.

property num_epochs

Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch

Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes

Returns the number of episodes to complete when testing.

set_weights(actor_weights)[source]

Update actor with the given weights.

Parameters

actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps

Returns the number of steps to collect with initial random policy.

property test_every

Number of network updates between test evaluations.

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters
  • parameter_name (str) – Worker.algo attribute name

  • new_parameter_value (int or float) – New value for parameter_name.

property update_every

Returns the number of data samples collected between network update stages.

Proximal Policy Optimization (PPO)

class pytorchrl.agent.algorithms.on_policy.ppo.PPO(device, envs, actor, lr=0.0001, eps=1e-08, gamma=0.99, num_epochs=4, clip_param=0.2, num_mini_batch=1, test_every=1000, max_grad_norm=0.5, entropy_coef=0.01, value_loss_coef=0.5, num_test_episodes=5, use_clipped_value_loss=True, policy_loss_addons=[])[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Proximal Policy Optimization algorithm class.

Algorithm class to execute PPO, from Schulman et al. (https://arxiv.org/abs/1707.06347). Algorithms are modules generally required by multiple workers, so PPO.algo_factory(…) returns a function that can be passed on to workers to instantiate their own PPO module.

Parameters
  • device (torch.device) – CPU or specific GPU where class computations will take place.

  • envs (VecEnv) – Vector of environments instance.

  • actor (Actor) – Actor class instance.

  • lr (float) – Optimizer learning rate.

  • eps (float) – Optimizer epsilon parameter.

  • num_epochs (int) – Number of PPO epochs.

  • gamma (float) – Discount factor parameter.

  • clip_param (float) – PPO clipping parameter.

  • num_mini_batch (int) – Number of batches to create from collected data for actor updates.

  • num_test_episodes (int) – Number of episodes to complete in each test phase.

  • test_every (int) – Regularity of test evaluations.

  • max_grad_norm (float) – Gradient clipping parameter.

  • entropy_coef (float) – PPO entropy coefficient parameter.

  • value_loss_coef (float) – PPO value coefficient parameter.

  • use_clipped_value_loss (bool) – Prevent value loss from shifting too fast.

  • policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Examples

>>> create_algo = PPO.create_factory(
    lr=0.01, eps=1e-5, num_epochs=4, clip_param=0.2,
    entropy_coef=0.01, value_loss_coef=0.5, max_grad_norm=0.5,
    num_mini_batch=4, use_clipped_value_loss=True, gamma=0.99)
acting_step(obs, rhs, done, deterministic=False)[source]

PPO acting function.

Parameters
  • obs (torch.tensor) – Current world observation

  • rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).

  • done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.

  • deterministic (bool) – Whether to randomly sample action from predicted distribution or take the mode.

Returns

  • action (torch.tensor) – Predicted next action.

  • clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).

  • rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).

  • other (dict) – Additional PPO predictions, value score and action log probability.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided.

Parameters

gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters
  • data (dict) – data batch containing all required tensors to compute PPO loss.

  • grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

  • grads (list of tensors) – List of actor gradients.

  • info (dict) – Dict containing current PPO iteration information.

compute_loss(data)[source]

Compute PPO loss from data batch.

Parameters

data (dict) – Data batch dict containing all required tensors to compute PPO loss.

Returns

  • value_loss (torch.tensor) – value term of PPO loss.

  • action_loss (torch.tensor) – policy term of PPO loss.

  • dist_entropy (torch.tensor) – policy term of PPO loss.

  • loss (torch.tensor) – PPO loss.

classmethod create_factory(lr=0.0001, eps=1e-08, gamma=0.99, num_epochs=4, clip_param=0.2, num_mini_batch=1, test_every=1000, max_grad_norm=0.5, entropy_coef=0.01, value_loss_coef=0.5, num_test_episodes=5, use_clipped_value_loss=True, policy_loss_addons=[])[source]

Returns a function to create new PPO instances.

Parameters
  • lr (float) – Optimizer learning rate.

  • eps (float) – Optimizer epsilon parameter.

  • num_epochs (int) – Number of PPO epochs.

  • gamma (float) – Discount factor parameter.

  • clip_param (float) – PPO clipping parameter.

  • num_mini_batch (int) – Number of batches to create from collected data for actor update.

  • num_test_episodes (int) – Number of episodes to complete in each test phase.

  • test_every (int) – Regularity of test evaluations.

  • max_grad_norm (float) – Gradient clipping parameter.

  • entropy_coef (float) – PPO entropy coefficient parameter.

  • value_loss_coef (float) – PPO value coefficient parameter.

  • use_clipped_value_loss (bool) – Prevent value loss from shifting too fast.

  • policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Returns

  • create_algo_instance (func) – Function that creates a new PPO class instance.

  • algo_name (str) – Name of the algorithm.

property gamma

Returns discount factor gamma.

property mini_batch_size

Returns the number of mini batches per epoch.

property num_epochs

Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch

Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes

Returns the number of episodes to complete when testing.

set_weights(actor_weights)[source]

Update actor with the given weights

Parameters

actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps

Returns the number of steps to collect with initial random policy.

property test_every

Number of network updates between test evaluations.

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters
  • parameter_name (str) – Worker.algo attribute name

  • new_parameter_value (int or float) – New value for parameter_name.

property update_every

Returns the number of data samples collected between network update stages.

Proximal Policy Optimization (PPO) with Random Network Distillation (RND)

class pytorchrl.agent.algorithms.on_policy.rnd_ppo.RND_PPO(envs, actor, device, lr=0.0001, eps=1e-08, gamma=0.99, num_epochs=4, clip_param=0.2, num_mini_batch=1, test_every=1000, max_grad_norm=2.0, entropy_coef=0.01, value_loss_coef=0.5, num_test_episodes=5, gamma_intrinsic=0.99, ext_adv_coeff=2.0, int_adv_coeff=1.0, predictor_proportion=2.0, pre_normalization_steps=50, pre_normalization_length=128, use_clipped_value_loss=False, intrinsic_rewards_network=None, intrinsic_rewards_target_network_kwargs={}, intrinsic_rewards_predictor_network_kwargs={}, policy_loss_addons=[])[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Exploration by Random Network Distillation with Proximal Policy Optimization algorithm class.

Algorithm class to execute RND PPO, from Burda et al., 2018 (https://arxiv.org/abs/1810.12894). Algorithms are modules generally required by multiple workers, so RND_PPO.algo_factory(…) returns a function that can be passed on to workers to instantiate their own RND_PPO module.

Parameters
  • device (torch.device) – CPU or specific GPU where class computations will take place.

  • envs (VecEnv) – Vector of environments instance.

  • actor (Actor) – Actor class instance.

  • lr (float) – Optimizer learning rate.

  • eps (float) – Optimizer epsilon parameter.

  • num_epochs (int) – Number of PPO epochs.

  • gamma (float) – Discount factor parameter.

  • clip_param (float) – PPO clipping parameter.

  • num_mini_batch (int) – Number of batches to create from collected data for actor updates.

  • num_test_episodes (int) – Number of episodes to complete in each test phase.

  • test_every (int) – Regularity of test evaluations.

  • max_grad_norm (float) – Gradient clipping parameter.

  • entropy_coef (float) – PPO entropy coefficient parameter.

  • value_loss_coef (float) – PPO value coefficient parameter.

  • use_clipped_value_loss (bool) – Prevent value loss from shifting too fast.

  • policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

  • gamma_intrinsic (float) – Discount factor parameter for intrinsic rewards.

  • ext_adv_coeff (float) – Extrinsic advantage coefficient.

  • int_adv_coeff (float) – Intrinsic advantage coefficient.

  • predictor_proportion (float) – Proportion of buffer sample to use to train the predictor network.

  • pre_normalization_steps (int) – Number of obs running average normalization steps to take before starting to train.

  • pre_normalization_length (int) – Length of each pre normalization steps (in environment steps).

  • intrinsic_rewards_network (nn.Module) – PyTorch nn.Module used for target and predictor networks.

  • intrinsic_rewards_target_network_kwargs (dict) – Keyword arguments for the target network.

  • intrinsic_rewards_predictor_network_kwargs (dict) – Keyword arguments for the predictor network.

Examples

>>> create_algo = RND_PPO.create_factory(
    lr=0.01, eps=1e-5, num_epochs=4, clip_param=0.2,
    entropy_coef=0.01, value_loss_coef=0.5, max_grad_norm=0.5,
    num_mini_batch=4, use_clipped_value_loss=True, gamma=0.99)
acting_step(obs, rhs, done, deterministic=False)[source]

PPO acting function.

Parameters
  • obs (torch.tensor) – Current world observation

  • rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).

  • done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.

  • deterministic (bool) – Whether to randomly sample action from predicted distribution or take the mode.

Returns

  • action (torch.tensor) – Predicted next action.

  • clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).

  • rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).

  • other (dict) – Additional PPO predictions, value score and action log probability.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided.

Parameters

gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters
  • data (dict) – data batch containing all required tensors to compute PPO loss.

  • grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

  • grads (list of tensors) – List of actor gradients.

  • info (dict) – Dict containing current PPO iteration information.

compute_loss(data)[source]

Compute PPO loss from data batch.

Parameters

data (dict) – Data batch dict containing all required tensors to compute PPO loss.

Returns

  • value_loss (torch.tensor) – value term of PPO loss.

  • action_loss (torch.tensor) – policy term of PPO loss.

  • dist_entropy (torch.tensor) – policy term of PPO loss.

  • loss (torch.tensor) – PPO loss.

classmethod create_factory(lr=0.0001, eps=1e-08, gamma=0.99, num_epochs=4, clip_param=0.2, num_mini_batch=1, test_every=1000, max_grad_norm=0.5, entropy_coef=0.01, value_loss_coef=0.5, num_test_episodes=5, gamma_intrinsic=0.99, ext_adv_coeff=2.0, int_adv_coeff=1.0, predictor_proportion=2.0, pre_normalization_steps=50, pre_normalization_length=128, use_clipped_value_loss=True, intrinsic_rewards_network=None, intrinsic_rewards_target_network_kwargs={}, intrinsic_rewards_predictor_network_kwargs={}, policy_loss_addons=[])[source]

Returns a function to create new RND PPO instances.

Parameters
  • lr (float) – Optimizer learning rate.

  • eps (float) – Optimizer epsilon parameter.

  • num_epochs (int) – Number of PPO epochs.

  • gamma (float) – Discount factor parameter.

  • clip_param (float) – PPO clipping parameter.

  • num_mini_batch (int) – Number of batches to create from collected data for actor update.

  • num_test_episodes (int) – Number of episodes to complete in each test phase.

  • test_every (int) – Regularity of test evaluations.

  • max_grad_norm (float) – Gradient clipping parameter.

  • entropy_coef (float) – PPO entropy coefficient parameter.

  • value_loss_coef (float) – PPO value coefficient parameter.

  • use_clipped_value_loss (bool) – Prevent value loss from shifting too fast.

  • gamma_intrinsic (float) – Discount factor parameter for intrinsic rewards.

  • ext_adv_coeff (float) – Extrinsic advantage coefficient.

  • int_adv_coeff (float) – Intrinsic advantage coefficient.

  • predictor_proportion (float) – Proportion of buffer sample to use to train the predictor network.

  • pre_normalization_steps (int) – Number of obs running average normalization steps to take before starting to train.

  • pre_normalization_length (int) – Length of each pre normalization steps (in environment steps).

  • intrinsic_rewards_network (nn.Module) – PyTorch nn.Module used for target and predictor networks.

  • intrinsic_rewards_target_network_kwargs (dict) – Keyword arguments for the target network.

  • intrinsic_rewards_predictor_network_kwargs (dict) – Keyword arguments for the predictor network.

  • policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Returns

  • create_algo_instance (func) – Function that creates a new PPO class instance.

  • algo_name (str) – Name of the algorithm.

property gamma

Returns discount factor gamma.

property mini_batch_size

Returns the number of mini batches per epoch.

property num_epochs

Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch

Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes

Returns the number of episodes to complete when testing.

set_weights(actor_weights)[source]

Update actor with the given weights

Parameters

actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps

Returns the number of steps to collect with initial random policy.

property test_every

Number of network updates between test evaluations.

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters
  • parameter_name (str) – Worker.algo attribute name

  • new_parameter_value (int or float) – New value for parameter_name.

property update_every

Returns the number of data samples collected between network update stages.