pytorchrl.agent.algorithms.on_policy package
Submodules
pytorchrl.agent.algorithms.on_policy.a2c module
- class pytorchrl.agent.algorithms.on_policy.a2c.A2C(device, envs, actor, lr_v=0.0001, lr_pi=0.0001, gamma=0.99, test_every=5000, max_grad_norm=0.5, num_test_episodes=5, policy_loss_addons=[])[source]
Bases:
pytorchrl.agent.algorithms.base.AlgorithmAlgorithm class to execute A2C, from Mnih et al. 2016 (https://arxiv.org/pdf/1602.01783.pdf).
- Parameters
device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor_critic class instance.
lr_v (float) – Value network learning rate.
lr_pi (float) – Policy network learning rate.
gamma (float) – Discount factor parameter.
num_test_episodes (int) – Number of episodes to complete in each test phase.
max_grad_norm (float) – Gradient clipping parameter.
test_every (int) – Regularity of test evaluations in actor updates.
num_test_episodes – Number of episodes to complete in each test phase.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
- acting_step(obs, rhs, done, deterministic=False)[source]
A2C acting function.
- Parameters
obs (torch.tensor) – Current world observation
rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or take the mode.
- Returns
action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
other (dict) – Additional A2C predictions, value score and action log probability.
- apply_gradients(gradients=None)[source]
Take an optimization step, previously setting new gradients if provided.
- Parameters
gradients (list of tensors) – List of actor gradients.
- compute_gradients(batch, grads_to_cpu=True)[source]
Compute loss and compute gradients but don’t do optimization step, return gradients instead.
- Parameters
data (dict) – data batch containing all required tensors to compute A2C loss.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.
- Returns
grads (list of tensors) – List of actor gradients.
info (dict) – Dict containing current A2C iteration information.
- compute_loss(data)[source]
Calculate A2C loss
- Parameters
data (dict) – Data batch dict containing all required tensors to compute A2C loss.
- Returns
loss – A2C loss.
- Return type
torch.tensor
- classmethod create_factory(lr_v=0.0001, lr_pi=0.0001, gamma=0.99, test_every=5000, max_grad_norm=0.5, num_test_episodes=5, policy_loss_addons=[])[source]
Returns a function to create new A2C instances.
- Parameters
lr_v (float) – Value network learning rate.
lr_pi (float) – Policy network learning rate.
gamma (float) – Discount factor parameter.
num_test_episodes (int) – Number of episodes to complete in each test phase.
max_grad_norm (float) – Gradient clipping parameter.
test_every (int) – Regularity of test evaluations in actor updates.
num_test_episodes – Number of episodes to complete in each test phase.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
- Returns
create_algo_instance (func) – Function that creates a new A2C class instance.
algo_name (str) – Name of the algorithm.
- property gamma
Returns discount factor gamma.
- property mini_batch_size
Returns the number of mini batches per epoch.
- property num_epochs
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_mini_batch
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_test_episodes
Returns the number of episodes to complete when testing.
- set_weights(actor_weights)[source]
Update actor with the given weights.
- Parameters
actor_weights (dict of tensors) – Dict containing actor weights to be set.
- property start_steps
Returns the number of steps to collect with initial random policy.
- property test_every
Number of network updates between test evaluations.
- update_algorithm_parameter(parameter_name, new_parameter_value)[source]
If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.
- Parameters
parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.
- property update_every
Returns the number of data samples collected between network update stages.
pytorchrl.agent.algorithms.on_policy.ppo module
- class pytorchrl.agent.algorithms.on_policy.ppo.PPO(device, envs, actor, lr=0.0001, eps=1e-08, gamma=0.99, num_epochs=4, clip_param=0.2, num_mini_batch=1, test_every=1000, max_grad_norm=0.5, entropy_coef=0.01, value_loss_coef=0.5, num_test_episodes=5, use_clipped_value_loss=True, policy_loss_addons=[])[source]
Bases:
pytorchrl.agent.algorithms.base.AlgorithmProximal Policy Optimization algorithm class.
Algorithm class to execute PPO, from Schulman et al. (https://arxiv.org/abs/1707.06347). Algorithms are modules generally required by multiple workers, so PPO.algo_factory(…) returns a function that can be passed on to workers to instantiate their own PPO module.
- Parameters
device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
lr (float) – Optimizer learning rate.
eps (float) – Optimizer epsilon parameter.
num_epochs (int) – Number of PPO epochs.
gamma (float) – Discount factor parameter.
clip_param (float) – PPO clipping parameter.
num_mini_batch (int) – Number of batches to create from collected data for actor updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations.
max_grad_norm (float) – Gradient clipping parameter.
entropy_coef (float) – PPO entropy coefficient parameter.
value_loss_coef (float) – PPO value coefficient parameter.
use_clipped_value_loss (bool) – Prevent value loss from shifting too fast.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
Examples
>>> create_algo = PPO.create_factory( lr=0.01, eps=1e-5, num_epochs=4, clip_param=0.2, entropy_coef=0.01, value_loss_coef=0.5, max_grad_norm=0.5, num_mini_batch=4, use_clipped_value_loss=True, gamma=0.99)
- acting_step(obs, rhs, done, deterministic=False)[source]
PPO acting function.
- Parameters
obs (torch.tensor) – Current world observation
rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or take the mode.
- Returns
action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
other (dict) – Additional PPO predictions, value score and action log probability.
- apply_gradients(gradients=None)[source]
Take an optimization step, previously setting new gradients if provided.
- Parameters
gradients (list of tensors) – List of actor gradients.
- compute_gradients(batch, grads_to_cpu=True)[source]
Compute loss and compute gradients but don’t do optimization step, return gradients instead.
- Parameters
data (dict) – data batch containing all required tensors to compute PPO loss.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.
- Returns
grads (list of tensors) – List of actor gradients.
info (dict) – Dict containing current PPO iteration information.
- compute_loss(data)[source]
Compute PPO loss from data batch.
- Parameters
data (dict) – Data batch dict containing all required tensors to compute PPO loss.
- Returns
value_loss (torch.tensor) – value term of PPO loss.
action_loss (torch.tensor) – policy term of PPO loss.
dist_entropy (torch.tensor) – policy term of PPO loss.
loss (torch.tensor) – PPO loss.
- classmethod create_factory(lr=0.0001, eps=1e-08, gamma=0.99, num_epochs=4, clip_param=0.2, num_mini_batch=1, test_every=1000, max_grad_norm=0.5, entropy_coef=0.01, value_loss_coef=0.5, num_test_episodes=5, use_clipped_value_loss=True, policy_loss_addons=[])[source]
Returns a function to create new PPO instances.
- Parameters
lr (float) – Optimizer learning rate.
eps (float) – Optimizer epsilon parameter.
num_epochs (int) – Number of PPO epochs.
gamma (float) – Discount factor parameter.
clip_param (float) – PPO clipping parameter.
num_mini_batch (int) – Number of batches to create from collected data for actor update.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations.
max_grad_norm (float) – Gradient clipping parameter.
entropy_coef (float) – PPO entropy coefficient parameter.
value_loss_coef (float) – PPO value coefficient parameter.
use_clipped_value_loss (bool) – Prevent value loss from shifting too fast.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
- Returns
create_algo_instance (func) – Function that creates a new PPO class instance.
algo_name (str) – Name of the algorithm.
- property gamma
Returns discount factor gamma.
- property mini_batch_size
Returns the number of mini batches per epoch.
- property num_epochs
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_mini_batch
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_test_episodes
Returns the number of episodes to complete when testing.
- set_weights(actor_weights)[source]
Update actor with the given weights
- Parameters
actor_weights (dict of tensors) – Dict containing actor weights to be set.
- property start_steps
Returns the number of steps to collect with initial random policy.
- property test_every
Number of network updates between test evaluations.
- update_algorithm_parameter(parameter_name, new_parameter_value)[source]
If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.
- Parameters
parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.
- property update_every
Returns the number of data samples collected between network update stages.
pytorchrl.agent.algorithms.on_policy.rnd_ppo module
- class pytorchrl.agent.algorithms.on_policy.rnd_ppo.RND_PPO(envs, actor, device, lr=0.0001, eps=1e-08, gamma=0.99, num_epochs=4, clip_param=0.2, num_mini_batch=1, test_every=1000, max_grad_norm=2.0, entropy_coef=0.01, value_loss_coef=0.5, num_test_episodes=5, gamma_intrinsic=0.99, ext_adv_coeff=2.0, int_adv_coeff=1.0, predictor_proportion=2.0, pre_normalization_steps=50, pre_normalization_length=128, use_clipped_value_loss=False, intrinsic_rewards_network=None, intrinsic_rewards_target_network_kwargs={}, intrinsic_rewards_predictor_network_kwargs={}, policy_loss_addons=[])[source]
Bases:
pytorchrl.agent.algorithms.base.AlgorithmExploration by Random Network Distillation with Proximal Policy Optimization algorithm class.
Algorithm class to execute RND PPO, from Burda et al., 2018 (https://arxiv.org/abs/1810.12894). Algorithms are modules generally required by multiple workers, so RND_PPO.algo_factory(…) returns a function that can be passed on to workers to instantiate their own RND_PPO module.
- Parameters
device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
lr (float) – Optimizer learning rate.
eps (float) – Optimizer epsilon parameter.
num_epochs (int) – Number of PPO epochs.
gamma (float) – Discount factor parameter.
clip_param (float) – PPO clipping parameter.
num_mini_batch (int) – Number of batches to create from collected data for actor updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations.
max_grad_norm (float) – Gradient clipping parameter.
entropy_coef (float) – PPO entropy coefficient parameter.
value_loss_coef (float) – PPO value coefficient parameter.
use_clipped_value_loss (bool) – Prevent value loss from shifting too fast.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
gamma_intrinsic (float) – Discount factor parameter for intrinsic rewards.
ext_adv_coeff (float) – Extrinsic advantage coefficient.
int_adv_coeff (float) – Intrinsic advantage coefficient.
predictor_proportion (float) – Proportion of buffer sample to use to train the predictor network.
pre_normalization_steps (int) – Number of obs running average normalization steps to take before starting to train.
pre_normalization_length (int) – Length of each pre normalization steps (in environment steps).
intrinsic_rewards_network (nn.Module) – PyTorch nn.Module used for target and predictor networks.
intrinsic_rewards_target_network_kwargs (dict) – Keyword arguments for the target network.
intrinsic_rewards_predictor_network_kwargs (dict) – Keyword arguments for the predictor network.
Examples
>>> create_algo = RND_PPO.create_factory( lr=0.01, eps=1e-5, num_epochs=4, clip_param=0.2, entropy_coef=0.01, value_loss_coef=0.5, max_grad_norm=0.5, num_mini_batch=4, use_clipped_value_loss=True, gamma=0.99)
- acting_step(obs, rhs, done, deterministic=False)[source]
PPO acting function.
- Parameters
obs (torch.tensor) – Current world observation
rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or take the mode.
- Returns
action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
other (dict) – Additional PPO predictions, value score and action log probability.
- apply_gradients(gradients=None)[source]
Take an optimization step, previously setting new gradients if provided.
- Parameters
gradients (list of tensors) – List of actor gradients.
- compute_gradients(batch, grads_to_cpu=True)[source]
Compute loss and compute gradients but don’t do optimization step, return gradients instead.
- Parameters
data (dict) – data batch containing all required tensors to compute PPO loss.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.
- Returns
grads (list of tensors) – List of actor gradients.
info (dict) – Dict containing current PPO iteration information.
- compute_loss(data)[source]
Compute PPO loss from data batch.
- Parameters
data (dict) – Data batch dict containing all required tensors to compute PPO loss.
- Returns
value_loss (torch.tensor) – value term of PPO loss.
action_loss (torch.tensor) – policy term of PPO loss.
dist_entropy (torch.tensor) – policy term of PPO loss.
loss (torch.tensor) – PPO loss.
- classmethod create_factory(lr=0.0001, eps=1e-08, gamma=0.99, num_epochs=4, clip_param=0.2, num_mini_batch=1, test_every=1000, max_grad_norm=0.5, entropy_coef=0.01, value_loss_coef=0.5, num_test_episodes=5, gamma_intrinsic=0.99, ext_adv_coeff=2.0, int_adv_coeff=1.0, predictor_proportion=2.0, pre_normalization_steps=50, pre_normalization_length=128, use_clipped_value_loss=True, intrinsic_rewards_network=None, intrinsic_rewards_target_network_kwargs={}, intrinsic_rewards_predictor_network_kwargs={}, policy_loss_addons=[])[source]
Returns a function to create new RND PPO instances.
- Parameters
lr (float) – Optimizer learning rate.
eps (float) – Optimizer epsilon parameter.
num_epochs (int) – Number of PPO epochs.
gamma (float) – Discount factor parameter.
clip_param (float) – PPO clipping parameter.
num_mini_batch (int) – Number of batches to create from collected data for actor update.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations.
max_grad_norm (float) – Gradient clipping parameter.
entropy_coef (float) – PPO entropy coefficient parameter.
value_loss_coef (float) – PPO value coefficient parameter.
use_clipped_value_loss (bool) – Prevent value loss from shifting too fast.
gamma_intrinsic (float) – Discount factor parameter for intrinsic rewards.
ext_adv_coeff (float) – Extrinsic advantage coefficient.
int_adv_coeff (float) – Intrinsic advantage coefficient.
predictor_proportion (float) – Proportion of buffer sample to use to train the predictor network.
pre_normalization_steps (int) – Number of obs running average normalization steps to take before starting to train.
pre_normalization_length (int) – Length of each pre normalization steps (in environment steps).
intrinsic_rewards_network (nn.Module) – PyTorch nn.Module used for target and predictor networks.
intrinsic_rewards_target_network_kwargs (dict) – Keyword arguments for the target network.
intrinsic_rewards_predictor_network_kwargs (dict) – Keyword arguments for the predictor network.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
- Returns
create_algo_instance (func) – Function that creates a new PPO class instance.
algo_name (str) – Name of the algorithm.
- property gamma
Returns discount factor gamma.
- property mini_batch_size
Returns the number of mini batches per epoch.
- property num_epochs
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_mini_batch
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_test_episodes
Returns the number of episodes to complete when testing.
- set_weights(actor_weights)[source]
Update actor with the given weights
- Parameters
actor_weights (dict of tensors) – Dict containing actor weights to be set.
- property start_steps
Returns the number of steps to collect with initial random policy.
- property test_every
Number of network updates between test evaluations.
- update_algorithm_parameter(parameter_name, new_parameter_value)[source]
If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.
- Parameters
parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.
- property update_every
Returns the number of data samples collected between network update stages.