Off-policy
Double Deep Q-Learning (DDQN)
- class pytorchrl.agent.algorithms.off_policy.ddqn.DDQN(device, envs, actor, lr=0.0001, gamma=0.99, polyak=0.995, num_updates=1, update_every=50, test_every=5000, max_grad_norm=0.5, start_steps=20000, mini_batch_size=64, num_test_episodes=5, initial_epsilon=1.0, epsilon_decay=0.999, target_update_interval=1, policy_loss_addons=[])[source]
Bases:
pytorchrl.agent.algorithms.base.AlgorithmDeep Q Learning algorithm class.
Algorithm class to execute DQN, from Mhin et al. (https://www.nature.com/articles/nature14236?wm=book_wap_0005) with target network.
- Parameters
device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (ActorCritic) – actor class instance.
lr (float) – learning rate.
gamma (float) – Discount factor parameter.
num_updates (int) – Num consecutive actor updates before data collection continues.
update_every (int) – Regularity of actor updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor update batches.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor updates.
max_grad_norm (float) – Gradient clipping parameter.
initial_epsilon (float) – initial value for DQN epsilon parameter.
epsilon_decay (float) – Exponential decay rate for epsilon parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
- acting_step(obs, rhs, done, deterministic=False)[source]
DDQN acting function.
- Parameters
obs (torch.tensor) – Current world observation
rhs (dict) – RNN recurrent hidden states.
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.
- Returns
action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (batch) – Actor recurrent hidden state.
other (dict) – Additional DDQN predictions, which are not used in other algorithms.
- apply_gradients(gradients=None)[source]
Take an optimization step, previously setting new gradients if provided. Update also target networks.
- Parameters
gradients (list of tensors) – List of actor gradients.
- compute_gradients(batch, grads_to_cpu=True)[source]
Compute loss and compute gradients but don’t do optimization step, return gradients instead.
- Parameters
data (dict) – data batch containing all required tensors to compute DQN loss.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.
- Returns
grads (list of tensors) – List of actor gradients.
info (dict) – Dict containing current DQN iteration information.
- compute_loss(batch, n_step=1, per_weights=1)[source]
Calculate DDQN loss
- Parameters
batch (dict) – Data batch dict containing all required tensors to compute DDQN loss.
- Returns
loss (torch.tensor) – DDQN loss.
errors (torch.tensor) – TD errors.
- classmethod create_factory(lr=0.0001, gamma=0.99, polyak=0.995, num_updates=50, update_every=50, test_every=5000, start_steps=20000, max_grad_norm=0.5, mini_batch_size=64, num_test_episodes=5, epsilon_decay=0.999, initial_epsilon=1.0, target_update_interval=1, policy_loss_addons=[])[source]
Returns a function to create new DDQN instances.
- Parameters
lr (float) – learning rate.
gamma (float) – Discount factor parameter.
polyak (float) – Polyak averaging parameter.
num_updates (int) – Num consecutive actor updates before data collection continues.
update_every (int) – Regularity of actor updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor updates.
max_grad_norm (float) – Gradient clipping parameter.
initial_epsilon (float) – initial value for DQN epsilon parameter.
epsilon_decay (float) – Exponential decay rate for epsilon parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
- Returns
create_algo_instance (func) – Function that creates a new DDQN class instance.
algo_name (str) – Name of the algorithm.
- set_weights(weights)[source]
Update actor critic with the given weights. Update also target networks.
- Parameters
weights (dict of tensors) – Dict containing actor weights to be set.
Deep Deterministic Policy Gradient (DDPG)
- class pytorchrl.agent.algorithms.off_policy.ddpg.DDPG(device, envs, actor, lr_q=0.0001, lr_pi=0.0001, gamma=0.99, polyak=0.995, num_updates=1, update_every=50, test_every=1000, max_grad_norm=0.5, start_steps=20000, mini_batch_size=64, num_test_episodes=5, target_update_interval=1, policy_loss_addons=[])[source]
Bases:
pytorchrl.agent.algorithms.base.AlgorithmDeep Deterministic Policy Gradient algorithm class.
Algorithm class to execute DDPG, from Timothy P. Lillicrap et al. CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING (https://arxiv.org/pdf/1509.02971.pdf). Algorithms are modules generally required by multiple workers, so DDPG.algo_factory(…) returns a function that can be passed on to workers to instantiate their own DDPG module.
- Parameters
device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
gamma (float) – Discount factor parameter.
polyak (float) – DDPG polyak averaging parameter.
num_updates (int) – Num consecutive actor_critic updates before data collection continues.
update_every (int) – Regularity of actor_critic updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor_critic update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor_critic Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor_critic updates.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
Examples
>>> create_algo = DDPG.create_factory( lr_q=1e-4, lr_pi=1e-4, gamma=0.99, polyak=0.995, num_updates=50, update_every=50, test_every=5000, start_steps=20000, mini_batch_size=64, num_test_episodes=0, target_update_interval=1)
- acting_step(obs, rhs, done, deterministic=False)[source]
DDPG acting function.
- Parameters
obs (torch.tensor) – Current world observation
rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.
- Returns
action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
other (dict) – Additional DDPG predictions, which are not used in other algorithms.
- apply_gradients(gradients=None)[source]
Take an optimization step, previously setting new gradients if provided. Update also target networks.
- Parameters
gradients (list of tensors) – List of actor gradients.
- compute_gradients(batch, grads_to_cpu=True)[source]
Compute loss and compute gradients but don’t do optimization step, return gradients instead.
- Parameters
batch (dict) – data batch containing all required tensors to compute DDPG losses.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.
- Returns
grads (list of tensors) – List of actor_critic gradients.
info (dict) – Dict containing current DDPG iteration information.
- compute_loss_pi(data, per_weights=1)[source]
Calculate DDPG policy loss.
- Parameters
data (dict) – Data batch dict containing all required tensors to compute DDPG losses.
- Returns
loss_pi – DDPG policy loss.
- Return type
torch.tensor
- compute_loss_q(data, n_step=1, per_weights=1)[source]
Calculate DDPG Q-nets loss
- Parameters
data (dict) – Data batch dict containing all required tensors to compute TD3 losses.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
per_weights – Prioritized Experience Replay (PER) important sampling weights or 1.0.
- Returns
loss_q1 (torch.tensor) – Q1-net loss.
loss_q2 (torch.tensor) – Q2-net loss.
loss_q (torch.tensor) – Weighted average of loss_q1 and loss_q2.
errors (torch.tensor) – TD errors.
- classmethod create_factory(lr_q=0.001, lr_pi=0.0001, gamma=0.99, polyak=0.995, num_updates=50, test_every=5000, update_every=50, start_steps=1000, max_grad_norm=0.5, mini_batch_size=64, num_test_episodes=5, target_update_interval=1.0, policy_loss_addons=[])[source]
Returns a function to create new DDPG instances.
- Parameters
lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
gamma (float) – Discount factor parameter.
polyak (float) – DDPG polyak averaging parameter.
num_updates (int) – Num consecutive actor_critic updates before data collection continues.
update_every (int) – Regularity of actor_critic updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor_critic update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor_critic Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor_critic updates.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
- Returns
create_algo_instance (func) – Function that creates a new DDPG class instance.
algo_name (str) – Name of the algorithm.
- property gamma
Returns discount factor gamma.
- property mini_batch_size
Returns the number of mini batches per epoch.
- property num_epochs
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_mini_batch
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_test_episodes
Returns the number of episodes to complete when testing.
- set_weights(actor_weights)[source]
Update actor with the given weights. Update also target networks.
- Parameters
actor_weights (dict of tensors) – Dict containing actor weights to be set.
- property start_steps
Returns the number of steps to collect with initial random policy.
- property test_every
Number of network updates between test evaluations.
- update_algorithm_parameter(parameter_name, new_parameter_value)[source]
If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.
- Parameters
parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.
- property update_every
Returns the number of data samples collected between network update stages.
Twin Delayed Deep Deterministic (TD3)
- class pytorchrl.agent.algorithms.off_policy.td3.TD3(device, envs, actor, lr_q=0.0001, lr_pi=0.0001, gamma=0.99, polyak=0.995, num_updates=1, update_every=50, test_every=1000, max_grad_norm=0.5, start_steps=20000, mini_batch_size=64, num_test_episodes=5, target_update_interval=1, policy_loss_addons=[])[source]
Bases:
pytorchrl.agent.algorithms.base.AlgorithmTwin Delayed Deep Deterministic Policy Gradient algorithm class. Algorithm class to execute TD3, from Scott Fujimoto et al. Addressing Function Approximation Error in Actor-Critic Methods (https://arxiv.org/pdf/1802.09477.pdf).
Algorithms are modules generally required by multiple workers, so TD3.algo_factory(…) returns a function that can be passed on to workers to instantiate their own TD3 module.
- Parameters
device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
gamma (float) – Discount factor parameter.
polyak (float) – TD3 polyak averaging parameter.
num_updates (int) – Num consecutive actor_critic updates before data collection continues.
update_every (int) – Regularity of actor_critic updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor_critic update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor_critic Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor_critic updates.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
Examples
>>> create_algo = TD3.create_factory( lr_q=1e-3, lr_pi=1e-3, gamma=0.99, polyak=0.995, num_updates=50, update_every=50, test_every=5000, start_steps=20000, mini_batch_size=100, num_test_episodes=0, target_update_interval=2)
- acting_step(obs, rhs, done, deterministic=False)[source]
TD3 acting function.
- Parameters
obs (torch.tensor) – Current world observation
rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.
- Returns
action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
other (dict) – Additional TD3 predictions, which are not used in other algorithms.
- apply_gradients(gradients=None)[source]
Take an optimization step, previously setting new gradients if provided. Update also target networks.
- Parameters
gradients (list of tensors) – List of actor gradients.
- compute_gradients(batch, grads_to_cpu=True)[source]
Compute loss and compute gradients but don’t do optimization step, return gradients instead.
- Parameters
batch (dict) – data batch containing all required tensors to compute TD3 losses.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.
- Returns
grads (list of tensors) – List of actor_critic gradients.
info (dict) – Dict containing current TD3 iteration information.
- compute_loss_pi(data, per_weights=1)[source]
Calculate TD3 policy loss.
- Parameters
data (dict) – Data batch dict containing all required tensors to compute TD3 losses.
- Returns
loss_pi – TD3 policy loss.
- Return type
torch.tensor
- compute_loss_q(data, n_step=1, per_weights=1)[source]
Calculate TD3 Q-nets loss
- Parameters
data (dict) – Data batch dict containing all required tensors to compute TD3 losses.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
per_weights – Prioritized Experience Replay (PER) important sampling weights or 1.0.
- Returns
loss_q1 (torch.tensor) – Q1-net loss.
loss_q2 (torch.tensor) – Q2-net loss.
loss_q (torch.tensor) – Weighted average of loss_q1 and loss_q2.
errors (torch.tensor) – TD errors.
- classmethod create_factory(lr_q=0.0001, lr_pi=0.0001, gamma=0.99, polyak=0.995, num_updates=50, test_every=5000, update_every=50, start_steps=1000, max_grad_norm=0.5, mini_batch_size=100, num_test_episodes=5, target_update_interval=1.0, policy_loss_addons=[])[source]
Returns a function to create new TD3 instances.
- Parameters
lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
gamma (float) – Discount factor parameter.
polyak (float) – TD3 polyak averaging parameter.
num_updates (int) – Num consecutive actor_critic updates before data collection continues.
update_every (int) – Regularity of actor_critic updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor_critic update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor_critic Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor_critic updates.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
- Returns
create_algo_instance (func) – Function that creates a new TD3 class instance.
algo_name (str) – Name of the algorithm.
- property gamma
Returns discount factor gamma.
- property mini_batch_size
Returns the number of mini batches per epoch.
- property num_epochs
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_mini_batch
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_test_episodes
Returns the number of episodes to complete when testing.
- set_weights(actor_weights)[source]
Update actor with the given weights. Update also target networks.
- Parameters
actor_weights (dict of tensors) – Dict containing actor weights to be set.
- property start_steps
Returns the number of steps to collect with initial random policy.
- property test_every
Number of network updates between test evaluations.
- update_algorithm_parameter(parameter_name, new_parameter_value)[source]
If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.
- Parameters
parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.
- property update_every
Returns the number of data samples collected between network update stages.
Soft Actor Critic (SAC)
- class pytorchrl.agent.algorithms.off_policy.sac.SAC(device, envs, actor, lr_q=0.0001, lr_pi=0.0001, lr_alpha=0.0001, gamma=0.99, polyak=0.995, num_updates=1, update_every=50, test_every=1000, max_grad_norm=0.5, initial_alpha=1.0, start_steps=20000, mini_batch_size=64, num_test_episodes=5, target_update_interval=1, policy_loss_addons=[])[source]
Bases:
pytorchrl.agent.algorithms.base.AlgorithmSoft Actor Critic algorithm class.
Algorithm class to execute SAC, from Haarnoja et al. (https://arxiv.org/abs/1812.05905). Algorithms are modules generally required by multiple workers, so SAC.algo_factory(…) returns a function that can be passed on to workers to instantiate their own SAC module.
- Parameters
device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor_critic class instance.
lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
lr_alpha (float) – Alpha optimizer learning rate.
gamma (float) – Discount factor parameter.
initial_alpha (float) – Initial entropy coefficient value (temperature).
polyak (float) – SAC polyak averaging parameter.
num_updates (int) – Num consecutive actor updates before data collection continues.
update_every (int) – Regularity of actor updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor updates.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
Examples
>>> create_algo = SAC.create_factory( lr_q=1e-4, lr_pi=1e-4, lr_alpha=1e-4, gamma=0.99, polyak=0.995, num_updates=50, update_every=50, test_every=5000, start_steps=20000, mini_batch_size=64, alpha=1.0, num_test_episodes=0, target_update_interval=1)
- acting_step(obs, rhs, done, deterministic=False)[source]
SAC acting function.
- Parameters
obs (torch.tensor) – Current world observation
rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.
- Returns
action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
other (dict) – Additional SAC predictions, which are not used in other algorithms.
- apply_gradients(gradients=None)[source]
Take an optimization step, previously setting new gradients if provided. Update also target networks.
- Parameters
gradients (list of tensors) – List of actor gradients.
- compute_gradients(batch, grads_to_cpu=True)[source]
Compute loss and compute gradients but don’t do optimization step, return gradients instead.
- Parameters
batch (dict) – data batch containing all required tensors to compute SAC losses.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.
- Returns
grads (list of tensors) – List of actor gradients.
info (dict) – Dict containing current SAC iteration information.
- compute_loss_alpha(log_probs, per_weights=1)[source]
Calculate SAC entropy loss.
- Parameters
log_probs (torch.tensor) – Log probability of predicted next action.
per_weights – Prioritized Experience Replay (PER) important sampling weights or 1.0.
- Returns
alpha_loss – SAC entropy loss.
- Return type
torch.tensor
- compute_loss_pi(batch, per_weights=1)[source]
Calculate SAC policy loss.
- Parameters
batch (dict) – Data batch dict containing all required tensors to compute SAC losses.
per_weights – Prioritized Experience Replay (PER) important sampling weights or 1.0.
- Returns
loss_pi (torch.tensor) – SAC policy loss.
logp_pi (torch.tensor) – Log probability of predicted next action.
- compute_loss_q(batch, n_step=1, per_weights=1)[source]
Calculate SAC Q-nets loss
- Parameters
batch (dict) – Data batch dict containing all required tensors to compute SAC losses.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
per_weights – Prioritized Experience Replay (PER) important sampling weights or 1.0.
- Returns
loss_q1 (torch.tensor) – Q1-net loss.
loss_q2 (torch.tensor) – Q2-net loss.
loss_q (torch.tensor) – Weighted average of loss_q1 and loss_q2.
errors (torch.tensor) – TD errors.
- classmethod create_factory(lr_q=0.0001, lr_pi=0.0001, lr_alpha=0.0001, gamma=0.99, polyak=0.995, num_updates=50, test_every=5000, update_every=50, start_steps=1000, max_grad_norm=0.5, initial_alpha=1.0, mini_batch_size=64, num_test_episodes=5, target_update_interval=1.0, policy_loss_addons=[])[source]
Returns a function to create new SAC instances.
- Parameters
lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
lr_alpha (float) – Alpha optimizer learning rate.
gamma (float) – Discount factor parameter.
initial_alpha (float) – Initial entropy coefficient value.
polyak (float) – SAC polyak averaging parameter.
num_updates (int) – Num consecutive actor updates before data collection continues.
update_every (int) – Regularity of actor updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor updates.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
- Returns
create_algo_instance (func) – Function that creates a new SAC class instance.
algo_name (str) – Name of the algorithm.
- property discrete_version
Returns True if action_space is discrete.
- property gamma
Returns discount factor gamma.
- property mini_batch_size
Returns the number of mini batches per epoch.
- property num_epochs
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_mini_batch
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_test_episodes
Returns the number of episodes to complete when testing.
- set_weights(actor_weights)[source]
Update actor with the given weights. Update also target networks.
- Parameters
actor_weights (dict of tensors) – Dict containing actor weights to be set.
- property start_steps
Returns the number of steps to collect with initial random policy.
- property test_every
Number of network updates between test evaluations.
- update_algorithm_parameter(parameter_name, new_parameter_value)[source]
If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.
- Parameters
parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.
- property update_every
Returns the number of data samples collected between network update stages.
Maximum a Posteriori Policy Optimization (MPO)
- class pytorchrl.agent.algorithms.off_policy.mpo.MPO(device, envs, actor, lr_q=0.0001, lr_pi=0.0001, gamma=0.99, polyak=1.0, num_updates=1, update_every=50, test_every=1000, start_steps=20000, mini_batch_size=64, num_test_episodes=5, target_update_interval=1, dual_constraint=0.1, kl_mean_constraint=0.01, kl_var_constraint=0.0001, kl_constraint=0.01, alpha_scale=10.0, alpha_mean_scale=1.0, alpha_var_scale=100.0, alpha_mean_max=0.1, alpha_var_max=10.0, alpha_max=1.0, mstep_iterations=5, sample_action_num=64, max_grad_norm=0.1, policy_loss_addons=[])[source]
Bases:
pytorchrl.agent.algorithms.base.AlgorithmMaximum a Posteriori Policy Optimization algorithm class.
Algorithm class to execute MPO, from A Abdolmaleki et al. (https://arxiv.org/abs/1806.06920). Algorithms are modules generally required by multiple workers, so MPO.algo_factory(…) returns a function that can be passed on to workers to instantiate their own MPO module.
This code has been adapted from https://github.com/daisatojp/mpo.
- Parameters
device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor_critic class instance.
lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
gamma (float) – Discount factor parameter.
polyak (float) – SAC polyak averaging parameter.
num_updates (int) – Num consecutive actor updates before data collection continues.
update_every (int) – Regularity of actor updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor updates.
dual_constraint (float) – Hard constraint of the dual formulation in the E-step corresponding to [2] p.4 ε.
kl_mean_constraint (float) – Hard constraint of the mean in the M-step corresponding to [2] p.6 ε_μ for continuous action space.
kl_var_constraint (float) – Hard constraint of the covariance in the M-step corresponding to [2] p.6 ε_Σ for continuous action space.
kl_constraint (float) – Hard constraint in the M-step corresponding to [2] p.6 ε_π for discrete action space.
alpha_scale (float) – Scaling factor of the lagrangian multiplier in the M-step for dicrete action spaces.
alpha_max (float) – Higher bound used for clipping the lagrangian lagrangian in discrete action spaces.
alpha_mean_scale (float) – Mean scaling factor of the lagrangian multiplier in the M-step for continuous action spaces.
alpha_var_scale (float) – Varience scaling factor of the lagrangian lagrangian in the M-step for continuous action spaces.
alpha_mean_max (float) – Higher bound used for clipping the lagrangian lagrangian in continuous action spaces.
alpha_var_max (float) – Higher bound used for clipping the lagrangian lagrangian in continuous action spaces.
mstep_iterations (int) – The number of iterations of the M-step
sample_action_num (int) – For continuous action spaces, number of samples used to compute expected Q scores.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
Examples
>>> create_algo = MPO.create_factory( lr_q=1e-4, lr_pi=1e-4, lr_alpha=1e-4, gamma=0.99, polyak=0.995, num_updates=50, update_every=50, test_every=5000, start_steps=20000, mini_batch_size=64, alpha=1.0, num_test_episodes=0, target_update_interval=1)
- acting_step(obs, rhs, done, deterministic=False)[source]
MPO acting function.
- Parameters
obs (torch.tensor) – Current world observation
rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.
- Returns
action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
other (dict) – Additional MPO predictions, which are not used in other algorithms.
- apply_gradients(gradients=None)[source]
Take an optimization step, previously setting new gradients if provided. Update also target networks.
- Parameters
gradients (list of tensors) – List of actor gradients.
- compute_gradients(batch, grads_to_cpu=True)[source]
Compute loss and compute gradients but don’t do optimization step, return gradients instead.
- Parameters
batch (dict) – data batch containing all required tensors to compute MPO losses.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.
- Returns
grads (list of tensors) – List of actor gradients.
info (dict) – Dict containing current MPO iteration information.
- compute_loss_pi(batch, per_weights=1)[source]
Calculate MPO policy loss.
- Parameters
batch (dict) – Data batch dict containing all required tensors to compute MPO losses.
per_weights – Prioritized Experience Replay (PER) important sampling weights or 1.0.
- Returns
loss_policy – MPO policy loss.
- Return type
torch.tensor
- compute_loss_q(batch, n_step=1, per_weights=1)[source]
Calculate MPO Q-nets loss
- Parameters
batch (dict) – Data batch dict containing all required tensors to compute MPO losses.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
per_weights – Prioritized Experience Replay (PER) important sampling weights or 1.0.
- Returns
loss_q (torch.tensor) – Q-net loss.
errors (torch.tensor) – TD errors.
- classmethod create_factory(lr_q=0.0001, lr_pi=0.0001, gamma=0.99, polyak=0.995, num_updates=50, test_every=5000, update_every=50, start_steps=1000, mini_batch_size=64, num_test_episodes=5, target_update_interval=1.0, dual_constraint=0.1, kl_mean_constraint=0.01, kl_var_constraint=0.0001, kl_constraint=0.01, alpha_scale=10.0, alpha_mean_scale=1.0, alpha_var_scale=100.0, alpha_mean_max=0.1, alpha_var_max=10.0, alpha_max=1.0, mstep_iterations=5, sample_action_num=64, max_grad_norm=0.1, policy_loss_addons=[])[source]
Returns a function to create new MPO instances.
- Parameters
lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
gamma (float) – Discount factor parameter.
polyak (float) – SAC polyak averaging parameter.
num_updates (int) – Num consecutive actor updates before data collection continues.
update_every (int) – Regularity of actor updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor updates.
dual_constraint (float) – Hard constraint of the dual formulation in the E-step corresponding to [2] p.4 ε.
kl_mean_constraint (float) – Hard constraint of the mean in the M-step corresponding to [2] p.6 ε_μ for continuous action space.
kl_var_constraint (float) – Hard constraint of the covariance in the M-step corresponding to [2] p.6 ε_Σ for continuous action space.
kl_constraint (float) – Hard constraint in the M-step corresponding to [2] p.6 ε_π for discrete action space.
alpha_scale (float) – Scaling factor of the lagrangian multiplier in the M-step for dicrete action spaces.
alpha_max (float) – Higher bound used for clipping the lagrangian lagrangian in discrete action spaces.
alpha_mean_scale (float) – Mean scaling factor of the lagrangian multiplier in the M-step for continuous action spaces.
alpha_var_scale (float) – Varience scaling factor of the lagrangian lagrangian in the M-step for continuous action spaces.
alpha_mean_max (float) – Higher bound used for clipping the lagrangian lagrangian in continuous action spaces.
alpha_var_max (float) – Higher bound used for clipping the lagrangian lagrangian in continuous action spaces.
mstep_iterations (int) – The number of iterations of the M-step
sample_action_num (int) – For continuous action spaces, number of samples used to compute expected Q scores.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.
- Returns
create_algo_instance (func) – Function that creates a new MPO class instance.
algo_name (str) – Name of the algorithm.
- property discrete_version
Returns True if action_space is discrete.
- property gamma
Returns discount factor gamma.
- property mini_batch_size
Returns the number of mini batches per epoch.
- property num_epochs
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_mini_batch
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_test_episodes
Returns the number of episodes to complete when testing.
- set_weights(actor_weights)[source]
Update actor with the given weights. Update also target networks.
- Parameters
actor_weights (dict of tensors) – Dict containing actor weights to be set.
- property start_steps
Returns the number of steps to collect with initial random policy.
- property test_every
Number of network updates between test evaluations.
- update_algorithm_parameter(parameter_name, new_parameter_value)[source]
If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.
- Parameters
parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.
- property update_every
Returns the number of data samples collected between network update stages.