Off-policy

Double Deep Q-Learning (DDQN)

class pytorchrl.agent.algorithms.off_policy.ddqn.DDQN(device, envs, actor, lr=0.0001, gamma=0.99, polyak=0.995, num_updates=1, update_every=50, test_every=5000, max_grad_norm=0.5, start_steps=20000, mini_batch_size=64, num_test_episodes=5, initial_epsilon=1.0, epsilon_decay=0.999, target_update_interval=1, policy_loss_addons=[])[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Deep Q Learning algorithm class.

Algorithm class to execute DQN, from Mhin et al. (https://www.nature.com/articles/nature14236?wm=book_wap_0005) with target network.

Parameters

device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (ActorCritic) – actor class instance.
lr (float) – learning rate.
gamma (float) – Discount factor parameter.
num_updates (int) – Num consecutive actor updates before data collection continues.
update_every (int) – Regularity of actor updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor update batches.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor updates.
max_grad_norm (float) – Gradient clipping parameter.
initial_epsilon (float) – initial value for DQN epsilon parameter.
epsilon_decay (float) – Exponential decay rate for epsilon parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

acting_step(obs, rhs, done, deterministic=False)[source]

DDQN acting function.

Parameters

obs (torch.tensor) – Current world observation
rhs (dict) – RNN recurrent hidden states.
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.

Returns

action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (batch) – Actor recurrent hidden state.
other (dict) – Additional DDQN predictions, which are not used in other algorithms.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided. Update also target networks.

Parameters: gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters

data (dict) – data batch containing all required tensors to compute DQN loss.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

grads (list of tensors) – List of actor gradients.
info (dict) – Dict containing current DQN iteration information.

compute_loss(batch, n_step=1, per_weights=1)[source]

Calculate DDQN loss

Parameters

batch (dict) – Data batch dict containing all required tensors to compute DDQN loss.

Returns

loss (torch.tensor) – DDQN loss.
errors (torch.tensor) – TD errors.

classmethod create_factory(lr=0.0001, gamma=0.99, polyak=0.995, num_updates=50, update_every=50, test_every=5000, start_steps=20000, max_grad_norm=0.5, mini_batch_size=64, num_test_episodes=5, epsilon_decay=0.999, initial_epsilon=1.0, target_update_interval=1, policy_loss_addons=[])[source]

Returns a function to create new DDQN instances.

Parameters

lr (float) – learning rate.
gamma (float) – Discount factor parameter.
polyak (float) – Polyak averaging parameter.
num_updates (int) – Num consecutive actor updates before data collection continues.
update_every (int) – Regularity of actor updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor updates.
max_grad_norm (float) – Gradient clipping parameter.
initial_epsilon (float) – initial value for DQN epsilon parameter.
epsilon_decay (float) – Exponential decay rate for epsilon parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Returns

create_algo_instance (func) – Function that creates a new DDQN class instance.
algo_name (str) – Name of the algorithm.

set_weights(weights)[source]

Update actor critic with the given weights. Update also target networks.

Parameters: weights (dict of tensors) – Dict containing actor weights to be set.

update_algo_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters

parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.

update_epsilon()[source]

update_target_networks()[source]: Update actor critic target networks with polyak averaging

Deep Deterministic Policy Gradient (DDPG)

class pytorchrl.agent.algorithms.off_policy.ddpg.DDPG(device, envs, actor, lr_q=0.0001, lr_pi=0.0001, gamma=0.99, polyak=0.995, num_updates=1, update_every=50, test_every=1000, max_grad_norm=0.5, start_steps=20000, mini_batch_size=64, num_test_episodes=5, target_update_interval=1, policy_loss_addons=[])[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Deep Deterministic Policy Gradient algorithm class.

Algorithm class to execute DDPG, from Timothy P. Lillicrap et al. CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING (https://arxiv.org/pdf/1509.02971.pdf). Algorithms are modules generally required by multiple workers, so DDPG.algo_factory(…) returns a function that can be passed on to workers to instantiate their own DDPG module.

Parameters

device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
gamma (float) – Discount factor parameter.
polyak (float) – DDPG polyak averaging parameter.
num_updates (int) – Num consecutive actor_critic updates before data collection continues.
update_every (int) – Regularity of actor_critic updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor_critic update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor_critic Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor_critic updates.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Examples

>>> create_algo = DDPG.create_factory(
        lr_q=1e-4, lr_pi=1e-4, gamma=0.99, polyak=0.995,
        num_updates=50, update_every=50, test_every=5000, start_steps=20000,
        mini_batch_size=64, num_test_episodes=0, target_update_interval=1)

acting_step(obs, rhs, done, deterministic=False)[source]

DDPG acting function.

Parameters

obs (torch.tensor) – Current world observation
rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.

Returns

action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
other (dict) – Additional DDPG predictions, which are not used in other algorithms.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided. Update also target networks.

Parameters: gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters

batch (dict) – data batch containing all required tensors to compute DDPG losses.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

grads (list of tensors) – List of actor_critic gradients.
info (dict) – Dict containing current DDPG iteration information.

compute_loss_pi(data, per_weights=1)[source]

Calculate DDPG policy loss.

Parameters: data (dict) – Data batch dict containing all required tensors to compute DDPG losses.
Returns: loss_pi – DDPG policy loss.
Return type: torch.tensor

compute_loss_q(data, n_step=1, per_weights=1)[source]

Calculate DDPG Q-nets loss

Parameters

data (dict) – Data batch dict containing all required tensors to compute TD3 losses.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
per_weights – Prioritized Experience Replay (PER) important sampling weights or 1.0.

Returns

loss_q1 (torch.tensor) – Q1-net loss.
loss_q2 (torch.tensor) – Q2-net loss.
loss_q (torch.tensor) – Weighted average of loss_q1 and loss_q2.
errors (torch.tensor) – TD errors.

classmethod create_factory(lr_q=0.001, lr_pi=0.0001, gamma=0.99, polyak=0.995, num_updates=50, test_every=5000, update_every=50, start_steps=1000, max_grad_norm=0.5, mini_batch_size=64, num_test_episodes=5, target_update_interval=1.0, policy_loss_addons=[])[source]

Returns a function to create new DDPG instances.

Parameters

lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
gamma (float) – Discount factor parameter.
polyak (float) – DDPG polyak averaging parameter.
num_updates (int) – Num consecutive actor_critic updates before data collection continues.
update_every (int) – Regularity of actor_critic updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor_critic update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor_critic Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor_critic updates.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Returns

create_algo_instance (func) – Function that creates a new DDPG class instance.
algo_name (str) – Name of the algorithm.

property gamma: Returns discount factor gamma.

property mini_batch_size: Returns the number of mini batches per epoch.

property num_epochs: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes: Returns the number of episodes to complete when testing.

set_weights(actor_weights)[source]

Update actor with the given weights. Update also target networks.

Parameters: actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps: Returns the number of steps to collect with initial random policy.

property test_every: Number of network updates between test evaluations.

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters

parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.

property update_every: Returns the number of data samples collected between network update stages.

update_target_networks()[source]: Update actor critic target networks with polyak averaging

Twin Delayed Deep Deterministic (TD3)

class pytorchrl.agent.algorithms.off_policy.td3.TD3(device, envs, actor, lr_q=0.0001, lr_pi=0.0001, gamma=0.99, polyak=0.995, num_updates=1, update_every=50, test_every=1000, max_grad_norm=0.5, start_steps=20000, mini_batch_size=64, num_test_episodes=5, target_update_interval=1, policy_loss_addons=[])[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Twin Delayed Deep Deterministic Policy Gradient algorithm class. Algorithm class to execute TD3, from Scott Fujimoto et al. Addressing Function Approximation Error in Actor-Critic Methods (https://arxiv.org/pdf/1802.09477.pdf).

Algorithms are modules generally required by multiple workers, so TD3.algo_factory(…) returns a function that can be passed on to workers to instantiate their own TD3 module.

Parameters

device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
gamma (float) – Discount factor parameter.
polyak (float) – TD3 polyak averaging parameter.
num_updates (int) – Num consecutive actor_critic updates before data collection continues.
update_every (int) – Regularity of actor_critic updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor_critic update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor_critic Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor_critic updates.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Examples

>>> create_algo = TD3.create_factory(
        lr_q=1e-3, lr_pi=1e-3, gamma=0.99, polyak=0.995,
        num_updates=50, update_every=50, test_every=5000, start_steps=20000,
        mini_batch_size=100, num_test_episodes=0, target_update_interval=2)

acting_step(obs, rhs, done, deterministic=False)[source]

TD3 acting function.

Parameters

obs (torch.tensor) – Current world observation
rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.

Returns

action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
other (dict) – Additional TD3 predictions, which are not used in other algorithms.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided. Update also target networks.

Parameters: gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters

batch (dict) – data batch containing all required tensors to compute TD3 losses.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

grads (list of tensors) – List of actor_critic gradients.
info (dict) – Dict containing current TD3 iteration information.

compute_loss_pi(data, per_weights=1)[source]

Calculate TD3 policy loss.

Parameters: data (dict) – Data batch dict containing all required tensors to compute TD3 losses.
Returns: loss_pi – TD3 policy loss.
Return type: torch.tensor

compute_loss_q(data, n_step=1, per_weights=1)[source]

Calculate TD3 Q-nets loss

Parameters

data (dict) – Data batch dict containing all required tensors to compute TD3 losses.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
per_weights – Prioritized Experience Replay (PER) important sampling weights or 1.0.

Returns

loss_q1 (torch.tensor) – Q1-net loss.
loss_q2 (torch.tensor) – Q2-net loss.
loss_q (torch.tensor) – Weighted average of loss_q1 and loss_q2.
errors (torch.tensor) – TD errors.

classmethod create_factory(lr_q=0.0001, lr_pi=0.0001, gamma=0.99, polyak=0.995, num_updates=50, test_every=5000, update_every=50, start_steps=1000, max_grad_norm=0.5, mini_batch_size=100, num_test_episodes=5, target_update_interval=1.0, policy_loss_addons=[])[source]

Returns a function to create new TD3 instances.

Parameters

lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
gamma (float) – Discount factor parameter.
polyak (float) – TD3 polyak averaging parameter.
num_updates (int) – Num consecutive actor_critic updates before data collection continues.
update_every (int) – Regularity of actor_critic updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor_critic update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor_critic Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor_critic updates.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Returns

create_algo_instance (func) – Function that creates a new TD3 class instance.
algo_name (str) – Name of the algorithm.

property gamma: Returns discount factor gamma.

property mini_batch_size: Returns the number of mini batches per epoch.

property num_epochs: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes: Returns the number of episodes to complete when testing.

set_weights(actor_weights)[source]

Update actor with the given weights. Update also target networks.

Parameters: actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps: Returns the number of steps to collect with initial random policy.

property test_every: Number of network updates between test evaluations.

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters

parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.

property update_every: Returns the number of data samples collected between network update stages.

update_target_networks()[source]: Update actor critic target networks with polyak averaging

Soft Actor Critic (SAC)

class pytorchrl.agent.algorithms.off_policy.sac.SAC(device, envs, actor, lr_q=0.0001, lr_pi=0.0001, lr_alpha=0.0001, gamma=0.99, polyak=0.995, num_updates=1, update_every=50, test_every=1000, max_grad_norm=0.5, initial_alpha=1.0, start_steps=20000, mini_batch_size=64, num_test_episodes=5, target_update_interval=1, policy_loss_addons=[])[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Soft Actor Critic algorithm class.

Algorithm class to execute SAC, from Haarnoja et al. (https://arxiv.org/abs/1812.05905). Algorithms are modules generally required by multiple workers, so SAC.algo_factory(…) returns a function that can be passed on to workers to instantiate their own SAC module.

Parameters

device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor_critic class instance.
lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
lr_alpha (float) – Alpha optimizer learning rate.
gamma (float) – Discount factor parameter.
initial_alpha (float) – Initial entropy coefficient value (temperature).
polyak (float) – SAC polyak averaging parameter.
num_updates (int) – Num consecutive actor updates before data collection continues.
update_every (int) – Regularity of actor updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor updates.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Examples

>>> create_algo = SAC.create_factory(
        lr_q=1e-4, lr_pi=1e-4, lr_alpha=1e-4, gamma=0.99, polyak=0.995,
        num_updates=50, update_every=50, test_every=5000, start_steps=20000,
        mini_batch_size=64, alpha=1.0, num_test_episodes=0, target_update_interval=1)

acting_step(obs, rhs, done, deterministic=False)[source]

SAC acting function.

Parameters

obs (torch.tensor) – Current world observation
rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.

Returns

action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
other (dict) – Additional SAC predictions, which are not used in other algorithms.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided. Update also target networks.

Parameters: gradients (list of tensors) – List of actor gradients.

calculate_target_entropy()[source]: Calculate SAC target entropy

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters

batch (dict) – data batch containing all required tensors to compute SAC losses.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

grads (list of tensors) – List of actor gradients.
info (dict) – Dict containing current SAC iteration information.

compute_loss_alpha(log_probs, per_weights=1)[source]

Calculate SAC entropy loss.

Parameters

log_probs (torch.tensor) – Log probability of predicted next action.
per_weights – Prioritized Experience Replay (PER) important sampling weights or 1.0.

Returns

alpha_loss – SAC entropy loss.

Return type

torch.tensor

compute_loss_pi(batch, per_weights=1)[source]

Calculate SAC policy loss.

Parameters

batch (dict) – Data batch dict containing all required tensors to compute SAC losses.
per_weights – Prioritized Experience Replay (PER) important sampling weights or 1.0.

Returns

loss_pi (torch.tensor) – SAC policy loss.
logp_pi (torch.tensor) – Log probability of predicted next action.

compute_loss_q(batch, n_step=1, per_weights=1)[source]

Calculate SAC Q-nets loss

Parameters

batch (dict) – Data batch dict containing all required tensors to compute SAC losses.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
per_weights – Prioritized Experience Replay (PER) important sampling weights or 1.0.

Returns

loss_q1 (torch.tensor) – Q1-net loss.
loss_q2 (torch.tensor) – Q2-net loss.
loss_q (torch.tensor) – Weighted average of loss_q1 and loss_q2.
errors (torch.tensor) – TD errors.

classmethod create_factory(lr_q=0.0001, lr_pi=0.0001, lr_alpha=0.0001, gamma=0.99, polyak=0.995, num_updates=50, test_every=5000, update_every=50, start_steps=1000, max_grad_norm=0.5, initial_alpha=1.0, mini_batch_size=64, num_test_episodes=5, target_update_interval=1.0, policy_loss_addons=[])[source]

Returns a function to create new SAC instances.

Parameters

lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
lr_alpha (float) – Alpha optimizer learning rate.
gamma (float) – Discount factor parameter.
initial_alpha (float) – Initial entropy coefficient value.
polyak (float) – SAC polyak averaging parameter.
num_updates (int) – Num consecutive actor updates before data collection continues.
update_every (int) – Regularity of actor updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor updates.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Returns

create_algo_instance (func) – Function that creates a new SAC class instance.
algo_name (str) – Name of the algorithm.

property discrete_version: Returns True if action_space is discrete.

property gamma: Returns discount factor gamma.

property mini_batch_size: Returns the number of mini batches per epoch.

property num_epochs: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes: Returns the number of episodes to complete when testing.

set_weights(actor_weights)[source]

Update actor with the given weights. Update also target networks.

Parameters: actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps: Returns the number of steps to collect with initial random policy.

property test_every: Number of network updates between test evaluations.

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters

parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.

property update_every: Returns the number of data samples collected between network update stages.

update_target_networks()[source]: Update actor critic target networks with polyak averaging

Maximum a Posteriori Policy Optimization (MPO)

class pytorchrl.agent.algorithms.off_policy.mpo.MPO(device, envs, actor, lr_q=0.0001, lr_pi=0.0001, gamma=0.99, polyak=1.0, num_updates=1, update_every=50, test_every=1000, start_steps=20000, mini_batch_size=64, num_test_episodes=5, target_update_interval=1, dual_constraint=0.1, kl_mean_constraint=0.01, kl_var_constraint=0.0001, kl_constraint=0.01, alpha_scale=10.0, alpha_mean_scale=1.0, alpha_var_scale=100.0, alpha_mean_max=0.1, alpha_var_max=10.0, alpha_max=1.0, mstep_iterations=5, sample_action_num=64, max_grad_norm=0.1, policy_loss_addons=[])[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Maximum a Posteriori Policy Optimization algorithm class.

Algorithm class to execute MPO, from A Abdolmaleki et al. (https://arxiv.org/abs/1806.06920). Algorithms are modules generally required by multiple workers, so MPO.algo_factory(…) returns a function that can be passed on to workers to instantiate their own MPO module.

This code has been adapted from https://github.com/daisatojp/mpo.

Parameters

device (torch.device) – CPU or specific GPU where class computations will take place.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor_critic class instance.
lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
gamma (float) – Discount factor parameter.
polyak (float) – SAC polyak averaging parameter.
num_updates (int) – Num consecutive actor updates before data collection continues.
update_every (int) – Regularity of actor updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor updates.
dual_constraint (float) – Hard constraint of the dual formulation in the E-step corresponding to [2] p.4 ε.
kl_mean_constraint (float) – Hard constraint of the mean in the M-step corresponding to [2] p.6 ε_μ for continuous action space.
kl_var_constraint (float) – Hard constraint of the covariance in the M-step corresponding to [2] p.6 ε_Σ for continuous action space.
kl_constraint (float) – Hard constraint in the M-step corresponding to [2] p.6 ε_π for discrete action space.
alpha_scale (float) – Scaling factor of the lagrangian multiplier in the M-step for dicrete action spaces.
alpha_max (float) – Higher bound used for clipping the lagrangian lagrangian in discrete action spaces.
alpha_mean_scale (float) – Mean scaling factor of the lagrangian multiplier in the M-step for continuous action spaces.
alpha_var_scale (float) – Varience scaling factor of the lagrangian lagrangian in the M-step for continuous action spaces.
alpha_mean_max (float) – Higher bound used for clipping the lagrangian lagrangian in continuous action spaces.
alpha_var_max (float) – Higher bound used for clipping the lagrangian lagrangian in continuous action spaces.
mstep_iterations (int) – The number of iterations of the M-step
sample_action_num (int) – For continuous action spaces, number of samples used to compute expected Q scores.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Examples

>>> create_algo = MPO.create_factory(
        lr_q=1e-4, lr_pi=1e-4, lr_alpha=1e-4, gamma=0.99, polyak=0.995,
        num_updates=50, update_every=50, test_every=5000, start_steps=20000,
        mini_batch_size=64, alpha=1.0, num_test_episodes=0, target_update_interval=1)

acting_step(obs, rhs, done, deterministic=False)[source]

MPO acting function.

Parameters

obs (torch.tensor) – Current world observation
rhs (torch.tensor) – RNN recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.

Returns

action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (torch.tensor) – Policy recurrent hidden state (if policy is not a RNN, rhs will contain zeroes).
other (dict) – Additional MPO predictions, which are not used in other algorithms.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided. Update also target networks.

Parameters: gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters

batch (dict) – data batch containing all required tensors to compute MPO losses.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

grads (list of tensors) – List of actor gradients.
info (dict) – Dict containing current MPO iteration information.

compute_loss_pi(batch, per_weights=1)[source]

Calculate MPO policy loss.

Parameters

batch (dict) – Data batch dict containing all required tensors to compute MPO losses.
per_weights – Prioritized Experience Replay (PER) important sampling weights or 1.0.

Returns

loss_policy – MPO policy loss.

Return type

torch.tensor

compute_loss_q(batch, n_step=1, per_weights=1)[source]

Calculate MPO Q-nets loss

Parameters

batch (dict) – Data batch dict containing all required tensors to compute MPO losses.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
per_weights – Prioritized Experience Replay (PER) important sampling weights or 1.0.

Returns

loss_q (torch.tensor) – Q-net loss.
errors (torch.tensor) – TD errors.

classmethod create_factory(lr_q=0.0001, lr_pi=0.0001, gamma=0.99, polyak=0.995, num_updates=50, test_every=5000, update_every=50, start_steps=1000, mini_batch_size=64, num_test_episodes=5, target_update_interval=1.0, dual_constraint=0.1, kl_mean_constraint=0.01, kl_var_constraint=0.0001, kl_constraint=0.01, alpha_scale=10.0, alpha_mean_scale=1.0, alpha_var_scale=100.0, alpha_mean_max=0.1, alpha_var_max=10.0, alpha_max=1.0, mstep_iterations=5, sample_action_num=64, max_grad_norm=0.1, policy_loss_addons=[])[source]

Returns a function to create new MPO instances.

Parameters

lr_pi (float) – Policy optimizer learning rate.
lr_q (float) – Q-nets optimizer learning rate.
gamma (float) – Discount factor parameter.
polyak (float) – SAC polyak averaging parameter.
num_updates (int) – Num consecutive actor updates before data collection continues.
update_every (int) – Regularity of actor updates in number environment steps.
start_steps (int) – Num of initial random environment steps before learning starts.
mini_batch_size (int) – Size of actor update batches.
target_update_interval (float) – regularity of target nets updates with respect to actor Adam updates.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations in actor updates.
dual_constraint (float) – Hard constraint of the dual formulation in the E-step corresponding to [2] p.4 ε.
kl_mean_constraint (float) – Hard constraint of the mean in the M-step corresponding to [2] p.6 ε_μ for continuous action space.
kl_var_constraint (float) – Hard constraint of the covariance in the M-step corresponding to [2] p.6 ε_Σ for continuous action space.
kl_constraint (float) – Hard constraint in the M-step corresponding to [2] p.6 ε_π for discrete action space.
alpha_scale (float) – Scaling factor of the lagrangian multiplier in the M-step for dicrete action spaces.
alpha_max (float) – Higher bound used for clipping the lagrangian lagrangian in discrete action spaces.
alpha_mean_scale (float) – Mean scaling factor of the lagrangian multiplier in the M-step for continuous action spaces.
alpha_var_scale (float) – Varience scaling factor of the lagrangian lagrangian in the M-step for continuous action spaces.
alpha_mean_max (float) – Higher bound used for clipping the lagrangian lagrangian in continuous action spaces.
alpha_var_max (float) – Higher bound used for clipping the lagrangian lagrangian in continuous action spaces.
mstep_iterations (int) – The number of iterations of the M-step
sample_action_num (int) – For continuous action spaces, number of samples used to compute expected Q scores.
max_grad_norm (float) – Gradient clipping parameter.
policy_loss_addons (list) – List of PolicyLossAddOn components adding loss terms to the algorithm policy loss.

Returns

create_algo_instance (func) – Function that creates a new MPO class instance.
algo_name (str) – Name of the algorithm.

property discrete_version: Returns True if action_space is discrete.

property gamma: Returns discount factor gamma.

property mini_batch_size: Returns the number of mini batches per epoch.

property num_epochs: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes: Returns the number of episodes to complete when testing.

set_weights(actor_weights)[source]

Update actor with the given weights. Update also target networks.

Parameters: actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps: Returns the number of steps to collect with initial random policy.

property test_every: Number of network updates between test evaluations.

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters

parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.

property update_every: Returns the number of data samples collected between network update stages.

update_target_networks()[source]: Update actor critic target networks with polyak averaging.