Model-Based

Model Predictive Control (MPC) Random Shooting (RS)

class pytorchrl.agent.algorithms.model_based.mpc_rs.MPC_RS(lr, envs, actor, device, mb_epochs, start_steps, update_every, action_noise, max_grad_norm, mini_batch_size, num_test_episodes, test_every)[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Model-Based MPC Random Shooting (RS) class. Trains a model of the environment and uses RS to select actions.

Parameters

lr (float) – Dynamics model learning rate.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – actor class instance.
device (torch.device) – CPU or specific GPU where class computations will take place.
mb_epochs (int) – Training epochs for the dynamics model.
start_steps (int) – Number of steps collected with initial random policy.
update_every (int) – Amount of data collected in between dynamics model updates.
action_noise – Exploration noise.
mini_batch_size (int) – Size of actor update batches.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations.

acting_step(obs, rhs, done, deterministic=False)[source]

Does the MPC search with random shooting action planning process.

Parameters

obs (torch.tensor) – Current world observation
rhs (dict) – RNN recurrent hidden states.
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.

Returns

action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (batch) – Actor recurrent hidden state.
other (dict) – Additional MPC predictions, which are not used in other algorithms.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided. Update also target networks.

Parameters: gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters

batch (dict) – data batch containing all required tensors to compute dynamics model losses.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

grads (list of tensors) – List of actor_critic gradients.
info (dict) – Dict containing current dynamics model iteration information.

compute_returns(states: torch.Tensor, actions: torch.Tensor, model: torch.nn.modules.module.Module)[source]

Calculates the trajectory returns

Parameters

states (torch.Tensor) – Trajectory states
actions (torch.Tensor) – Trajectory actions
model (dynamics Model) – Calculates the next states and rewards

Returns

returns – Trajectory returns of the RS MPC

Return type

torch.Tensor

classmethod create_factory(lr, start_steps, update_every, mb_epochs, action_noise, mini_batch_size, test_every=10, max_grad_norm=0.5, num_test_episodes=3)[source]

Returns a function to create a new Model-Based MPC instance.

Parameters

lr (float) – Dynamics model learning rate.
start_steps (int) – Number of steps collected with initial random policy.
update_every (int) – Amount of data collected in between dynamics model updates.
mb_epochs (int) – Training epochs for the dynamics model.
action_noise – Exploration noise.
mini_batch_size (int) – Size of actor update batches.
test_every (int) – Regularity of test evaluations.
num_test_episodes (int) – Number of episodes to complete in each test phase.

Returns

create_algo_instance (func) – Function that creates a new MPC_RS class instance.
algo_name (str) – Name of the algorithm.

property gamma: Returns discount factor gamma.

property mini_batch_size: Returns the number of mini batches per epoch.

property num_epochs: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes: Returns the number of episodes to complete when testing.

set_weights(actor_weights)[source]

Update actor with the given weights. Update also target networks.

Parameters: actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps: Returns the number of steps to collect with initial random policy.

property test_every: Number of network updates between test evaluations.

training_step(batch)[source]

Does the forward pass and loss calculation of the dynamics model given the training data.

Parameters: batch (dict) – Training data with inputs and labels
Returns: torch.Tensor
Return type: Returns the training loss

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters

parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.

property update_every: Returns the number of data samples collected between network update stages.

Model Predictive Control (MPC) Cross-Entropy Method (CEM)

class pytorchrl.agent.algorithms.model_based.mpc_cem.MPC_CEM(lr, envs, actor, device, start_steps, update_every, mb_epochs, action_noise, mini_batch_size, ub=1, lb=- 1, k_best=5, epsilon=0.001, update_alpha=0.0, max_grad_norm=0.5, iter_update_steps=3, test_every=10, num_test_episodes=3)[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Model-Based MPC Cross-Entropy Method (CEM) class. Trains a model of the environment and uses CEM to select actions.

Parameters

lr (float) – Dynamics model learning rate.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – actor class instance.
device (torch.device) – CPU or specific GPU where class computations will take place.
mb_epochs (int) – Training epochs for the dynamics model.
start_steps (int) – Number of steps collected with initial random policy.
update_every (int) – Amount of data collected in between dynamics model updates.
action_noise – Exploration noise.
mini_batch_size (int) – Size of actor update batches.
ub (float) – Actions upper bound.
lb (float) – Actions lower bound.
k_best (int) – Number of best action proposals per iteration.
epsilon (float) – Threshold to stop the training iteration earlier if the action variance is very low.
update_alpha – Action distribution mean soft update parameter.
iter_update_steps – Number of optimizing action sampling iterations.
max_grad_norm (float) – Gradient clipping parameter.
test_every (int) – Regularity of test evaluations.
num_test_episodes (int) – Number of episodes to complete in each test phase.

acting_step(obs, rhs, done, deterministic=False)[source]

Does the MPC search with CEM action planning process.

Parameters

obs (torch.tensor) – Current world observation
rhs (dict) – RNN recurrent hidden states.
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.

Returns

action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (batch) – Actor recurrent hidden state.
other (dict) – Additional MPC predictions, which are not used in other algorithms.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided. Update also target networks.

Parameters: gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters

batch (dict) – data batch containing all required tensors to compute dynamics model losses.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

grads (list of tensors) – List of actor_critic gradients.
info (dict) – Dict containing current dynamics model iteration information.

classmethod create_factory(lr, start_steps, update_every, mb_epochs, action_noise, mini_batch_size, ub=1, lb=- 1, k_best=5, epsilon=0.001, update_alpha=0.0, max_grad_norm=0.5, iter_update_steps=3, test_every=10, num_test_episodes=3)[source]

Returns a function to create a new Model-Based MPC instance.

lr: float: Dynamics model learning rate.
mb_epochsint: Training epochs for the dynamics model.
start_steps: int: Number of steps collected with initial random policy.
update_everyint: Amount of data collected in between dynamics model updates.
action_noise :: Exploration noise.
mini_batch_sizeint: Size of actor update batches.
ubfloat: Actions upper bound.
lbfloat: Actions lower bound.
k_bestint: Number of best action proposals per iteration.
epsilonfloat: Threshold to stop the training iteration earlier if the action variance is very low.
update_alpha :: Action distribution mean soft update parameter.
iter_update_steps :: Number of optimizing action sampling iterations.
max_grad_normfloat: Gradient clipping parameter.
test_everyint: Regularity of test evaluations.
num_test_episodesint: Number of episodes to complete in each test phase.

Returns

create_algo_instance (func) – Function that creates a new MPC_CEM class instance.
algo_name (str) – Name of the algorithm.

property gamma: Returns discount factor gamma.

property mini_batch_size: Returns the number of mini batches per epoch.

property num_epochs: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes: Returns the number of episodes to complete when testing.

select_k_best(rewards, action_hist)[source]

Selects k action trajectories that led to the highest reward.

Parameters

rewards (np.array) – Rewards per rollout
action_history (np.array) – Action history for all rollouts

Returns

k_best_rewards (np.array) – K-rewards of the action trajectories that the highest reward value
elite_actions (np.array) – Best action histories

set_weights(actor_weights)[source]

Update actor with the given weights. Update also target networks.

Parameters: actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps: Returns the number of steps to collect with initial random policy.

property test_every: Number of network updates between test evaluations.

training_step(batch)[source]

Does the forward pass and loss calculation of the dynamics model given the training data.

Parameters: batch (dict) – Training data with inputs and labels
Returns: torch.Tensor
Return type: Returns the training loss

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters

parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.

property update_every: Returns the number of data samples collected between network update stages.

update_gaussians(old_mu, old_var, best_actions)[source]

Updates the mu and var value for the gaussian action sampling method.

Parameters

old_mu (np.array) – Old mean value
old_var (np.array) – Old variance value
best_actions (np.array) – Action history that led to the highest reward

Returns

mu (np.array) – Updated mean values
var (np.array) – Updated variance values

Model Predictive Control (MPC) Deep Dynamics Models (PDDM)

class pytorchrl.agent.algorithms.model_based.mpc_pddm.MPC_PDDM(lr, envs, actor, device, start_steps, update_every, mb_epochs, action_noise, mini_batch_size, gamma=1.0, beta=0.5, max_grad_norm=0.5, test_every=10, num_test_episodes=3)[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Model-Based MPC Planning with Deep Dynamics Models (PDDM) class. Trains a model of the environment and uses PDDM to select actions.

Parameters

lr (float) – Dynamics model learning rate.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – actor class instance.
device (torch.device) – CPU or specific GPU where class computations will take place.
mb_epochs (int) – Training epochs for the dynamics model.
start_steps (int) – Number of steps collected with initial random policy.
update_every (int) – Amount of data collected in between dynamics model updates.
action_noise – Exploration noise.
mini_batch_size (int) – Size of actor update batches.
gamma (float) – Reward-weighting factor.
beta (float) – Action filtering coefficient.
max_grad_norm (float) – Gradient clipping parameter.
test_every (int) – Regularity of test evaluations.
num_test_episodes (int) – Number of episodes to complete in each test phase.

acting_step(obs, rhs, done, deterministic=False)[source]

Does the MPC search with PDDM action planning process.

Parameters

obs (torch.tensor) – Current world observation
rhs (dict) – RNN recurrent hidden states.
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.

Returns

action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (batch) – Actor recurrent hidden state.
other (dict) – Additional MPC predictions, which are not used in other algorithms.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided. Update also target networks.

Parameters: gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters

batch (dict) – data batch containing all required tensors to compute dynamics model losses.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

grads (list of tensors) – List of actor_critic gradients.
info (dict) – Dict containing current dynamics model iteration information.

classmethod create_factory(lr, start_steps, update_every, mb_epochs, action_noise, mini_batch_size, gamma=1.0, beta=0.5, max_grad_norm=0.5, test_every=10, num_test_episodes=3)[source]

Returns a function to create a new Model-Based MPC instance.

Parameters

lr (float) – Dynamics model learning rate.
mb_epochs (int) – Training epochs for the dynamics model.
start_steps (int) – Number of steps collected with initial random policy.
update_every (int) – Amount of data collected in between dynamics model updates.
action_noise – Exploration noise.
mini_batch_size (int) – Size of actor update batches.
gamma (float) – Reward-weighting factor.
beta (float) – Action filtering coefficient.
max_grad_norm (float) – Gradient clipping parameter.
test_every (int) – Regularity of test evaluations.
num_test_episodes (int) – Number of episodes to complete in each test phase.

Returns

create_algo_instance (func) – Function that creates a new MPC_PDDM class instance.
algo_name (str) – Name of the algorithm.

property gamma: Returns discount factor gamma.

get_pred_trajectories(states, model)[source]

Calculates the returns when planning given a state and a model.

Parameters

states (torch.Tensor) – Initial states that are used for the planning.
model (dynamics model nn.Module) – The dynamics model that is used to predict the next state and reward.

Returns

actions (np.array) – Action history of the sampled trajectories used for planning.
returns (np.array) – Returns of the action trajectories.

property mini_batch_size: Returns the number of mini batches per epoch.

property num_epochs: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch: Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes: Returns the number of episodes to complete when testing.

sample_actions(past_action)[source]

Samples action trajectories.

Parameters: past_action (np.array) – Previous action mean value.
Returns: actions – Sampled action trajectories.
Return type: np.array

set_weights(actor_weights)[source]

Update actor with the given weights. Update also target networks.

Parameters: actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps: Returns the number of steps to collect with initial random policy.

property test_every: Number of network updates between test evaluations.

training_step(batch)[source]

Does the forward pass and loss calculation of the dynamics model given the training data.

Parameters: batch (dict) – Training data with inputs and labels
Returns: torch.Tensor
Return type: Returns the training loss

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters

parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.

property update_every: Returns the number of data samples collected between network update stages.

update_mu(action_hist, returns)[source]

Updates the mean value for the action sampling distribution.

Parameters

action_hist (np.array) – Action history of the planned trajectories.
returns (np.array) – Returns of the planned trajectories.

Returns

mu – Updates mean value.

Return type

np.array