Model-Based

Model Predictive Control (MPC) Random Shooting (RS)

class pytorchrl.agent.algorithms.model_based.mpc_rs.MPC_RS(lr, envs, actor, device, mb_epochs, start_steps, update_every, action_noise, max_grad_norm, mini_batch_size, num_test_episodes, test_every)[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Model-Based MPC Random Shooting (RS) class. Trains a model of the environment and uses RS to select actions.

Parameters
  • lr (float) – Dynamics model learning rate.

  • envs (VecEnv) – Vector of environments instance.

  • actor (Actor) – actor class instance.

  • device (torch.device) – CPU or specific GPU where class computations will take place.

  • mb_epochs (int) – Training epochs for the dynamics model.

  • start_steps (int) – Number of steps collected with initial random policy.

  • update_every (int) – Amount of data collected in between dynamics model updates.

  • action_noise – Exploration noise.

  • mini_batch_size (int) – Size of actor update batches.

  • num_test_episodes (int) – Number of episodes to complete in each test phase.

  • test_every (int) – Regularity of test evaluations.

acting_step(obs, rhs, done, deterministic=False)[source]

Does the MPC search with random shooting action planning process.

Parameters
  • obs (torch.tensor) – Current world observation

  • rhs (dict) – RNN recurrent hidden states.

  • done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.

  • deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.

Returns

  • action (torch.tensor) – Predicted next action.

  • clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).

  • rhs (batch) – Actor recurrent hidden state.

  • other (dict) – Additional MPC predictions, which are not used in other algorithms.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided. Update also target networks.

Parameters

gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters
  • batch (dict) – data batch containing all required tensors to compute dynamics model losses.

  • grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

  • grads (list of tensors) – List of actor_critic gradients.

  • info (dict) – Dict containing current dynamics model iteration information.

compute_returns(states: torch.Tensor, actions: torch.Tensor, model: torch.nn.modules.module.Module)[source]

Calculates the trajectory returns

Parameters
  • states (torch.Tensor) – Trajectory states

  • actions (torch.Tensor) – Trajectory actions

  • model (dynamics Model) – Calculates the next states and rewards

Returns

returns – Trajectory returns of the RS MPC

Return type

torch.Tensor

classmethod create_factory(lr, start_steps, update_every, mb_epochs, action_noise, mini_batch_size, test_every=10, max_grad_norm=0.5, num_test_episodes=3)[source]

Returns a function to create a new Model-Based MPC instance.

Parameters
  • lr (float) – Dynamics model learning rate.

  • start_steps (int) – Number of steps collected with initial random policy.

  • update_every (int) – Amount of data collected in between dynamics model updates.

  • mb_epochs (int) – Training epochs for the dynamics model.

  • action_noise – Exploration noise.

  • mini_batch_size (int) – Size of actor update batches.

  • test_every (int) – Regularity of test evaluations.

  • num_test_episodes (int) – Number of episodes to complete in each test phase.

Returns

  • create_algo_instance (func) – Function that creates a new MPC_RS class instance.

  • algo_name (str) – Name of the algorithm.

property gamma

Returns discount factor gamma.

property mini_batch_size

Returns the number of mini batches per epoch.

property num_epochs

Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch

Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes

Returns the number of episodes to complete when testing.

set_weights(actor_weights)[source]

Update actor with the given weights. Update also target networks.

Parameters

actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps

Returns the number of steps to collect with initial random policy.

property test_every

Number of network updates between test evaluations.

training_step(batch)[source]

Does the forward pass and loss calculation of the dynamics model given the training data.

Parameters

batch (dict) – Training data with inputs and labels

Returns

torch.Tensor

Return type

Returns the training loss

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters
  • parameter_name (str) – Worker.algo attribute name

  • new_parameter_value (int or float) – New value for parameter_name.

property update_every

Returns the number of data samples collected between network update stages.

Model Predictive Control (MPC) Cross-Entropy Method (CEM)

class pytorchrl.agent.algorithms.model_based.mpc_cem.MPC_CEM(lr, envs, actor, device, start_steps, update_every, mb_epochs, action_noise, mini_batch_size, ub=1, lb=- 1, k_best=5, epsilon=0.001, update_alpha=0.0, max_grad_norm=0.5, iter_update_steps=3, test_every=10, num_test_episodes=3)[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Model-Based MPC Cross-Entropy Method (CEM) class. Trains a model of the environment and uses CEM to select actions.

Parameters
  • lr (float) – Dynamics model learning rate.

  • envs (VecEnv) – Vector of environments instance.

  • actor (Actor) – actor class instance.

  • device (torch.device) – CPU or specific GPU where class computations will take place.

  • mb_epochs (int) – Training epochs for the dynamics model.

  • start_steps (int) – Number of steps collected with initial random policy.

  • update_every (int) – Amount of data collected in between dynamics model updates.

  • action_noise – Exploration noise.

  • mini_batch_size (int) – Size of actor update batches.

  • ub (float) – Actions upper bound.

  • lb (float) – Actions lower bound.

  • k_best (int) – Number of best action proposals per iteration.

  • epsilon (float) – Threshold to stop the training iteration earlier if the action variance is very low.

  • update_alpha – Action distribution mean soft update parameter.

  • iter_update_steps – Number of optimizing action sampling iterations.

  • max_grad_norm (float) – Gradient clipping parameter.

  • test_every (int) – Regularity of test evaluations.

  • num_test_episodes (int) – Number of episodes to complete in each test phase.

acting_step(obs, rhs, done, deterministic=False)[source]

Does the MPC search with CEM action planning process.

Parameters
  • obs (torch.tensor) – Current world observation

  • rhs (dict) – RNN recurrent hidden states.

  • done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.

  • deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.

Returns

  • action (torch.tensor) – Predicted next action.

  • clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).

  • rhs (batch) – Actor recurrent hidden state.

  • other (dict) – Additional MPC predictions, which are not used in other algorithms.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided. Update also target networks.

Parameters

gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters
  • batch (dict) – data batch containing all required tensors to compute dynamics model losses.

  • grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

  • grads (list of tensors) – List of actor_critic gradients.

  • info (dict) – Dict containing current dynamics model iteration information.

classmethod create_factory(lr, start_steps, update_every, mb_epochs, action_noise, mini_batch_size, ub=1, lb=- 1, k_best=5, epsilon=0.001, update_alpha=0.0, max_grad_norm=0.5, iter_update_steps=3, test_every=10, num_test_episodes=3)[source]

Returns a function to create a new Model-Based MPC instance.

lr: float

Dynamics model learning rate.

mb_epochsint

Training epochs for the dynamics model.

start_steps: int

Number of steps collected with initial random policy.

update_everyint

Amount of data collected in between dynamics model updates.

action_noise :

Exploration noise.

mini_batch_sizeint

Size of actor update batches.

ubfloat

Actions upper bound.

lbfloat

Actions lower bound.

k_bestint

Number of best action proposals per iteration.

epsilonfloat

Threshold to stop the training iteration earlier if the action variance is very low.

update_alpha :

Action distribution mean soft update parameter.

iter_update_steps :

Number of optimizing action sampling iterations.

max_grad_normfloat

Gradient clipping parameter.

test_everyint

Regularity of test evaluations.

num_test_episodesint

Number of episodes to complete in each test phase.

Returns

  • create_algo_instance (func) – Function that creates a new MPC_CEM class instance.

  • algo_name (str) – Name of the algorithm.

property gamma

Returns discount factor gamma.

property mini_batch_size

Returns the number of mini batches per epoch.

property num_epochs

Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch

Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes

Returns the number of episodes to complete when testing.

select_k_best(rewards, action_hist)[source]

Selects k action trajectories that led to the highest reward.

Parameters
  • rewards (np.array) – Rewards per rollout

  • action_history (np.array) – Action history for all rollouts

Returns

  • k_best_rewards (np.array) – K-rewards of the action trajectories that the highest reward value

  • elite_actions (np.array) – Best action histories

set_weights(actor_weights)[source]

Update actor with the given weights. Update also target networks.

Parameters

actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps

Returns the number of steps to collect with initial random policy.

property test_every

Number of network updates between test evaluations.

training_step(batch)[source]

Does the forward pass and loss calculation of the dynamics model given the training data.

Parameters

batch (dict) – Training data with inputs and labels

Returns

torch.Tensor

Return type

Returns the training loss

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters
  • parameter_name (str) – Worker.algo attribute name

  • new_parameter_value (int or float) – New value for parameter_name.

property update_every

Returns the number of data samples collected between network update stages.

update_gaussians(old_mu, old_var, best_actions)[source]

Updates the mu and var value for the gaussian action sampling method.

Parameters
  • old_mu (np.array) – Old mean value

  • old_var (np.array) – Old variance value

  • best_actions (np.array) – Action history that led to the highest reward

Returns

  • mu (np.array) – Updated mean values

  • var (np.array) – Updated variance values

Model Predictive Control (MPC) Deep Dynamics Models (PDDM)

class pytorchrl.agent.algorithms.model_based.mpc_pddm.MPC_PDDM(lr, envs, actor, device, start_steps, update_every, mb_epochs, action_noise, mini_batch_size, gamma=1.0, beta=0.5, max_grad_norm=0.5, test_every=10, num_test_episodes=3)[source]

Bases: pytorchrl.agent.algorithms.base.Algorithm

Model-Based MPC Planning with Deep Dynamics Models (PDDM) class. Trains a model of the environment and uses PDDM to select actions.

Parameters
  • lr (float) – Dynamics model learning rate.

  • envs (VecEnv) – Vector of environments instance.

  • actor (Actor) – actor class instance.

  • device (torch.device) – CPU or specific GPU where class computations will take place.

  • mb_epochs (int) – Training epochs for the dynamics model.

  • start_steps (int) – Number of steps collected with initial random policy.

  • update_every (int) – Amount of data collected in between dynamics model updates.

  • action_noise – Exploration noise.

  • mini_batch_size (int) – Size of actor update batches.

  • gamma (float) – Reward-weighting factor.

  • beta (float) – Action filtering coefficient.

  • max_grad_norm (float) – Gradient clipping parameter.

  • test_every (int) – Regularity of test evaluations.

  • num_test_episodes (int) – Number of episodes to complete in each test phase.

acting_step(obs, rhs, done, deterministic=False)[source]

Does the MPC search with PDDM action planning process.

Parameters
  • obs (torch.tensor) – Current world observation

  • rhs (dict) – RNN recurrent hidden states.

  • done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.

  • deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.

Returns

  • action (torch.tensor) – Predicted next action.

  • clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).

  • rhs (batch) – Actor recurrent hidden state.

  • other (dict) – Additional MPC predictions, which are not used in other algorithms.

apply_gradients(gradients=None)[source]

Take an optimization step, previously setting new gradients if provided. Update also target networks.

Parameters

gradients (list of tensors) – List of actor gradients.

compute_gradients(batch, grads_to_cpu=True)[source]

Compute loss and compute gradients but don’t do optimization step, return gradients instead.

Parameters
  • batch (dict) – data batch containing all required tensors to compute dynamics model losses.

  • grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.

Returns

  • grads (list of tensors) – List of actor_critic gradients.

  • info (dict) – Dict containing current dynamics model iteration information.

classmethod create_factory(lr, start_steps, update_every, mb_epochs, action_noise, mini_batch_size, gamma=1.0, beta=0.5, max_grad_norm=0.5, test_every=10, num_test_episodes=3)[source]

Returns a function to create a new Model-Based MPC instance.

Parameters
  • lr (float) – Dynamics model learning rate.

  • mb_epochs (int) – Training epochs for the dynamics model.

  • start_steps (int) – Number of steps collected with initial random policy.

  • update_every (int) – Amount of data collected in between dynamics model updates.

  • action_noise – Exploration noise.

  • mini_batch_size (int) – Size of actor update batches.

  • gamma (float) – Reward-weighting factor.

  • beta (float) – Action filtering coefficient.

  • max_grad_norm (float) – Gradient clipping parameter.

  • test_every (int) – Regularity of test evaluations.

  • num_test_episodes (int) – Number of episodes to complete in each test phase.

Returns

  • create_algo_instance (func) – Function that creates a new MPC_PDDM class instance.

  • algo_name (str) – Name of the algorithm.

property gamma

Returns discount factor gamma.

get_pred_trajectories(states, model)[source]

Calculates the returns when planning given a state and a model.

Parameters
  • states (torch.Tensor) – Initial states that are used for the planning.

  • model (dynamics model nn.Module) – The dynamics model that is used to predict the next state and reward.

Returns

  • actions (np.array) – Action history of the sampled trajectories used for planning.

  • returns (np.array) – Returns of the action trajectories.

property mini_batch_size

Returns the number of mini batches per epoch.

property num_epochs

Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_mini_batch

Returns the number of times the whole buffer is re-used before data collection proceeds.

property num_test_episodes

Returns the number of episodes to complete when testing.

sample_actions(past_action)[source]

Samples action trajectories.

Parameters

past_action (np.array) – Previous action mean value.

Returns

actions – Sampled action trajectories.

Return type

np.array

set_weights(actor_weights)[source]

Update actor with the given weights. Update also target networks.

Parameters

actor_weights (dict of tensors) – Dict containing actor weights to be set.

property start_steps

Returns the number of steps to collect with initial random policy.

property test_every

Number of network updates between test evaluations.

training_step(batch)[source]

Does the forward pass and loss calculation of the dynamics model given the training data.

Parameters

batch (dict) – Training data with inputs and labels

Returns

torch.Tensor

Return type

Returns the training loss

update_algorithm_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters
  • parameter_name (str) – Worker.algo attribute name

  • new_parameter_value (int or float) – New value for parameter_name.

property update_every

Returns the number of data samples collected between network update stages.

update_mu(action_hist, returns)[source]

Updates the mean value for the action sampling distribution.

Parameters
  • action_hist (np.array) – Action history of the planned trajectories.

  • returns (np.array) – Returns of the planned trajectories.

Returns

mu – Updates mean value.

Return type

np.array