Model-Based
Model Predictive Control (MPC) Random Shooting (RS)
- class pytorchrl.agent.algorithms.model_based.mpc_rs.MPC_RS(lr, envs, actor, device, mb_epochs, start_steps, update_every, action_noise, max_grad_norm, mini_batch_size, num_test_episodes, test_every)[source]
Bases:
pytorchrl.agent.algorithms.base.AlgorithmModel-Based MPC Random Shooting (RS) class. Trains a model of the environment and uses RS to select actions.
- Parameters
lr (float) – Dynamics model learning rate.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – actor class instance.
device (torch.device) – CPU or specific GPU where class computations will take place.
mb_epochs (int) – Training epochs for the dynamics model.
start_steps (int) – Number of steps collected with initial random policy.
update_every (int) – Amount of data collected in between dynamics model updates.
action_noise – Exploration noise.
mini_batch_size (int) – Size of actor update batches.
num_test_episodes (int) – Number of episodes to complete in each test phase.
test_every (int) – Regularity of test evaluations.
- acting_step(obs, rhs, done, deterministic=False)[source]
Does the MPC search with random shooting action planning process.
- Parameters
obs (torch.tensor) – Current world observation
rhs (dict) – RNN recurrent hidden states.
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.
- Returns
action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (batch) – Actor recurrent hidden state.
other (dict) – Additional MPC predictions, which are not used in other algorithms.
- apply_gradients(gradients=None)[source]
Take an optimization step, previously setting new gradients if provided. Update also target networks.
- Parameters
gradients (list of tensors) – List of actor gradients.
- compute_gradients(batch, grads_to_cpu=True)[source]
Compute loss and compute gradients but don’t do optimization step, return gradients instead.
- Parameters
batch (dict) – data batch containing all required tensors to compute dynamics model losses.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.
- Returns
grads (list of tensors) – List of actor_critic gradients.
info (dict) – Dict containing current dynamics model iteration information.
- compute_returns(states: torch.Tensor, actions: torch.Tensor, model: torch.nn.modules.module.Module)[source]
Calculates the trajectory returns
- Parameters
states (torch.Tensor) – Trajectory states
actions (torch.Tensor) – Trajectory actions
model (dynamics Model) – Calculates the next states and rewards
- Returns
returns – Trajectory returns of the RS MPC
- Return type
torch.Tensor
- classmethod create_factory(lr, start_steps, update_every, mb_epochs, action_noise, mini_batch_size, test_every=10, max_grad_norm=0.5, num_test_episodes=3)[source]
Returns a function to create a new Model-Based MPC instance.
- Parameters
lr (float) – Dynamics model learning rate.
start_steps (int) – Number of steps collected with initial random policy.
update_every (int) – Amount of data collected in between dynamics model updates.
mb_epochs (int) – Training epochs for the dynamics model.
action_noise – Exploration noise.
mini_batch_size (int) – Size of actor update batches.
test_every (int) – Regularity of test evaluations.
num_test_episodes (int) – Number of episodes to complete in each test phase.
- Returns
create_algo_instance (func) – Function that creates a new MPC_RS class instance.
algo_name (str) – Name of the algorithm.
- property gamma
Returns discount factor gamma.
- property mini_batch_size
Returns the number of mini batches per epoch.
- property num_epochs
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_mini_batch
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_test_episodes
Returns the number of episodes to complete when testing.
- set_weights(actor_weights)[source]
Update actor with the given weights. Update also target networks.
- Parameters
actor_weights (dict of tensors) – Dict containing actor weights to be set.
- property start_steps
Returns the number of steps to collect with initial random policy.
- property test_every
Number of network updates between test evaluations.
- training_step(batch)[source]
Does the forward pass and loss calculation of the dynamics model given the training data.
- Parameters
batch (dict) – Training data with inputs and labels
- Returns
torch.Tensor
- Return type
Returns the training loss
- update_algorithm_parameter(parameter_name, new_parameter_value)[source]
If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.
- Parameters
parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.
- property update_every
Returns the number of data samples collected between network update stages.
Model Predictive Control (MPC) Cross-Entropy Method (CEM)
- class pytorchrl.agent.algorithms.model_based.mpc_cem.MPC_CEM(lr, envs, actor, device, start_steps, update_every, mb_epochs, action_noise, mini_batch_size, ub=1, lb=- 1, k_best=5, epsilon=0.001, update_alpha=0.0, max_grad_norm=0.5, iter_update_steps=3, test_every=10, num_test_episodes=3)[source]
Bases:
pytorchrl.agent.algorithms.base.AlgorithmModel-Based MPC Cross-Entropy Method (CEM) class. Trains a model of the environment and uses CEM to select actions.
- Parameters
lr (float) – Dynamics model learning rate.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – actor class instance.
device (torch.device) – CPU or specific GPU where class computations will take place.
mb_epochs (int) – Training epochs for the dynamics model.
start_steps (int) – Number of steps collected with initial random policy.
update_every (int) – Amount of data collected in between dynamics model updates.
action_noise – Exploration noise.
mini_batch_size (int) – Size of actor update batches.
ub (float) – Actions upper bound.
lb (float) – Actions lower bound.
k_best (int) – Number of best action proposals per iteration.
epsilon (float) – Threshold to stop the training iteration earlier if the action variance is very low.
update_alpha – Action distribution mean soft update parameter.
iter_update_steps – Number of optimizing action sampling iterations.
max_grad_norm (float) – Gradient clipping parameter.
test_every (int) – Regularity of test evaluations.
num_test_episodes (int) – Number of episodes to complete in each test phase.
- acting_step(obs, rhs, done, deterministic=False)[source]
Does the MPC search with CEM action planning process.
- Parameters
obs (torch.tensor) – Current world observation
rhs (dict) – RNN recurrent hidden states.
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.
- Returns
action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (batch) – Actor recurrent hidden state.
other (dict) – Additional MPC predictions, which are not used in other algorithms.
- apply_gradients(gradients=None)[source]
Take an optimization step, previously setting new gradients if provided. Update also target networks.
- Parameters
gradients (list of tensors) – List of actor gradients.
- compute_gradients(batch, grads_to_cpu=True)[source]
Compute loss and compute gradients but don’t do optimization step, return gradients instead.
- Parameters
batch (dict) – data batch containing all required tensors to compute dynamics model losses.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.
- Returns
grads (list of tensors) – List of actor_critic gradients.
info (dict) – Dict containing current dynamics model iteration information.
- classmethod create_factory(lr, start_steps, update_every, mb_epochs, action_noise, mini_batch_size, ub=1, lb=- 1, k_best=5, epsilon=0.001, update_alpha=0.0, max_grad_norm=0.5, iter_update_steps=3, test_every=10, num_test_episodes=3)[source]
Returns a function to create a new Model-Based MPC instance.
- lr: float
Dynamics model learning rate.
- mb_epochsint
Training epochs for the dynamics model.
- start_steps: int
Number of steps collected with initial random policy.
- update_everyint
Amount of data collected in between dynamics model updates.
- action_noise :
Exploration noise.
- mini_batch_sizeint
Size of actor update batches.
- ubfloat
Actions upper bound.
- lbfloat
Actions lower bound.
- k_bestint
Number of best action proposals per iteration.
- epsilonfloat
Threshold to stop the training iteration earlier if the action variance is very low.
- update_alpha :
Action distribution mean soft update parameter.
- iter_update_steps :
Number of optimizing action sampling iterations.
- max_grad_normfloat
Gradient clipping parameter.
- test_everyint
Regularity of test evaluations.
- num_test_episodesint
Number of episodes to complete in each test phase.
- Returns
create_algo_instance (func) – Function that creates a new MPC_CEM class instance.
algo_name (str) – Name of the algorithm.
- property gamma
Returns discount factor gamma.
- property mini_batch_size
Returns the number of mini batches per epoch.
- property num_epochs
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_mini_batch
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_test_episodes
Returns the number of episodes to complete when testing.
- select_k_best(rewards, action_hist)[source]
Selects k action trajectories that led to the highest reward.
- Parameters
rewards (np.array) – Rewards per rollout
action_history (np.array) – Action history for all rollouts
- Returns
k_best_rewards (np.array) – K-rewards of the action trajectories that the highest reward value
elite_actions (np.array) – Best action histories
- set_weights(actor_weights)[source]
Update actor with the given weights. Update also target networks.
- Parameters
actor_weights (dict of tensors) – Dict containing actor weights to be set.
- property start_steps
Returns the number of steps to collect with initial random policy.
- property test_every
Number of network updates between test evaluations.
- training_step(batch)[source]
Does the forward pass and loss calculation of the dynamics model given the training data.
- Parameters
batch (dict) – Training data with inputs and labels
- Returns
torch.Tensor
- Return type
Returns the training loss
- update_algorithm_parameter(parameter_name, new_parameter_value)[source]
If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.
- Parameters
parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.
- property update_every
Returns the number of data samples collected between network update stages.
- update_gaussians(old_mu, old_var, best_actions)[source]
Updates the mu and var value for the gaussian action sampling method.
- Parameters
old_mu (np.array) – Old mean value
old_var (np.array) – Old variance value
best_actions (np.array) – Action history that led to the highest reward
- Returns
mu (np.array) – Updated mean values
var (np.array) – Updated variance values
Model Predictive Control (MPC) Deep Dynamics Models (PDDM)
- class pytorchrl.agent.algorithms.model_based.mpc_pddm.MPC_PDDM(lr, envs, actor, device, start_steps, update_every, mb_epochs, action_noise, mini_batch_size, gamma=1.0, beta=0.5, max_grad_norm=0.5, test_every=10, num_test_episodes=3)[source]
Bases:
pytorchrl.agent.algorithms.base.AlgorithmModel-Based MPC Planning with Deep Dynamics Models (PDDM) class. Trains a model of the environment and uses PDDM to select actions.
- Parameters
lr (float) – Dynamics model learning rate.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – actor class instance.
device (torch.device) – CPU or specific GPU where class computations will take place.
mb_epochs (int) – Training epochs for the dynamics model.
start_steps (int) – Number of steps collected with initial random policy.
update_every (int) – Amount of data collected in between dynamics model updates.
action_noise – Exploration noise.
mini_batch_size (int) – Size of actor update batches.
gamma (float) – Reward-weighting factor.
beta (float) – Action filtering coefficient.
max_grad_norm (float) – Gradient clipping parameter.
test_every (int) – Regularity of test evaluations.
num_test_episodes (int) – Number of episodes to complete in each test phase.
- acting_step(obs, rhs, done, deterministic=False)[source]
Does the MPC search with PDDM action planning process.
- Parameters
obs (torch.tensor) – Current world observation
rhs (dict) – RNN recurrent hidden states.
done (torch.tensor) – 1.0 if current obs is the last one in the episode, else 0.0.
deterministic (bool) – Whether to randomly sample action from predicted distribution or taking the mode.
- Returns
action (torch.tensor) – Predicted next action.
clipped_action (torch.tensor) – Predicted next action (clipped to be within action space).
rhs (batch) – Actor recurrent hidden state.
other (dict) – Additional MPC predictions, which are not used in other algorithms.
- apply_gradients(gradients=None)[source]
Take an optimization step, previously setting new gradients if provided. Update also target networks.
- Parameters
gradients (list of tensors) – List of actor gradients.
- compute_gradients(batch, grads_to_cpu=True)[source]
Compute loss and compute gradients but don’t do optimization step, return gradients instead.
- Parameters
batch (dict) – data batch containing all required tensors to compute dynamics model losses.
grads_to_cpu (bool) – If gradient tensor will be sent to another node, need to be in CPU.
- Returns
grads (list of tensors) – List of actor_critic gradients.
info (dict) – Dict containing current dynamics model iteration information.
- classmethod create_factory(lr, start_steps, update_every, mb_epochs, action_noise, mini_batch_size, gamma=1.0, beta=0.5, max_grad_norm=0.5, test_every=10, num_test_episodes=3)[source]
Returns a function to create a new Model-Based MPC instance.
- Parameters
lr (float) – Dynamics model learning rate.
mb_epochs (int) – Training epochs for the dynamics model.
start_steps (int) – Number of steps collected with initial random policy.
update_every (int) – Amount of data collected in between dynamics model updates.
action_noise – Exploration noise.
mini_batch_size (int) – Size of actor update batches.
gamma (float) – Reward-weighting factor.
beta (float) – Action filtering coefficient.
max_grad_norm (float) – Gradient clipping parameter.
test_every (int) – Regularity of test evaluations.
num_test_episodes (int) – Number of episodes to complete in each test phase.
- Returns
create_algo_instance (func) – Function that creates a new MPC_PDDM class instance.
algo_name (str) – Name of the algorithm.
- property gamma
Returns discount factor gamma.
- get_pred_trajectories(states, model)[source]
Calculates the returns when planning given a state and a model.
- Parameters
states (torch.Tensor) – Initial states that are used for the planning.
model (dynamics model nn.Module) – The dynamics model that is used to predict the next state and reward.
- Returns
actions (np.array) – Action history of the sampled trajectories used for planning.
returns (np.array) – Returns of the action trajectories.
- property mini_batch_size
Returns the number of mini batches per epoch.
- property num_epochs
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_mini_batch
Returns the number of times the whole buffer is re-used before data collection proceeds.
- property num_test_episodes
Returns the number of episodes to complete when testing.
- sample_actions(past_action)[source]
Samples action trajectories.
- Parameters
past_action (np.array) – Previous action mean value.
- Returns
actions – Sampled action trajectories.
- Return type
np.array
- set_weights(actor_weights)[source]
Update actor with the given weights. Update also target networks.
- Parameters
actor_weights (dict of tensors) – Dict containing actor weights to be set.
- property start_steps
Returns the number of steps to collect with initial random policy.
- property test_every
Number of network updates between test evaluations.
- training_step(batch)[source]
Does the forward pass and loss calculation of the dynamics model given the training data.
- Parameters
batch (dict) – Training data with inputs and labels
- Returns
torch.Tensor
- Return type
Returns the training loss
- update_algorithm_parameter(parameter_name, new_parameter_value)[source]
If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.
- Parameters
parameter_name (str) – Worker.algo attribute name
new_parameter_value (int or float) – New value for parameter_name.
- property update_every
Returns the number of data samples collected between network update stages.