On-Policy

Vanilla On-Policy Buffer

class pytorchrl.agent.storages.on_policy.vanilla_on_policy_buffer.VanillaOnPolicyBuffer(size, device, actor, algorithm, envs)[source]

Bases: pytorchrl.agent.storages.base.Storage

Storage class for On-Policy algorithms.

Parameters

size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.
envs (VecEnv) – Vector of environments instance.

after_gradients(batch, info)[source]

After updating actor policy model, make sure self.step is at 0.

Parameters

batch (dict) – Data batch used to compute the gradients.
info (dict) – Additional relevant info from gradient computation.

Returns

info – info dict updated with relevant info from Storage.

Return type

dict

before_gradients()[source]: Before updating actor policy model, compute returns and advantages.

compute_advantages(returns, values)[source]: Compute transition advantage values.

compute_returns(rewards, returns, values, dones, gamma)[source]: Compute return values.

classmethod create_factory(size)[source]

Returns a function that creates VanillaOnPolicyBuffer instances.

Parameters: size (int) – Storage capacity along time axis.
Returns: create_buffer_instance – creates a new VanillaOnPolicyBuffer class instance.
Return type: func

generate_batches(num_mini_batch, mini_batch_size, num_epochs=1, shuffle=True)[source]

Returns a batch iterator to update actor.

Parameters

num_mini_batch (int) – Number mini batches per epoch.
mini_batch_size (int) – Number of samples contained in each mini batch.
num_epochs (int) – Number of epochs.
shuffle (bool) – Whether to shuffle collected data or generate sequential

Yields

batch (dict) – Generated data batches.

get_all_buffer_data(data_to_cpu=False)[source]

Return all currently stored data.

Parameters: data_to_cpu (bool) – Whether or not to move data to cpu memory.

get_num_channels_obs(sample)[source]

Obtain num_channels_obs and set it as class attribute.

Parameters: sample (dict) – Data sample (containing all tensors of an environment transition)

init_tensors(sample)[source]

Lazy initialization of data tensors from a sample.

Parameters: sample (dict) – Data sample (containing all tensors of an environment transition)

insert_data_slice(new_data)[source]

Replace currently stored data.

Parameters: new_data (dict) – Dictionary of env transition samples to replace self.data with.

insert_transition(sample)[source]

Store new transition sample.

Parameters: sample (dict) – Data sample (containing all tensors of an environment transition)

normalize_int_rewards()[source]: In order to keep the rewards on a consistent scale, the intrinsic rewards are normalized by dividing them by a running estimate of their standard deviation.

reset()[source]: Set class counters to zero and remove stored data

storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'ExternalReturn', 'IntrinsicReturn', 'Value', 'IntrinsicValue', 'LogProbability', 'Advantage', 'IntrinsicAdvantage')

update_storage_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters

parameter_name (str) – Attribute name
new_parameter_value (int or float) – New value for parameter_name.

Generalized Advantage Estimator (GAE) Buffer

class pytorchrl.agent.storages.on_policy.gae_buffer.GAEBuffer(size, device, actor, algorithm, envs, gae_lambda=0.95)[source]

Bases: pytorchrl.agent.storages.on_policy.vanilla_on_policy_buffer.VanillaOnPolicyBuffer

Storage class for On-Policy algorithms with Generalized Advantage Estimator (GAE). https://arxiv.org/abs/1506.02438

Parameters

size (int) – Storage capacity along time axis.
gae_lambda (float) – GAE lambda parameter.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.

compute_returns(rewards, returns, values, dones, gamma)[source]: Compute return values.

classmethod create_factory(size, gae_lambda=0.95)[source]

Returns a function that creates OnPolicyGAEBuffer instances.

Parameters

size (int) – Storage capacity along time axis.
gae_lambda (float) – GAE lambda parameter.

Returns

create_buffer_instance – creates a new OnPolicyBuffer class instance.

Return type

func

storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'ExternalReturn', 'IntrinsicReturn', 'Value', 'IntrinsicValue', 'LogProbability', 'Advantage', 'IntrinsicAdvantage')

property used_capacity: Returns the step up to which storage is full with env transitions.

V-trace Buffer

class pytorchrl.agent.storages.on_policy.vtrace_buffer.VTraceBuffer(size, device, actor, algorithm, envs)[source]

Bases: pytorchrl.agent.storages.on_policy.vanilla_on_policy_buffer.VanillaOnPolicyBuffer

Storage class for On-Policy algorithms with off-policy correction method V-trace (https://arxiv.org/abs/1506.02438).

Parameters

size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.

before_gradients()[source]: Before updating actor policy model, compute returns and advantages.

compute_vtrace(clip_rho_thres=1.0, clip_c_thres=1.0)[source]

Computes V-trace target values and advantage predictions and stores them, along with the updated action log probabilities, in storage.

Parameters

clip_rho_thres (float) – V-trace rho threshold parameter.
clip_c_thres (float) – V-trace c threshold parameter.

get_updated_action_log_probs()[source]: Computes new log probabilities of actions stored in storage according to current actor version. It also uses the current actor version to update the value predictions.

storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'ExternalReturn', 'IntrinsicReturn', 'Value', 'IntrinsicValue', 'LogProbability', 'Advantage', 'IntrinsicAdvantage')

Proximal Policy Optimization with Demonstrations Buffer (PPOD)

class pytorchrl.agent.storages.on_policy.ppod_buffer.PPODBuffer(size, device, actor, algorithm, envs, rho=0.1, phi=0.3, gae_lambda=0.95, alpha=10, total_buffer_demo_capacity=51, initial_human_demos_dir=None, initial_agent_demos_dir=None, supplementary_demos_dir=None, target_agent_demos_dir=None, num_agent_demos_to_save=10, initial_reward_threshold=None, save_demos_every=10, demo_dtypes={'Action': <class 'numpy.float32'>, 'Observation': <class 'numpy.float32'>, 'Reward': <class 'numpy.float32'>})[source]

Bases: pytorchrl.agent.storages.on_policy.gae_buffer.GAEBuffer

Storage class for PPO+D algorithm.

Parameters

size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.
initial_human_demos_dir (str) – Path to directory containing human initial demonstrations.
initial_agent_demos_dir (str) – Path to directory containing other agent initial demonstrations.
supplementary_demos_dir (str) – Path to a directory where additional demos can be added after training has started. these demos will be incorporated into the buffer as bonus agent demos.
target_agent_demos_dir (str) – Path to directory where best reward demonstrations should be saved.
rho (float) – PPO+D rho parameter.
phi (float) – PPO+D phi parameter.
alpha (float) – PPO+D alpha parameter
gae_lambda (float) – GAE lambda parameter.
total_buffer_demo_capacity (int) – Maximum number of demos to keep between reward and value demos.
save_demos_every (int) – Save top demos every `save_demo_frequency`th data collection.
num_agent_demos_to_save (int) – Number of top reward demos to save.
initial_reward_threshold (float) – initial value to use as reward threshold for new demos.
demo_dtypes (dict) – data types to use for the demos.

after_gradients(batch, info)[source]

After updating actor policy model, make sure self.step is at 0. :param batch: Data batch used to compute the gradients. :type batch: dict :param info: Additional relevant info from gradient computation. :type info: dict

Returns: info – info dict updated with relevant info from Storage.
Return type: dict

anneal_parameters()[source]: Update demos probabilities as explained in PPO+D paper.

before_gradients()[source]: Before updating actor policy model, compute returns and advantages.

check_demo_buffer_capacity()[source]: Check total amount of demos. If total amount of demos exceeds self.max_demos, pop demos.

classmethod create_factory(size, rho=0.1, phi=0.3, gae_lambda=0.95, alpha=10, total_buffer_demo_capacity=51, initial_human_demos_dir=None, initial_agent_demos_dir=None, supplementary_demos_dir=None, target_agent_demos_dir=None, num_agent_demos_to_save=10, initial_reward_threshold=None, save_demos_every=10, demo_dtypes={'Action': <class 'numpy.float32'>, 'Observation': <class 'numpy.float32'>, 'Reward': <class 'numpy.float32'>})[source]

Returns a function that creates PPODBuffer instances.

Parameters

size (int) – Storage capacity along time axis.
initial_human_demos_dir (str) – Path to directory containing human initial demonstrations.
initial_agent_demos_dir (str) – Path to directory containing other agent initial demonstrations.
supplementary_demos_dir (str) – Path to a directory where additional demos can be added after training has started. these demos will be incorporated into the buffer as bonus agent demos.
target_agent_demos_dir (str) – Path to directory where best reward demonstrations should be saved.
rho (float) – PPO+D rho parameter.
phi (float) – PPO+D phi parameter.
alpha (float) – PPO+D alpha parameter
gae_lambda (float) – GAE lambda parameter.
total_buffer_demo_capacity (int) – Maximum number of demos to keep between reward and value demos.
save_demos_every (int) – Save top demos every `save_demo_frequency`th data collection.
num_agent_demos_to_save (int) – Number of top reward demos to save.
initial_reward_threshold (float) – initial value to use as reward threshold for new demos.
demo_dtypes (dict) – data types to use for the demos.

Returns

create_buffer_instance – creates a new PPODBuffer class instance.

Return type

func

demos_data_fields = ('Observation', 'Action', 'Reward')

insert_transition(sample)[source]

Store new transition sample.

Parameters: sample (dict) – Data sample (containing all tensors of an environment transition)

load_demo(demo_path)[source]: Loads and returns a environment demonstration.

load_initial_demos()[source]: Load initial demonstrations. Warning: make sure the environment frame_skip and frame_stack hyperparameters are the same as those used to record the demonstrations!

load_supplementary_demos()[source]: Load demonstrations found in the self.supplementary_demos (if any). Warning: make sure the environment frame_skip and frame_stack hyperparameters are the same as those used in the demonstrations!

on_policy_data_fields = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'ExternalReturn', 'IntrinsicReturn', 'Value', 'IntrinsicValue', 'LogProbability', 'Advantage', 'IntrinsicAdvantage')

sample_demo(env_id)[source]: With probability rho insert reward demos, with probability phi insert value demos.

save_demos()[source]: Saves the top num_rewards_demos demos from the reward demos buffer and the top num_value_demos demos from the value demos buffer.

track_potential_demos(sample)[source]: Tracks current episodes looking for potential demos