On-Policy

Vanilla On-Policy Buffer

class pytorchrl.agent.storages.on_policy.vanilla_on_policy_buffer.VanillaOnPolicyBuffer(size, device, actor, algorithm, envs)[source]

Bases: pytorchrl.agent.storages.base.Storage

Storage class for On-Policy algorithms.

Parameters
  • size (int) – Storage capacity along time axis.

  • device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.

  • actor (Actor) – Actor class instance.

  • algorithm (Algorithm) – Algorithm class instance.

  • envs (VecEnv) – Vector of environments instance.

after_gradients(batch, info)[source]

After updating actor policy model, make sure self.step is at 0.

Parameters
  • batch (dict) – Data batch used to compute the gradients.

  • info (dict) – Additional relevant info from gradient computation.

Returns

info – info dict updated with relevant info from Storage.

Return type

dict

before_gradients()[source]

Before updating actor policy model, compute returns and advantages.

compute_advantages(returns, values)[source]

Compute transition advantage values.

compute_returns(rewards, returns, values, dones, gamma)[source]

Compute return values.

classmethod create_factory(size)[source]

Returns a function that creates VanillaOnPolicyBuffer instances.

Parameters

size (int) – Storage capacity along time axis.

Returns

create_buffer_instance – creates a new VanillaOnPolicyBuffer class instance.

Return type

func

generate_batches(num_mini_batch, mini_batch_size, num_epochs=1, shuffle=True)[source]

Returns a batch iterator to update actor.

Parameters
  • num_mini_batch (int) – Number mini batches per epoch.

  • mini_batch_size (int) – Number of samples contained in each mini batch.

  • num_epochs (int) – Number of epochs.

  • shuffle (bool) – Whether to shuffle collected data or generate sequential

Yields

batch (dict) – Generated data batches.

get_all_buffer_data(data_to_cpu=False)[source]

Return all currently stored data.

Parameters

data_to_cpu (bool) – Whether or not to move data to cpu memory.

get_num_channels_obs(sample)[source]

Obtain num_channels_obs and set it as class attribute.

Parameters

sample (dict) – Data sample (containing all tensors of an environment transition)

init_tensors(sample)[source]

Lazy initialization of data tensors from a sample.

Parameters

sample (dict) – Data sample (containing all tensors of an environment transition)

insert_data_slice(new_data)[source]

Replace currently stored data.

Parameters

new_data (dict) – Dictionary of env transition samples to replace self.data with.

insert_transition(sample)[source]

Store new transition sample.

Parameters

sample (dict) – Data sample (containing all tensors of an environment transition)

normalize_int_rewards()[source]

In order to keep the rewards on a consistent scale, the intrinsic rewards are normalized by dividing them by a running estimate of their standard deviation.

reset()[source]

Set class counters to zero and remove stored data

storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'ExternalReturn', 'IntrinsicReturn', 'Value', 'IntrinsicValue', 'LogProbability', 'Advantage', 'IntrinsicAdvantage')
update_storage_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters
  • parameter_name (str) – Attribute name

  • new_parameter_value (int or float) – New value for parameter_name.

Generalized Advantage Estimator (GAE) Buffer

class pytorchrl.agent.storages.on_policy.gae_buffer.GAEBuffer(size, device, actor, algorithm, envs, gae_lambda=0.95)[source]

Bases: pytorchrl.agent.storages.on_policy.vanilla_on_policy_buffer.VanillaOnPolicyBuffer

Storage class for On-Policy algorithms with Generalized Advantage Estimator (GAE). https://arxiv.org/abs/1506.02438

Parameters
  • size (int) – Storage capacity along time axis.

  • gae_lambda (float) – GAE lambda parameter.

  • device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.

  • envs (VecEnv) – Vector of environments instance.

  • actor (Actor) – Actor class instance.

  • algorithm (Algorithm) – Algorithm class instance.

compute_returns(rewards, returns, values, dones, gamma)[source]

Compute return values.

classmethod create_factory(size, gae_lambda=0.95)[source]

Returns a function that creates OnPolicyGAEBuffer instances.

Parameters
  • size (int) – Storage capacity along time axis.

  • gae_lambda (float) – GAE lambda parameter.

Returns

create_buffer_instance – creates a new OnPolicyBuffer class instance.

Return type

func

storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'ExternalReturn', 'IntrinsicReturn', 'Value', 'IntrinsicValue', 'LogProbability', 'Advantage', 'IntrinsicAdvantage')
property used_capacity

Returns the step up to which storage is full with env transitions.

V-trace Buffer

class pytorchrl.agent.storages.on_policy.vtrace_buffer.VTraceBuffer(size, device, actor, algorithm, envs)[source]

Bases: pytorchrl.agent.storages.on_policy.vanilla_on_policy_buffer.VanillaOnPolicyBuffer

Storage class for On-Policy algorithms with off-policy correction method V-trace (https://arxiv.org/abs/1506.02438).

Parameters
  • size (int) – Storage capacity along time axis.

  • device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.

  • envs (VecEnv) – Vector of environments instance.

  • actor (Actor) – Actor class instance.

  • algorithm (Algorithm) – Algorithm class instance.

before_gradients()[source]

Before updating actor policy model, compute returns and advantages.

compute_vtrace(clip_rho_thres=1.0, clip_c_thres=1.0)[source]

Computes V-trace target values and advantage predictions and stores them, along with the updated action log probabilities, in storage.

Parameters
  • clip_rho_thres (float) – V-trace rho threshold parameter.

  • clip_c_thres (float) – V-trace c threshold parameter.

get_updated_action_log_probs()[source]

Computes new log probabilities of actions stored in storage according to current actor version. It also uses the current actor version to update the value predictions.

storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'ExternalReturn', 'IntrinsicReturn', 'Value', 'IntrinsicValue', 'LogProbability', 'Advantage', 'IntrinsicAdvantage')

Proximal Policy Optimization with Demonstrations Buffer (PPOD)

class pytorchrl.agent.storages.on_policy.ppod_buffer.PPODBuffer(size, device, actor, algorithm, envs, rho=0.1, phi=0.3, gae_lambda=0.95, alpha=10, total_buffer_demo_capacity=51, initial_human_demos_dir=None, initial_agent_demos_dir=None, supplementary_demos_dir=None, target_agent_demos_dir=None, num_agent_demos_to_save=10, initial_reward_threshold=None, save_demos_every=10, demo_dtypes={'Action': <class 'numpy.float32'>, 'Observation': <class 'numpy.float32'>, 'Reward': <class 'numpy.float32'>})[source]

Bases: pytorchrl.agent.storages.on_policy.gae_buffer.GAEBuffer

Storage class for PPO+D algorithm.

Parameters
  • size (int) – Storage capacity along time axis.

  • device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.

  • envs (VecEnv) – Vector of environments instance.

  • actor (Actor) – Actor class instance.

  • algorithm (Algorithm) – Algorithm class instance.

  • initial_human_demos_dir (str) – Path to directory containing human initial demonstrations.

  • initial_agent_demos_dir (str) – Path to directory containing other agent initial demonstrations.

  • supplementary_demos_dir (str) – Path to a directory where additional demos can be added after training has started. these demos will be incorporated into the buffer as bonus agent demos.

  • target_agent_demos_dir (str) – Path to directory where best reward demonstrations should be saved.

  • rho (float) – PPO+D rho parameter.

  • phi (float) – PPO+D phi parameter.

  • alpha (float) – PPO+D alpha parameter

  • gae_lambda (float) – GAE lambda parameter.

  • total_buffer_demo_capacity (int) – Maximum number of demos to keep between reward and value demos.

  • save_demos_every (int) – Save top demos every `save_demo_frequency`th data collection.

  • num_agent_demos_to_save (int) – Number of top reward demos to save.

  • initial_reward_threshold (float) – initial value to use as reward threshold for new demos.

  • demo_dtypes (dict) – data types to use for the demos.

after_gradients(batch, info)[source]

After updating actor policy model, make sure self.step is at 0. :param batch: Data batch used to compute the gradients. :type batch: dict :param info: Additional relevant info from gradient computation. :type info: dict

Returns

info – info dict updated with relevant info from Storage.

Return type

dict

anneal_parameters()[source]

Update demos probabilities as explained in PPO+D paper.

before_gradients()[source]

Before updating actor policy model, compute returns and advantages.

check_demo_buffer_capacity()[source]

Check total amount of demos. If total amount of demos exceeds self.max_demos, pop demos.

classmethod create_factory(size, rho=0.1, phi=0.3, gae_lambda=0.95, alpha=10, total_buffer_demo_capacity=51, initial_human_demos_dir=None, initial_agent_demos_dir=None, supplementary_demos_dir=None, target_agent_demos_dir=None, num_agent_demos_to_save=10, initial_reward_threshold=None, save_demos_every=10, demo_dtypes={'Action': <class 'numpy.float32'>, 'Observation': <class 'numpy.float32'>, 'Reward': <class 'numpy.float32'>})[source]

Returns a function that creates PPODBuffer instances.

Parameters
  • size (int) – Storage capacity along time axis.

  • initial_human_demos_dir (str) – Path to directory containing human initial demonstrations.

  • initial_agent_demos_dir (str) – Path to directory containing other agent initial demonstrations.

  • supplementary_demos_dir (str) – Path to a directory where additional demos can be added after training has started. these demos will be incorporated into the buffer as bonus agent demos.

  • target_agent_demos_dir (str) – Path to directory where best reward demonstrations should be saved.

  • rho (float) – PPO+D rho parameter.

  • phi (float) – PPO+D phi parameter.

  • alpha (float) – PPO+D alpha parameter

  • gae_lambda (float) – GAE lambda parameter.

  • total_buffer_demo_capacity (int) – Maximum number of demos to keep between reward and value demos.

  • save_demos_every (int) – Save top demos every `save_demo_frequency`th data collection.

  • num_agent_demos_to_save (int) – Number of top reward demos to save.

  • initial_reward_threshold (float) – initial value to use as reward threshold for new demos.

  • demo_dtypes (dict) – data types to use for the demos.

Returns

create_buffer_instance – creates a new PPODBuffer class instance.

Return type

func

demos_data_fields = ('Observation', 'Action', 'Reward')
insert_transition(sample)[source]

Store new transition sample.

Parameters

sample (dict) – Data sample (containing all tensors of an environment transition)

load_demo(demo_path)[source]

Loads and returns a environment demonstration.

load_initial_demos()[source]

Load initial demonstrations. Warning: make sure the environment frame_skip and frame_stack hyperparameters are the same as those used to record the demonstrations!

load_supplementary_demos()[source]

Load demonstrations found in the self.supplementary_demos (if any). Warning: make sure the environment frame_skip and frame_stack hyperparameters are the same as those used in the demonstrations!

on_policy_data_fields = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'ExternalReturn', 'IntrinsicReturn', 'Value', 'IntrinsicValue', 'LogProbability', 'Advantage', 'IntrinsicAdvantage')
sample_demo(env_id)[source]

With probability rho insert reward demos, with probability phi insert value demos.

save_demos()[source]

Saves the top num_rewards_demos demos from the reward demos buffer and the top num_value_demos demos from the value demos buffer.

track_potential_demos(sample)[source]

Tracks current episodes looking for potential demos