pytorchrl.agent.storages.on_policy package
Submodules
pytorchrl.agent.storages.on_policy.gae_buffer module
- class pytorchrl.agent.storages.on_policy.gae_buffer.GAEBuffer(size, device, actor, algorithm, envs, gae_lambda=0.95)[source]
Bases:
pytorchrl.agent.storages.on_policy.vanilla_on_policy_buffer.VanillaOnPolicyBufferStorage class for On-Policy algorithms with Generalized Advantage Estimator (GAE). https://arxiv.org/abs/1506.02438
- Parameters
size (int) – Storage capacity along time axis.
gae_lambda (float) – GAE lambda parameter.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.
- classmethod create_factory(size, gae_lambda=0.95)[source]
Returns a function that creates OnPolicyGAEBuffer instances.
- Parameters
size (int) – Storage capacity along time axis.
gae_lambda (float) – GAE lambda parameter.
- Returns
create_buffer_instance – creates a new OnPolicyBuffer class instance.
- Return type
func
- storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'ExternalReturn', 'IntrinsicReturn', 'Value', 'IntrinsicValue', 'LogProbability', 'Advantage', 'IntrinsicAdvantage')
- property used_capacity
Returns the step up to which storage is full with env transitions.
pytorchrl.agent.storages.on_policy.ppod_buffer module
- class pytorchrl.agent.storages.on_policy.ppod_buffer.PPODBuffer(size, device, actor, algorithm, envs, rho=0.1, phi=0.3, gae_lambda=0.95, alpha=10, total_buffer_demo_capacity=51, initial_human_demos_dir=None, initial_agent_demos_dir=None, supplementary_demos_dir=None, target_agent_demos_dir=None, num_agent_demos_to_save=10, initial_reward_threshold=None, save_demos_every=10, demo_dtypes={'Action': <class 'numpy.float32'>, 'Observation': <class 'numpy.float32'>, 'Reward': <class 'numpy.float32'>})[source]
Bases:
pytorchrl.agent.storages.on_policy.gae_buffer.GAEBufferStorage class for PPO+D algorithm.
- Parameters
size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.
initial_human_demos_dir (str) – Path to directory containing human initial demonstrations.
initial_agent_demos_dir (str) – Path to directory containing other agent initial demonstrations.
supplementary_demos_dir (str) – Path to a directory where additional demos can be added after training has started. these demos will be incorporated into the buffer as bonus agent demos.
target_agent_demos_dir (str) – Path to directory where best reward demonstrations should be saved.
rho (float) – PPO+D rho parameter.
phi (float) – PPO+D phi parameter.
alpha (float) – PPO+D alpha parameter
gae_lambda (float) – GAE lambda parameter.
total_buffer_demo_capacity (int) – Maximum number of demos to keep between reward and value demos.
save_demos_every (int) – Save top demos every `save_demo_frequency`th data collection.
num_agent_demos_to_save (int) – Number of top reward demos to save.
initial_reward_threshold (float) – initial value to use as reward threshold for new demos.
demo_dtypes (dict) – data types to use for the demos.
- after_gradients(batch, info)[source]
After updating actor policy model, make sure self.step is at 0. :param batch: Data batch used to compute the gradients. :type batch: dict :param info: Additional relevant info from gradient computation. :type info: dict
- Returns
info – info dict updated with relevant info from Storage.
- Return type
dict
- check_demo_buffer_capacity()[source]
Check total amount of demos. If total amount of demos exceeds self.max_demos, pop demos.
- classmethod create_factory(size, rho=0.1, phi=0.3, gae_lambda=0.95, alpha=10, total_buffer_demo_capacity=51, initial_human_demos_dir=None, initial_agent_demos_dir=None, supplementary_demos_dir=None, target_agent_demos_dir=None, num_agent_demos_to_save=10, initial_reward_threshold=None, save_demos_every=10, demo_dtypes={'Action': <class 'numpy.float32'>, 'Observation': <class 'numpy.float32'>, 'Reward': <class 'numpy.float32'>})[source]
Returns a function that creates PPODBuffer instances.
- Parameters
size (int) – Storage capacity along time axis.
initial_human_demos_dir (str) – Path to directory containing human initial demonstrations.
initial_agent_demos_dir (str) – Path to directory containing other agent initial demonstrations.
supplementary_demos_dir (str) – Path to a directory where additional demos can be added after training has started. these demos will be incorporated into the buffer as bonus agent demos.
target_agent_demos_dir (str) – Path to directory where best reward demonstrations should be saved.
rho (float) – PPO+D rho parameter.
phi (float) – PPO+D phi parameter.
alpha (float) – PPO+D alpha parameter
gae_lambda (float) – GAE lambda parameter.
total_buffer_demo_capacity (int) – Maximum number of demos to keep between reward and value demos.
save_demos_every (int) – Save top demos every `save_demo_frequency`th data collection.
num_agent_demos_to_save (int) – Number of top reward demos to save.
initial_reward_threshold (float) – initial value to use as reward threshold for new demos.
demo_dtypes (dict) – data types to use for the demos.
- Returns
create_buffer_instance – creates a new PPODBuffer class instance.
- Return type
func
- demos_data_fields = ('Observation', 'Action', 'Reward')
- insert_transition(sample)[source]
Store new transition sample.
- Parameters
sample (dict) – Data sample (containing all tensors of an environment transition)
- load_initial_demos()[source]
Load initial demonstrations. Warning: make sure the environment frame_skip and frame_stack hyperparameters are the same as those used to record the demonstrations!
- load_supplementary_demos()[source]
Load demonstrations found in the self.supplementary_demos (if any). Warning: make sure the environment frame_skip and frame_stack hyperparameters are the same as those used in the demonstrations!
- on_policy_data_fields = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'ExternalReturn', 'IntrinsicReturn', 'Value', 'IntrinsicValue', 'LogProbability', 'Advantage', 'IntrinsicAdvantage')
- sample_demo(env_id)[source]
With probability rho insert reward demos, with probability phi insert value demos.
pytorchrl.agent.storages.on_policy.vanilla_on_policy_buffer module
- class pytorchrl.agent.storages.on_policy.vanilla_on_policy_buffer.VanillaOnPolicyBuffer(size, device, actor, algorithm, envs)[source]
Bases:
pytorchrl.agent.storages.base.StorageStorage class for On-Policy algorithms.
- Parameters
size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.
envs (VecEnv) – Vector of environments instance.
- after_gradients(batch, info)[source]
After updating actor policy model, make sure self.step is at 0.
- Parameters
batch (dict) – Data batch used to compute the gradients.
info (dict) – Additional relevant info from gradient computation.
- Returns
info – info dict updated with relevant info from Storage.
- Return type
dict
- classmethod create_factory(size)[source]
Returns a function that creates VanillaOnPolicyBuffer instances.
- Parameters
size (int) – Storage capacity along time axis.
- Returns
create_buffer_instance – creates a new VanillaOnPolicyBuffer class instance.
- Return type
func
- generate_batches(num_mini_batch, mini_batch_size, num_epochs=1, shuffle=True)[source]
Returns a batch iterator to update actor.
- Parameters
num_mini_batch (int) – Number mini batches per epoch.
mini_batch_size (int) – Number of samples contained in each mini batch.
num_epochs (int) – Number of epochs.
shuffle (bool) – Whether to shuffle collected data or generate sequential
- Yields
batch (dict) – Generated data batches.
- get_all_buffer_data(data_to_cpu=False)[source]
Return all currently stored data.
- Parameters
data_to_cpu (bool) – Whether or not to move data to cpu memory.
- get_num_channels_obs(sample)[source]
Obtain num_channels_obs and set it as class attribute.
- Parameters
sample (dict) – Data sample (containing all tensors of an environment transition)
- init_tensors(sample)[source]
Lazy initialization of data tensors from a sample.
- Parameters
sample (dict) – Data sample (containing all tensors of an environment transition)
- insert_data_slice(new_data)[source]
Replace currently stored data.
- Parameters
new_data (dict) – Dictionary of env transition samples to replace self.data with.
- insert_transition(sample)[source]
Store new transition sample.
- Parameters
sample (dict) – Data sample (containing all tensors of an environment transition)
- normalize_int_rewards()[source]
In order to keep the rewards on a consistent scale, the intrinsic rewards are normalized by dividing them by a running estimate of their standard deviation.
- storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'ExternalReturn', 'IntrinsicReturn', 'Value', 'IntrinsicValue', 'LogProbability', 'Advantage', 'IntrinsicAdvantage')
pytorchrl.agent.storages.on_policy.vtrace_buffer module
- class pytorchrl.agent.storages.on_policy.vtrace_buffer.VTraceBuffer(size, device, actor, algorithm, envs)[source]
Bases:
pytorchrl.agent.storages.on_policy.vanilla_on_policy_buffer.VanillaOnPolicyBufferStorage class for On-Policy algorithms with off-policy correction method V-trace (https://arxiv.org/abs/1506.02438).
- Parameters
size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.
- compute_vtrace(clip_rho_thres=1.0, clip_c_thres=1.0)[source]
Computes V-trace target values and advantage predictions and stores them, along with the updated action log probabilities, in storage.
- Parameters
clip_rho_thres (float) – V-trace rho threshold parameter.
clip_c_thres (float) – V-trace c threshold parameter.
- get_updated_action_log_probs()[source]
Computes new log probabilities of actions stored in storage according to current actor version. It also uses the current actor version to update the value predictions.
- storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'ExternalReturn', 'IntrinsicReturn', 'Value', 'IntrinsicValue', 'LogProbability', 'Advantage', 'IntrinsicAdvantage')