pytorchrl.agent.storages.off_policy package
Submodules
pytorchrl.agent.storages.off_policy.ere_buffer module
- class pytorchrl.agent.storages.off_policy.ere_buffer.EREBuffer(size, device, actor, algorithm, envs, n_step=1, epsilon=0.0, alpha=0.0, beta=1.0, default_error=1000000, eta=1.0, cmin=5000)[source]
Bases:
pytorchrl.agent.storages.off_policy.per_buffer.PERBufferStorage class for Off-Policy algorithms with Emphasizing Recent Experience buffer (https://arxiv.org/abs/1906.04009).
This component extends PERBuffer, allowing to combine ERE with Prioritized Experience Replay (PER) if required. Nonetheless PER parameters, epsilon, alpha and beta, are set by default to values that make PER equivalent to a vanilla replay buffer, allowing to use only ERE. Also n step learning can be combined with PER and ERE using this component, but default n_step value is 1.
- Parameters
size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
epsilon (float) – PER epsilon parameter.
alpha (float) – PER alpha parameter.
beta (float) – PER beta parameter.
default_error (int or float) – Default TD error value to use for newly added data samples.
eta (float) – ERE eta parameter.
cmin (int) – ERE cmin parameter.
- after_gradients(batch, info)[source]
Steps required after updating actor policy model
- Parameters
batch (dict) – Data batch used to compute the gradients.
info (dict) – Additional relevant info from gradient computation.
- Returns
info – info dict updated with relevant info from Storage.
- Return type
dict
- classmethod create_factory(size, n_step=1, epsilon=0.0, alpha=0.0, beta=1.0, default_error=1000000, eta=1.0, cmin=5000)[source]
Returns a function that creates EREBuffer instances.
- Parameters
size (int) – Storage capacity along time axis.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
epsilon (float) – PER epsilon parameter.
alpha (float) – PER alpha parameter.
beta (float) – PER beta parameter.
default_error (int or float) – Default TD error value to use for newly added data samples.
eta (float) – ERE eta parameter.
cmin (int) – ERE cmin parameter.
- Returns
create_buffer_instance – creates a new EREBuffer class instance.
- Return type
func
- generate_batches(num_mini_batch, mini_batch_size, num_epochs=1)[source]
Returns a batch iterator to update actor.
- Parameters
num_mini_batch (int) – Number mini batches per epoch.
mini_batch_size (int) – Number of samples contained in each mini batch.
num_epochs (int) – Number of epochs.
- Yields
batch (dict) – Generated data batches. Contains also extra information relevant to ERE.
- storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'NextObservation', 'NextRecurrentHiddenStates', 'NextDone', 'ActionProbs')
- pytorchrl.agent.storages.off_policy.ere_buffer.dim0_reshape(tensor, size1, size2)[source]
Reshapes tensor so indices are defined like this:
00, 01, 02, 03, 04, 05, 06, 07, 08, 09, size + 1, …, self.max_size 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, size + 1, …, self.max_size 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, size + 1, …, self.max_size
pytorchrl.agent.storages.off_policy.her_buffer module
- class pytorchrl.agent.storages.off_policy.her_buffer.HERBuffer(size, device, actor, algorithm, envs, her_function, n_step=1, epsilon=0.0, alpha=0.0, beta=1.0, default_error=1000000, eta=1.0, cmin=5000)[source]
Bases:
pytorchrl.agent.storages.off_policy.ere_buffer.EREBufferStorage class for Off-Policy algorithms using HER (https://arxiv.org/abs/1707.01495).
- Parameters
size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
epsilon (float) – PER epsilon parameter.
alpha (float) – PER alpha parameter.
beta (float) – PER beta parameter.
default_error (int or float) – Default TD error value to use for newly added data samples.
eta (float) – ERE eta parameter.
cmin (int) – ERE cmin parameter.
her_function (func) – Function to update obs, rhs, obs2 and rew according to HER paper.
Warning
When using an environment vector of size larger than 1, episode sized must be of a fixed length. This HER implementation is not able to deal with envs of variable episode length, except in the case of environment vector size 1.
- classmethod create_factory(size, her_function=<function HERBuffer.<lambda>>, n_step=1, epsilon=0.0, alpha=0.0, beta=1.0, default_error=1000000, eta=1.0, cmin=5000)[source]
Returns a function that creates HERBuffer instances.
- Parameters
size (int) – Storage capacity along time axis.
- Returns
create_buffer_instance – creates a new HERBuffer class instance.
- Return type
func
- handle_end_of_episode()[source]
At the end of an environment episode, generates HER data and adds it to the replay buffer.
- insert_transition(sample)[source]
Store new transition sample.
- Parameters
sample (dict) – Data sample (containing all tensors of an environment transition)
- storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'NextObservation', 'NextRecurrentHiddenStates', 'NextDone', 'ActionProbs')
pytorchrl.agent.storages.off_policy.nstep_buffer module
- class pytorchrl.agent.storages.off_policy.nstep_buffer.NStepReplayBuffer(size, device, actor, algorithm, envs, n_step=1)[source]
Bases:
pytorchrl.agent.storages.off_policy.replay_buffer.ReplayBufferStorage class for Off-Policy with multi step learning (https://arxiv.org/abs/1710.02298).
- Parameters
size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
- classmethod create_factory(size, n_step=1)[source]
Returns a function that creates NStepReplayBuffer instances.
- Parameters
size (int) – Storage capacity along time axis.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
- Returns
create_buffer_instance – creates a new NStepReplayBuffer class instance.
- Return type
func
- generate_batches(num_mini_batch, mini_batch_size, num_epochs=1)[source]
Returns a batch iterator to update actor.
- Parameters
num_mini_batch (int) – Number mini batches per epoch.
mini_batch_size (int) – Number of samples contained in each mini batch.
num_epochs (int) – Number of epochs.
- Yields
batch (dict) – Generated data batches.
- insert_transition(sample)[source]
Store new transition sample.
- Parameters
sample (dict) – Data sample (containing all tensors of an environment transition)
- storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'NextObservation', 'NextRecurrentHiddenStates', 'NextDone', 'ActionProbs')
- pytorchrl.agent.storages.off_policy.nstep_buffer.dim0_reshape(tensor, size)[source]
Reshapes tensor so indices are defined like this:
00, 01, 02, 03, 04, 05, 06, 07, 08, 09, size + 1, …, self.max_size 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, size + 1, …, self.max_size 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, size + 1, …, self.max_size
pytorchrl.agent.storages.off_policy.per_buffer module
- class pytorchrl.agent.storages.off_policy.per_buffer.PERBuffer(size, device, actor, algorithm, envs, n_step=1, epsilon=0.0, alpha=0.0, beta=1.0, default_error=1000000)[source]
Bases:
pytorchrl.agent.storages.off_policy.nstep_buffer.NStepReplayBufferStorage class for Off-Policy algorithms using PER (https://arxiv.org/abs/1707.01495).
This component extends NStepReplayBuffer, enabling to combine PER with n step learning. However, default n_step value is 1, which is equivalent to not using n_step learning at all.
- Parameters
size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
epsilon (float) – PER epsilon parameter.
alpha (float) – PER alpha parameter.
beta (float) – PER beta parameter.
default_error (int or float) – Default TD error value to use for newly added data samples.
- after_gradients(batch, info)[source]
Steps required after updating actor policy model
- Parameters
batch (dict) – Data batch used to compute the gradients.
info (dict) – Additional relevant info from gradient computation.
- Returns
info – info dict updated with relevant info from Storage.
- Return type
dict
- classmethod create_factory(size, n_step=1, epsilon=0.0, alpha=0.0, beta=1.0, default_error=1000000)[source]
Returns a function that creates PERBuffer instances.
- Parameters
size (int) – Storage capacity along time axis.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
epsilon (float) – PER epsilon parameter.
alpha (float) – PER alpha parameter.
beta (float) – PER beta parameter.
default_error (int or float) – Default TD error value to use for newly added data samples.
- Returns
create_buffer_instance – creates a new PERBuffer class instance.
- Return type
func
- generate_batches(num_mini_batch, mini_batch_size, num_epochs=1)[source]
Returns a batch iterator to update actor.
- Parameters
num_mini_batch (int) – Number mini batches per epoch.
mini_batch_size (int) – Number of samples contained in each mini batch.
num_epochs (int) – Number of epochs.
- Yields
batch (dict) – Generated data batches.
- get_priority(error)[source]
Takes in the error of one or more examples and returns the proportional priority
- get_sequence_priority(sequence_data, eta=0.9)[source]
Get priority score for a given data sequence.
- insert_transition(sample)[source]
Store new transition sample.
- Parameters
sample (dict) – Data sample (containing all tensors of an environment transition)
- storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'NextObservation', 'NextRecurrentHiddenStates', 'NextDone', 'ActionProbs')
- pytorchrl.agent.storages.off_policy.per_buffer.dim0_reshape(tensor, size)[source]
Reshapes tensor so indices are defined like this:
00, 01, 02, 03, 04, 05, 06, 07, 08, 09, size + 1, …, self.max_size 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, size + 1, …, self.max_size 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, size + 1, …, self.max_size
pytorchrl.agent.storages.off_policy.replay_buffer module
- class pytorchrl.agent.storages.off_policy.replay_buffer.ReplayBuffer(size, device, actor, algorithm, envs)[source]
Bases:
pytorchrl.agent.storages.base.StorageStorage class for Off-Policy algorithms.
- Parameters
size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance
envs (VecEnv) – Vector of environments instance.
- after_gradients(batch, info)[source]
Steps required after updating actor policy model
- Parameters
batch (dict) – Data batch used to compute the gradients.
info (dict) – Additional relevant info from gradient computation.
- Returns
info – info dict updated with relevant info from Storage.
- Return type
dict
- classmethod create_factory(size)[source]
Returns a function that creates ReplayBuffer instances.
- Parameters
size (int) – Storage capacity along time axis.
- Returns
create_buffer_instance – creates a new ReplayBuffer class instance.
- Return type
func
- generate_batches(num_mini_batch, mini_batch_size, num_epochs=1)[source]
Returns a batch iterator to update actor.
- Parameters
num_mini_batch (int) – Number mini batches per epoch.
mini_batch_size (int) – Number of samples contained in each mini batch.
num_epochs (int) – Number of epochs.
- Yields
batch (dict) – Generated data batches.
- get_all_buffer_data(data_to_cpu=False)[source]
Return all currently stored data. If data_to_cpu, no need to do anything since data tensors are already in cpu memory.
- Parameters
data_to_cpu (bool) – Whether or not to move data tensors to cpu memory.
- Returns
data – data currently stored in the buffer.
- Return type
dict
- get_data_slice(start_pos, end_pos)[source]
Makes a copy of all tensors in the bufer between steps start_pos and end_pos.
- Parameters
start_pos (int) – initial slice position.
end_pos (int) – final slice position.
- Returns
data – data slice copied from the buffer.
- Return type
dict
- init_tensors(sample)[source]
Lazy initialization of data tensors from a sample.
- Parameters
sample (dict) – Data sample (containing all tensors of an environment transition)
- insert_data_slice(new_data)[source]
Appends new_data to currently stored data.
- Parameters
new_data (dict) – Dictionary of env transition samples to be added to self.data.
- insert_single_tensor_slice(tensor_storage, tensor_key, tensor_values)[source]
Appends tensor_value to buffer dict using tensor_key as key.
- Parameters
tensor_storage –
tensor_key (str) – key to use to store the tensor.
tensor_values (np.ndarray) – tensor values.
- Returns
l – length (time axe) of the tensor added to the buffer.
- Return type
int
- insert_transition(sample)[source]
Store new transition sample.
- Parameters
sample (dict) – Data sample (containing all tensors of an environment transition)
- reset()[source]
Set class size and step to zero. If self.actor uses RNNs, add overlap slice of last sequence before reset at the beginning of the storage.
- storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'NextObservation', 'NextRecurrentHiddenStates', 'NextDone', 'ActionProbs')
- pytorchrl.agent.storages.off_policy.replay_buffer.dim0_reshape(tensor, size)[source]
Reshapes tensor so indices are defined like this:
00, 01, 02, 03, 04, 05, 06, 07, 08, 09, size + 1, …, self.max_size 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, size + 1, …, self.max_size 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, size + 1, …, self.max_size