Off-policy

Replay buffer

class pytorchrl.agent.storages.off_policy.replay_buffer.ReplayBuffer(size, device, actor, algorithm, envs)[source]

Bases: pytorchrl.agent.storages.base.Storage

Storage class for Off-Policy algorithms.

Parameters

size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance
envs (VecEnv) – Vector of environments instance.

after_gradients(batch, info)[source]

Steps required after updating actor policy model

Parameters

batch (dict) – Data batch used to compute the gradients.
info (dict) – Additional relevant info from gradient computation.

Returns

info – info dict updated with relevant info from Storage.

Return type

dict

before_gradients()[source]: Steps required before updating actor policy model.

classmethod create_factory(size)[source]

Returns a function that creates ReplayBuffer instances.

Parameters: size (int) – Storage capacity along time axis.
Returns: create_buffer_instance – creates a new ReplayBuffer class instance.
Return type: func

generate_batches(num_mini_batch, mini_batch_size, num_epochs=1)[source]

Returns a batch iterator to update actor.

Parameters

num_mini_batch (int) – Number mini batches per epoch.
mini_batch_size (int) – Number of samples contained in each mini batch.
num_epochs (int) – Number of epochs.

Yields

batch (dict) – Generated data batches.

get_all_buffer_data(data_to_cpu=False)[source]

Return all currently stored data. If data_to_cpu, no need to do anything since data tensors are already in cpu memory.

Parameters: data_to_cpu (bool) – Whether or not to move data tensors to cpu memory.
Returns: data – data currently stored in the buffer.
Return type: dict

get_data_slice(start_pos, end_pos)[source]

Makes a copy of all tensors in the bufer between steps start_pos and end_pos.

Parameters

start_pos (int) – initial slice position.
end_pos (int) – final slice position.

Returns

data – data slice copied from the buffer.

Return type

dict

init_tensors(sample)[source]

Lazy initialization of data tensors from a sample.

Parameters: sample (dict) – Data sample (containing all tensors of an environment transition)

insert_data_slice(new_data)[source]

Appends new_data to currently stored data.

Parameters: new_data (dict) – Dictionary of env transition samples to be added to self.data.

insert_single_tensor_slice(tensor_storage, tensor_key, tensor_values)[source]

Appends tensor_value to buffer dict using tensor_key as key.

Parameters

tensor_storage –
tensor_key (str) – key to use to store the tensor.
tensor_values (np.ndarray) – tensor values.

Returns

l – length (time axe) of the tensor added to the buffer.

Return type

int

insert_transition(sample)[source]

Store new transition sample.

Parameters: sample (dict) – Data sample (containing all tensors of an environment transition)

reset()[source]: Set class size and step to zero. If self.actor uses RNNs, add overlap slice of last sequence before reset at the beginning of the storage.

storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'NextObservation', 'NextRecurrentHiddenStates', 'NextDone', 'ActionProbs')

update_storage_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters

parameter_name (str) – Attribute name
new_parameter_value (int or float) – New value for parameter_name.

N-Step Replay Buffer

class pytorchrl.agent.storages.off_policy.nstep_buffer.NStepReplayBuffer(size, device, actor, algorithm, envs, n_step=1)[source]

Bases: pytorchrl.agent.storages.off_policy.replay_buffer.ReplayBuffer

Storage class for Off-Policy with multi step learning (https://arxiv.org/abs/1710.02298).

Parameters

size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.

classmethod create_factory(size, n_step=1)[source]

Returns a function that creates NStepReplayBuffer instances.

Parameters

size (int) – Storage capacity along time axis.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.

Returns

create_buffer_instance – creates a new NStepReplayBuffer class instance.

Return type

func

generate_batches(num_mini_batch, mini_batch_size, num_epochs=1)[source]

Returns a batch iterator to update actor.

Parameters

num_mini_batch (int) – Number mini batches per epoch.
mini_batch_size (int) – Number of samples contained in each mini batch.
num_epochs (int) – Number of epochs.

Yields

batch (dict) – Generated data batches.

insert_transition(sample)[source]

Store new transition sample.

Parameters: sample (dict) – Data sample (containing all tensors of an environment transition)

storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'NextObservation', 'NextRecurrentHiddenStates', 'NextDone', 'ActionProbs')

update_storage_parameter(parameter_name, new_parameter_value)[source]

If parameter_name is an attribute of the algorithm, change its value to new_parameter_value value.

Parameters

parameter_name (str) – Attribute name
new_parameter_value (int or float) – New value for parameter_name.

Prioritized Experience Replay Buffer

class pytorchrl.agent.storages.off_policy.per_buffer.PERBuffer(size, device, actor, algorithm, envs, n_step=1, epsilon=0.0, alpha=0.0, beta=1.0, default_error=1000000)[source]

Bases: pytorchrl.agent.storages.off_policy.nstep_buffer.NStepReplayBuffer

Storage class for Off-Policy algorithms using PER (https://arxiv.org/abs/1707.01495).

This component extends NStepReplayBuffer, enabling to combine PER with n step learning. However, default n_step value is 1, which is equivalent to not using n_step learning at all.

Parameters

size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
epsilon (float) – PER epsilon parameter.
alpha (float) – PER alpha parameter.
beta (float) – PER beta parameter.
default_error (int or float) – Default TD error value to use for newly added data samples.

after_gradients(batch, info)[source]

Steps required after updating actor policy model

Parameters

batch (dict) – Data batch used to compute the gradients.
info (dict) – Additional relevant info from gradient computation.

Returns

info – info dict updated with relevant info from Storage.

Return type

dict

before_gradients()[source]: Steps required before updating actor policy model.

classmethod create_factory(size, n_step=1, epsilon=0.0, alpha=0.0, beta=1.0, default_error=1000000)[source]

Returns a function that creates PERBuffer instances.

Parameters

size (int) – Storage capacity along time axis.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
epsilon (float) – PER epsilon parameter.
alpha (float) – PER alpha parameter.
beta (float) – PER beta parameter.
default_error (int or float) – Default TD error value to use for newly added data samples.

Returns

create_buffer_instance – creates a new PERBuffer class instance.

Return type

func

generate_batches(num_mini_batch, mini_batch_size, num_epochs=1)[source]

Returns a batch iterator to update actor.

Parameters

num_mini_batch (int) – Number mini batches per epoch.
mini_batch_size (int) – Number of samples contained in each mini batch.
num_epochs (int) – Number of epochs.

Yields

batch (dict) – Generated data batches.

get_priority(error)[source]: Takes in the error of one or more examples and returns the proportional priority

get_sequence_priority(sequence_data, eta=0.9)[source]: Get priority score for a given data sequence.

insert_transition(sample)[source]

Store new transition sample.

Parameters: sample (dict) – Data sample (containing all tensors of an environment transition)

storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'NextObservation', 'NextRecurrentHiddenStates', 'NextDone', 'ActionProbs')

Emphasizing Recent Experience Replay Buffer (ERE)

class pytorchrl.agent.storages.off_policy.ere_buffer.EREBuffer(size, device, actor, algorithm, envs, n_step=1, epsilon=0.0, alpha=0.0, beta=1.0, default_error=1000000, eta=1.0, cmin=5000)[source]

Bases: pytorchrl.agent.storages.off_policy.per_buffer.PERBuffer

Storage class for Off-Policy algorithms with Emphasizing Recent Experience buffer (https://arxiv.org/abs/1906.04009).

This component extends PERBuffer, allowing to combine ERE with Prioritized Experience Replay (PER) if required. Nonetheless PER parameters, epsilon, alpha and beta, are set by default to values that make PER equivalent to a vanilla replay buffer, allowing to use only ERE. Also n step learning can be combined with PER and ERE using this component, but default n_step value is 1.

Parameters

size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
epsilon (float) – PER epsilon parameter.
alpha (float) – PER alpha parameter.
beta (float) – PER beta parameter.
default_error (int or float) – Default TD error value to use for newly added data samples.
eta (float) – ERE eta parameter.
cmin (int) – ERE cmin parameter.

after_gradients(batch, info)[source]

Steps required after updating actor policy model

Parameters

batch (dict) – Data batch used to compute the gradients.
info (dict) – Additional relevant info from gradient computation.

Returns

info – info dict updated with relevant info from Storage.

Return type

dict

classmethod create_factory(size, n_step=1, epsilon=0.0, alpha=0.0, beta=1.0, default_error=1000000, eta=1.0, cmin=5000)[source]

Returns a function that creates EREBuffer instances.

Parameters

size (int) – Storage capacity along time axis.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
epsilon (float) – PER epsilon parameter.
alpha (float) – PER alpha parameter.
beta (float) – PER beta parameter.
default_error (int or float) – Default TD error value to use for newly added data samples.
eta (float) – ERE eta parameter.
cmin (int) – ERE cmin parameter.

Returns

create_buffer_instance – creates a new EREBuffer class instance.

Return type

func

generate_batches(num_mini_batch, mini_batch_size, num_epochs=1)[source]

Returns a batch iterator to update actor.

Parameters

num_mini_batch (int) – Number mini batches per epoch.
mini_batch_size (int) – Number of samples contained in each mini batch.
num_epochs (int) – Number of epochs.

Yields

batch (dict) – Generated data batches. Contains also extra information relevant to ERE.

storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'NextObservation', 'NextRecurrentHiddenStates', 'NextDone', 'ActionProbs')

update_eta(info)[source]: Adjust eta parameter based on how fast or slow the agent is learning in recent episodes

Hindsight experience replay buffer (HER)

class pytorchrl.agent.storages.off_policy.her_buffer.HERBuffer(size, device, actor, algorithm, envs, her_function, n_step=1, epsilon=0.0, alpha=0.0, beta=1.0, default_error=1000000, eta=1.0, cmin=5000)[source]

Bases: pytorchrl.agent.storages.off_policy.ere_buffer.EREBuffer

Storage class for Off-Policy algorithms using HER (https://arxiv.org/abs/1707.01495).

Parameters

size (int) – Storage capacity along time axis.
device (torch.device) – CPU or specific GPU where data tensors will be placed and class computations will take place. Should be the same device where the actor model is located.
envs (VecEnv) – Vector of environments instance.
actor (Actor) – Actor class instance.
algorithm (Algorithm) – Algorithm class instance.
n_step (int or float) – Number of future steps used to computed the truncated n-step return value.
epsilon (float) – PER epsilon parameter.
alpha (float) – PER alpha parameter.
beta (float) – PER beta parameter.
default_error (int or float) – Default TD error value to use for newly added data samples.
eta (float) – ERE eta parameter.
cmin (int) – ERE cmin parameter.
her_function (func) – Function to update obs, rhs, obs2 and rew according to HER paper.

Warning

When using an environment vector of size larger than 1, episode sized must be of a fixed length. This HER implementation is not able to deal with envs of variable episode length, except in the case of environment vector size 1.

copy_single_tensor(key, position)[source]: Generates a copy of tensor key at index position.

classmethod create_factory(size, her_function=<function HERBuffer.<lambda>>, n_step=1, epsilon=0.0, alpha=0.0, beta=1.0, default_error=1000000, eta=1.0, cmin=5000)[source]

Returns a function that creates HERBuffer instances.

Parameters: size (int) – Storage capacity along time axis.
Returns: create_buffer_instance – creates a new HERBuffer class instance.
Return type: func

handle_end_of_episode()[source]: At the end of an environment episode, generates HER data and adds it to the replay buffer.

insert_transition(sample)[source]

Store new transition sample.

Parameters: sample (dict) – Data sample (containing all tensors of an environment transition)

storage_tensors = ('Observation', 'RecurrentHiddenStates', 'Done', 'Action', 'Reward', 'IntrinsicReward', 'NextObservation', 'NextRecurrentHiddenStates', 'NextDone', 'ActionProbs')