-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: add Diversity is All You Need implementation #267
base: master
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
refactor to use classes to enable easy options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @kinalmehta! DIAYN looks like a pretty interesting paper. Some thoughts:
- Have you run some preliminary experiments to see if you can replicate the results reported in the paper?
learn_skills
anduse_skills
have a huge amount of duplicate code. If their purpose is to save and load models, consider the approach listed in https://docs.cleanrl.dev/advanced/resume-training/#resume-training_1.
group.add_argument("--learn-skills", action='store_true', default=False) | ||
group.add_argument("--use-skills", action='store_true', default=False) | ||
group.add_argument("--evaluate-skills", action='store_true', default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the following configuration for bool:
parser.add_argument("--autotune", type=lambda x:bool(strtobool(x)), default=False, nargs="?", const=True,
help="automatic tuning of the entropy coefficient")
# INFO: don't need to use OptionsPolicy as it is not used in the paper. | ||
# Instead skill is uniformly sampled from the skills space. | ||
# This can be used later to use pretrained skills to optimize for a specific reward function. | ||
# class OptionsPolicy(nn.Module): | ||
# def __init__(self, env, num_skills): | ||
# super().__init__() | ||
# self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod(), 256) | ||
# self.fc2 = nn.Linear(256, 256) | ||
# self.fc3 = nn.Linear(256, num_skills) | ||
|
||
# def forward(self, x): | ||
# x = F.relu(self.fc1(x)) | ||
# x = F.relu(self.fc2(x)) | ||
# x = self.fc3(x) | ||
# return OneHotCategorical(logits = x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go to docs, under the implementation details section.
|
||
|
||
def split_aug_obs(aug_obs, num_skills): | ||
assert type(aug_obs) in [torch.Tensor, np.ndarray] and type(num_skills) is int, "invalid input type" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check may not be needed for simplicity. Otherwise we may also need a check for aug_obs_z
.
class DIAYN: | ||
def __init__(self, args, run_name=None, device=torch.device("cpu")): | ||
|
||
self.args = args | ||
self.device = device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the standard single-file implementation format in place of classes.
# TRY NOT TO MODIFY: start the game | ||
obs = self.envs.reset() | ||
z_aug_obs = aug_obs_z(obs, one_hot_z) | ||
for global_step in range(self.args.total_timesteps): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between learn_skills
and use_skills
? Why are both of them going over 1000000
steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
learn_skills
is the unsupervised skill learning phase, whereas use_skills
is to fine-tune the trained model to optimize for the environment reward
self.writer.add_scalar("charts/SPS", int(global_step / (time.time() - start_time)), global_step) | ||
if self.args.autotune: | ||
self.writer.add_scalar("losses/alpha_loss", alpha_loss.item(), global_step) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for return — it's implicit.
# self.actor_optimizer.load_state_dict(models_info["actor_optimizer"]) | ||
# self.q_optimizer.load_state_dict(models_info["q_optimizer"]) | ||
# self.discriminator_optimizer.load_state_dict(models_info["discriminator_optimizer"]) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for return — it's implicit.
Description
Adds implementation of Diversity is All You Need paper. It is an unsupervised option learning framework which can later be used for transfer learning.
To-Do
Types of changes
Checklist:
pre-commit run --all-files
passes (required).mkdocs serve
.If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.
--capture-video
flag toggled on (required).mkdocs serve
.width=500
andheight=300
).