Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: add Diversity is All You Need implementation #267

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

kinalmehta
Copy link
Collaborator

Description

Adds implementation of Diversity is All You Need paper. It is an unsupervised option learning framework which can later be used for transfer learning.

To-Do

  • Implement unsupervised skill learning
  • implement model saving and loading logic
  • use the pre-trained model to train on a task from paper to benchmark
  • add documentation and benchmarks

Types of changes

  • Bug fix
  • New feature
  • New algorithm
  • Documentation

Checklist:

  • I've read the CONTRIBUTION guide (required).
  • I have ensured pre-commit run --all-files passes (required).
  • I have updated the documentation and previewed the changes via mkdocs serve.
  • I have updated the tests accordingly (if applicable).

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

  • I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
  • I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
  • I have added additional documentation and previewed the changes via mkdocs serve.
    • I have explained note-worthy implementation details.
    • I have explained the logged metrics.
    • I have added links to the original paper and related papers (if applicable).
    • I have added links to the PR related to the algorithm.
    • I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
    • I have added the learning curves (in PNG format with width=500 and height=300).
    • I have added links to the tracked experiments.
    • I have updated the overview sections at the docs and the repo
  • I have updated the tests accordingly (if applicable).

@kinalmehta kinalmehta self-assigned this Aug 27, 2022
@vercel
Copy link

vercel bot commented Aug 27, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
cleanrl ✅ Ready (Inspect) Visit Preview Dec 9, 2022 at 7:05PM (UTC)

refactor to use classes to enable easy options
Copy link
Owner

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @kinalmehta! DIAYN looks like a pretty interesting paper. Some thoughts:

  1. Have you run some preliminary experiments to see if you can replicate the results reported in the paper?
  2. learn_skills and use_skills have a huge amount of duplicate code. If their purpose is to save and load models, consider the approach listed in https://docs.cleanrl.dev/advanced/resume-training/#resume-training_1.

Comment on lines 78 to 80
group.add_argument("--learn-skills", action='store_true', default=False)
group.add_argument("--use-skills", action='store_true', default=False)
group.add_argument("--evaluate-skills", action='store_true', default=False)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the following configuration for bool:

    parser.add_argument("--autotune", type=lambda x:bool(strtobool(x)), default=False, nargs="?", const=True,
        help="automatic tuning of the entropy coefficient")

Comment on lines 119 to 133
# INFO: don't need to use OptionsPolicy as it is not used in the paper.
# Instead skill is uniformly sampled from the skills space.
# This can be used later to use pretrained skills to optimize for a specific reward function.
# class OptionsPolicy(nn.Module):
# def __init__(self, env, num_skills):
# super().__init__()
# self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod(), 256)
# self.fc2 = nn.Linear(256, 256)
# self.fc3 = nn.Linear(256, num_skills)

# def forward(self, x):
# x = F.relu(self.fc1(x))
# x = F.relu(self.fc2(x))
# x = self.fc3(x)
# return OneHotCategorical(logits = x)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go to docs, under the implementation details section.



def split_aug_obs(aug_obs, num_skills):
assert type(aug_obs) in [torch.Tensor, np.ndarray] and type(num_skills) is int, "invalid input type"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check may not be needed for simplicity. Otherwise we may also need a check for aug_obs_z.

Comment on lines 202 to 206
class DIAYN:
def __init__(self, args, run_name=None, device=torch.device("cpu")):

self.args = args
self.device = device
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the standard single-file implementation format in place of classes.

# TRY NOT TO MODIFY: start the game
obs = self.envs.reset()
z_aug_obs = aug_obs_z(obs, one_hot_z)
for global_step in range(self.args.total_timesteps):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between learn_skills and use_skills? Why are both of them going over 1000000 steps?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

learn_skills is the unsupervised skill learning phase, whereas use_skills is to fine-tune the trained model to optimize for the environment reward

self.writer.add_scalar("charts/SPS", int(global_step / (time.time() - start_time)), global_step)
if self.args.autotune:
self.writer.add_scalar("losses/alpha_loss", alpha_loss.item(), global_step)
return
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for return — it's implicit.

# self.actor_optimizer.load_state_dict(models_info["actor_optimizer"])
# self.q_optimizer.load_state_dict(models_info["q_optimizer"])
# self.discriminator_optimizer.load_state_dict(models_info["discriminator_optimizer"])
return
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for return — it's implicit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants