Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classifier timeline and white paper #64

Closed
fmingyan opened this issue Apr 22, 2022 · 5 comments
Closed

Classifier timeline and white paper #64

fmingyan opened this issue Apr 22, 2022 · 5 comments

Comments

@fmingyan
Copy link

Could you provide some guidance on when the classifier weights will become public? Will there be a white paper on the classifier similar to what was published for FLoC?

@jkarlin
Copy link
Collaborator

jkarlin commented Apr 25, 2022

The model is currently public in the sense that it's shipped to canary/dev/beta browsers that have enabled the API (e.g., via enabling Privacy Sandbox Ads APIs in chrome://flags). It's in your config directory under OptimizationGuidePredictionModels/ as a tflite model. The directory names are cryptic, but you're in the right directory if there is a override_list.pb.gz file in there as well. That override list makes it possible to override the model's output for domains. We currently have the top 10,000 domains manually labeled and placed in that override list.

I do expect that we'll have an updated website with more details.

Note that we also intend to make a chrome://topics-internals page for helping to debug topics data. One section of the internals page will include a tool to let developers manually query the model.

@fmingyan
Copy link
Author

Thank you, could you also guide on where to find the top 10k domains currently used and how it can be mapped to model input?

@leeronisrael
Copy link
Contributor

You can find the path to the model file in the "Classifier" tab of chrome://topics-internals/ page (see docs here). The top 10k domains currently used are in the override_list.pb.gz file found in the same directory as the model above. The domain to topics associations in the list are utilized by the API in lieu of the output of the model itself.

To run the model directly, refer to documentation here: https://www.tensorflow.org/lite/guide/inference#running_a_model (Also see: https://www.tensorflow.org/learn)

To inspect the override_list.pb.gz file:

Unpack it: gunzip -c override_list.pb.gz > override_list.pb
Use protoc to inspect: protoc --decode_raw < override_list.pb > output.txt
Also see: Taxonomy of topics with IDs: https://github.com/patcg-individual-drafts/topics/blob/main/taxonomy_v1.md

@nTastevin
Copy link

Thank you for the details above.

According to the model get_input_details() method, the model seems to need this tensors in input:

  • 'input_ids_1': array([ 1, 128], dtype=int32)
  • 'input_mask_1': array([ 1, 128], dtype=int32)
  • 'token_type_ids_1': array([ 1, 128], dtype=int32)

Could you, in order to run the model locally, provide some guidelines on the way you preprocess domain strings to convert it in input tensor values compatible with the model?

@jkarlin
Copy link
Collaborator

jkarlin commented Jul 13, 2022

I'm not an expert on this, but looking at https://source.chromium.org/chromium/chromium/src/+/main:third_party/tflite_support/src/tensorflow_lite_support/cc/task/processor/bert_preprocessor.cc;l=117;drc=a6fe0210768868959ac7e8d0e04eaf771e83e524;bpv=1;bpt=1 it appears that the input mask is always set to 1 and the type id is the same as the input id.

You may also find #79 (comment) helpful.

@jkarlin jkarlin closed this as completed Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants