Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing delays for aggregatable reports #738

Closed
csharrison opened this issue Mar 28, 2023 · 2 comments · Fixed by #749 or #750
Closed

Reducing delays for aggregatable reports #738

csharrison opened this issue Mar 28, 2023 · 2 comments · Fixed by #749 or #750

Comments

@csharrison
Copy link
Collaborator

Currently, aggregatable reports in the API come with a randomized delay of (10 mins, 1 hour), though often due to offline users or inactive browsers the delay can be much larger. This delay serves to protect whether any particular user had an attributed conversion, which is cross site data. However, if aggregatable reports leaked less cross-site data, they could be delivered with reduced delay.

In this issue I propose reducing the delay in the API to something like ~0-10 minutes instead of ~10-60 minutes.

We can do this by taking ideas from issue 439 and introducing null reports for some fraction trigger registrations, to reduce the total amount of cross-site information embedded in a report. Currently, as a lot of the cross site information is embedded in the existing source_registration_time field, we are also thinking of making this field optional as part of this change. E.g. in the trigger registration JSON:

{
  aggregatable_source_registration_time: "omit" // or "include"
  ...
}

If the source_registration_time field is present, the null report rate will need be higher. Currently we are thinking of something like ~0.05 reports (in expectation) for triggers which don’t specify the source registration time, and ~0.25 reports if they do.

Note: ideally this level of configuration would be done globally at the reporting origin level rather than the trigger level, but that would require introducing global configuration to the API across users. This is something we might explore in the future, so any feedback where we’d want to vary this across an ad-tech / reporting origin would be useful.

We also think that in the future, embedding the source_registration_time inside the encrypted payload would improve the situation here and reduce the amount of null reports. For this reason we think defaulting to omitting this field makes sense.

@csharrison
Copy link
Collaborator Author

Let me clarify the algorithm for generating null reports here that I am thinking:

  • Upon a successful trigger registration (whether it generates any real reports or not):
  • If source_registration_time is omitted
    • Sample n = GeometricDistribution(p_1), and schedule n null reports to be sent
  • Otherwise:
    • For each possible value of source_registration_time
      • Sample n = GeometricDistribution(p_2), and schedule n null reports to be sent with the given source_registration_time

Where p_1 = ~.952 and p_2 = ~.992 to achieve the expected null reports above.

To generate a null report, use a contribution where the key is a random 128 bit number and the value is 0. Everything else should be specified by the trigger (except source_registration_time which is handled above). Null reports should not check or affect rate limits.

@csharrison
Copy link
Collaborator Author

An alternative, simpler privacy mechanism to geometric noise is a "one way flipping" proposal. Here is the algorithm:

  • Upon a successful trigger registration:
  • If source_registration_time is omitted
    • If no real report is present, emit a null report with probability p_1
  • Otherwise, for each possible value of source_registration_time:
    • If no real report is present with this value of source_registration_time, emit a null report with probability p_2

This mechanism does not have some interesting properties of the geometric distribution (e.g. privacy amplifies with shuffling reports among many users), but at the same time I'm not sure we could achieve this with the report verification proposal. So in general, this one might be preferred because:

  1. It is simpler
  2. It emits a bounded # of fake reports (vs. the unbounded Geometric distribution)

cc @linnan-github

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant