Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifying the noisy topics #75

Closed
AlexandreGilotte opened this issue May 30, 2022 · 1 comment
Closed

Identifying the noisy topics #75

AlexandreGilotte opened this issue May 30, 2022 · 1 comment

Comments

@AlexandreGilotte
Copy link

It seems to me that the current specs of the API may enable a simple and practical attack to identify the noisy topics, which could thus be filter out by the DSPs.

This attack relies on those two rules:

  • "The caller only receives topics it has observed the user visit in the past."
  • "The exception to this filtering is the 5% random topic, that topic will not be filtered."

A direct consequence of those rules is that if a caller never observed any user before, then any topic it would receive is a random topic.

An attacker could thus call the API with two distinct endpoints:

  • one regular endpoint, observing as much of the web as possible, to get as many user topics as possible (This is just the regular API use).
  • an attack endpoint, which have never observed the user before. Any topic returned to this endpoint is a random topic, and should be filtered out from the result of the regular query.

Ensuring that an endpoint never observed the user may be non trivial, but a simple proxy would be to use as a caller the site the user in on. Any topic returned to this caller which is not the topic assigned to that site is thus a random topic.

What are your thoughts on this?

@martinthomson
Copy link

My understanding here is that there is a fixed process that occurs in any given top-level browsing context.

  1. Initialize an empty set.
  2. For each of the three preceding weeks:
    a. Calculate the top 5 topics for that week.
    b. Draw a number from $[0..20)$ at random.
    c. If the number is 0 (5% probability), draw a number from $[0..349)$ at random and add the corresponding topic from the complete set of topics to the set.
    d. Otherwise, draw a number from $[0..5)$ at random and, add the corresponding topic from the top 5 to the set, unless the site calling the API has not witnessed this topic.
  3. Shuffle the set and return it.

The process would use a PRG as its random source with the seed being determined by the top-level site identity and epoch1. As a result, all third-party content would draw the same random values. The only difference in the result would be that genuine topics would not be seen by a site that wasn't able to witness that topic (the bit in italics).

So yes, this would result in only the random topics being shown to the unique site (your "attack endpoint", though I would frame all sites as attackers here), which could then inform other sites which topics were random. There is a low probability of a random topic also being genuine, meaning that this trick could filter out those topics, but that is low enough a chance to disregard.

An attacker needs a fairly large pool of sites in order to be sure that users haven't seen them call the API before. Getting a domain you control on the PSL is the easiest way to do that. You can't get wildcard certificates for those domains after doing that (though maybe before... ) but running ACME is possible.

It is also possible to determine that a randomized topic is thus if you receive a topic more than once. This takes multiple weeks as you have to wait for the topic to disappear. But just one repetition makes it extremely likely (>99%) that it is genuine.

Footnotes

  1. This isn't entirely clear from the explainer and there is no specification, so this is a guess. I believe that this algorithm is necessary to avoid having sites collaborate within a site to enumerate the complete set of topics. Though it requires witness from each of a non-trivial number of sites to perform enumeration reliably, if the PRG seed isn't constant the randomness can be erased easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants