Identifying the noisy topics #75

AlexandreGilotte · 2022-05-30T13:39:57Z

It seems to me that the current specs of the API may enable a simple and practical attack to identify the noisy topics, which could thus be filter out by the DSPs.

This attack relies on those two rules:

"The caller only receives topics it has observed the user visit in the past."
"The exception to this filtering is the 5% random topic, that topic will not be filtered."

A direct consequence of those rules is that if a caller never observed any user before, then any topic it would receive is a random topic.

An attacker could thus call the API with two distinct endpoints:

one regular endpoint, observing as much of the web as possible, to get as many user topics as possible (This is just the regular API use).
an attack endpoint, which have never observed the user before. Any topic returned to this endpoint is a random topic, and should be filtered out from the result of the regular query.

Ensuring that an endpoint never observed the user may be non trivial, but a simple proxy would be to use as a caller the site the user in on. Any topic returned to this caller which is not the topic assigned to that site is thus a random topic.

What are your thoughts on this?

martinthomson · 2022-05-30T23:56:22Z

My understanding here is that there is a fixed process that occurs in any given top-level browsing context.

Initialize an empty set.
For each of the three preceding weeks:
a. Calculate the top 5 topics for that week.
b. Draw a number from $[0..20)$ at random.
c. If the number is 0 (5% probability), draw a number from $[0..349)$ at random and add the corresponding topic from the complete set of topics to the set.
d. Otherwise, draw a number from $[0..5)$ at random and, add the corresponding topic from the top 5 to the set, unless the site calling the API has not witnessed this topic.
Shuffle the set and return it.

The process would use a PRG as its random source with the seed being determined by the top-level site identity and epoch¹. As a result, all third-party content would draw the same random values. The only difference in the result would be that genuine topics would not be seen by a site that wasn't able to witness that topic (the bit in italics).

So yes, this would result in only the random topics being shown to the unique site (your "attack endpoint", though I would frame all sites as attackers here), which could then inform other sites which topics were random. There is a low probability of a random topic also being genuine, meaning that this trick could filter out those topics, but that is low enough a chance to disregard.

An attacker needs a fairly large pool of sites in order to be sure that users haven't seen them call the API before. Getting a domain you control on the PSL is the easiest way to do that. You can't get wildcard certificates for those domains after doing that (though maybe before... ) but running ACME is possible.

It is also possible to determine that a randomized topic is thus if you receive a topic more than once. This takes multiple weeks as you have to wait for the topic to disappear. But just one repetition makes it extremely likely (>99%) that it is genuine.

This isn't entirely clear from the explainer and there is no specification, so this is a guess. I believe that this algorithm is necessary to avoid having sites collaborate within a site to enumerate the complete set of topics. Though it requires witness from each of a non-trivial number of sites to perform enumeration reliably, if the PRG seed isn't constant the randomness can be erased easily. ↩

jkarlin mentioned this issue Jun 16, 2022

Toward an oligopoly of “Topic providers”? #73

Open

jkarlin mentioned this issue Jul 12, 2022

Provide Topics API for not adding current page's topics #54

Closed

cilvento mentioned this issue Nov 8, 2022

Consider alternative topic assignment mechanisms #113

Open

jkarlin mentioned this issue Dec 2, 2022

Proposed improvement to random noise #121

Closed

jkarlin closed this as completed Jun 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identifying the noisy topics #75

Identifying the noisy topics #75

AlexandreGilotte commented May 30, 2022

martinthomson commented May 30, 2022

Identifying the noisy topics #75

Identifying the noisy topics #75

Comments

AlexandreGilotte commented May 30, 2022

martinthomson commented May 30, 2022

Footnotes