Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide Topics API for not adding current page's topics #54

Closed
stguav opened this issue Mar 16, 2022 · 19 comments
Closed

Provide Topics API for not adding current page's topics #54

stguav opened this issue Mar 16, 2022 · 19 comments

Comments

@stguav
Copy link

stguav commented Mar 16, 2022

The Topics API provides one zero-argument function document.browsingTopics(), which serves three logically distinct purposes:

  1. Getting topics about the user. This is the obvious one.
  2. Determining caller eligibility to receive the topic from future calls of the API.
  3. Building up the user's set of top 5 topics for the epoch of the current call.

It would be useful to provide a little more control over these three different aspects of the API. In particular, there is some tension between the first two and the last use case. For the first two use-cases, there is no downside (aside potentially from some latency) to calling the API. Each ad tech is incentivized to call the API whenever possible, either to get useful signals or enabling nonempty responses for future calls to the API.

On the other hand, there are potential downsides to calling the API when it comes to the third point. For example, for a very large publisher site with generic, not commercially relevant topics at the domain/subdomain level. The ad tech might like to call the API to get useful signals, but with the current API it may not be worth the risk of potentially contaminating the users' future top 5 topics with the generic, not commercially relevant topics.

It would be beneficial to perhaps provide an argument that controls the behavior, something like browsingTopics(add_current_topics=true). Since eligibility is determined per API caller there should be no ecosystem concern about "freeloaders" getting other callers topics without contributing. There also does not seem to be any detrimental effect on user privacy. While the concern mentioned above might be partially mitigated by improved Topics ranking and commercially focused taxonomy changes, it seems best to provide this flexibility for API callers so they have flexibility in how they use the API.

@michaelkleber
Copy link
Collaborator

Since eligibility is determined per API caller there should be no ecosystem concern about "freeloaders" getting other callers topics without contributing.

I'm surprised to hear you say that. During previous discussions of the API we have very consistently heard that it needs to be "pay to play" based on the risk that sites would always choose to request a user's topics without contributing their own visits to the model, if that choice were possible.

@stguav
Copy link
Author

stguav commented Mar 17, 2022

I should clarify, I think we should still combine 2 and 3 in my description of the API for this reason. If the API caller decides to set add_current_topics=false in the call, it means that the current page's topics would not be used for per caller Topic eligibility or building up a profile on future calls.

The API caller only gets topics from calls where they contributed. If a caller never contributes /News, then they will not receive it on future calls. Then, I don't think there are any ecosystem risks: if an ad tech always calls with add_current_topics=false, then they won't receive any Topics in future API calls.

@michaelkleber
Copy link
Collaborator

I understand, but the risk we've heard about is of a particular publisher site saying "it's OK to use topics for targeting on my site but not to have my site contribute to the topics profile".

@dmarti
Copy link
Contributor

dmarti commented Mar 17, 2022

Very large sites, such as those that carry user-generated content from large numbers of users, are already effectively opted out of contributing, just by having some content about each topic. (They could opt back in by splitting their content into multiple sections: #17)

@stguav
Copy link
Author

stguav commented Mar 19, 2022

I understand, but the risk we've heard about is of a particular publisher site saying "it's OK to use topics for targeting on my site but not to have my site contribute to the topics profile".

This is something that SSPs can manage in their relationships with publishers. Similar to other agreements around audience targeting that are common today.

@jkarlin
Copy link
Collaborator

jkarlin commented Apr 26, 2022

I agree that the contribution risk is mitigated now that a caller only receives topics for sites that they run (2) on. I think teasing apart into (1) and (2+3) into separate spaces makes sense.

The fact that 1+2+3 are tied together has caused problems for #7 where sending the topic via a request header would also incur behaviors (2+3) which the server may not have wanted/agreed to. I think the way forward there is to split (1) into a request header and (2+3) into a response header.

Your proposal of adding an argument to document.browsingTopics() seems reasonable to me. We could also consider adding a new separate method for (2+3) though I don't have strong feelings. Your proposal would be more backwards compatible.

@jkarlin
Copy link
Collaborator

jkarlin commented Apr 27, 2022

@stguav thinking more on this, it seems difficult for a server to figure out when it ought to call (2+3) in order to filter out particular topics. The server would a) need to know what Chrome thinks the topics are for the given site and b) there is collateral damage if there are other topics that the site has that the server is interested in.

Is the use case here purely to filter out overreported topics from the user's top 5? The better solution to that problem might be weighting the topics by frequency.

@stguav
Copy link
Author

stguav commented Apr 27, 2022

@stguav thinking more on this, it seems difficult for a server to figure out when it ought to call (2+3) in order to filter out particular topics. The server would a) need to know what Chrome thinks the topics are for the given site and b) there is collateral damage if there are other topics that the site has that the server is interested in.

I'm not sure I agree that this is difficult for a server to figure out. I believe the Chrome classification will be available as discussed in #64 (comment), and this seems similar to other kinds of optimizations that ad tech providers (ATPs) are used to making.

Is the use case here purely to filter out overreported topics from the user's top 5? The better solution to that problem might be weighting the topics by frequency.

Yes, the primary use case is to have more control over the kinds of topics the API would return, and frequency weighting would help. However, "overreported" is not precisely the same issue. ATPs are interested in the commercial relevance. (Just because a topic is rare, does not mean that it is commercially relevant.) I expect that Chrome may not want to be deciding the commercial relevance of different topics, since different parts of the ads ecosystem are likely to have different opinions and creating a broad consensus seems difficult. So it seems preferable to me, to provide more control and flexibility in the API, as long as it doesn't compromise user privacy or ecosystem health.

@jkarlin
Copy link
Collaborator

jkarlin commented Apr 28, 2022

I'd need to think further on the impact it might have on user privacy. What if different third-parties have different ideas of what topics are commercially relevant and then the top 5 topics wind up being super noisy? I guess the idea is that a little bit of influence over the top topics is better than none?

@zhengweiwithoutthei
Copy link

One important use case of having a separate set/get API (either in the form of JS or headers proposed in #7 ) is regulatory compliance. Unlike user consent signals, publisher control setting might only exist in server side. The client side is not able to make the decision whether it is ok to use the API until talking to the server side. Most of the pressure comes from the setting part of the API (2+3). If we can separate the API into a getter and a setter, it will largely mitigate the concern.

@dmarti
Copy link
Contributor

dmarti commented May 1, 2022

@zhengweiwithoutthei Not all users are going to be in a consent-based jurisdiction. Have to be able to handle both GDPR and similar jurisdictions (where you need a basis for processing, generally consent in the case of marketing) and opt-out-based jurisdictions like California, where you don't need consent in advance but if an opt-out is in effect for a given user/site pair you can't share their info (You can't "sell" their info (including any exchange of data for something of value) today, "share" takes effect next year.) So there are regulatory issues around (1) as well.

@zhengweiwithoutthei
Copy link

@dmarti Yes. I agree there are regulatory issues around (1) as well. My view of this issues is as following: the processing of (1) (use of topics signal for targeting) is usually done server side, where you have a more complete set signals for consents, configures and out-opt, etc. So even if all of the client side consent checks are passed, we calls the API to get the topics, the server side still has the choice to nullify or ignore the signal if an opt-out in effect. This should satisfy both consent or opt-out based jurisdictions.

However, for (2), (3), if action of setting the topics profile is done client side together with (1) where we might not yet have all of the basis you need in advance and there is no way to revert the topic you just viewed after talking to server side, it can be a bigger issue.

@jkarlin
Copy link
Collaborator

jkarlin commented May 23, 2022

I do think it's reasonable to separate (1) and (2+3), and the fact that they were combined was due to legacy FLoC reasons that are no longer relevant. So the plan is (sometime in the next few milestones) to create a document.browsingTopics({recordView: false}) where the default for recordView is true. Please provide any feedback on the proposal here.

@zhengweiwithoutthei
Copy link

I am supportive of this proposal. Please also considered a request/response header version of the implementation (#7) for the same proposal.

@stguav
Copy link
Author

stguav commented May 23, 2022

Thanks for the update @jkarlin !

@jkarlin
Copy link
Collaborator

jkarlin commented Jul 12, 2022

Hey all. Quick update on thoughts in this space.

Pros:

  • Enables sending Topics in headers (Should there be a way to send topics via Fetch as a request header?  #7) which is much better for performance (e.g., on initial bid request as opposed to waiting around to create a x-origin iframe in a new process and postMessage the result back)
  • As mentioned above by others it's better for regulatory compliance
  • More developer friendly (retrieving topics and noting observation are more naturally viewed as separate actions)

Cons:

  • Simpler to determine which topics are noise (just use a separate domain that you never observe/witness on, only retrieve. If you retrieve something it's noise) See Identifying the noisy topics #75.
  • It's easier to encode information into observations as retrieving doesn't also cause observation, exacerbating the attacks discussed in Fingerprinting threat using the TOPICS API #74.
  • Note that neither of these attack surfaces are introduced by this, they're just made a bit simpler.

@dmarti
Copy link
Contributor

dmarti commented Jul 13, 2022

Another item for the "cons" section is that excluding the current site's topics could help callers monetize scraper sites (and other low-value, low-engagement sites).

If the same widely-used caller is on legit topic-focused sites and on scraper sites, it can choose to call Topics API on the scraper sites in order to get valuable topics leaked from the legit sites -- but it ends up getting (1) for whatever topics the scraper site ended up with, obfuscating its own view of the user.

Blocking collecting topics for the current site makes it easy for a caller to do a one-way data flow from legit sites to scraper sites.

@jkarlin
Copy link
Collaborator

jkarlin commented Jul 13, 2022

Typically the browser doesn't make judgement calls as to what sites are high or low value. If the user is going to the site, then presumably it has value to the user at that time. Therefore I'm not sure I see that as a con here.

@dmarti
Copy link
Contributor

dmarti commented Jul 13, 2022

People visit low-quality sites all the time, generally because of some kind of deceptive link. They bounce, but generally after the ad impressions count as viewed.

Depending on the rewards for running a scraper, or other crappy site, those sites are going to have more or less incentive to try to get users to click on deceptive links (using black hat SEO, social and forum spam, malware, whatever.) The more that Topics API gives an advantage to low-quality sites over high-quality ones, the more deceptive links that users will likely be exposed to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants