Hostnames may be private #185

chrisvls · 2023-05-22T22:37:44Z

Intranet hostnames may include private information, like project names, division names, application names.
Intranet hostnames may include information that is difficult for a third-party to accurately assign to topics.
SaaS applications that use subdomains may include significant information in the hostname. If, for example, it were known that an M&A deal room application had a hostname of "siliconvalleybank.dealroomapp.com", it could move the stock market.
For some SaaS applications, the intent or consent to share topic information may be at the instance or even user level. Some instances of a wiki might not be sensitive. Others may be.
Many site terms of service would prohibit disclosure of the hostname or sufficient information to assign a topic.

How does the Topics API envision managing these aspects?

chrisvls · 2023-05-23T00:28:38Z

Similar to 3 above, it is also common for SaaS contracts to prohibit disclosure that company X is a customer of App Y except as authorized. As a result, disclosing "companyX.SaaSApplicationY.com" would violate pretty standard clauses governing this kind of confidentiality.

jkarlin · 2023-05-24T16:24:33Z

Thanks for reaching out! Responses inline:

Intranet hostnames may include private information, like project names, division names, application names.
Intranet hostnames may include information that is difficult for a third-party to accurately assign to topics.

We have a few layers of protection here. The first, is that we don't classify hostnames that resolve to private IP address space (IANA reserved address ranges). Many intranets exist in such reserved ranges. Second, I don't expect that these intranet sites will be serving ads and calling the browsingTopics API, and therefore won't be included in the user's top topic calculation. Third, the taxonomy is rather coarse grained. Fourth, we introduce noised topics so one doesn't know which sites the user actually visited. And finally, the user may have visited any one (or multiple) of a number of sites about said topic. It's at best a probabilistic inference which site a user visited.

SaaS applications that use subdomains may include significant information in the hostname. If, for example, it were known that an M&A deal room application had a hostname of "siliconvalleybank.dealroomapp.com", it could move the stock market.

Similar to above. The topics provided by the taxonomy are very coarse grained. The Topics API currently classifies siliconvalleybank.dealroomapp.com as "149. Finance". That doesn't make it clear whether the topic came from the etld+1, or the subdomain, or especially which particular bank it was.

For some SaaS applications, the intent or consent to share topic information may be at the instance or even user level. Some instances of a wiki might not be sensitive. Others may be.
Many site terms of service would prohibit disclosure of the hostname or sufficient information to assign a topic.

Those apps/sites can disable topics (e.g., via the permissions policy API) on instances where consent is not given or disclosure is prohibited. When calculating the next set of Topics for the user, the API only considers hostnames from those pages in which the API is called and the permission policy grants the call and the IP address is not reserved and the user is not in incognito mode and the user hasn't disabled permission etc.

chrisvls · 2023-05-25T20:54:18Z

Thank you for your thoughtful response. I had misunderstood a very basic aspect -- that the classification occurs within the browser not by a service running elsewhere. (I think I saw the "public" and "by a partner" and jumped to conclusions.)

I will add though that it seems that there is an argument that the default permissions policy should be to deny extraction and sharing that data during the experimentation phase.

This leads more broadly to a comment that I'll make elsewhere on the spec (when I'm more confident I haven't missed basic points, like the above ;) )... the safety of this API relies very heavily on the coarseness of the topic taxonomy and relies somewhat on the coarseness of the topics calculation input data.

But the spec makes no promise that the taxonomy will remain coarse. Indeed, it highlights that accepting the spec entails accepting changes to the taxonomy. And the spec explicitly states that the input data could include all text in the document.

chrisvls · 2023-05-25T21:08:40Z

Oh, I would also add that lots of internal apps and intranets are hosted on public cloud infrastructure. We may not be in a zero-trust world, but we are definitely in a world where lots of what-used-to-be internal-only apps are accessible from anywhere and not ip blocked.

chrisvls mentioned this issue Jun 12, 2023

Dumb question... where in the spec does it restrict topic input data collection or topic calculation to pages that have called the API? #197

Closed

chrisvls closed this as completed Jun 12, 2023

chrisvls mentioned this issue Jun 22, 2023

Topics calculation input data should remain local to the user's machine and the classification model should run locally #193

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hostnames may be private #185

Hostnames may be private #185

chrisvls commented May 22, 2023 •

edited

Loading

chrisvls commented May 23, 2023 •

edited

Loading

jkarlin commented May 24, 2023

chrisvls commented May 25, 2023 •

edited

Loading

chrisvls commented May 25, 2023 •

edited

Loading

Hostnames may be private #185

Hostnames may be private #185

Comments

chrisvls commented May 22, 2023 • edited Loading

chrisvls commented May 23, 2023 • edited Loading

jkarlin commented May 24, 2023

chrisvls commented May 25, 2023 • edited Loading

chrisvls commented May 25, 2023 • edited Loading

chrisvls commented May 22, 2023 •

edited

Loading

chrisvls commented May 23, 2023 •

edited

Loading

chrisvls commented May 25, 2023 •

edited

Loading

chrisvls commented May 25, 2023 •

edited

Loading