The Laboratorium (3d ser.)

A blog by James Grimmelmann

Soyez réglé dans votre vie et ordinaire afin
d'être violent et original dans vos oeuvres.

When Law is Code

I have a new Jotwell review of Sarah Lawsky’s Coding the Code: Catala and Computationally Accessible Tax Law. It is nominally a review of this recent (outstanding) article, but I used the occasion to go back through her recent body of work and introduce it to a wider audience who may not be aware of the remarkable nature of her project. Here are some excerpts:

Sarah B. Lawsky’s Coding the Code: Catala and Computationally Accessible Tax Law offers an exceptionally thoughtful perspective on the automation of legal rules. It provides not just a nuanced analysis of the consequences of translating legal doctrines into computer programs (something many other scholars have done), but also a tutorial in how to do so effectively, with fidelity to the internal structure of law and humility about what computers do and don’t do well. …

Coding the Code, like the rest of Lawsky’s work, stands out in two ways. First, she is actively making it happen, using her insights as a legal scholar and logician to push forward the state of the art. Her Lawsky Practice Problems site–a hand-coded open source app that can generate as many tax exercises as students have the patience to work through–is a pedagogical gem, because it matches the computer science under the hood to the structure of the legal problem. (Her Teaching Algorithms and Algorithms for Teaching documents the app and why it works the way it does.)

Second, Lawsky’s claims about the broader consequences of formal approaches are grounded in a nuanced understanding of what these formal approaches do well and what they do not. Sometimes formalization leads to insight; her recent Reasoning with Formalized Statutes shows how coding up a statute section can reveal unexpected edge cases and drafting mistakes. At other times, formalization is hiding in plain sight. As she observes in 2020’s Form as Formalization, the IRS already walks taxpayers through tax algorithms; its forms provide step-by-step instruction for making tax computations. In every case, Lawsky links carefully links her systemic claims to specific doctrinal examples. She shows not that computational law will change everything, but rather that it is already changing some things, in ways large and small.

GenLaw 2024

I’m virtually attending the GenLaw 2024 workshop today, and I will be liveblogging the presentations.

Introduction

A. Feder Cooper and Katherine Lee: Welcome!

The generative AI supply chain includes many stages, actors, and choices. But wherever there are choices, there are research questions: how ML developers make those choices? And wherever there are choices, there are policy questions: what are the consequences for law and policy of those choices?

GenLaw is not an archival venue, but if you are interested in publishing work in this space, consider the ACM CS&Law conference, happening next in March 2025 in Munich.

Kyle Lo

Kyle Lo, Demystifying Data Curation for Language Models.

I think of data in three stages:

  1. Shopping for data, or acquiring it.
  2. Cooking your data, or transforming it.
  3. Tasting your data, or testing it.

Someone once told me, “Infinite tokens, you could just train on the whole Internet.” Scale is important. What’s the best way to get a lot of data? Our #1 choice is public APIs leading to bulk data. 80 to 100% of the data comes from web scrapers (CommonCrawl, Internet Archive, etc.). These are nonprofits that have been operating long before generative AI was a thing. A small percentage (about 1%) is user-created content like Wikipedia or ArXiv. And about 5% or less is open publishers, like PubMed. Datasets also heavily remix existing datasets.

Nobody crawls the data themselves unless they’re really big and have a lot of good programmers. You can either do deep domain-specific crawls, or a broad and wide crawl. A lot of websites require you to follow links and click buttons to get at the content. Writing the code to coax out this content—hidden behind JS—requires a lot of site-specific code. For each website, one has to ask whether going through this is worth the trouble.

It’s also getting harder to crawl. A lot more sites have robots.txt that ask not to be crawled or have terms of service restricting crawling. This makes CommonCrawl’s job harder. Especially if you’re polite, you spend a lot more energy working through a decreasing pile of sources. More data is now available only to those who pay for it. We’re not running out of training data, we’re running out of open training data, which raises serious issues of equitable access.

Moving on to transformation, the first step is to filter out low-quality pages (e.g., site navigation or r/microwavegang). You typically need to filter out sensitive data like passwords, NSFW content, and duplicates.

Next is linearization: remove header text, navigational links on pages, etc., and convert to a stream of tokens. Poor linearization can be irrecoverable. It can break up sentences and render source content incoherent.

There is filtering: cleaning up data. Every data source needs its own pipeline! For example, for code, you might want to include Python but not Fortran. Training on user-uploaded CSVs in a code repository is usually not helpful.

Ssing small-model classifiers to do filtering has side effects. There are a lot of terms of service out there. If you do deduplication, you may wind up throwing out a lot of terms of service. Removing PII with low-precision classifiers can have legal consequences. Or, sometimes we see data that includes scientific text in English and pornography in Chinese—a poor classifier will misunderstand it.

My last point: people have pushed for a safe harbor for AI research. We need something similar for open-data research. In doing open research, am I taking on too much risk?

Gabriele Mazzini

Gabriele Mazzini, Introduction to the AI Act and Generative AI.

The AI Act is a first-of-its-kind in the world. In the EU, the Commission proposes legislation and also implements it. The draft is send to the Council, which represents governments of member states, and to the Parliament, which is directly elected. The Council and Parliament have to agree to enact legislation. Implementation is carried out via member states. The Commission can provide some executive action and some guidance.

The AI Act required some complex choices: it should be horizontal, applying to all of AI, rather than being sector-specific. But different fields do have different legal regimes (e.g. financial regulation).

The most important concept in the AI Act is its risk-based approach. The greater the risk, the stricter the rules—but there is no regulation of AI as such. It focuses on use cases, with stricter rules for riskier uses.

  • From the EU’s point of view, a few uses–such as social scoring—are unacceptable risk and prohibited.
  • The high-risk category covers about 90% of the rules in the AI Act. This includes AI systems that are safety components of physical products (e.g. robotics). It also includes some specifically listed uses, such as recruitment in employment. These AI systems are subject to compliance with specific requirements ex ante.
  • The transparency risk category requires disclosures (e.g. that you are interacting with an AI chatbot and not a human). This is where generative AI mostly comes in: that you know that content was created by AI.
  • Everything else is minimal or no risk and is not regulated.

Most generative AI systems are in the transparency category (e.g. disclosure of training data). But some systems, e.g. those trained over a certain compute threshold, are subject to stricter rules.

Martin Senftleben

Martin Senftleben, Copyright and GenAI Development – Regulatory Approaches and Challenges in the EU and Beyond

AI forces us to confront the dethroning of the human author. Copyright has long been based on the unique creativity of human authors, but now generative AI generate outputs that appear as though they were human-generated.

In copyright, we give one person a monopoly right to decide what can be done with a work, but that makes follow-on innovation difficult. That was difficult enough in the past, when the follow-on innovation came from other authors (parody, pastiche, etc.). Here, the follow-on innovation comes from the machine. Copyright policy makes this complex right now. It’s an attempt to reconcile fair renumeration for human authors with a successful AI sector.

The copyright answer would be licensing—on the input side, pay for each and every piece of data that goes into the data set, and on the output side, pay for outputs. If you do this, you get problems for the AI sector. You get very limited access to data, with a few large players paying for data from publishers, but others getting nothing. This produces bias in the sense that it only reflects mainstream inputs (English, but not Dutch and Slovak).

If you try to favor a vibrant AI sector, you don’t require licensing for training and you make all the outputs legal (e.g. fair use). This increases access and you have less bias on the output, but you have no remuneration for authors.

From a legal-comparative perspective, it’s fascinating to see how different legislators approach these questions. Japan and Southeast Asian countries have tried to support AI developers, e.g. broad text and data mining (TDM) exemptions as applied to AI training. In the U.S., the discussion is about fair use and there are about 25 lawsuits. Fair use opens up the copyright system immediately because users can push back.

In the E.U., forget about fair use. We have the directive on the Digital Single Market in 2019, which was written without generative AI in mind. The focus was on scientific TDMs. That exception doesn’t cover commercial or even non-profit activity, only scientific research. A research organization can work with a private partner. There is also a broader TDM exemption that enables TDM unless the copyright owner has opted out using “machine-readable means” (e.g. in robots.txt).

The AI Act makes things more complex; it has AI-related components. It confirms that reproductions for TDM are still within the scope of copyright and require an exemption. It confirms that opt-outs must be observed. What about training in other countries? If you at a later stage want to offer your trained models in the EU, you must have evidence that you trained in accordance with EU policy. This is an intended Brussels effect.

The AI Act also has transparency obligations: specifically a “sufficiently detailed summary of the content used for training.” Good luck with that one! Even knowing what’s in the datasets you’re using is a challenge. There will be an AI Office, which will set up a template. Also, is there a risk that AI trained in the EU will simply be less clever than AI trained elsewhere? That it will marginalize the EU cultural heritage?

That’s where we stand the E.U. Codes of practice will start in May 2025 and become enforceable against AI providers in August 2025. If you seek licenses now, make sure they cover the training you have done in the past.

Panel: Data Curation and IP

Panelists: Julia Powles, Kyle Lo, Martin Senftleben, A. Feder Cooper (moderator)

Cooper: Julia, tell us about the view from Australia.

Julia: Outside the U.S., copyright law also includes moral rights, especially attribution and integrity. Three things: (1) Artists are feeling disempowered. (2) Lawyers gotten preoccupied with where (geographically) acts are taking place. (3) Governments are in a giant game of chicken of who will insist that AI providers comply. Everyone is waiting for artists to mount challenges that they don’t have the resources to mount. Most people who are savvy about IP hate copyright. We don’t show the concern that we show for the AI industry for students or others who are impacted by copyright. Australia is being very timid, as are most countries.

Cooper: Martin, can you fill us in on moral rights?

Martin: Copyright is not just about the money. It’s about the personal touch of what we create as human beings. Moral rights:

  • To decide whether a work will be made available to the public at all.
  • Attribution, to have your name associated with the work.
  • Integrity, to decide on modifications to the work.
  • Integrity, to object to the use of the work in unwanted contexts (such as pornography).

The impact on AI training is very unclear. It’s not clear what will happen in the courts. Perhaps moral rights will let authors avoid machine training entirely. Or perhaps they will apply at the output level. Not clear whether these rights will fly due to idea/expression dichotomy.

Cooper: Kyle, can you talk about copyright considerations in data curation?

Kyle: I’m worried about: (1) it’s important to develop techniques for fine-tuning, but (2) will my company let me work on projects where we hand off the control to others? Without some sort of protection for developing unlearning, we won’t have research on these techniques.

Cooper: Follow-up: you went right to memorization. Are we caring too much about memorization?

Kyle: There’s a simplistic view that I want to get away from: that it’s only regurgitation that matters. There are other harmful behaviors, such as a perfect style imitator for an author. It’s hard to form an opinion about good legislation without knowledge of what the state of the technology is, and what’s possible or not.

Julia: It feels like the wave of large models we’ve had in the last few years have really consumed our thinking about the future of AI. Especially the idea that we “need” scale and access to all copyrighted works. Before ChatGPT, the idea was that these models were too legally dangerous to release. We have impeded the release of bioscience because we have gone through the work of deciding what we want to allow. In many cases, having the large general model is not the best solution to a problem. In many cases, the promise remains unrealized.

Martin: Memorization and learning of concepts is one of the most fascinating and different problems. From a copyright perspective, getting knowledge about the black box is interesting and important. Cf. Matthew Sag’s “Snoopy problem.” CC licenses often come with a share-alike restriction. If it can be demonstrated that there are traces of this material in fully-trained models, those models would need to be shared under those terms.

Kyle: Do we need scale? I go back and forth on this all the time. On the one hand,I detest the idea of a general-purpose model. It’s all domain effects. That’s ML 101. On the other hand, these models are really impressive. The science-specific models are worse than GPT-4 for their use case. I don’t know why these giant proprietary models are so good. The more I deviate my methods from common practice, the less applicable my findings are. We have to hyperscale to be relevant, but I also hate it.

Cooper: How should we evaluate models?

When I work on general-purpose models, I try to reproduce what closed models are doing. I set up evaluations to try to replicate how they think. But I haven’t even reached the point of being able to reproduce their results. Everyone’s hardware is different and training runs can go wrong in lots of ways.

When I work on smaller and more specific models, not very much has change. The story has been to focus on the target domain, and that’s still the case. It’s careful scientific work. Maybe the only wrench is that general-purpose models can be prompted for outputs that are different than the ones they were created to focus on.

Cooper: Let’s talk about guardrails.

Martin: Right now, the copyright discussion focuses on the AI training stage. In terms of costs, this means that AI training is burdened with copyright issues, which makes training more expensive. Perhaps we should diversify legal tools by moving from input to output. Let the trainers do what they want, and we’ll put requirements on outputs and require them to create appropriate filters.

Julia: I find the argument that it’ll be too costly to respect copyright to be bunk. There are 100 countries that have to negotiate with major publishers for access to copyrighted works. There are lots of humans that we don’t make these arguments for. We should give these permissions to humans before machines. It seems obvious that we’d have impressive results at hyperscale. For 25 years, IP has debated traditional cultural knowledge. There, we have belatedly recognized the origin of this knowledge. The same goes for AI: it’s about acknowledging the source of the knowledge they are trained on.

Turning to supply chains, in addition to the copying right, there are authorizing, importing, and communicating, plus moral rights. An interesting avenue for regulation is to ask where sweatshops of people doing content moderation and data labeling take place.

Cooper: Training is resource-intensive, but so is inference.

Question: Why are we treating AI differently than biotechnology?

Julia: We have a strong physical bias. Dolly the sheep had an impact that 3D avatars didn’t. Also, it’s different power players.

Martin: Pam Samuelson has a good paper on historical antecedents for new copying technologies. Although I think that generative AI dethrones human authors and that is something new.

Kyle: AI is a proxy for other things; it doesn’t feel genuine until it’s applied.

Question: There have been a lot of talks about the power of training on synthetic data. Is copyright the right mechanism for training on synthetic data?

Kyle: It is hard to govern these approaches on the output side, you would really have to deal with it on the input side.

Martin: I hate to say this as a lawyer, but … it depends.

Question: We live in a fragmented import/export market. (E.g., the data security executive order

Martin: There have been predictions that territoriality will die, but so far it has persisted.

Connor Dunlop

Connor Dunlop, GPAI Governance and Oversight in the EU – And How You Might be Able to Contribute

Three topics:

  1. Role of civil society
  2. My work and how we fit in
  3. How you can contribute

AI operates within a complex system of social and economic structures. The ecosystem includes industry and more. AI and society includes government actors and NGOs exist to support those actors. There are many types of expertise involved here. Ada Lovelace is an organization that thinks abut how AI and data impact people in society. We aim for research expertise, promoting AI literacy, building technical tools like audits and evaluations. A possible gap in the ecosystem is strategic litigation expertise.

At Ada Lovelace, we try to identify key topics early on and ground them in research. We do a lot of polling and engagement on public perspectives. And we recognize nuance and try to make sure that people know what the known unknowns are and where people disagree.

On AI governance, we have been asking about different accountability mechanisms. What mechanisms are available, how are they employed in the real world, do they work, and can they be reflected in standards, law, or policy?

Sabrina Küspert

Sabrina Küspert, Implementing the AI Act

The AI Act follows a risk-based approach. (Review of risk-based approach pyramid.) It adopts harmonized rules across all 27 member states. The idea is that if you create trust, you also create excellence. If provider complies, they get access to the entire EU.

For general-purpose models, the rules are transparency obligations. Anyone who wants to build on a general-purpose model should be able to understand its capabilities and what it is based on. Providers must mitigate systemic risks with evaluation, mitigation, cybersecurity, incident reporting, and corrective measures.

The EU AI Office is part of the Commission and the center of AI expertise for the EU. It will facilitate a process to detail the rules around transparency, copyright, risk assessment, and risk mitigation via codes of practice. Also building enforcement structures. It will have technical capacity and regulatory powers (e.g. to compel assessments).

Finally, we’re facilitating international cooperation on AI. We’re working with the U.S. AI Safety Office, building an international network among key partners, and engaged in bilateral and multilateral activities.

Spotlight Poster Presentations

Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI (Robert Hönig, Javier Rando, Nicholas Carlini, Florian Tramer): We investigated methods that artists can use to prevent AI training on their work, and found that these protections can often be disabled. These tools (e.g. Glaze) work by adding adversarial perturbations to an artist’s images in ways that are unnoticable to humans but degrade models trained on them. You can use an off-the-shelf HuggingFace model to remove the perturbations and recover the original images. In some cases, adding Gaussian noise or using a different fine-tuning tool also suffices to disable the protections.

Training Foundation Models as Data Compression: On Information, Model Weights and Copyright Law (Giorgio Franceschelli, Claudia Cevenini, Mirco Musolesi): Our motivation is the knowledge that models tend to memorize and regurgitate. We observe that model weights are smaller than the training data, so there is an analogy that training is compression. Given this, is a model a copy or derivative work of training data?

Machine Unlearning Fails to Remove Data Poisoning Attacks (Martin Pawelczyk, Ayush Sekhari, Jimmy Z Di, Yiwei Lu, Gautam Kamath, Seth Neel): Real-world motivations for unlearning are to remove data due to revoked consent or to unlearn bad/adversarial data that impact performance. Typical implementations use likelihood ratio tests (LRTs) that involve hundreds of shadow models. We put poisons in part of the training data; then we apply an unlearning algorithm to our poisoned model and then ask whether the algorithm removed the effects of the poison. We add Gaussian poisoning to existing indiscriminate and targeted poisoning methods. Unlearning can be evaluated by measuring correlation between our Gaussians and the output model. We observe that the state-of-the-art methods we tried weren’t really successful at removing Gaussian poison and no method performs well across both vision and language tasks.

Ordering Model Deletion (Daniel Wilf-Townsend): Model deletion (a.k.a. model destruction or algorithmic disgorgement) is a remedial tool that courts and agencies can use that requires discontinuing use of a model trained on unlawfully used data. Why do it? First, in a privacy context, the inferences are what you care about, so just deleting the underlying data isn’t sufficient to prevent the harm. Second, it provides increased deterrence. But there are problems, including proportionality. Think of OpenAi vs. a blog post: if GPT-4 trains on a single blog post of mine, then I could force deletion, which is massively disproportionate to the harm. It could be unfair, or create massive chilling effects. Model deletion is an equitable remedy, and equitable doctrines should be used to enforce proportionality and tied to culpability.

Ignore Safety Directions. Violate the CFAA? (Ram Shankar Siva Kumar, Kendra Albert, Jonathon Penney): We explore the legal aspects of prompt injection attacks. We define prompt injection as inputting data into an LLM that cause it to behave in ways contrary to the model provider’s intentions. There are legal and cybersecurity risks, including under the CFAA, and a history of government and companies targeting researchers and white-hat hackers. Our paper attempts to show the complexity of applying the CFAA to generative-AI systems. One takeaway: whether prompt injection violates the CFAA depends on many factors. Sometimes they do, but there are uncertainties. Another takeaway: we need more clarity from courts and from scholars and researchers. Thus, we need a safe harbor for security researchers.

Fantastic Copyrighted Beasts and How (Not) to Generate Them (Luxi He, Yangsibo Huang, Weijia Shi, Tinghao Xie, Haotian Liu, ; Wang, Yue; Zettlemoyer, Luke; Zhang, Chiyuan; Chen, Danqi; Henderson, Peter): We have all likely seen models that generate copyrighted characters—and models that refuse to generate them. It turns out that using generic keywords like “Italian plumber” suffices. There was a recent Chinese case holding a service provider liable for generations of Ultraman. Our work introduces a copyrighted-characters reproduction benchmark. We also develop an evaluation suite that has consistency with user intent but avoids copyrighted characters. We applied this suite to various models, and propose methods to avoid copyrighted characters. We find that prompt rewriting is not fully effective on its own. But we find that using copyrighted character names as negative prompts increases effectiveness from about 50% to about 85%.

Matthew Jagielski and Katja Filippova

Matthew Jagielski and Katja Filippova, Machine Unlearning: [JG: I missed this due to a livestream hiccup, but will go back and fill it in.]

Kimberly Mai

Kimberly Mai, Data Protection in the Era of Generative AI

Under the GDPR, personal data is “any information relating to an identified or identifiable” person. That includes hash numbers of people in an experimental study, or license plate numbers. It depends on how easy it is to identify someone. The UK AI framework has principles that already map to data protection law.

Our view is that data protection law applies at every stage of the AI lifecycle. This makes the UK ICO a key regulator in the AI space. AI is a key area of focus for us. Generative AI raises some significant issues, and the ICO has launched a consultation.

What does “accuracy” mean in a generative-AI context? This isn’t a statistical notion; instead, data must be correct, not misleading, and where necessary up-to-date. In a creative context, that might not require factual accuracy. At the output level, a hallucinating model that produces incorrect outputs about a person might be inaccurate. We think this might require labeling, attribution, etc., but I am eager to hear your thoughts.

Now, for individual rights. We believe that rights to be informed and to access are crucial here. On the remaining four, it’s a more difficult picture. It’s very hard to unlearn, which makes the right to erasure quite difficult to apply. We want to hear from you how machine learning applies to data protection concepts. We will be releasing something on controllership shortly, and please share your thoughts with us. We can also provide advice on deploying systems. (We also welcome non-U.K. input.)

Herbie Bradley

Herbie Bradley, Technical AI Governance

Technical AI governance is technical analysis and tools for supporting effective AI governance. There are problems around data, compute, models, and user interaction. For example, are hardware-enabled compute governance feasible? Or, how should we think about how often to evaluate fine-tuned models for safety? What are best practices for language model benchmarking? And, looking to the future, how likely is it that certain research directions will pan out? (Examples include unlearning, watermarking, differential privacy, etc.)

Here is another example: risk thresholds. Can we translate benchmark results into assessments that are useful to policymakers. The problems are that this is dependent on a benchmark, it has to have a qualitative element, and knowledge and best practices shift rapidly. Any implementation will likely be iterative and involve conversations with policy experts and technical researchers.

It is useful to have technical capacity within governments. First, to carry out the actual technical work for implementing a policy or carry out safety testing. Second, you need it to have advisory capacity, and this is often much more useful.

Takeaways. First, if you’re a researcher, consider joining government or a think tank that supports government. Second, if you’re a policy maker, consider uncertainties that could be answered by technical capacity.

Panel: Privacy and Data Policy

Sabrina Ross, Herbie Bradley, Niloofar Mireshgallah, Matthew Jagielski, Paul Ohm (moderator), Katherine Lee (moderator)

Paul: We have this struggle in policy to come up with rules and standards that can be measured. What do we think about Herbie’s call for metrics?

Sabrina: We are at the beginning; the conversation is being led by discussions around safety. How do you measure data minimization, for example: comparing utility loss to data reduction. I’m excited by the trend.

Niloofar: There are multiple ways. Differential privacy (DP) was a theory concept, used for the census, and now is treated as a good tool. But with LLMs, it becomes ambiguous again. Tools can work in one place but not in another. Events like this help technical people understand what’s missing. I learned that most NLP people think of copyright as verbatim copying, but that’s not the only form of copying.

Paul: I worry that if we learn too hard into evaluation, we’ll lose values. What are we missing here?

Matthew: In the DP community, we have our clear epsilon values, and then we have our vibes, which aren’t measured but are built into the algorithm. The data minimization paper has a lot of intuitive value.

Herbie: Industry, academia, and government have different incentives and needs. Academia may like evaluations that are easily measurable and cheap. Industry may like it for marketing, or reducing liability risk. Government may want it to be robust or widely used, or relatively cheap.

Niloofar: It depends on what’s considered valuable. It used to be that data quality wasn’t valued. A few years ago, at ICML you’d only see theory papers, now there is more applied work.

Paul: You used this word “publish”: I thought you just uploaded things to ArXiv and moved on.

Katherine: Let’s talk about unlearning. Can we talk about evaluations that might be useful, and how it might fit into content moderation.

Matthew: To evaluate unlearning, you need to say something about a counterfactual world. State of the art techniques include things like “train your model a thousand times,” which is impractical for big models. There are also provable techniques; evaluation there looks much different. For content moderation, it’s unclear that this is an intervention on data and not alignment. If you have a specific goal, that you can measure directly.

Herbie: With these techniques, it’s very easy to target adjacent knowledge, which isn’t relevant and isn’t what you want to target. Often, various pieces of PII are available on the Internet, and the system could locate them even if information on them has been removed from the model itself.

Paul: Could we map the right to be forgotten onto unlearning?

Sophia: There are lots of considerations here (e.g. public figures versus private ones), so I don’t see a universal application.

Paul: Maybe what we want is a good output filter.

Niloofar: Even if you’re able to verify deletion, you may still be leaking information. There are difficult questions about prospective vs. retrospective activity. It’s a hot potato situation: people put out papers then other people show they don’t work. We could use more systematic frameworks.

Sophia: I prefer to connect the available techniques to the goals we’re trying to achieve.

Katherine: This is a fun time to bring up the copyright/privacy parallel. People talk about the DMCA takedown process, which isn’t quite applicable to generative AI but people do sometimes wonder about it.

Niloofar: I see that NLP people have a memorization idea, so they write a paper, and they need an application, so they look to privacy or copyright. They appeal to these two and put them together. The underlying latent is the same, but in copyright you can license it. I feel like privacy is more flexible, and you have complex inferences. In copyright, you have idea, expression, and those have different meanings.

Matthew: It’s interesting to see what changes in versions of a model. You are training the pain of a passive adversary versus one who is really going to try. For computer scientists, this idea of a weak vs. strong adversary is radioactive.

Paul: My Myth of the Superuser paper was about how laws are written to deal with powerful hackers but then used against ordinary users. Licensing is something you can do for copyright risk; in privacy, we talk about consent. Strategically, are they the same?

Sophia: For a long time, consent was seen as a gold standard. More recently, we’ve started to consider consent fatigue. For some uses it’s helpful, for others it’s not.

Paul: The TDM exception is interesting. The conventional wisdom in privacy was that those dumb American rules were opt-out. In copyright, the tables have turned.

Matthew: Licensing and consent change your distribution. Some people are more likely to opt in or opt out.

Herbie: People don’t have a good sense of how the qualities of licenseable differ from what is available on the Internet.

Niloofar: There is a dataset of people chatting with ChatGPT who affirmatively consented. But people share a lot of their private data through this, and become oblivious to what they have put in the model. You’re often sharing information about other people too. A journalist put their conversation with a private source into the chat!

Paul: Especially for junior grad students, the fact that every jurisdiction is doing this alone might be confusing. Why is that?

Herbie: I.e., why is there no international treaty?

Paul: Or even talk more and harmonize?

Herbie: We do. The Biden executive order influenced the E.U.’s thinking. But a lot of it comes down to cultural values and how different communities think.

Paul: Can you compare the U.K. to the E.U.?

Herbie: We’re watching the AI Act closely. I quite like what we’re doing.

Sophia: We have to consider the incentives that regulators are balancing. But in some ways, I think there is a ton of similarity. Singapore and the E.U. both have data minimization.

Herbie: There are significant differences between the thinking of different government systems in terms of how up-to-date they are.

Paul: This is where I explain to my horrified friends that the FTC has 45 employees working on this. There is a real resource imbalance.

Matthew: The point about shared values is why junior grad students shouldn’t be disheartened. The data minimization paper pulled out things that can be technicalized.

Niloofar: I can speak from the side of when I was a young grad student. When I came here, I was surprised by copyright. It’s always easier to build on legacy than to create something new.

Paul: None of you signed onto the cynical “It’s all trade war all the way down.” On our side of the pond, one story was that the rise of Mistral changed the politics considerably. If true, Mistral is the best thing ever to happen to Silicon Valley, because it tamps down protectionism. Or maybe this is the American who has no idea what he’s talking about.

Katherine: We’ve talked copyright, privacy, and safety. What else should we think about as we go off into the world?

Sophia: The problem is the organizing structure of the work to be done. Is fairness a safety problem, a privacy problem, or an inclusion problem? We’ve seen how some conceptions of data protection can impede fairness conversations.

Paul: I am genuinely curious. Are things hardening so much that you’ll find yourself in a group that people say, “We do copyright here; toxicity is down the hall?” (I think this would be bad.)

Herbie: Right now, academics are incentivized to talk about the general interface.

Paul: Has anyone said “antitrust” today? Right now, there is a quiet struggle between the antitrust Lina Khan/Tim Wu camp and all the other information harms. There are some natural monopoly arguments when it comes to large models.

Niloofar: At least on the academic side, people who work in theory do both privacy and fairness. When people who work in NLP started to care more, then there started to be more division. So toxicity/ethics people are little separate. When you say “safety,” it’s mostly about jailbreaking.

Paul: Maybe these are different techniques for different problem? Let me give you a thought about the First Amendment. Justice Kagan gets five justices to agree that social media is core protected speech. Lots of American scholars think this will also apply to large language models. This Supreme Court is putting First Amendment on the rise.

Matthew: I think alignment is the big technique overlap I’m seeing right now. But when I interact with the privacy community, people who do that are privacy people.

Katherine: That’s partly because those are the tools that we have.

Question: If we had unlearning, would that be okay with GDPR?

Question: If we go forward 2-3 years and there are some problems and clear beliefs about how they should be regulated, then how will this be enforced, and what skills do these people have?

Niloofar: On consent, I don’t know what we do about children.

Paul: In the U.S., we don’t consider children to be people.

Niloofar: I don’t know what this solution would look like.

Kimberly: In the U.K., if you’re over 13 you can consent. GDPR has protections for children. You have to consider risks and harms to children when you are designing under data protection by design.

Herbie: If you have highly adversarial users, unlearning might not be sufficient.

Sabrina: We’re already computer scientists working with economists. The more we can bring to bear, the more successful we’ll be.

Paul: I’ve spent my career watching agencies bring in technologists. Some success, some fail. Europe has had success with investing a lot. But the state of Oregon will hire half a technologist and pay them 30% what they would make. Europe understands that you have to write a big check, create a team, and plan for managing them.

Matthew: As an Oregonian, I’m glad Oregon was mentioned. I wanted to mention that people want unlearning to do some things that are more suitable for unlearning, and there are some goals that really are about data management. (Unless we start calling unlearning techniques “alignment.”)


And that’s it!

The Files are in the Computer

I have a new draft essay, The Files are in the Computer: On Copyright, Memorization, and Generative AI. It is a joint work with my regular co-author A. Feder Cooper, who just completed his Ph.D. in Computer Science at Cornell. We presented an earlier version of the paper at the online AI Disrupting Law Symposium symposium hosted by the Chicago-Kent Law Review in April, and the final version will come out in the CKLR. Here is the abstract:

The New York Times’s copyright lawsuit against OpenAI and Microsoft alleges that OpenAI’s GPT models have “memorized” Times articles. Other lawsuits make similar claims. But parties, courts, and scholars disagree on what memorization is, whether it is taking place, and what its copyright implications are. Unfortunately, these debates are clouded by deep ambiguities over the nature of “memorization,” leading participants to talk past one another.

In this Essay, we attempt to bring clarity to the conversation over memorization and its relationship to copyright law. Memorization is a highly active area of research in machine learning, and we draw on that literature to pro- vide a firm technical foundation for legal discussions. The core of the Essay is a precise definition of memorization for a legal audience. We say that a model has “memorized” a piece of training data when (1) it is possible to reconstruct from the model (2) a near-exact copy of (3) a substantial portion of (4) that specific piece of training data. We distinguish memorization from “extraction” (in which a user intentionally causes a model to generate a near-exact copy), from “regurgitation” (in which a model generates a near-exact copy, regardless of the user’s intentions), and from “reconstruction” (in which the near-exact copy can be obtained from the model by any means, not necessarily the ordinary generation process).

Several important consequences follow from these definitions. First, not all learning is memorization: much of what generative-AI models do involves generalizing from large amounts of training data, not just memorizing individual pieces of it. Second, memorization occurs when a model is trained; it is not something that happens when a model generates a regurgitated output. Regurgitation is a symptom of memorization in the model, not its cause. Third, when a model has memorized training data, the model is a “copy” of that training data in the sense used by copyright law. Fourth, a model is not like a VCR or other general-purpose copying technology; it is better at generating some types of outputs (possibly including regurgitated ones) than others. Fifth, memorization is not just a phenomenon that is caused by “adversarial” users bent on extraction; it is a capability that is latent in the model itself. Sixth, the amount of training data that a model memorizes is a consequence of choices made in the training process; different decisions about what data to train on and how to train on it can affect what the model memorizes. Seventh, system design choices also matter at generation time. Whether or not a model that has memorized training data actually regurgitates that data depends on the design of the overall system: developers can use other guardrails to prevent extraction and regurgitation. In a very real sense, memorized training data is in the model–to quote Zoolander, the files are in the computer.

A Statement on Signing

I am serving on Cornell’s Committee on Campus Expressive Activity. We have been charged with “making recommendations for the formulation of a Cornell policy that both protects free expression and the right to protest, while establishing content-neutral limits that ensure the ability of the university community to pursue its mission.” Our mission includes formulating a replacement for Cornell’s controversial Interim Expressive Activity Policy, making recommendations about how the university should respond to violations of the policy, and educating faculty, staff, and students about the policy and the values at stake.

I have resolved that while I am serving on the committee, I will not sign letters or other policy statements on these issues. This is a blanket abstention. It does not reflect any agreement or disagreement with the specifics of a statement within the scope of what the committee will consider.

This is not because I have no views on free speech, universities’ mission, protests, and student discipline. I do. Some of them are public because I have written about them at length; others are private because I have never shared them with anyone; most are somewhere in between. Some of these views are strongly held; others are so tentative they could shift in a light breeze.

Instead, I believe that a principled open-mindedness is one of the most important things I can bring to the committee. This has been a difficult year for Cornell, as for many other colleges and universities. Frustration is high, and trust is low.A good policy can help repair some of this damage. It should help students feel safe, respected, welcomed, and heard. It should help community members be able to trust the administration, and each other. Everyone should be able to feel that the policy was created and is being applied fairly, honestly, and justly. Whether or not we achieve that goal, we have to try.

I think that signing my name to something is a commitment. It means that I endorse what it says, and that I am prepared to defend those views in detail if challenged. If I sign a letter now, and then vote for a committee report that endorses something different, I think my co-signers would be entitled to ask me to explain why my thinking had changed. And if I sign a letter now, I think someone who disagrees with it would be entitled to ask whether I am as ready to listen to their views as I should be.

Other members of the committee may reach different conclusions about what and when to sign, and I respect their choices. My stance reflects my individual views on what signing a letter means, and about what I personally can bring to the committee. Others have different but entirely reasonable views.

I also have colleagues and students who have views on the issues the committee will discuss. They will share many of those views, in open letters, op-eds, and other fora. This is a good thing. They have things to say that the community, the administration, and the committee should hear. I don’t disapprove of their views by not signing; I don’t endorse those views, either. I’m just abstaining for now, because my most important job, while the committee’s work is ongoing, is to listen.

Postmodern Community Standards

This is a Jotwell-style review of Kendra Albert, Imagine a Community: Obscenity’s History and Moderating Speech Online_, 25 Yale Journal of Law and Technology Special Issue 59 (2023). I’m a Jotwell reviewer, but I am conflicted out of writing about Albert’s essay there because I co-authored a short piece with them last year. Nonetheless, I enjoyed Imagine a Community so much that I decided to write a review anyway, and post it here.

One of the great non-barking dogs in Internet law is obscenity. The first truly major case in the field was an obscenity case. 1997’s Reno v. ACLU, 521 U.S. 844 (1997), held that the harmful-to-minors provisions of the federal Communications Decency Act were unconstitutional because they prevented adults from receiving non-obscene speech online. Several additional Supreme Court cases followed over the next few years, as well as numerous lower-court cases, mostly rejecting various attempts to redraft definitions and prohibitions in a way that would survive constitutional scrutiny.

But then … silence. From roughly the mid-2000s on, very few obscenity cases have generated new law. As a casebook editor, I even started deleting material – this never happens – simply because there was nothing new to teach. This absence was a nagging question in the back of my mind. But now, thanks to Kendra Albert’s Imagine a Community, I have the answer, perfectly obvious now that they have laid it out so clearly. The courts did not give up on obscenity, but they gave up on obscenity law.

Imagine a Community is a cogent exploration of the strange career of community standards in obscenity law. Albert shows that the although the “contemporary community standards” test was invented to provide doctrinal clarity, it has instead been used for doctrinal evasion and obfuscation. Half history and half analysis, their essay is an outstanding example of a recent wave of cogent scholarship on sex, law, and the Internet, from scholars like Albert themself, Andrew Gilden, I. India Thusi, and others.

The historical story proceeds as a five-act tragedy, in which the Supreme Court is brought low by its hubris. In the first act, until the middle of the twentieth century, obscenity law varied widely from state to state and case to case. Then, in the second act, the Warren Court constitutionalized the law of obscenity, holding that whether a work is protected by the First Amendment depends on whether it “appeals to prurient interest” as measured by “contemporary community standards.” Roth v. United States, 354 U.S. 476, 489 (1957).

This test created two interrelated problems for the Supreme Court. First, it was profoundly ambiguous. Were community standards geographical or temporal, local or national? And second, it required the courts to decide a never-ending stream of obscenity cases. It proved immensely difficult to articulate how works did – or did not – comport with community standards, leading to embarrassments of reasoned explication like Potter Stewart’s “I know it when I see it” in Jacobellis v. Ohio, 378 U.S. 184, 197 (1964).

The Supreme Court was increasingly uncomfortable with these cases, but it was also unwilling to deconstitutionalize obscenity or to abandon the community-standards test. Instead, in Miller v. California, 413 U.S. 15 (1973), it threw up its hands and turned community standards into a factual question for the jury. As Albert explains, “The local community standard won because it was not possible to imagine what a national standard would be.”

The historian S.F.C. Milsom blamed “the miserable history of crime in England” on the “blankness of the general verdict” (Historical Foundations of the Common Law pp. 403, 413). There could be no substantive legal development unless judges engaged with the facts of individual cases, but the jury in effect hid all of the relevant facts behind a simple “guilty” or “not guilty.”

Albert shows that something similar happened in obscenity law’s third act. The jury’s verdict established that the defendant’s material did or did not appeal to the prurient interest according to contemporary standards. But it did so without ever saying out loud what those standards were. There were still obscenity prosecutions, and there were still obscenity convictions, but in a crucial sense there was much less obscenity law.

In the fourth act, the Internet unsettled a key assumption underpinning the theory that obscenity was a question of local community standards: that every communication had a unique location. The Internet created new kinds of online communities, but it also dissolved the informational boundaries of physical ones. Is a website published everywhere, and thus subject to every township, village, and borough’s standards? Or was a national rule now required? In the 2000s, courts wrestled inconclusively with the question of “Who gets to decide what is too risqué for the Internet?”

And then, Albert demonstrates, in the tragedy’s fifth and deeply ironic act, prosecutors gave up the fight. They have largely avoided bringing adult Internet obscenity cases, focusing instead on child sexual abuse material cases and on cases involving “local businesses where the question of what the appropriate community was much less fraught.” The community-standards timbers have rotted, but no one has paid it much attention because they are not bearing any weight.

This history is a springboard for two perceptive closing sections. First, Albert shows that the community-standards-based obscenity test is extremely hard to justify on its own terms, when measured against contemporary First Amendment standards. It has endured not because it is correct but because it is useful. “The ‘community’ allows courts to avoid the reality that obscenity is a First Amendment doctrine designed to do exactly what justices have decried in other contexts – have the state decide ‘good speech’ from ‘bad speech’ based on preference for certain speakers and messages.” Once you see the point put this way, it is obvious – and it is also obvious that this is the only way this story could have ever ended.

Second – and this is the part that makes this essay truly next-level – Albert describes the tragedy’s farcical coda. The void created by this judicial retreat has been filled by private actors. Social-media platforms, payment providers, and other online intermediaries have developed content-moderation rules on sexually explicit material. These rules sometimes mirror the vestigial conceptual architecture of obscenity law, but often they are simply made up. Doctrine abhors a vacuum:

Pornography producers and porn platforms received lists of allowed and disallowed words and content – from “twink” to “golden showers,” to how many fingers a performer might use in a penetration scene. Rules against bodily fluids other than semen, even the appearance of intoxication, or certain kinds of suggestions of non-consent (such as hypnosis) are common.

One irony of this shift from public to private is that it has done what the courts have been unwilling to: create a genuinely national (sometimes even international) set of rules. Another is that these new “community standards” – a term used by social-media platforms apparently without irony – are applied without any real sensitivity to the actual standards of actual community members. They are simply the diktats of powerful platforms.

Perhaps none of this will matter. Albert suggests that the Supreme Court should perhaps “reconsider[] whether obscenity should be outside the reach of the First Amendment altogether.” Maybe it will, and maybe the legal system will catch up to the Avenue Q slogan: “The Internet is for porn.”

But there is another and darker possibility. The law of public sexuality in the United States has taken a turn over the last few years. Conservative legislators and prosecutors have claimed with a straight face that drag shows, queer romances, and trans bodies are inherently obscene. A new wave of age-verification laws sharply restrict what children are allowed to read on the Internet, and force adults to undergo new levels of surveillance when they go online. It is unsettlingly possible that the Supreme Court may be about to speedrun its obscenity jurisprudence, only backwards and in heels.

But sufficient unto the day is the evil thereof. For now, Imagine a Community is a model for what a law-review essay should be: concise, elegant, and illuminating.