←back to Articles

Proprietary Data Considered Harmful to Open Source AI

a collection of books on a bookshelf with iron bars

OSI’s Definition of Open Source AI Raises Critical Legal Concerns for Developers and Businesses, says Luca Antiga, AI developer of PyTorch fame and fellow lobbyist for Big Freedom. He literally wrote the book on the topic, so while every voice is equal in this ongoing asinine debate, some are more equal than others. When people like Luca talk, I listen, which is more than can be said for the anonymous Open Source Initiative (OSI) representative who publicly took him to task over it on LinkedIn, making sure to @-mention him, his publication, and — I kid you not — his employer. I hope an apology will be forthcoming, because this goes way beyond the bounds of professional discourse and is frankly unbecoming of an organisation that claims to champion open dialogue and collaboration.

On the subject of the recently released release candidate for the (too) soon to be released Open Source AI Definition (OSAID), Luca “believes it leaves a critical question unanswered, particularly for developers and businesses looking to adopt open source AI models with confidence.” The “elephant in the room”, he suggests, is that models “may still be considered open source if all information about data sources” is shared (without requiring access to the data sources themselves), and that the data needs to be open for a derivative of it to be open. This has been the case for Open Source to date, and he gives some examples to that effect.

Indeed, ideally training datasets would be made available under an Open Data license, and while there is a large and growing collection of such pristine specimens, the real world is messy. Extremely popular sources like Common Crawl — basically a dump of the entire Internet — don’t own the copyrights to license it to you even if they wanted to, instead making it available for free under their terms of use. This data is apparently so toxic it can’t possibly be used to train Open Source AI, and yet so clean AWS agreed to host it for free as a Public Data Set (cheers!). The good news is that while this post will eventually end up in there, I’m not about to try to sue you for using it to train your AI, and even if I did, I’d be unlikely to prevail. An artist who gazes upon the Mona Lisa is bound to incorporate its teachings consciously or subconsciously into their future work, and yet the remnants of it among such a massive corpus of all public works will be undetectably small.

This right here is the compromise hiding in plain sight that would make most of the objections go away overnight (including my own). Require training datasets but accept Open Access data, even if only on a transitional basis, all the while encouraging the creation of truly open foundations for future generations. The spawn of these could end up in holy sanctuaries like Debian Linux‘s main repository, while those with the taint of toxic waste go in contrib or non-free. That will likely eventuate as the de facto two-tier approach that’s also been proposed several times, based on the Lesser, now Library GPL precedent. Debian, home of the Debian Free Software Guidelines (DFSG) on which the OSI’s own Open Source Definition was based, has spent the last twenty years trying to keep binary blobs of proprietary firmware out of the Linux kernel, only to pragmatically allow them into the installer in 2022, voting to modify the Debian Social Contract in order to do so. If they can stomach it to avoid irrelevance, so can we.

Back to Luca, who the OSI publicly scolded for giving “an errant interpretation of a footnote in RC1”. As usual, while the big print giveth (“Parameters shall be made available under OSI-approved terms“), the small print taketh away:

The Open Source AI Definition does not take any stance as to whether model parameters require a license, or any other legal instruments, and whether they can be legally controlled by any such instruments once disclosed and shared.”

Apparently this linguistic car crash — which starts by saying the exact opposite of the claim it’s attempting to clarify — is trying to say that the OSAID “does not require nor advocates a specific legal mechanism for assuring that the model parameters are freely available to all”. So the beating hearts of all AI systems “may be free by their nature”, or could require a license or some other “instrument” that “might become clearer over time”, conveniently after their self-imposed deadline of the All Things Open conference on 28 October. The same community that famously used copyright against itself in inventing copyleft doesn’t even know if the document they’re rushing out the door half-baked is worth the paper it’s not printed on for protecting the Model, or if it’s even required at all given they’re not demanding the Data, and the Code is already covered by OSI Approved licenses?

The OSI then cite another word salad from the FAQ — which is an informative reference that has absolutely no bearing on the normative definition — asking (but not answering) “Are model parameters copyrightable?”, and also linking to a paper that wonders “Is Numerical Data Copyrightable?”. Finally, the origin of the red herring regularly used to kick up dust about unanswered and unanswerable — conveniently until after the cat is out of the bag — questions has been revealed, so let’s take a closer look at that too:

Essentially, this process entails transformation of copyright protected works, for example, a short story, into numerical values. Does copyright extend to this transformed subject matter?

You know what else does that? Digital cameras. And you know what’s still protected by copyright? Digital photos. If you don’t believe me then by all means go train your fancy new AI model on commercial stock photos licensed “including for fee” (which is allowed under the release candidate) from Adobe, Getty, etc. — whose only job is to monetise those datasets — and see how long it is before you get sued out of existence. Go to Jail. Do Not Pass Go, Do Not Collect $200. Jail.

You know how I know that’s what’s going to happen? Because The Times Sue[d] OpenAI and Microsoft Over A.I. Use of Copyrighted Work, arguing that “millions of articles from The New York Times were used to train chatbots that now compete with it” and demanding “destruction […] of all GPT or other LLM models
and training sets that incorporate Times Works.” Data custodians cannot possibly allow you to melt down and hand out their crown jewels, and any (subscription-only) license they eventually grant is likely to be so expensive as to make an absolute mockery of Open Source’s promises of No Discrimination Against Persons or Groups [or] Fields of Endeavor. The way I see it is that the current release candidate is like an open manhole cover in a busy street, and any individual or business adopting a model certified by it risks a one-way trip into the sewer.

Luca has since replied saying his “main intention was in fact to spark discussion [like this] in the open [because] there is a diverse set of stakeholders who might not be paying attention, and this is an opportunity to engage in the process,” concluding that “there is a chasm in what the general public perceives as being “open source” and what the definition is proposing, and it needs clarification.” Rather than calling for “more light than heat in these situations”, the OSI’s leadership would do well to take heed of the rapidly growing group of experts bearing torches, as well as his observation that “more precise definitions will likely emerge elsewhere to fill the gap [as] Open Source AI can only move forward toward widespread enterprise adoption with this understanding.”

Earlier tonight we should have discussed this and the many other unresolved objections from the community on the 20th(!) townhall on the topic, but it didn’t happen for whatever reason so we’re back to spilling digital ink. I should really be spending my Kwaai time coding the Personal Artificial Intelligence Operating System (pAI-OS) — an Open Source project I hope will be a beneficiary rather than victim of this process — but this is that important and inexplicably urgent.