The Open Source Initiative (OSI) continued their condescension towards the community today in a self-congratulatory post explaining How we passed the AI conundrums, like they were Alexander slashing through the Gordian knot (the article’s featured image is a modern equivalent, a Megaminx).
Turtles All The Way Down
The anonymous representative starts out by painting “some people” as extremists for demanding the same licensing as has been required by the Open Source Definition for decades. They then set up the strawman that “limiting Open source AI only to systems trainable on freely distributable data would relegate Open Source AI to a niche,” and then immediately slashed it down by claiming such “freely and legally shareable data is a tiny fraction of what is necessary”. This ignores the inconvenient truth that nobody’s still seriously asking for Open Data licensing (as much as I/we would prefer it), rather accepting the significant compromise of publicly accessible data that still guarantees the ability to exercise the four essential freedoms enumerated in the draft itself, at least with respect to their own creation. It’s not entirely unlike Debian Linux finally accepting to ship non-free firmware after two decades so you can actually run their software on your hardware, only we’re not even talking about shipping or sharing anything that’s not already publicly shared by others.
The argument also completely ignores the fact that more and more valuable Open Data is coming online every day, and that when Open Source was established decades ago there was only a “tiny fraction” then of the software that’s available now. Indeed, in the absence of a clear and strong Open Source Definition it’s unlikely that the industry would have curated such a large and valuable archive. Just yesterday AlphaFold won the Nobel Prize for its Creative Commons (CC-BY-SA) licensed database that “provides open access to over 200 million protein structure predictions to accelerate scientific research”. The reality is that a more ambitious Open Source AI definition in line with the existing understanding of what Open Source is will incentivise rather than deter the creation of new reusable datasets, for example with medical data provided with consent by patients (to murder another regular strawman).
Karsten Wade’s proposal to achieve consensus for the 1.0 release yesterday codifies this and resolves nearly all of the outstanding objections (including mine) simply by keeping Open (i.e., shareable) and Public (i.e., accessible) data while eliminating the two most problematic data classes: “Restricted (née Obtainable) and Unavailable (née Unshareable)”. Clearly you cannot exercise any/all of the four freedoms with data you can’t even access let alone afford, but as I noted yesterday in Proprietary Data Considered Harmful to Open Source AI, restricted datasets “including for fee” (e.g., stock photos, New York Times articles) are even more dangerous in that they are almost guaranteed to get those relying on the OSI’s definition nuked from orbit by data custodians — whose one job it is to exploit their government-granted monopoly over their datasets. At least public datasets like Common Crawl have already been subjected to sterilising sunlight, and anything that needed to be removed will have been removed by via the claims process in their terms of use — many eyes make bugs shallow after all. This Achilles heel alone is enough to render RC1 dead on arrival when the first actual enterprise lawyer looks at it, and if nothing else it must be abandoned for the sake of public safety (you’re welcome to cite this point later and I’ll be sure to post it where it will be seen).
The author then applies the false analogy of comparing training data to proprietary system libraries, forgetting that you can still run and verify the behaviour of a program like Emacs running on a proprietary operating system like macOS, but the same cannot be said for machine learning models trained on inaccessible data — the training data input is fundamentally tied to the system’s output while libraries are not.
Going for the trifecta, they also argue equivocation in shifting the definition of “Open Source” to accommodate the missing data by claiming they’re still sharing metadata. This change would very significantly alter the well-established meaning of “Open Source”, doing damage to that entire ecosystem as well — reason enough in and of itself for the board to reject the release candidate even if that meant releasing nothing at all. Incidentally, the sunk cost fallacy was also just argued on the twentieth town hall call, in that one of the three reasons given for releasing anything is because it’s been worked on for a while.
I’m still yet to extract an admission from anyone that they’re knowingly capitulating on the four freedoms and redefining “Open Source”, including on that call, but I’d love for someone to get the claim that the four freedoms are protected on record, and then have them justify it. The entire argument really is a stack of logical fallacies in a trench coat.
Conundrums?
The AI Conundrums refers to Stephen O’Grady of developer-focused industry analyst Redmonk‘s insightful post back in July, an ice age ago in the accelerating technological change of the AI industry. In the megamix article he covers the business models of commercial providers (ChatGPT, Claude, Gemini, etc.), the commoditisation capabilities of gateways like LiteLLM (product) and OpenRouter (service), and wraps up with a look at Open Source, AI, and Data. Regarding the draft definition‘s weasel-worded request for metadata (i.e., data about data) rather than the actual data on which the AI system was trained, O’Grady points out that “many smart and reasonable individuals with decades of open source experience regard this as insufficient” (you can see for yourself via this sampling of links).
“Julia Ferraioli, meanwhile, explicitly made the case […] that the current definition falls short, arguing in essence that without the data – and a sure way to replicate it – the draft AI definition cannot fulfill the four key freedoms that the traditional OSI definition does.” O’Grady concurs:
This is correct. If data is key to AI, and data may or not be replicated per these more lenient terms, then while users would be able to use or distribute the system, clearly the draft definition could not guarantee the right to study the full system or modify it deeply.
Note that the last word is doing a lot of heavy lifting here because of the common sleight of hand that providing enough [meta]data to enable any modification (e.g., fine-tuning) is sufficient to satisfy the freedom to modify, equating it to enabling all modifications or improvements: “If your right to modify a program is limited, in substance, to changes that someone else considers an improvement, that program is not free.”
O’Grady notes that “an OSI definition that does not require the inclusion of training data is problematic”, but that requiring “full training set availability” is also problematic, giving two reasons that I would argue are non-issues for public rather than open data:
- Practicality: These datasets are large and unwieldly. Fortunately, they also tend to already be reliably hosted by third-parties like AWS. As I noted yesterday, for popular training datasets like Common Crawl, “this data is apparently so toxic it can’t possibly be used to train Open Source AI, and yet so clean AWS agreed to host it for free as a Public Data Set”.
- Legality: “Authors may rely on training data that cannot be legally released under an open source license,” and while my preferred Open Source AI definition would require Open Data licenses, the far better compromise than throwing the baby out with the bathwater is accepting this is often impractical and loosening the requirement to simply demand the data be public (which again is typically hosted by third-parties).
O’Grady then shifts to outcomes, re-confirming the position that “strict adherence to the four freedoms clearly necessitates a full release of training data“. The OSI literally has one job, and that’s to protect those four freedoms, deemed essential for a reason, and ironically listed in the draft definition which then goes on to not protect them:
- Use the system for any purpose and without having to ask for permission.
- Study how the system works and inspect its components.
- Modify the system for any purpose, including to change its output.
- Share the system for others to use with or without modifications, for any purpose.
He then gets to the OSI’s purported “smoking gun” now appearing in every slide deck on every call and at every event trying to justify this sorry state of affairs:
If we assume, for example, that the definition requires full release of datasets, one thing is certain: in Julia’s words, it would be “a definition for which few existing systems qualify.” (OSI note: also less powerful and limited to specific domains)
Ignoring that the OSI injecting their own opinion is not how quotes work, and taking Julia Ferraioli — who you may remember as one of those “smart and reasonable individuals with decades of open source experience” advocating for the availability of data above — out of context to make the opposite point to her own clearly-stated opinion, the statement itself is conditional on requiring “full release of datasets”, which absolutely nobody is asking for.
Maybe Redmonk will do a follow-up, but I did ask them what they thought of the OSI [ab]using their article to argue for a less open Open Source AI definition and they had this to say. I’m going to let Julia wrap it up as she explains why we consider this issue so important — even existential for Open Source in an AI-dominated world — better than I could:
Open source software was a radical concept in the beginning. We didn’t get to where we are today by abiding by the status quo. We need to carry that forward with us into new domains, into new (or renewed, in the case of AI) technologies. We need to be bold and brave. We need to fight for openness and transparency.
Edited to add: EletherAI’s Aviya Skowron rightly noted my error in conflating pre-prepared training data with Common Crawl, which is typically filtered/transformed/etc., then combined with other data sources, and this raw or source data superset further processed with code to create the actual training dataset, the distribution of which is as problematic as the original source: “In order to get to the actual training data, you’d need the model / dataset developer to publish the processing code and identify the particular CC dump they used.” You can read “training dataset” in this context as either the actual training dataset or its precursors (both data and code).