The Open Source Initiative (OSI) are seeking endorsements of their upcoming Open Source AI Definition (OSAID), which has worked its way through a series of drafts since the start of the year to land on v0.0.9. Any differences between this and RC 1 are likely to be minimal as the form notes they “will double-check with [endorsers] to confirm [their] continued endorsement of any changes from v.0.0.9 to RC1”. Do not endorse this draft.
While the horse is being loaded into the starting gate, it hasn’t bolted yet, so it’s not too late to advocate for a change in direction. Tom Callaway agrees “we can always make the definition more permissive if we discover that we have been too ambitious with a definition, but that it is functionally impossible for us to be more ambitious later”.
Background
As a refresher, the OSI’s own Open Source Definition, which was originally derived from the Debian Free Software Guidelines (DFSG), requires (among other things) that “the source code must be the preferred form in which a programmer would modify the program”. Note that the Free Software Foundation’s own Free Software Definition predates the DFSG, but it was not derived from it. In a world where software was delivered in a compiled binary form which is infeasible to modify, Open Source gave users the blueprints to make changes as they saw fit, by way of OSI Approved licenses. This is good stuff and the OSI has played an important role in the industry to date.
Enter cloud computing almost 20 years ago now(!), and the transition from products to services that disrupted the IT industry (AI is set to disrupt every industry). For the Free and Open Source Software (FOSS) community this was a problem too because the “service provider loophole” meant that software was no longer being “conveyed” — a requirement to use it when it had to run on your machine — but merely executed on a server with controlled access provided. Attempts to address this by making viral licenses like the GNU General Public License (GPL) even more viral (e.g., Affero GPL aka AGPL) — triggering the requirement to redistribute source simply by accessing it over a network rather than distributing it — largely failed (thankfully).
That’s why I rounded up several of the early pioneers of cloud computing to form the Open Cloud Initiative (OCI), and through a consultative process, determined that “Open Cloud” would require open standard formats and interfaces; there’s no point having transparent access (i.e., APIs) to opaque data (i.e., undocumented, obfuscated, or even encrypted formats), nor having no programmatic access to transparent data (i.e., open standard formats). Related definitions were included for “Free Cloud” and “Open Source Cloud”, which we see a demand for again today with AI. The OSI declined my offer to take on this challenge at the time, but I understand they may yet — hopefully they get it right when and if they do. The context is useful for how we got to where we are.
Open Source AI
Artificial Intelligence (AI) on the other hand is something the OSI determined needed to be addressed this year and set out to do so with the first town hall meeting on 12 January 2024, based on early drafts from a private workgroup. From the slides, they rightly started by asking “What is the preferred form to make modifications to an AI system?”, noting that:
To be Open Source, an AI system needs to be available under legal terms that grant the freedoms to:
- Use the system for any purpose and without having to ask for permission.
- Study how the system works and inspect its components.
- Modify the system to change its recommendations, predictions or decisions to adapt to your needs.
- Share the system with or without modifications, for any purpose.
So far, so good. They also rightly define the components of an AI system as:
- Code Instruction for a computer to complete a task.
- Model Abstracted representation of what an AI system has learned from the training data.
- Data Information converted into a form efficient for processing and transfer.
So, in order to protect the four freedoms, we just need to make the three components available under approved licenses, right? Apparently, not.
Two-Legged Stool
By draft 0.0.6, the third leg of the stool, data, had been cut off and deemed “not required” but rather merely “appreciated” — a pointless observation for a litmus test. In draft 0.0.7, the data deficiency was acknowledged by way of its rebranding it to “data transparency”, which was when I waded into the discussion:
I worry that while the definition (which should probably be labeled as such rather than “What is Open Source AI”) requires that users be able to “modify the system for any purpose” (which is implied in the DFSG and implemented in its terms), the checklist makes the requisite inputs for said modifications (i.e., training data) optional but “appreciated”.
OSI’s Executive Director, Stefano Maffulli replied, referring me to an earlier thread from January on the elephant in the room and claiming that “the issue is not that ‘most models will not meet the definition’ but that none will do.” Richard Fontana concurred, adding that “as a practical matter there will be few, if any, open source models other than perhaps some toy examples or research models of limited practical interest and utility”. This point is irrelevant though, as either the definition meets the standard set by the community long ago (whether or not any existing implementations are compliant out of the gate), or it does not, and I would join others in arguing that the proposed definition does not.
Existing Open Source AI
Stefano Zacchiroli agreed in the earlier thread, stating that “the ability to redistribute the original training dataset should be a requirement for the freedom to modify an ML-based AI system, because short of that the amount of modifications one can do to such a system is much more limited.” He beat me to responding in the later thread that appropriately licensed training data sets like The Stack v2 (and Wikipedia, Wikidata, Wikimedia Commons, etc.) do exist and are already being used to create what I/we would consider to be true “Open Source AI” like StarCoder2. The “truly open” (their words, not mine) Open Language Model (OLMo) was also cited, demonstrating that truly Open Source AI is an achievable aim, provided we don’t poison the well by lowering the well-established standards (irrespective of noted existing ab/use of the term “Open Source” in the machine learning community, which is also irrelevant). This “prior art” meets the board approval criteria that it “Provides real-life examples: The definition must include relevant examples of AI systems that comply with it at the time of approval, so cannot have an empty set.”
Neither thread came to even a rough consensus, with folks on both sides of the debate maintaining their strongly-held positions. Various issues were raised around not being able to fix models trained on defamatory data, models being derivative works of training data and thus inheriting the license, or alternatively being mathematical calculations unprotectable by copyright, the practicality of hosting large data sets and risk of third–party hosted data sets disappearing, the EU’s AI law exclusion for “open weights” only models (which doesn’t matter), whether we are weighing reproducibility too heavily or not enough (if not for reproducibility of binaries, what is the point of source code?), and whether byte-for-byte reproducibility is an achievable or even useful goal, the issue of open access but copyrighted data including Flickr images and ImageNet, and various other interesting but often irrelevant arguments (before devolving into a debate about OSI governance and the history of the OSD).
Weasel Words
By draft 0.0.8 the weasel word “information” was introduced, defining the contrived term “data information” (otherwise known as metadata, as in “data about data”) to be “sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data”. The excuse given on behalf of these non-free models is that the training data is inconvenient to release or unable to be released for whatever reason, typically because it’s subject to proprietary copyrights (e.g., New York Times articles, YouTube transcripts), or because it contains personally identifiable information (e.g., healthcare records). Encouraging the creation and curation of free data sets rather than undermining them is precisely why we owe it to the community to meet our own well-established standards.
In current draft 0.0.9 the additional requirement that “Data information shall be made available with licenses that comply with the Open Source Definition” was also introduced, but is too little, too late; metadata licensing is meaningless if the data itself is unavailable. A redundant and confusing attempt to disambiguate AI models (including weights) from AI weights was also added despite both code and weights being covered earlier. It is clear to me and many others that this version, which is set to become the release candidate and ultimately the 1.0 release at All Things Open on 28 October 2024, remains deficient and should not be adopted. It’s like the process — not helped by the curious board approval criteria that it be “Ready by October 2024: A usable version of the definition needs to be ready for approval by the board at the October board meeting” rather than when it’s ready — went something like:
- We have to do something.
- This is something.
- We have to do this.
Why not both?
I would be remiss at this point not to acknowledge the carefully-considered compromise offered by fellow maintainer of the Personal Artificial Intelligence Operating System (pAI-OS) — hence my interest in the topic — and fellow open source old guard Karsten Wade, who presented a detailed proposal to satisfy both camps by bifurcating Open Source AI models into the default required to release the data (“D+”) and those with an “integrity exception” (“D-“). As discussed in the earlier thread, there is precedent for this dating back to the 1991 GPLv2 system library exemption for proprietary operating system components, but that was when it was not going to be possible to release an Open Source operating system without them, which is no longer the case and was never the case here. A related exception was the creation of the Lesser GPL (LGPL), with which an author can allow their code to be linked with other non-free software.
While this proposal has some traction with another opponent of the proposed definition, Steven Pousty pointed out that the “OSI has made no such distinction with software so why should they with AI/ML.” I also prefer to keep solutions as simple as possible (but no simpler), and in any case the OSAID needs to function as a binary litmus test like the Open Source Definition (OSD): either you’re in (i.e., “D+”) or you’re out. Fortunately, there’s a perfectly good existing term that describes “D-“: Open Weights.
Toxic Candy
We owe nothing to those that don’t commit to the cause and need not pander to the wants of their business models, but one could still make the case for a pejoratively-named “Lesser” license for those who don’t or can’t release training data. Mo Zhou, a fellow Debian developer who drafted Debian’s ML Policy and author of the earlier thread, has a special name for these appealing yet problematic models: Toxic Candy.
Toxic Candy Model is a model satisfying ALL the following conditions:
- The model itself is FOSS-Licensed & DFSG-compliant;
- trained from unknown, private, or non-free datasets or simulators;
- corresponding training program or script is present and complete;
It’s worth referencing the definition of a free (as in freedom, not as in beer) model, as Debian appear to have already got it right half a decade ago:
Free Model is a model satisfying ALL the following conditions:
- The model itself is FOSS-Licensed & DFSG-compliant;
- trained from explicitly FOSS-licensed & DFSG-compliant datasets (e.g. for supervised or unsupervised learning) or simulators (e.g. for reinforcement learning), and the dataset is publicly available to anonymous users;
- corresponding training program is present and complete;
While there are a few OSI Approved licenses that are unacceptable under Debian’s DFSG (OSL, RPL, CPAL, etc.), the fact that most models blessed as kosher by the OSAID today would be permanently ineligible for inclusion in Debian should be a five-alarm fire for those pushing it. There’s good reason why Debian won’t distribute AI models any time soon, but that’s no excuse to lower the bar.
OpenAI on Open AI
Being one to eat my own dogfood, I had a chat with the cloud brain (ChatGPT), asking it:
What is the equivalent of the “preferred form” for open source software (i.e., source code, as in “the source code must be the preferred form in which a programmer would modify the program”) in the context of Artificial Intelligence?
This is what it had to say:
In summary, the “preferred form” of an AI system for modification would include:
- The source code for the model architecture and training process.
- The raw or cleaned training data, or at least a detailed description of it.
- The model’s learned weights or checkpoints.
- The configuration settings for training, including hyperparameters.
- Scripts for training, evaluating, and fine-tuning the model.
- Documentation that contextualizes the model’s function, performance, and limitations.
All these components together form the most modifiable and transparent version of an AI system, analogous to source code in traditional software.
Addressing training data specifically, it had this to say (closing the loophole it opened in the bullet point above):
Training Data: In many AI systems, the model is only as good as the data it was trained on. Thus, providing access to the “preferred form” would mean giving access to the training data in its raw or cleaned form, as well as details about any preprocessing steps. Transparency about data sources and curation is critical for reproducibility and auditability.
Preferred Form
If we focus on the Open Source Definition‘s critical “preferred form” term — which prevents folks from distributing source code in Japanese on punch cards to meet the letter but not the spirit of the law — then it is a question for practitioners: if tasked with modifying a model then would the preferred form be the input to the training process (i.e., the data), or the output from it (i.e., the model), as Stefano Zacchiroli stated:
The main argument for including data in the definition is that they are part of the preferred form for modification of an AI system. We can debate (ideally, together with the actors who are actually creating and modifying AI systems) whether it is true or not that training data is part of that “preferred form”, but it is undeniable that if they are, then the definition of Open Source AI must include them.
Sure, you can make minor modifications with the model only (e.g., fine tuning), but for major modifications (e.g., removing, substituting, or supplementing parts of the corpus) you absolutely need the data. Any restriction on the four freedoms above must render the candidate inadmissible, and it is crystal clear that lack of data restricts the ability to study and modify the system at least. I would further argue that it restricts its use — as deploying black boxes of unknown and/or unverifiable provenance is unacceptable to many users — as well as the ability to share in the traditional sense of the term Open Source (i.e., with those same freedoms).
Historical Mistake in the Making
Indeed, by cementing the status quo it is possible to do more harm than good, and I would cite the precautionary principle in saying that the onus is on those advocating this watered-down approach to address the risk to the Open Source community and fledgling ecosystem of truly Open Source models. Given the sustained, well-justified objections, I would argue that the third board approval criteria — that it be “Supported by diverse stakeholders: The definition needs to have approval by end users, developers, deployers and subjects of AI, globally” — has not been met. No amount of “endorsers” will change that.
At the IETF, lack of disagreement is more important than agreement, and “coming to consensus is when everyone (including the person making the objection) comes to the conclusion that either the objections are valid, and therefore make a change to address the objection, or that the objection was not really a matter of importance, but merely a matter of taste.”
To quote Mo Zhou:
If OSAID does not require original training dataset, I’d say it will become a historical mistake.
I agree, and whether or not you do too, you should join the discussion.