This FAQ-style article should help you understand our opinion on the issues with the process and its product:
Overview
What’s the problem?
The Open Source Initiative (OSI)’s board of 10 people (assisted by their employees), took it upon themselves — without the mandate or ultimate approval of their membership — to release an Open Source AI Definition (OSAID) that does not require the source (i.e., training data) of AI models and systems, and which has been roundly rejected by the industry and its experts.
This means that the resulting models do not fully protect the four essential freedoms of Open Source software (despite citing them in the definition itself), as implemented in the Open Source Definition (which dates back to 1998 and last updated 2007). Their position, which happens to align with the desires of their sponsors and thus presents a perceived if not actual conflict of interest, renders compliant models unable to form the foundation of the next generation, as is the case with Free & Open Source Software (FOSS) like Linux today. Other definitions from the Free Software Foundation (FSF), Debian, and others do require the data, which is a significant problem in and of itself given it splits the industry.
The source (i.e., data) is also required to assess and address ethical, security, and other issues, which is ironic given they claimed to be acting in the interests and on behalf of minorities who would require said data to deal with bias and fairness issues with AI systems.
Open Source
Who came up with the term Open Source?
Caroline Peterson, executive director at Foresight Institiute at the time, coined the term Open Source Software on 3 February 1998, and it was floated at meetings on Netscape Navigator’s launch on her behalf by Linux programmer Todd Anderson.
The introduction of the term “open source software” was a deliberate effort to make this field of endeavor more understandable to newcomers and to business, which was viewed as necessary to its spread to a broader community of users. The problem with the main earlier label, “free software,” was not its political connotations, but that—to newcomers—its seeming focus on price is distracting. A term was needed that focuses on the key issue of source code and that does not immediately confuse those new to the concept. The first term that came along at the right time and fulfilled these requirements was rapidly adopted: open source.
Early adopters and promoters of the term included Tim O’Reilly, Eric Raymond, and Bruce Perens. There are reports of the term pre-dating this use (see Mozilla, OSI, & the memory-holing of Computer History), but this was the point where it entered the mainstream vernacular.
Is it even possible to extend Open Source to AI?
Respected Open Source analysts Redmonk have argued in The Narrows Bridge: From Open Source to AI that it is not:
I do not believe the term open source can or should be extended into the AI world.
They liken OSAID to the Narrows bridge which “began to ripple and twist like a ribbon” then “collapsed entirely”, serving as an “object lesson for engineers to this day”.
It is too early to tell whether Open Source can effectively be applied to AI, but I am personally optimistic.
Open Source Initiative (OSI)
What is the Open Source Initiative (OSI)?
A California public benefit corporation with federal 501c3 tax-exempt status (for the time being given their lobbying).
It was created by fellow Debian Developers as the “marketing department” of Free Software, including:
- Bruce Perens (author of the DFSG which was cloned as the OSD 1.0 in 1998, left over controversial license approval in 2020)
- Eric Raymond (author of the Cathedral and the Bazaar, banned by the OSI board in 2020)
- Ian Murdock (the “ian” in Debian, died in 2015)
Where did the OSI derive its legitimacy?
Community consensus, which it no longer has.
While it doesn’t have formal guidelines on this as it’s never done it before (the OSD in the form of the DFSG predates the OSI), the IETF has defined rough consensus which deals with similar issues: On Consensus and Humming in the IETF.
Does the OSI have community consensus?
No. The OSI has been actively repressing dissent, especially on the missing requirement for training data, throughout the process, and several significant organisations in the FOSS community have come out against it or with conflicting definitions:
- Debian: Concerns regarding the “Open Source AI Definition”
- Digital Public Goods Alliance: The Role of Open Data in AI systems as Digital Public Goods
- Free Software Foundation: FSF is working on freedom in machine learning applications
- Software Freedom Conservancy: Open Source AI Definition Erodes the Meaning of “Open Source”
Does the OSI board have AI competences?
No, none of them even claim AI competences in their profiles.
See also The OSI lacks competence to define Open Source AI.
Who elects the OSI board?
The board, with advisory votes from members (4 seats) and affiliates (4 seats).
What’s this about bullying Open Source & Artificial Intelligence experts?
I have personal experience with this as an OSI executive spent 90 minutes on the phone with an officer of Kwaai regarding my contributions to this unnecessary and damaging debate, but fortunately its leader sees the value in it.
I also received this private message via the OSI’s own forums when the censorship was ramped up:
I was not banned.
But I don’t know if I feel like arguing with alpha males who take advantage of their role to censor women. Sad and pathetic.
CW: mental health, open source AI, harassment
I’m tired, friends. This is a personal, emotional post for me. It is not intended to be scientific or objective.
The whole effort to “define open source AI” has been heartbreaking for many of us who care so very deeply about open source. It is not just the lack of transparency, biased narratives, and fallacious arguments that we have seen from the OSI — it’s how participants who present differing viewpoints have been treated.
I have been called a liar, accused of being an enemy of open source, had my contributions misrepresented and shut down, and had my job not-so-subtly threatened for authentically contributing my expertise — expertise that lies precisely at the intersection of open source and artificial intelligence. I have seen others experience the same.
Despite this, I have made genuine efforts to provide opportunities for repair and remediation that were rejected out of hand.
This is not the open source community that welcomed me over 20 years ago. Maybe that community doesn’t exist anymore in the same form, but we deserve one that treats people with kindness and respect. We all deserve the one that pushes for inclusion and empowerment. The one that opens windows instead of closing doors.
This isn’t about AI. It may have started with AI, but that’s not what is at stake.
It’s about the very meaning of open source, and the OSI’s insistence on undermining it on a technical AND cultural level. I worry about the opportunities that future generations will not have as technology trends closed. I worry about future contributors who will decide to steer clear because of renewed toxicity. I worry about voices silenced, progress reversed, potential lost.
Again, this is very personal. I stepped back from actively participating in the official channels for the OSI’s process because of how damaging it was to my mental health. The behavior of a handful of individuals in positions of authority left me questioning my place in open source writ large. Worse than that — these abuses of power sent me into a severe PTSD flare-up that put my safety at risk.
I love open source. The past year has been personally, professionally, and emotionally devastating. I don’t know how to recover from this. Those of you who have been kind to me during all of this, I appreciate you more than I can possibly say. It keeps me here. It tells me that maybe there is still room for me in open source.
What’s at stake here is more than open source AI, AI, OR open source. People are at stake.
What’s this about censorship?
The OSI’s community manager/s have censored their discussion forums so aggressively that at the time of writing there had not been a third-party post in a month. All posts are now moderated, and several users including myself have been temporarily banned and/or threatened with permanent bans:
Are there alternative forums?
Yes, the community has set up open discussion forums at https://discuss.opensourcedefinition.org/
Who are the sponsors?
The OSI’s sponsors include the likes of Google, Cisco, Bloomberg, CapitalOne, GitHub, Intel, Meta, Microsoft, and Red Hat.
Many of these companies have some noteworthy things in common:
- They are in the A.I. business in some way.
- They make use of “Open Source” in their A.I. products.
- They use “Open Source” as a promotional and public relations tool.
- They, in one way or another, work with a closed, proprietary set of A.I. training data.
- They have significant “Diversity, Equity, and Inclusion” efforts.
When you add that all together, this “Open Source AI Definition” begins to make a lot more sense.
It is, in short:
An effort to create a “Certification” which will declare all of their A.I. systems (no matter how closed their data is) as “Open Source”… while simultaneously being run by a DEI activist organization with a focus on racial and gender identity quotas.
It checks a whole lot of check boxes. All at once.
Source: Open Source AI Definition: Not Open, Built by DEI, Funded by Big Tech
Open Source Artificial Intelligence Definition (OSAID)
Does the OSI have the mandate of membership to pursue the OSAID?
No, the OSI’s board did not seek or obtain the mandate of membership (if anything, membership voted against it in a show of hands), nor did they ultimately submit their document for approval by the membership.
Did the members approve the OSAID?
No, just the board.
What process was used for the development of the OSAID?
The OSI’s “Deep Dive” into AI used a relatively unknown process called “co-design” rather than the typical rough consensus process used for technical standards:
Co-design, also called participatory or human-centered design, is a set of creative methods used to solve communal problems by sharing knowledge and power. The co-design methodology addresses the challenges of reaching an agreed definition within a diverse community (Costanza-Chock, 2020: Escobar, 2018: Creative Reaction Lab, 2018: Friedman et al., 2019).
This sounds reasonable, but you don’t have to dive too far into those references to learn about the black feminist concept of the “Matrix of Domination” and discover “design principles and practices erase certain groups of people, specifically those who are intersectionally disadvantaged or multiply-burdened under capitalism, white supremacy, heteropatriarchy, and settler colonialism”.
While interesting and important, this has nothing to do with the technical question of what components of an AI system need to be open for the system to fully protect the four essential freedoms of free software. This is a question for subject matter experts who have been excluded from the process; when everyone’s an expert, nobody is:
We believe that everyone is an expert based on their own lived experience, and that we all have unique and brilliant contributions to bring to a design process.
See also:
- OSI on the co-design process that [still-]birthed the Open Source AI Definition (OSAID)
- Design Justice, A.I., and Escape from the Matrix of Domination
Who led the process?
To create this new “OSAID”, the Open Source Initiative hired Mer Joyce from the consulting agency known as “Do Big Good“.
Why, specifically, was Mer Joyce hired to lead the effort to create a brand new “Open Source” definition, specifically focused on Artificial Intelligence?
- Was it her extensive background in Open Source?
- Or her expertise in A.I. related topics?
- Perhaps it was simply her many years of work in software, in general?
Nope. It was none of those things. Because, in fact, Mer Joyce appears to have approximately zero experience in any of those areas.
In fact, the stated reason that Mer Joyce was chosen to create this Open Source definition is, and I quote:
“[Mer Joyce] has worked for over a decade at the intersection of research, policy, innovation and social change.”
Her work experience appears to be mostly focused on Leftist political activism and working on Democrat political campaigns. <snip>
Why this agency, and this individual, was hired to lead the work on the OSAID is beyond baffling. Just the same, this appears to be part of a larger pattern within Open Source and Big Tech: Hiring non-technical, political activist types to lead highly technical projects. It doesn’t usually go well.
Source: Open Source AI Definition: Not Open, Built by DEI, Funded by Big Tech
What about Diversity, Equity, and Inclusion (DEI)?
Considering that the leadership hired to oversee the OSAID’s creation is extremely non-technical — and almost 100% focused on “anti-racist” and “decolonizing” activism — it’s no surprise that one of the first steps taken was to create “working groups” based entirely on skin color and gender identity.
“The next step was the formation of four working groups to initially analyze four different AI systems and their components. To achieve better representation, special attention was given to diversity, equity and inclusion. Over 50% of the working group participants are people of color, 30% are black, 75% were born outside the US, and 25% are women, trans or nonbinary.”
What does having “25% of the people being Trans or nonbinary” have to do with creating a rule-set for software licensing?
Your guess is as good as mine.
But, from the very start of the OSAID’s drafting, the focus was not on “creating the best Open Source AI Definition possible”… it was on, and I quote, “diversity, equity and inclusion”.
The best and brightest? Not important. Meritocracy? Thrown out the window.
Implement highly racist “skin color quotas” in the name of “DEI”? You bet! Lots of that!
Source: Open Source AI Definition: Not Open, Built by DEI, Funded by Big Tech
Did the OSI faithfully follow the “co-design” proces?
No. For example, they drove the discussions and decisions in their direction by unilaterally and prematurely closing a proposal to reach consensus:
We see the role of the designer as a facilitator rather than an expert.
What was the voting scandal?
The OSI describe the “co-design” process as democratic, complete with questionable statistics. Our analysis found vote-cancelling negative votes were granted (only) to Meta’s Llama working group, and these superpowers were used (including by their lawyers) to cancel other working groups’ votes to require the release of training data, without reporting them in the working group final report vote tallies.
They admitted it and apologised, only to uphold the flawed decision to exclude training data — subsequent analysis showed the vote found it should have been included — claiming the process still advertised as democratic today was “never meant neither to be scientific, nor representative, nor democratic or anything like that”.
In any case, democracy is not suitable for defining technical standards to meet specific technical requirements (i.e., the full protection of the four essential freedoms of free software: to use, study, modify, and share the work).
See also: Lies, Damned Lies, and Statistics: The path to a meaningful Open Source AI Definition
What do the OSI’s founders think of the OSAID?
OSD author Bruce Perens has argued that:
- The OSAID is flawed.
- That OSI hasn’t done a great job and so weren’t necessarily the best team to do this.
- The result is less than Open Source.
He considers the training data to be the source code and that the OSD does not need updating:
You can apply the original Open Source Definition to machine learning and I feel it works better. You need to apply it to two pieces: the software (and possibly specialized hardware), and the training data. The training data is “source code” for the purposes of the Open Source definition. This is complicated because some models evolve with each question asked, etc., and thus the training data is ever changing and becomes very large.
What do OSAID “Endorsers” think about this?
Many of the “endorsers” offer only lukewarm support and impose additional conditions, for example:
- Ai2 are now forced to claim they are “more open than open“: “Open access to pre-training data is a must have for understanding model behavior. We cannot take a scientific approach to AI creation without access to and understanding of the training data.”
- “Nextcloud supports that but when it comes to data, we believe it should be always fully available.” and “In case of Nextcloud, that means we keep our Ethical AI Definition, which does contain a requirement to open up all data, in parallel with the OSI definition.”
- LINAGORA by way of the OpenLLM-France project’s “data-first approach” released a presentation specifically calling for the release of training data and calling out the risk of contamination: “Transparence et publication des jeux de données d’apprentissage avec prise de compte de la contamination ou pas pour la licence finale du modèle.”
- LINAGORA also proposed their own OpenSourceAI-Definition that explained why data is critical: “Training datasets are the foundation of AI model accuracy and reliability. Their open availability ensures replicability and transparency in AI development.”
Was creating the OSAID even necessary?
No. The Open Source Definition (OSD) has stood the test of time over a quarter century since its release in 1998. While it is strong on “openness”, it is implicit about “completeness”, which has left data dependencies like databases and media in a grey area. A better approach would be to make this explicit, which could be done simply by adding a single sentence to the introduction, as has been proposed for the community WIP.
How do I know if the OSD or OSAID applies to my software?
You don’t. In all likelihood, both do. In many/most cases, they will give you conflicting answers. Given the OSAID is a much weaker standard, the most common scenario is that software would pass the OSAID and fail the more meaningful OSD. The OSAID casts a very wide net, covering any software that “infers, from the input it receives, how to generate outputs” (i.e., virtually all software).
Does the OSAID conflict with the established OSD?
Yes.
What are the security implications of OSAID?
By not requiring the data, OSAID enables or exacerbates security vulnerabilities including prompt injections, data leakage, inadequate sandboxing, and unauthorized code execution, among others, often making them undetectable. That’s not to say that providing it is a panacea, but you at least have some chance of identifying the needle/s in the haystack, and you can even use AI to help find them.
See also:
- Turning Open Source AI Insecurity up to 11 with OSI’s RC1
- OWASP Top 10 for Large Language Model Applications
What types of data are acceptable by the OSAID?
Any data or no data. Yes, you read that right. No, it doesn’t make sense given “the training data is source code” per the OSD’s author.
Their FAQ deceptively divides data into four categories, only to accept any data, or none:
- Open training data: Open data like CC-BY-SA licensed Wikipedia, which could be considered Open Source.
- Public training data: Accessible but unlicensed/improperly licensed data like Common Crawl, which allows users to study and modify the system, but limits their ability to share or even use it (given the often unknown provenance, and possible tainting of outputs with others’ rights). Potentially a candidate for a “limited” Open Source brand like LGPL.
- Obtainable training data: Obtainable “including for fee”, like NYT articles. This typically means there’s a custodian who will aggressively pursue any use of the data for training or inference: The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work
- Unobtainable non-public training data: Not obtainable for any amount of money, like Facebook/Instagram’s social graph. Clearly not a candidate for Open Source, and potentially dangerous.
Open Source Artificial Intelligence (OSAI)
What does Debian think about this?
Fellow Debian developers are falling over themselves to second a General Resolution (GR) in opposition to the OSAID and declaring it to be not DFSG-compliant, which would have significant implications; among other things, OSAID-compliant software would be rejected by Debian and its dependent distros. Based on the sentiment across several threads, it is overwhelmingly likely it would pass (e.g., Concerns regarding the “Open Source AI Definition” 1.0-RC2). The sentiment can be summarised as follows:
“The source code” must include everything needed to rebuild the software so that it works the same as the original. An AI system doesn’t “work the same” — i.e., give the same output from the same input — without the training data, so the training data is clearly part of the source.
Requiring that users must “build a substantially equivalent” part of the source on their own, as stated in the “Data Information” paragraph, is obviously at odds with the DFSG. That’s like not releasing the source code at all and claiming that it’s still free software because “a skilled person” could rewrite it. That’s obvious bullshit.
OSAID-compliant models will not be eligible for inclusion in Debian and dependent distros.
What does the Free Software Foundation think about this?
FSF is working on freedom in machine learning applications:
We believe that we cannot say a ML application is free unless all its training data and the related scripts for processing it respect all users, following the four freedoms.
OSAID-compliant models will not be considered “free software” by the FSF.
What does the Digital Public Goods Alliance think about this?
The DPGA has just published their views on The Role of Open Data in AI systems as Digital Public Goods:
Importantly, the proposal, which will soon move to community consultation, is to continue requiring open training data for AI systems to be considered DPGs.
OSAID-compliant models will not be considered “public goods” without the data.
What does the Software Freedom Conservancy think about this?
They consider that the Open Source AI Definition Erodes the Meaning of “Open Source”:
- “The OSI refused to add this requirement because of a fundamental flaw in their process; they decided that ‘there was no point in publishing a definition that no existing AI system could currently meet.’ This fundamental compromise undermined the community process, and amplified the role of stakeholders who would financially benefit.”
- “OSI made an unforced error in this regard. While they could have humbly announced this as ‘recommendations’ or ‘guidelines,’ they instead formalized it as a ‘definition’ — with equivalent authority to their OSD.”
- “Their celerity of response made OSI an easy target for regulatory capture.”
- “To make a final decision about the software freedom and rights implications of such a nascent field led to an automatic bias to accept the actions of first movers as legitimate.”
- “By making this definition official too soon, OSI has endorsed demonstrably bad LLM-backed generative AI systems as ‘open source’ by definition!”
- “OSI also disenfranchised the users and content creators in this process.”
- “The Open Source community remains consistently in danger of excessive insularity, and the OSAID is an unfortunate example of how insular we can be.”
- “Until today, I was always able to say: “I believe that anything the OSI calls ‘open source’ gives you all the rights and freedoms that you deserve”. I now cannot say that again unless/until the OSI revokes the OSAID. Unfortunately, that Rubicon may have now been permanently crossed!”
- “OSI undermines its position as a philosophical leader in Open Source as long as OSAID 1.0 stands as a formal definition.”
- “The humble act now is to admit that it was just too soon to publish a ‘definition’ and rebrand these the OSAID 1.0 as ‘current recommendations.’”
- “Rather than merely be a pundit on this matter, I am instead today putting myself forward to try to be part of the solution… I will work arduously for my entire term to see the OSAID repealed.”
What does Stanford University’s Percy Liang think about this?
A “very informative interview” (Percy Liang on truly open AI) with Percy Liang, Associate Professor of Computer Science at Stanford University, director of the Center for Research on Foundation Models (CRFM), and co-founder of Together AI, has been recommended by OSI in support of the OSAID, likely because he linked to it in quoting the four essential freedoms of free software.
As is often the case, the claim is not in the citation given, and Liang actually undermines their position by confirming the criticality of training data access:
As the default for frontier models is API access, we’ve become accustomed to a low standard of transparency – we’re happy when we get access to weights. But this is only a partial improvement. There’s lots of crucial information we’re missing and without the training data, you risk catastrophic forgetting if you fine-tune.
Furthermore, open science needs open-source models. Without knowledge of the training data, pipelines, or the test-train overlap, how can we interpret test accuracies or understand model capabilities?
He also notes the existing “truly open-source work”, invalidating the OSI’s trope that a meaningful definition that requires data would result in an “empty set”:
The AI community is also producing great truly open-source work, including OLMo from AI2, LLM360’s K2, MAP’s Neo, HuggingFace’s SmoLM, BigCode’s Starcoder, Together’s RedPajama, EleutherAI’s Pythia, GPT-J, and NeoX, and the multi-team DattaComp-LM effort.
These teams all release code, weights, most of their checkpoints, and at least some of their training data. Others go as far as releasing all of their training data, evaluation code, and intermediate checkpoints. This means that the work is in principle reproducible, in line with the norms we’d expect in other areas of science.
Implications
Can AI models reveal their training data?
Yes, and attempts to prevent this are often ineffective:
- Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy: “Specifically, we design and implement an efficient defense that perfectly prevents all verbatim memorization. And yet, we demonstrate that this “perfect” filter does not prevent the leakage of training data.”
- LLMs and Memorization: On Quality and Specificity of Copyright Compliance: LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act.
- Copyright And The Challenge of Large Language Models: The NYT v. OpenAI case and others raising the same issue highlight a fundamental mismatch between traditional copyright law and the reality of LLM technology and the AI industries’ fair use defense.
It is also not binary; models may produce short strings or paraphrase protected content in ways that may be acceptable in one jurisdiction but not another. Developments like the “Adversarial Compression Ratio (ACR) as a metric for assessing memorization in LLMs” demonstrate this spectrum: Rethinking LLM Memorization through the Lens of Adversarial Compression.
Additionally, rights holders are able to identify not only verbatim and paraphrased copyrighted works, but also inject tamper-resistant watermarks that could be used as proof in cases against violators: Protecting Copyright of Medical Pre-trained Language Models: Training-Free Backdoor Watermarking.
What are the implications for copyright?
Users and rights-holders are not able to verify the contents of AI models without access to its training data. This is bad for both groups, as rights may be undermined and devalued by inclusion in Open Source models, and users may be exposed to significant liability for infringement.
This alone should be a showstopper for the OSAID, and it will likely incite action from rights-holders as they become aware of the issue. They may have limited ability to protect themselves from such infringement, which may be hosted outside of their jurisdiction and subject to different copyright regimes.
For example, copyright content (for the sake of the argument, Mickey Mouse) could be incorporated into a model in Germany under the guise of teaching and research (Section 60 UrhG) with the results made available as an OSAID-compliant AI model. This could undermine the rights-holders’ rights and/or expose distributors and users to legal liability.
This calls into question whether the rights to even use or share a model are fully protected by OSAID (we know the rights to study and modify models are not).
What are the implications for human creativity?
With more and more human output being run through LLMs, we risk our work being tainted by unknown third-parties’ rights. OSAID-compliant models exacerbate this issue when they should be the antidote for it. This is a huge problem that deserves — and should have received — close attention from experts and philosophers. The Software Freedom Conservancy have raised similar concerns:
FOSS activists err when we unilaterally dictate and define what is ethical, moral, open and Free in areas outside of software. Software rights theorists can (and should) make meaningful contributions in these other areas, but not without substantial collaboration with those creative individuals who produce the source material. Where were the painters, the novelists, the actors, the playwrights, the musicians, and the poets in the OSAID drafting process?
With models being trained on “toxic” datasets like Common Crawl — a dump of the Internet published without Open Data licenses — anybody could potentially demonstrate a “chain of custody” from their work to yours, even if you unknowingly used a model included in word processing software for spelling and grammar checks (an extreme example, granted, but one worthy of consideration nonetheless). It should be of no surprise then that Common Crawl is an OSAID booster: “The Common Crawl Foundation fully supports the Open Source AI Definition as a crucial step in setting clear standards for open and transparent AI development. This definition will help ensure AI develops responsibly, staying open and accessible to everyone.“
Excuses
What about the EU AI Act?
We’ve heard constantly that the OSAID needed to be rushed out because of the EU AI Act (Artificial Intelligence Act), which entered into force on 1 August 2024. Notwithstanding that a US-based charitable organisation that risks loss of tax-exempt status for lobbying should not be trying to influence the European Union — ironically via a paid employee in the post-Brexit United Kingdom — Ben Brooks (Fellow @ Harvard University’s Berkman Klein Center) explains why “it’s crucial that the OSAID isn’t adopted, uncritically, in regulation. It has different goals, and could have unintended consequences.”:
Bottom line? We should be mindful of how transparency obligations directed at one objective (e.g. auditability, replicability, “open washing”) could impact the wider community if they are transposed into regulations directed at a different objective (e.g. opt-outs). If regulatory goals can be achieved with a generous interpretation of “open source”, governments should take the expansive view. The OSAID serves a noble but altogether different purpose.
What about medical applications?
The OSI asks “Do you want Open Source AI built on medical data?”, and the answer is a qualified “yes”.
In addition to copyrights, medical data is generally subject to strict privacy controls which makes it inherently unsuitable for training Open Source AI models. That’s fine, because not everything has to be Open Source, and some things are better off not being Open Source. Given we know that AI models can reveal their training data, the suggestion that the training process offers adequate protection for such sensitive data is as bogus as it is dangerous, both to data subjects and to those seeking to rely on such assurances to avoid legal liability.
Anyone claiming the ability to conceal the source (i.e. data) is a feature of OSAID — or conversely that requiring the data is a limitation of more meaningful definitions — either doesn’t understand the technology or does and is deliberately deceiving you. Either way, in doing so they have proven themselves incompetent to create such a definition. Indeed, allowing the publication of medical models as Open Source without requiring the data (as OSAID does) actually impairs the industry. In any case, an Open Weight model would be published anyway with or without the Open Source moniker, so there is significant cost and no benefit in extending the definition to cover them.
Fortunately, medical data (e.g., lung cancer scans) gathered with explicit consent may be suitable for training medical models, in which case its availability under a meaningful Open Source definition that requires the release of training data will enable researchers to study and modify it as a foundation for future models. The same cannot be said for Open Weight models which do not include the training data, severely limiting the freedom to study and modify the model to e.g. fine-tuning. Furthermore, there is strong incentive for patients to provide such permissions, as the research could be used to save themselves, a loved one, or others.
What about federated learning?
The OSI asks “Can you imagine an Open Source AI built with federated learning?”, and the answer is a qualified “yes”.
Federated learning (aka collaborative learning) is a technique that doesn’t require collecting the training data centrally, rather having each participant conduct training with their own generally private slice of the data, consolidating the results in a model trained on the superset of participants’ data without any having had access to others’ data. This is useful for applications like fraud detection, where a consortium of banks can work together to train a model that exceeds the performance of any one bank’s model, without sharing their customers’ data with other banks.
Federated learning is specifically designed to improve privacy by not sharing the training data, and given the data is the source for AI, it is fundamentally incompatible with Open Source. That’s fine, because not everything has to be Open Source, and some things are better off not being Open Source. Given we know that AI models can reveal their training data, the suggestion that the training process offers adequate protection for such sensitive data is as bogus as it is dangerous, both to data subjects and to those seeking to rely on such assurances to avoid legal liability. Extending the definition to allow for federated learning would be the software equivalent of collecting closed-source binary blob firmware drivers from several vendors and calling the resulting distribution Open Source, which is obviously nonsense.
Anyone claiming the ability to conceal the source (i.e. data) is a feature of OSAID — or conversely that requiring the data is a limitation of more meaningful definitions — either doesn’t understand the technology or does and is deliberately deceiving you. Either way, in doing so they have proven themselves incompetent to create such a definition.
Fortunately, federated learning can still benefit from Open Source tool chains, and the results can be released as Open Weight models which do not deceive users as to the availability of source (i.e., data).
What about training budgets?
Another invalid excuse for hiding the training data is that you couldn’t afford to train with it yourself anyway.
When the Open Source Definition was released and OSI founded in 1998, computers were expensive and took forever to compile a package or program, let alone a full Linux distribution. The software got more efficient and computers more powerful, so this wasn’t a problem for long. Similarly with AI, the hardware is getting faster and cheaper by orders of magnitude, and the models have started to get smaller and more efficient (for example, by training on specialised sets of high-quality data).
Even if this were not the case, companies are securing 9- and 10-figure checks from private investors before even turning to public money. Countries like France and Germany, and regions like Europe appreciate the strategic value of having their own AI models and services, and are willing to invest to make them a reality:
- TechCrunch: Paris-based AI startup Mistral AI raises $640M
- Lidl owner and Bosch Ventures co-lead $500M Series B into German AI startup Aleph Alpha
Also worth mentioning is that a lot of the training spend is actually in licensing data that was previously “procured” for “free”, but the licensing details are typically not published and are unlikely to apply to freely downloadable models.
What about Child Sexual Abuse Material (CSAM)?
OSI’s leadership have argued that “a popular dataset of images used for training many image generation AI tools […] contained child sexual abuse images for many years” in making the case that “pushing for radical openness with data has clear drawbacks and issues”, claiming that “there was no easy way for the maintainers of this dataset to notice and remove those images” and seemingly suggesting that it should be closed instead of being abandoned altogether. See Open Source AI: Won’t somebody please think of the children?
Taking Action
What can people do about it?
The OSI has had all year to incorporate feedback from the community but has chosen not to, undermining their claims that this is just a 1.0 and future versions will address our concerns.
Unless repealed by the OSI, the community must reject the OSAID. Start by signing and sharing the Open Source Declaration, referring to OSD v1.9 specifically, and joining our uncensored Discourse server.
Who are you and why do you care?
A Debian developer and volunteer at the Kwaai Open Source AI Lab where I lead development of the Personal Artificial Intelligence Operating System (pAI-OS).
Without a meaningful Open Source standard for AI, there is little incentive to develop such systems, nor the datasets and models that will empower future generations.
Acronyms
- DFSG: Debian Free Software Guidelines
- DPGA: Digital Public Goods Alliance
- FOSS: Free & Open Source Software
- FSF: Free Software Foundation
- OSAID: Open Source AI Definition
- OSD: Open Source Definition
- OSI: Open Source Initiative
- SFC: Software Freedom Conservancy
- WIP: Work In Progress