I’ve downloaded and transcribed all 18 of the Open Source Initiative (OSI)‘s townhalls on the subject of their Open Source AI Definition (OSAID) sideshow so you don’t have to. I used one of OpenAI’s Whisper models (large-v2
) whose “performance is close to that of professional human transcribers” (per the paper released with it). “Whisper’s code and model weights are released under the MIT License.” OpenAI also released a model card and Colab example, wrapping it all up in a nice blog post Introducing Whisper, checking several boxes on the proposed checklist.
Despite having been granted access to the Code and Model under the OSI Approved MIT license, the Data (and related code used to collate and massage it into a useful form for training) was conspicuous in its absence beyond a list of links to third-party data sources, and a single file containing subtitles for a few dozen segments of YouTube videos of the Late Show with Stephen Colbert (which seems like more of a troll given it’s such a tiny fraction of the claimed-but-impossible-to-verify 800,000 hours of training data!).
Predictably, users are unable to fully and freely exercise two of the four essential freedoms of Open Source (to Use, Study, Modify, and Share it), which have happily been determined from the outset to be the same for AI; they can use and share it — already a boon for app developers granted — but attempts to modify and study it have resulted in no end of confusion and dead ends.
Open Source AI Lite
Whisper is exactly the kind of Open Source-ish AI the OSI seeks to certify with the upcoming release candidate due to be launched next month. While it partially protects your freedom to use and share it — provided you’re willing to do either without knowing or being able to verify its provenance — it does virtually nothing to protect your freedom to study and modify it (beyond very limited fine-tuning). It’s more akin to the Lesser (now Library) GPL (LGPL), also created for pragmatic adoption reasons similar to those being touted today (but at least then the software itself was Open Source with the exception to link to proprietary code rather than vice versa).
If excluding training data really is deemed non-negotiable, then the other simple solution to the problem is branding acknowledging this limitation, similar to Karsten Wade’s “D+/D-” proposal:
- Open Source AI Lite
- Limited Open Source AI
- Library Open Source AI
- Lesser Open Source AI
- Qualified Open Source AI
- etc.
Co-Design? Consensus? Democracy? Dictatorship.
The process was declared to be “co-design” from the first minute of the first meeting, apparently decided in the closed mailing list phase last year when transparency was not a priority. It was “to come out of consensus“, but then they “asked the group to vote” only to admit yesterday “those results were never meant neither to be scientific, nor representative, nor democratic or anything like that” — no surprise when counting the votes without granting vote nullifying superpowers to certain candidates gave the wrong answer (i.e., that training data should be required)!
Stop gaslighting the community about “co-design” being an inclusive, transparent, and auditable process when the single most important decision — whether or not to require training data — was apparently pre-determined. Don’t endorse it on that basis, and don’t rely on it on that basis. Don’t rely on my opinions either, rather refer to the repo and draw your own conclusions. While I entered this process with open eyes, at this point I start to accept that it is the result of corporate capture and more the voice of the company sponsors (the likes of Cisco and Google) than the community.
Training Data Indecision
One of the first questions in the first meeting asked about training data (unfortunately dissenting voices have a habit of not getting recorded) and in the response it was already insinuated that it may be “sufficient to have a detailed description of what went into it”. We have not moved one millimeter from that position all year despite clear and consistent protest against it.
I asked GPT-4o (gpt.py
) to review the Whisper transcripts for the impact of inclusion or exclusion of training data on the four essential freedoms, and to specifically surface evidence of predetermined positions, biases, etc. You can see the constant drumbeat of dissent from that first town hall through yesterday in its analysis below.
High Stakes, Low Expectations
If there is consensus on anything it’s that the release candidate does not fully protect all four essential freedoms (to Use, Study, Modify, and Share), and arguably only partially protects two of them (to Use and Share). Apparently, no practitioners have even attempted to demonstrate that it does (per Model Weights is not enough for Open Source AI), and won’t until after it’s released is impossible to retract.
This means projects like the Personal Artificial Intelligence Operating System (pAI-OS) — all this being part of my work for Kwaai (a volunteer based open source AI research and development lab) when I’m not studying for my masters in CS/ML — will not have the same potent tool in the AI arena that other software projects have enjoyed for decades courtesy the Open Source Definition (OSD), and will instead have to deal with competitors also quasi-legitimately claiming to be Open Source AI.
Worst of all, the proposed OSAID won’t do anything to help projects like Whisper share their work — they already are — nor encourage them to raise the bar by sharing data. At yesterday’s town hall I was asked why the first application to be built on pAI-OS decided to use Llama, and it’s the same reason I used GPT for this analysis: powerful Open Source Large Language Model (LLM) candidates like that of Japan’s National Institute of Informatics (NII) are still a work in progress, despite claims to the contrary by Meta with their “Open Source AI” Llama (which although the OSI agree is not Open Source AI, is closer under the current definition than some are comfortable with). Hugging Face just announced there are more than 1,000,000 free (as in beer, not as in freedom) models on their platform too, but there will be less incentive to create more of them, and less opportunities to “stand on the shoulders of giants” by studying, modifying, and sharing existing models to do so.
The Path to a Meaningful Open Source AI Definition
I’ve been asked to contribute less, and less often, and I plan to accept their invitation, but we will continue holding their feet in the fire in the hope that common sense prevails, perhaps at the upcoming board vote (I’ve reached out to them individually and trust they are seeing these updates).
While OSI co-founder Bruce Perens is still around to talk about it (What comes after open source? Bruce Perens is working on it), another co-founder, the late, great Ian Murdock (the “ian” in Debian of the Debian Free Software Guidelines on which the original definition of Open Source was based) is not. Perhaps this fits in with Bruce’s “Post-Open” pitch, but I wonder what he would think about it? After all, the OSI was founded to further the interests of business — and they’re certainly doing that here — but its roots are in Free Software and the Four Freedoms, which are essential for a reason rather than merely “appreciated” like the training data in earlier drafts.
In other news, I gave a talk on Lies, Damned Lies, and Statistics: On the path to a meaningful Open Source AI Definition yesterday, and Kwaai have agreed to take on the topic in their policy track. While the Open Source Initiative may have been the ideal home for this conversation, it’s not the only one — may a thousand Open Source AI Definitions bloom!
Town Hall 01 – January 12, 2024
Overview:
The first OSI townhall focused on establishing a framework for defining Open Source AI, emphasizing the need to include various stakeholders in the process. The meeting outlined the importance of aligning AI with open source principles, using the four essential freedoms as a guide. Discussions touched on the challenges of incorporating training data within these freedoms, highlighting the complexities and potential biases involved. The atmosphere was collaborative, but there were underlying tensions regarding data transparency and legal considerations.
Key Takeaways
- The need to define Open Source AI, not just machine learning.
- Incorporating four freedoms in the AI definition.
- Challenges in including training data within these freedoms.
- Stakeholder inclusion is crucial for the definition process.
- Legal frameworks and documents are key to granting freedoms.
Data Mentions
There were significant discussions on whether and how to include training data in the Open Source AI definition. Questions were raised about the level of access required to the training data, whether the full dataset or just a description would suffice. This was identified as a critical and delicate issue.
Quotes
- “‘If I like an AI system, I must be free to share it with other people.’”
- “‘The question is what kind of access? What level of access? The full on training data set? Or is it sufficient to have a detailed description of what went into it?’”
- “‘We cannot have an open source AI that will never be transparent, will never be explainable or fair.’”
Town Hall 02 – January 26, 2024
Overview:
The second OSI townhall continued the conversation on defining Open Source AI, with a focus on identifying necessary components for using, studying, modifying, and sharing AI systems. The meeting concentrated on analyzing the components of Llama 2, discussing what is necessary to exercise the four essential freedoms. The atmosphere was cooperative but revealed differing opinions on the necessity of including training data for AI systems. The group worked on moving forward with the checklist for legal documents and freedoms, despite differing opinions about data inclusion.
Key Takeaways
- Continued work on defining necessary components for Open Source AI.
- Discussion on the necessity of including training data for AI systems.
- Different interpretations of ‘model parameters’ and their necessity.
- Importance of creating a checklist for legal documents and freedoms.
- Need to engage with policymakers and end users to avoid societal concerns.
Data Mentions
The townhall included discussions on whether training data is necessary for using AI systems like Llama 2. Opinions varied, with some participants considering it ‘nice to have’ for validation, while others deemed it unnecessary for running the model. There was also debate about documenting data and its role in transparency and validation.
Quotes
- “‘We need to have a shared understanding among multiple experts.’”
- “‘Most of the answers are in the not necessary.’”
- “‘We need to make sure that whatever definition comes out of this process is not seen as a threat to society by regulators.’”
- “‘The open data world has a different culture than the open source movement.’”
- “‘We need to work with them to understand exactly what they think of their space once it becomes actionable just like software.’”
Town Hall 03 – February 09, 2024
Overview:
The third OSI townhall advanced the discussion on defining Open Source AI, with a particular focus on the inclusion of training data and its implications for the four essential freedoms. The meeting involved detailed analyses of AI systems such as Llama2 and Pythia, with participants debating the necessity and impact of including training data in the AI definition. The tone was constructive, but disagreements on data transparency and access persisted. The atmosphere highlighted the complexities of aligning AI with open source principles, especially concerning data access and usage.
Key Takeaways
- Deep dive into the necessity of training data in Open Source AI.
- Ongoing debates on data transparency and access.
- Importance of the four freedoms in guiding AI definition.
- Need for consensus on data inclusion or exclusion.
- Scheduled completion of AI system component analysis by May.
Data Mentions
Training data was a significant point of discussion, with debates on its necessity for using, studying, modifying, and sharing AI systems like Llama2 and Pythia. There was a consensus that model parameters are essential, but training data’s role remained contentious.
Quotes
- “‘We need to keep moving. We need to finish this process.’”
- “‘If we face an obstacle, we move around it and return later.’”
- “‘We need a definition of AI system because the open source AI needs to refer to a system.’”
- “‘The conversation around data is very crucial.’”
- “‘We need to come close to a conclusion very soon.’”
Town Hall 04 – February 23, 2024
Overview:
The fourth OSI townhall focused on refining the definition of Open Source AI with a narrowed scope on machine learning. The meeting highlighted ongoing debates about including training data, with opinions split on its necessity. Discussions centered on the need for clarity in defining ‘AI systems’ versus components, emphasizing the importance of ensuring the four essential freedoms. The atmosphere was collaborative, with a focus on reaching consensus and addressing potential biases and legal challenges.
Key Takeaways
- Narrowing focus to machine learning within Open Source AI.
- Continued debate on the inclusion of training data and its implications.
- Clarification needed on defining ‘AI systems’ versus components.
- Emphasis on the four essential freedoms guiding the definition.
- Plans for separate discussions on data governance and availability.
Data Mentions
Training data remained a contentious issue, with discussions on whether full datasets or descriptions are sufficient. The necessity of data for exercising freedoms was debated, with potential separate discussions on data governance proposed.
Quotes
- “‘The freedom to study all the descriptions inside the Kino Manifesto.’”
- “‘We’re really focusing on machine learning right now.’”
- “‘The conversations around data are crucial.’”
- “‘A randomized sample of the data inside the dataset could be sufficient.’”
- “‘We need a definition for a system because we need to have some way of anchoring that conversation.’”
Town Hall 05 – March 08, 2024
Overview:
The fifth OSI townhall continued to focus on the definition of Open Source AI, emphasizing the inclusion of training data and its implications for the four essential freedoms. The meeting involved discussions on required components for AI systems, with a significant focus on the necessity and transparency of training data. Participants explored the legal frameworks and transparency requirements needed to align AI systems with open source principles. The atmosphere was collaborative, with a focus on reaching consensus and addressing potential biases and legal challenges.
Key Takeaways
- Emphasis on training data’s role in Open Source AI.
- Continued debate on data transparency and access.
- Focus on aligning AI systems with the four essential freedoms.
- Discussions on legal frameworks and transparency requirements.
- Planned evaluation of AI systems’ compliance with required components.
Data Mentions
Training data was a major focus, with discussions on its necessity for transparency and validation. The need for detailed documentation and transparency around data was emphasized, with an ongoing debate about the level of access required.
Quotes
- “‘Sufficiently detailed information on how the system was trained is required.’”
- “‘The code used for pre-processing data must be available.’”
- “‘Training data access is not strictly necessary for rebuilding a model.’”
- “‘The necessity of datasets is a continuous debate.’”
- “‘Transparency requirements are likely to be mandated by law.’”
Town Hall 06 – March 22, 2024
Overview:
The sixth OSI townhall centered on finalizing the Open Source AI definition with an emphasis on the four essential freedoms. The meeting continued to grapple with the inclusion of training data, focusing on transparency requirements rather than full data access. The atmosphere was determined, with an intent to move past longstanding obstacles regarding data. Discussions also touched on the legal frameworks needed to support the definition, with a focus on documentation and compliance. The tone was constructive, though underlying disagreements about data inclusion remained.
Key Takeaways
- Finalization of Open Source AI definition, focusing on transparency.
- Continued debate on training data inclusion, focusing on transparency rather than access.
- Emphasis on the legal frameworks and documentation needed for compliance.
- Intention to move forward despite unresolved data issues.
- Recognition of data as a controversial and complex element.
Data Mentions
The townhall emphasized transparency in training data requirements without necessitating full data access. There was recognition of the longstanding debate around this issue, with a decision to proceed without full access to datasets and revisit if necessary.
Quotes
- “‘We must keep going and reach a conclusion.’”
- “‘Access to the preferred form to make modifications is essential.’”
- “‘Data is the most controversial part of open source AI.’”
- “‘Let’s move on and finish this investigation phase.’”
- “‘Transparency requirements are informed by the EU AI Act.’”
Town Hall 07 – April 05, 2024
Overview:
The seventh OSI townhall focused on refining the Open Source AI definition, particularly emphasizing transparency requirements for training data and the overall documentation process. The meeting reiterated the focus on the four essential freedoms and the challenges of balancing transparency with accessibility. The discussion included a detailed examination of the process for establishing these requirements and the importance of community involvement. The atmosphere was constructive, with a focus on finalizing version 0.0.7 by the next week, despite ongoing debates on data transparency.
Key Takeaways
- Refinement of Open Source AI definition with transparency focus.
- Emphasis on documentation requirements for data.
- Continued focus on the four essential freedoms.
- Planning for release of definition version 0.0.7.
- Ongoing challenges in balancing transparency with accessibility.
Data Mentions
The townhall highlighted transparency requirements for training data, specifying that datasets themselves are not required but documentation is essential. This marks a shift from previous discussions that considered full data access.
Quotes
- “‘Data sets themselves are not required, but we have transparency requirements.’”
- “‘Transparency requirements only, with documentation requirements.’”
- “‘Documentation for each of these required components.’”
- “‘The conversation around data is very crucial.’”
- “‘We need that kind of intellectual energy to make this definition a good one.’”
Town Hall 08 – April 19, 2024
Overview:
OSI Townhall 08 focused on the Open Source AI Definition, emphasizing transparency and inclusion of training data as a proxy rather than requiring full datasets. Discussions highlighted ongoing challenges in balancing transparency with practicality and legal considerations. The townhall also included debates on the role of model parameters and their legal status, aiming to finalize the definition for release. The atmosphere was forward-looking, with a focus on overcoming obstacles and ensuring inclusivity and global representation.
Key Takeaways
- Emphasis on transparency and proxy for training data instead of full datasets.
- Continued challenges in balancing transparency, practicality, and legal considerations.
- Debate on the legal status of model parameters and necessary terms.
- Efforts to ensure inclusivity and global representation in the definition process.
- Focus on finalizing the definition for an upcoming release.
Data Mentions
The townhall emphasized transparency through data proxies, like tools for dataset recreation, rather than requiring full datasets. This decision aims to avoid legal and practical hurdles while maintaining openness.
Quotes
- “‘We will automatically exclude from the pool of possible AI systems.’”
- “‘It’s much easier and probably also more relevant to have access instead as a proxy.’”
- “‘We started to work on a frequently asked question document.’”
- “‘We are making history right now.’”
- “‘We need to provide a stable view of what open source AI means.’”
Town Hall 09 – May 03, 2024
Overview:
OSI Townhall 09 focused on the Open Source AI Definition, particularly on the inclusion or exclusion of training data from required components. The meeting discussed the legal and practical implications of data transparency and the possibility of using data information as a substitute for full datasets. The atmosphere was collaborative, with a focus on refining the definition and ensuring stakeholder inclusion. There were ongoing debates on the necessity of training data and the legal frameworks for AI components, highlighting the complexities involved in maintaining the four essential freedoms.
Key Takeaways
- Discussion on including or excluding training data in Open Source AI.
- Use of data information as a substitute for full datasets.
- Focus on legal frameworks and transparency requirements.
- Continued debate on model parameters and their legal status.
- Efforts to ensure a representative and inclusive process.
Data Mentions
The townhall discussed whether training data should be included in the required components of Open Source AI. The use of data information as a substitute for full datasets was emphasized, highlighting the legal and practical challenges of full data inclusion.
Quotes
- “‘The source data is not required because of legal reasons.’”
- “‘Transparency is the feature of open source.’”
- “‘We want to know what’s going in there.’”
- “‘We need to help regulators understand this space.’”
- “‘The definition, the draft 08, 008 is feature complete.’”
Town Hall 10 – May 31, 2024
Overview:
The tenth OSI townhall focused on addressing challenges in the Open Source AI Definition, with a particular emphasis on data inclusion and transparency. The meeting discussed the difficulties faced by volunteer reviewers in accessing required documentation and the necessity of engaging system creators for accurate validation. Key concerns included the transparency and availability of training data, with suggestions to include data cards and adjust documentation requirements. The atmosphere was collaborative, but underlying tensions about data transparency and process complexities persisted.
Key Takeaways
- Challenges in accessing required documentation for AI validation.
- Engagement with system creators is necessary for accurate validation.
- Debates on including data cards and data processing code in requirements.
- Emphasis on transparency and documentation for training data.
- Collaborative efforts with organizations like the Linux Foundation.
Data Mentions
The townhall highlighted the difficulties in accessing training data and documentation, with discussions on including data cards and removing data processing code. The need for transparency in data requirements was emphasized, though full data access was not mandated.
Quotes
- “‘It was hard for volunteer reviewers to find required documents to do the review.’”
- “‘The AWS Open Source team posted a range of concerns with v 0.0.8, foremost on data.’”
- “‘The Linux Foundation team recommended adding Data card to the required components.’”
- “‘No changes will be made without a very clear, structured public process.’”
- “‘The conversation around data is very crucial.’”
Town Hall 11 – June 14, 2024
Overview:
The eleventh OSI townhall focused on ongoing debates regarding the inclusion of training data in the Open Source AI definition. The meeting emphasized the use of data information as a substitute for full datasets, highlighting legal uncertainties and the complexities of implementing the four essential freedoms. Discussions included alternative proposals like synthetic data and the challenges of federated learning. The atmosphere was collaborative, yet underscored by the contentious nature of data inclusion. The townhall aimed to align stakeholders and refine the definition for a final release.
Key Takeaways
- Use of data information instead of full datasets continues to be debated.
- Legal uncertainties around datasets affect the definition process.
- Alternative proposals like synthetic data are considered experimental.
- Challenges in federated learning without creating datasets.
- Focus on aligning stakeholders for a final definition release.
Data Mentions
Discussions centered on using data information as a proxy for training data, considering legal challenges and the practicality of full data inclusion. Synthetic data and federated learning were discussed as alternatives, but both have limitations.
Quotes
- “‘Requiring only data information instead of training datasets is the greatest point of debate now.’”
- “‘The Pile taken down after an alleged copyright infringement in the US.’”
- “‘We need to find ways to provide solid principles that maybe will not change but allow for some parts to be adapted.’”
- “‘The intention of Data information is to allow developers to recreate a substantially equivalent system using the same or similar data.’”
- “‘Alternative proposals like synthetic data are experimental and unproven.’”
Town Hall 12 – June 28, 2024
Overview:
The twelfth OSI townhall focused on refining the Open Source AI Definition, particularly on clarifying the four essential freedoms and their recipients. Discussions centered on the inclusion of training data, with emphasis on transparency and legal compliance, and the use of proxies such as data information instead of full datasets. The meeting also addressed the separation of the checklist document to streamline the definition process. The atmosphere was collaborative, with ongoing debates about data inclusion and transparency, highlighting the need for alignment with legal frameworks and stakeholder input.
Key Takeaways
- Refinement of the four essential freedoms and their recipients.
- Focus on transparency and legal compliance in data inclusion.
- Use of data information as a substitute for full datasets continues.
- Separation of checklist document to streamline the definition process.
- Ongoing collaboration with stakeholders and legal experts.
Data Mentions
The townhall continued to emphasize the use of data information as a proxy for training data instead of requiring full datasets. There was discussion about aligning the checklist with the Model Openness Framework and the need for creators to provide documentation.
Quotes
- “‘The freedom to study, use, modification, and sharing are clarified.’”
- “‘Components must be free from encumbrances that prevent exercising freedoms.’”
- “‘We are moving the checklist to a separate document.’”
- “‘Data information will remain the same because the topic is still being discussed.’”
- “‘Explainability becomes easier when something is really transparent and clear.’”
Town Hall 13 – July 26, 2024
Overview:
The thirteenth OSI townhall focused on the upcoming release of version 0.0.9 of the Open Source AI Definition. The meeting reviewed progress on defining the four essential freedoms and the separation of the checklist document. Emphasis was placed on validation of AI systems against the definition, with discussions around the inclusion of training data remaining prominent. The atmosphere was anticipatory, with participants eager for the new version’s release. The collaborative tone persisted, though underlying tensions about data transparency and legal challenges were evident.
Key Takeaways
- Anticipation for version 0.0.9 of Open Source AI Definition.
- Continued focus on defining four essential freedoms.
- Validation of AI systems against the definition is ongoing.
- Emphasis on separating checklist document for clarity.
- Ongoing discussions on training data inclusion and transparency.
Data Mentions
The townhall continued to emphasize the use of data information as a proxy for training data, maintaining the position that full datasets are not required. The focus was on transparency and clarity, though tensions around legal challenges and data access persisted.
Quotes
- “‘The freedom to use, study, modify, and share the system.’”
- “‘Components and systems must be free from encumbrances.’”
- “‘The checklist will be a separate document and process.’”
- “‘Transparency requirements are crucial.’”
- “‘We’re really focusing on making this definition stable.’”
Town Hall 14 – August 23, 2024
Overview:
The fourteenth OSI townhall focused on unveiling version 0.0.9 of the Open Source AI Definition. The meeting emphasized the necessity of including weights, code, and detailed data information for AI systems, while acknowledging the challenges of providing full access to training data due to legal, privacy, and cultural constraints. The atmosphere was collaborative, with a focus on refining the definition to be globally applicable. Ongoing debates about data inclusion persisted, highlighting the compromise between transparency and practicality. The meeting aimed to align stakeholders for the final release.
Key Takeaways
- Unveiling of OSAID version 0.0.9 with emphasis on weights, code, and data information.
- Acknowledgment of challenges in providing full training data access.
- Focus on creating a globally applicable definition.
- Ongoing debates about data inclusion and transparency.
- Compromise between transparency and practicality.
Data Mentions
The townhall emphasized the inclusion of model weights, code, and detailed information about data rather than full datasets. Legal, privacy, and cultural challenges were acknowledged as obstacles to full data inclusion, with a focus on transparency as a compromise.
Quotes
- “‘Training data is valuable to study AI systems, but not part of the preferred form for making modifications.’”
- “‘Data can be hard to share. Laws limit resharing to protect interests.’”
- “‘Open training data provides the best way to enable users to study the system.’”
- “‘Public training data also enables users to study the work.’”
- “‘If training data had been required, how many models would have met that?’”
Town Hall 15 – September 06, 2024
Overview:
The fifteenth OSI townhall focused on the Open Source AI Definition, particularly on resolving the ongoing debate about the inclusion of training data. The meeting emphasized the importance of transparency and detailed documentation about datasets, while acknowledging legal and practical challenges that prevent full data access. Discussions included legal implications, privacy considerations, and the cultural significance of data. The atmosphere was collaborative, with a strong focus on refining the definition to meet the October deadline. The meeting aimed to gather endorsements for the upcoming release candidate, underscoring the importance of stakeholder alignment.
Key Takeaways
- Focus on transparency and detailed data documentation.
- Legal, privacy, and cultural challenges prevent full data access.
- Emphasis on refining the definition for October deadline.
- Gathering endorsements for the release candidate.
- Stakeholder alignment remains crucial for success.
Data Mentions
The townhall reiterated the exclusion of full training datasets, instead emphasizing the need for transparency and detailed data information. Legal and privacy concerns were highlighted as key reasons for not including complete datasets.
Quotes
- “‘Training data is valuable to study AI systems: to understand the biases that have been learned.’”
- “‘Data can be hard to share. Laws that permit training on data often limit resharing.’”
- “‘Open training data provides the best way to enable users to study the system.’”
- “‘The definition needs to have approval by end users, developers, deployers and subjects of AI, globally.’”
- “‘We need to provide a stable view of what open source AI means.’”
Town Hall 16 – September 13, 2024
Overview:
The sixteenth OSI townhall focused on refining the Open Source AI Definition, particularly on the inclusion of model weights, code, and detailed data information rather than full datasets. Discussions emphasized the legal, privacy, and cultural challenges associated with sharing training data. The meeting highlighted ongoing debates about the necessity of training data for exercising the four essential freedoms, with a strong focus on transparency and accessibility. The atmosphere was collaborative yet revealed underlying tensions about data inclusion, as participants worked towards finalizing the definition for the October release. The meeting also called for endorsements from stakeholders globally to support the upcoming release candidate.
Key Takeaways
- Refinement of Open Source AI Definition focusing on weights, code, and data information.
- Acknowledgment of legal, privacy, and cultural challenges in sharing training data.
- Emphasis on transparency and accessibility rather than full data access.
- Ongoing debates about the necessity of training data for the four essential freedoms.
- Call for global endorsements for the upcoming release candidate.
Data Mentions
The townhall reiterated the exclusion of full training datasets, emphasizing the necessity of transparency and detailed information about data. Legal, privacy, and cultural challenges were highlighted as key reasons for not including complete datasets.
Quotes
- “‘The preferred form of making modifications must include weights, code, and detailed data information.’”
- “‘Training data is valuable to study AI systems but not part of the preferred form for making modifications.’”
- “‘Legal and cultural practices pose challenges in sharing complete datasets.’”
- “‘Transparency and accessibility are emphasized over full data access.’”
- “‘We are seeking endorsements from end users, developers, and stakeholders globally.’”
Town Hall 17 – September 20, 2024
Overview:
The seventeenth OSI townhall focused on the nearing release of the Open Source AI Definition (OSAID) and the ongoing challenges related to the inclusion or exclusion of training data. The meeting highlighted the necessity of transparency, particularly in providing data information rather than full datasets due to legal, privacy, and cultural constraints. The atmosphere was collaborative, with a strong focus on reaching consensus and finalizing the definition. There were ongoing discussions about aligning the definition with global stakeholder interests, and a push for endorsements in preparation for the official launch.
Key Takeaways
- Focus on finalizing the Open Source AI Definition for the upcoming launch.
- Emphasis on transparency through data information instead of full datasets.
- Acknowledgment of legal, privacy, and cultural challenges in data sharing.
- Collaborative efforts to align the definition with global stakeholder interests.
- Push for endorsements from individuals and organizations.
Data Mentions
The townhall reiterated the use of data information as a substitute for full datasets, emphasizing transparency and acknowledging the legal, privacy, and cultural challenges that prevent full data access. There were no major shifts in positions from previous townhalls regarding data inclusion.
Quotes
- “‘The preferred form must include weights, code, and detailed data information.’”
- “‘Endorsement means your name and organizational affiliation will be appended to a press release.’”
- “‘Training data is valuable to study AI systems but not part of the preferred form for making modifications.’”
- “‘Legal, privacy, and cultural challenges limit full access to training data.’”
- “‘We are seeking endorsements from end users, developers, and stakeholders globally.’”
Town Hall 18 – September 27, 2024
Overview:
OSI Townhall 18 focused on the final stages of the Open Source AI Definition (OSAID) process, emphasizing the inclusion of weights, code, and detailed data information instead of full datasets. The meeting was marked by debates on the necessity of including training data and its implications for the four essential freedoms. Discussions highlighted the need to align with regulatory requirements and the complexities of balancing transparency with legal, privacy, and cultural constraints. The atmosphere was collaborative but underscored by ongoing tensions about data inclusion. The meeting aimed to finalize the definition and gather endorsements for its launch, while addressing concerns about potential open-washing.
Key Takeaways
- Focus on finalizing the OSAID with weights, code, and data information.
- Debates on including training data and its impact on the four essential freedoms.
- Alignment with regulatory requirements emphasized.
- Challenges in balancing transparency with legal, privacy, and cultural constraints.
- Efforts to gather endorsements and address open-washing concerns.
Data Mentions
The townhall continued to emphasize the use of data information instead of full datasets, with discussions on transparency and legal, privacy, and cultural challenges. The necessity of training data for the four freedoms was debated, with concerns about potential open-washing if data is not included.
Quotes
- “‘If we assume that the definition requires full release of datasets, few existing systems qualify.’”
- “‘Legal and cultural practices pose challenges in sharing complete datasets.’”
- “‘Transparency and accessibility are emphasized over full data access.’”
- “‘The preferred form of making modifications must include weights, code, and detailed data information.’”
- “‘We need to provide a stable view of what open source AI means.’”