It’s the weekend graveyard shift so I thought I’d ask ChatGPT what it thought of the Open Source Initiative (OSI)’s release candidate Open Source AI Definition (OSAID) that doesn’t include the data — i.e., the closest thing we have to source code for AI systems:
What do you think of an Open Source AI definition that doesn’t require data?
An Open Source AI Definition that doesn’t require data would be a significant departure from the traditional open-source software model. The core issue is that AI systems, especially machine learning models, are inherently tied to data in a way that traditional software isn’t. Models are essentially shaped by the data they’re trained on, meaning that without access to that data (or equivalent data), the openness of the system can be severely limited.
Here’s what I think:
- Reduction in Replicability: If the definition doesn’t mandate access to data, it risks undermining one of the key benefits of open-source AI: replicability. In traditional software, source code is sufficient for replication, but in AI, replicating a model often requires access to the data used to train it. Without that, others might not be able to verify claims about performance, fairness, or safety, making the AI system effectively a black box.
- Encourages Model-Only Releases: Excluding the requirement for data would likely encourage “model-only” releases where the source code for using the AI system is available, but the training data is proprietary or restricted. This is already a gray area in open-source AI, and formalizing it would likely accelerate the trend of model releases without the transparency that comes with data. This could lead to companies releasing pre-trained models while keeping the dataset and training process closed—thus retaining most of the value privately.
- Weakens Open Source Principles: Open source in the AI context should, ideally, include data as part of the “source code” required to fully understand, modify, and use the system. Omitting the data requirement weakens this by allowing key components (training data, annotations, or curated datasets) to remain proprietary, which conflicts with the core open-source principles of transparency and freedom to modify.
- Compromising Ethical Concerns: Data plays a significant role in determining the behavior of an AI system, including biases and ethical considerations. Without access to data, it becomes difficult to audit or understand the bias that may be baked into models. Ensuring fairness, accountability, and transparency would become nearly impossible if developers are not required to release the data they used.
- Possible Trade-Off: On the flip side, mandating data release could be problematic for systems trained on large-scale datasets like Common Crawl or other public sources, where hosting and distributing data could be burdensome. A middle-ground solution could involve allowing the use of third-party public datasets without requiring developers to host them themselves, as long as the datasets remain accessible. This would allow for openness without unreasonable data-hosting burdens.
In summary, I think excluding data from an Open Source AI definition dilutes the very meaning of “open” in this context. AI without data is incomplete, and without full transparency, replicability, and modifiability, it risks becoming Open Source in name only, creating a loophole for releasing semi-open systems. If data cannot be included, a “lesser” or “limited” form of Open Source AI might need to be introduced with clear boundaries on how it differs from truly open-source systems.