←back to Articles

Openness vs Completeness: Data Dependencies and Open Source AI

Four Freedoms of Free Software

For those of you new to the Free & Open Source Software (FOSS) party, these creative works are deliberately made available by their authors under licenses that protect the “four essential freedoms” of free (as in freedom, not as in beer) software, allowing users to:

  1. Use
  2. Study
  3. Modify
  4. Share

These four freedoms are implemented in the Open Source Definition (OSD) — which does not directly reference them (a point for which a fix has been proposed for a potential future version of same) — and its 10 principles:

  1. Free Redistribution
  2. Source Code
  3. Derived Works
  4. Integrity of The Author’s Source Code
  5. No Discrimination Against Persons or Groups
  6. No Discrimination Against Fields of Endeavor
  7. Distribution of License
  8. License Must Not Be Specific to a Product
  9. License Must Not Restrict Other Software
  10. License Must Be Technology-Neutral

Open Source Licenses

The 4 freedoms, implemented as the 10 principles, are then made concrete (and useful) in the form of licenses. While most creative output is implicitly or explicitly “All rights reserved” via default protection under prevailing copyright laws, these licenses deliberately and voluntarily grant additional rights to others, for free. This is a potent differentiator, encouraging adoption that would otherwise not be achievable.

Some 96 Open Source licenses have been proposed and approved by community members (mostly lawyers), so a programmer just has to choose one, apply it to their work, and then other developers working individually or for a company can quickly and easily use any work covered by these licenses without having to worry whether their freedoms to use, study, modify, and share the work are protected. Developers do need to be careful about incompatibilities between Open Source licenses, especially when using them with other non-free software (for example, you can’t incorporate some open-source code into closed-source software). These licenses describe a spectrum of openness, but there are already way too many of them. This “license proliferation” problem “affects the whole FOSS ecosystem negatively by the burden of increasingly complex license selection, license interaction, and license compatibility considerations”.

Free & Open Source software like Linux can be downloaded, modified, and shared freely. Indeed, Linux uses a “viral” license (GPL) that ensures your changes are “infected” by it and stay free for all to use — a clever hack called “copyleft” that uses copyright against itself to ensure continued protection of the four freedoms. Fortunately, many/most licenses are “permissive”, allowing you to share and even sell derivative works that do not include your improvements — after all, Open Source was created as the business-friendly middle ground between Ballmer’s “Linux is Cancer” Microsoft and Stallman’s Free Software Foundation (FSF).

Compare this to proprietary software like Microsoft Office (and Windows on which it runs for that matter), which you can only really use in limited ways; you can’t reverse engineer or modify it (outside of provided APIs), and sharing it is a sure way to get sued. And that’s a perfectly fine business model — it’s their intellectual property after all, and they can do with it what they want (with important exceptions for fair use, reverse engineering, etc. depending on the jurisdiction).

Data Dependencies

This has worked extraordinarily well — far better than its creators had anticipated at the time — over the past quarter century. While the Open Source Definition (OSD) does not need to be “fixed” fundamentally, several bugfixes have been proposed for a potential future version to address unforeseen issues. Ever since id software released “Quake” in 1999, the year after the OSD was launched and OSI incorporated — only as the game engine without the all-important data (textures etc.) that describe the game itself — there has been the question of data. This is fine if you refer to the product as the “Quake Engine”, but not if you call it “Quake” as it is clearly not the same thing.

Reproducibility Requirement

For software to be Open Source you require reproducibility. Some have mistakenly claimed this goes further into the realm of Open Science (as if that’s a bad thing), but it is the reality and requirements of Open Source today: If the source code produces the software, and is made available under an Open Source License, then the software is Open Source.

In the simplest possible example, if the source code hello.c compiles the hello program which prints Hello, World! when executed, and that source code is available under the MIT license, then that program is Open Source. But if the Hello, World! text is stored in a hello.txt file which is NOT appropriately licensed, then the result is non-free. It does not function like hello (i.e., the program) without hello.txt (i.e., the data) being present, and if you can’t use, study, modify, and share the program and its dependencies then you can’t call it Open Source under the same brand.

To do so would be like Microsoft calling Windows “Open Source” and providing hello.c which does something else entirely. Or like id software releasing the Quake engine without the data (i.e., PAK files, which are like ZIP archives for textures etc.) and trying to call Quake “Open Source”. Clearly this is nonsense, but calling it something else (e.g., “Quake engine”, or print in our simple example above) would be fine and indeed Open Source.

Open Source AI

Unfortunately, the Open Source Initiative (OSI) has been asleep at the wheel for decade/s; the last Open Source Definition (OSD) release was v1.9 in 2007, and it did not address the issue of data dependencies that effectively render Open Source software non-free.

The question of data has come to the forefront with the advent of Artificial Intelligence (AI), for which the data is the source code, per the original author and co-founder, Bruce Perens (since banned from his own organisation by its board who have been accused of corporate capture):

You can apply the original Open Source Definition to machine learning and I feel it works better. You need to apply it to two pieces: the software (and possibly specialized hardware), and the training data. The training data is “source code” for the purposes of the Open Source definition.

The OSI’s belated attempt to address this by releasing an incompatible and conflicting Open Source AI Definition (OSAID) is as unnecessary — as demonstrated by the community’s Work in Progress on a bugfixed version of the Open Source Definition — as it is dangerous, both to Open Source AI and to existing Open Source Software given it broadly covers any software that “infers, from the input it receives, how to generate output” i.e. all software.

Completeness vs Openness

While Open Source has stood the test of time in the openness dimension, fully protecting the four essential freedoms for free software (i.e., source code), it is increasingly stretched on the completeness axis.

This is not a new problem and was arguably an oversight in the initial release given the id software release of Quake without the data dependencies the very next year, but vision is 20/20 in hindsight and the exclusion of dependencies including data may have been deliberate. The question is nuanced as we don’t require, for example, system libraries and compilers to be Open Source; you can run Open Source software like Firefox on top of a proprietary operating system like Windows for example. Similarly, you can build and run Open Source AI systems using closed-source drivers like NVIDIA’s Compute Unified Device Architecture (CUDA).

This is where the OSI is wrong (again) in claiming they “passed the AI Conundrums” (incidentally the author of that article has since declared it impossible, likening it to the Tacoma Narrows bridge disaster):

For Open Source AI it’s a similar dance: You can’t legally give us all the data? Fine, we’ll fork it. For example, you made an AI that recognizes bone cancer in humans but the data can’t be shared. We’ll fork it! Tell us exactly how you built the system, how you trained it, share the code you used, and an anonymized sample of the data you used so we can train on our X-ray images. The system will be slightly different but it’s still Open Source AI.

You can’t fork (“take a copy of source code from one software package and start independent development on it, creating a distinct and separate piece of software”) software without the source code, and given that for AI the data is the source code, you need the data. Giving me “sufficiently detailed” metadata (“data information”) and saying a “skilled person can build a substantially equivalent system” is obvious bulls—t, “like not releasing the source code at all and claiming that it’s still free software because “a skilled person” could rewrite it”. Or like releasing a recipe that requires unicorn horns.

Even if Meta’s Llama gave a “sufficiently detailed” description of the Facebook and Instagram social graph data — that you and I can’t access for any amount of money as their secret sauce is not for sale — there is no possible way that you or I as a “skilled person” could build a “substantially equivalent system”. Clearly if I took said instructions and trained it with texts from my mother, I would not end up with anything like Llama! The claim that “the system will be slightly different but it’s still Open Source AI” is patently absurd and designed to deceive readers as to the critical role of data in AI.

Four Quadrants

Inspired by Open Source old guard Karsten Wade’s Proposal to handle Data Openness in the Open Source AI definition, which led to his Proposal to achieve consensus for the 1.0 release, which was controversially prematurely closed by OSI leadership, I’ve created a chart showing the openness vs completeness dimensions of Open Source:

The four quadrants are as follows:

Open but Not Complete (Orange): Today’s Open Source Definition v1.9 sits in the bottom right corner in that it has proven itself in openness but lacks completeness, because it does not explicitly require data dependencies (though one could argue it does so implicitly).

Complete but Not Open (Orange): The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency, and Usability in Artificial Intelligence by the Linux Foundation does a much better job on completeness, creating a framework and tool for enumerating and assessing the components that need to be included for reproducibility of an AI system. It fails on openness however, by accepting “any license or unlicensed” for “raw training datasets used in the training of the model” even at its highest Class I (Open Science).

Neither Open Nor Complete (Red): The OSI’s Open Source AI Definition (OSAID) does not require any data so fails completely in both dimensions.

Both Open and Complete (Green): This is where we need to get to, either using a combination of existing tools or a new one.

An Open and Complete Solution

The easiest and safest way to get to where we need to be as an industry — a solution that satisfies both openness and completeness — is to move into the top right quadrant from the bottom right (i.e., make an open solution like the OSD more complete) or top left (i.e., make a complete solution like MOF more open), or some combination. The Open Source Definition should continue describe the principles that protect the freedoms, with implementation left to licenses (for openness), adding frameworks (for completeness).

The problem with making a complete solution like the MOF more open is that you now have two definitions for the same software. This is particularly problematic as more and more software is written by or incorporates AI. Separating the two would require a third definition, to decide which definition to use! Furthermore, that definition would have to be incorporated into both documents, which means you would have to update the OSD anyway as otherwise it will always conflict with an Open Source AI definition, as is the case today.

Simple Solution

The simplest solution is often the best, and that’s to make completeness explicit in the Open Source Definition, without changing the meaning of Open Source itself.

Everything should be as simple as possible, but not simpler (per Albert Einstein), and one such proposal (which is intended as a proof of concept) is to add a single sentence like this to the OSD covering data dependencies:

In cases where software relies on data—including databases, models, or media—for its creation, modification, or operation, that data is considered integral to the program and is subject to the same requirements.

While this demonstrates that the OSI’s OSAID was not only unwanted but also unnecessary, it’s not to say that this is the only possible way to implement this change. For example, rather than adding it to the introduction as shown in the Work in Progress (WIP) document at the time of writing, OSD 2 (Source Code) could be renamed “Source” and updated to cover data (which is what we started with), ideally also closing the “preferred form” loophole and updating the term “programmer” as “practitioner” to reflect today’s reality. Another suggestion is to explicitly incorporate the four freedoms, which is perhaps the only point the OSAID got (almost) right.

Of course, the default position is to stick with the status quo (as suggested by its author), which is why we’ve also launched the Open Source Declaration:

We declare that Open Source is defined solely by the Open Source Definition (OSD) version 1.9.

Any amendments or new definitions shall only be recognized with clear community consensus via an open and transparent process.