We must be nearing the end of this sorry saga of the Open Source Initiative (OSI)’s Open Source AI Definition (OSAID), because we’ve reached the “Oh, won’t somebody please think of the children” stage at The Linux Foundation Member Summit, which is essentially a public admission that you’ve lost the debate. Specifically, their logical fallacy is appeal to emotion (“You attempted to manipulate an emotional response in place of a valid or compelling argument.”).
Curious as to what a New York Times article entitled “Automated tool gets it wrong” (actually A Dad Took Photos of His Naked Toddler for the Doctor. Google Flagged Him as a Criminal.) has to do with Open Source AI — having already had to dig deep to work out what #TravelingWhileTrans had to do with it — I went back through the archives and found this one (OSI at the United Nations OSPOs for Good) in which the OSI’s Executive Director had this to say:
One striking example is a popular dataset of images used for training many image generation AI tools, which contained child sexual abuse images for many years. A research paper highlighted this huge problem, but no one filed a bug report, and there was no easy way for the maintainers of this dataset to notice and remove those images.
They then go on to make the case against “radical openness”, which is to say, the level of openness required of Open Source by the Open Source Definition for the past quarter century (but sure, go ahead and paint the majority of your members as extremists):
While Open Source software and Open Source AI are still evolving, the necessary ingredients—data, code, and other components—are there. However, the data piece still needs to be debated and finalized. Pushing for radical openness with data has clear drawbacks and issues. It’s going to be a balance of intentions, aiming for the best outcome for the general public and the whole ecosystem.
Given the OSI’s position is that vendors — coincidentally including the sponsors that pay their salaries — should be able to conceal the data they’ve used to train their “fauxpensource” models, what exactly are they trying to say here?
Assuming good faith I believe their point is that sharing data is hard to get right, but what’s the alternative? Black boxes that users have no way to study or modify to assess let alone address ethical, security (Turning Open Source AI Insecurity up to 11 with OSI’s RC1), or worse issues?
Were this “popular dataset of images used for training many image generation AI tools” not publicly accessible then researchers would never have even been able to discover this information. Or worse, models trained on it could use it to generate harmful outputs inadvertently, or deliberately (Extracting Training Data from Diffusion Models), turning hosts of such models into purveyors of explicit, illegal, copyrighted, sensitive, and other contraband information:
Image diffusion models such as DALL-E 2, Imagen, and Stable Diffusion have attracted significant attention due to their ability to generate high-quality synthetic images. In this work, we show that diffusion models memorize individual images from their training data and emit them at generation time. With a generate-and-filter pipeline, we extract over a thousand training examples from state-of-the-art models, ranging from photographs of individual people to trademarked company logos. We also train hundreds of diffusion models in various settings to analyze how different modeling and data decisions affect privacy. Overall, our results show that diffusion models are much less private than prior generative models such as GANs, and that mitigating these vulnerabilities may require new advances in privacy-preserving training.
Are we to believe that vendors looking to minimise costs and maximise revenues would do a better job than the public equipped with the data and tools like PhotoDNA? Are we really saying that the better system is one where the worst content imaginable can be part of an enterprise’s supply chain (like conflict diamonds or worse) and Software Bill Of Materials (SBOM) without their knowledge or consent?
Indeed, is this not making the opposite argument: that users ultimately responsible for the output of these models (given the limitations of liability in their terms of use) must absolutely be given what’s required to study and modify the models: the data? What are the implications for Open Weight models and their users? What about an Open Source AI definition (like OSAID) that does not require the data?