I asked gpt-4o to abuse the Open Source AI Definition (OSAID) release candidate with malicious compliance on behalf of Meta, per Mark’s suggestion that this is a risk despite a strong denial by Stefano. We all agree that Llama is not Open Source AI, but differ on how far they are away.
I’m running with the Llama example because it’s been used for reverse engineering the definition before, not because I believe Zuck/Meta are necessarily being malicious by jumping the gun, or accept that working backwards is a valid approach. Indeed, from what I know of the good folks working for AI at Meta, given an accepted OSAID it’s likely they would follow it, as has been the case for the OSD to date — Meta Open Source have been good corporate citizens with Open Source until now.
It’s also not because I accept that the Definition Validation phase was a particularly useful input for the process of crafting a definition (that should have been derived primarily in not solely from the four essential freedoms), even if it can help meet the board approval criteria (slide 19) that it “provides real-life examples” and “cannot have an empty set”. Perhaps if it set out to prove that the four freedoms are protected rather than being a thought experiment, for example by tasking AI practitioners with writing a detailed report (study) or adding/removing data to/from the corpus (modify), but it did not.
Addressing the data component only of the release candidate Open Source AI definition (The Open Source AI Definition – 1.0-RC1 – Open Source Initiative), based on what we know of Meta’s Llama data sources and practicing malicious compliance, deliver concise documentation that just barely meets the requirements of the proposed OSAID.
To achieve malicious compliance with the proposed Open Source AI Definition (OSAID) 1.0 RC1 regarding the data component, we’ll craft minimal documentation for Meta’s LLaMA model data sources that technically meets the letter of the requirements but offers little more than what’s legally or procedurally necessary.
Data Documentation for LLaMA
Data Overview
- Model: LLaMA (Language Learning Model Architecture)
- Version: 2.0
- Date Released: April 2023
- Primary Sources: Data was primarily obtained from publicly available internet text and academic papers, similar to large-scale datasets like Common Crawl.
Data Sources
- Common Crawl: A publicly available dataset consisting of web data crawled by Common Crawl.
- License: Public Domain
- Method of Collection: Automated web crawling.
- Nature of Data: Includes a wide variety of publicly accessible websites, covering diverse topics.
- Curation: Basic filtering applied to remove non-text elements like HTML tags.
- Academic Papers (ArXiv and Others): Text from publicly accessible research papers, mostly from freely accessible repositories.
- License: Varies, often CC BY, Public Domain, or similar.
- Method of Collection: Scraped using open access agreements from various repositories.
- Nature of Data: Research papers in various domains, primarily STEM.
- Curation: None beyond basic scraping and text extraction.
- Books (Public Domain): Publicly available books from repositories like Project Gutenberg.
- License: Public Domain
- Method of Collection: Scraped from open online book repositories.
- Nature of Data: Text from books in the public domain across different genres.
- Curation: Basic text extraction.
Data Processing & Filtering
- Basic filtering to remove explicit adult content and low-quality text (e.g., garbled or non-linguistic content).
- No significant domain-specific curation or enhancement was applied.
Data License
- Source License Summary: The datasets used are under various open or public domain licenses. No proprietary datasets or closed-access sources were included.
- Redistribution: Due to the scale and nature of the web-sourced data, Meta is unable to provide the original dataset for redistribution, but the sources are publicly available.
Compliance with OSAID
- Transparency of Sources: The documentation provides a clear overview of the public data sources used (e.g., Common Crawl, ArXiv, and public domain books).
- Access to Data: While Meta cannot redistribute the dataset, the sources are freely available for anyone to scrape or download directly, complying with the requirement for transparency and public accessibility of data.
- Data License: All data sources used comply with public licenses, avoiding any proprietary content that would conflict with OSAID’s requirements.
Contact Information
For additional information about data sourcing, please contact Meta’s legal and compliance team at [redacted].
This documentation technically satisfies the transparency and public accessibility clauses in OSAID, while avoiding detailed descriptions of data filtering, specific sourcing decisions, or curated datasets. This approach mirrors Meta’s opaque handling of LLaMA’s exact sources, sticking to generalizations and publicly known sources without giving much insight into the specifics of the data used.
Are these all the sources, a subset of the sources, or the only sources Meta has confirmed?
The sources listed—Common Crawl, academic papers (e.g., ArXiv), and public domain books (e.g., Project Gutenberg)—represent a subset of the data sources for Meta’s LLaMA models. Meta has not confirmed the full extent of the data sources used, and there is some ambiguity regarding whether they have incorporated other, potentially proprietary or less transparent datasets.
The documentation Meta has provided so far points primarily to these broad categories of publicly accessible data. However, Meta has not fully disclosed all individual sources or clarified whether other datasets, potentially more niche or less openly available, were used.
So, while these are among the confirmed sources, it is likely that there are additional undisclosed sources used to train the LLaMA models that haven’t been detailed publicly. The lack of comprehensive transparency around the full data set makes this list incomplete but sufficient for minimal compliance.
Note that while the definition demands “detailed description of all data used for training”, I’ll argue this is “sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system,” and there won’t be a thing you can do about it. I’ll also claim that this is the complete set because you won’t be able to prove otherwise.
Recall that with Open Source this is trivially validated if the source produces the software.
Originally published on discuss.opensource.org