During discussions on the definition of Open Source AI, the question was posed as to whether testing data should be made available even just to use an AI system (rather than study, modify, or share it, those being the four freedoms Open Source protects). Here are my thoughts on the subject:
In certain contexts, providing testing datasets would be required, OR would be required IF training data is not (i.e., testing data can occasionally be a substitute for training data). For instance:
- Regulated Industries: In fields like healthcare, medical devices, or factory automation, it is necessary to demonstrate compliance with safety requirements and regulations. Testing datasets may provide the necessary transparency to show that the system behaves according to the manufacturer’s claims and passes necessary audits (i.e., produces compliant outputs given certain inputs).
- Data Return Systems: If a system’s output includes returning verbatim data, like a virtual psychologist quoting the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) or a virtual doctor referencing a Prescriber’s Digital Reference (PDR) entry, the testing datasets may be required for verification of proper function and compliance. As another recent example, retrieving verbatim quotes from transcripts with an LLM is risky given hallucination (which is one of several reasons why I didn’t request attribution).
- Data-Dependent Algorithms: Algorithms may need to process some or all of the “source” data at runtime rather than “learning” it in advance, such as k-means clustering in the AI context (per @spotaws’ point about scripting languages in the software context), meaning training datasets are integral to their operation and validation, but testing datasets are often a random subset of the same source anyway so if you can release one, you can release both, and:
- Cross-Validation: Where training and testing datasets are selected from a superset “source” several times, (e.g., k-fold cross validation) then the entire dataset of both testing and training data is required.
- Security Testing: Relevant test cases may be necessary (but not necessarily sufficient) to ensuring systems perform to spec. For example, in traditional Open Source, I can refer to the source code to verify that
wordcount.exe
does not also steal my crypto. For ML models, testing data can sometimes (but not always and/or not fully) play a similar role in validation and verification.
In some cases where the training data is unavailable, providing the testing data can be used to verify and evaluate the system’s performance. In others, testing data may even be more convenient or compliant due to its smaller size (a medical model could include only patients with the required waivers, for example).
Using testing datasets in lieu of training datasets can be exploited though, for example by:
- Size: Releasing an unusably small training set merely to comply with the definition/checklist (e.g., one cat and one dog photo). It’s hard to conceive of universal language to prevent this.
- Backdoors/Jailbreaks: Satisfying provided test cases while others trigger hidden behaviours, bypasses restrictions, releases sensitive information, or deploys a hidden payload.
- Selective Omission: Crafting a testing dataset that exhibits atypically high levels of performance (e.g., precision/recall, time) not representative of real-world scenarios, for example by filtering out examples with high MSE scores or that take longer to process.
- Overfitting: Releasing a testing dataset similar to or a mirror of the training dataset to mislead users as to the suitability of the system.
- Oversimplification: Including only simplified examples in the testing dataset (e.g., clear, daytime images of roads without pedestrians for an autonomous vehicle system).
- Obscuration: Providing irrelevant test data that checks the box while not giving an indication of real-world performance.
- Versioning: Failing to update the testing data as the model evolves such that it doesn’t test newly introduced functionality.
In any case, if any context demands testing data then every context effectively does; the definition must not to discriminate against persons or fields of endeavour.