In light of how often federated learning is used by the Open Source Initiative (OSI) as an excuse for enabling vendors to conceal the source (i.e., data) under their flawed Open Source AI Definition (OSAID), I’ve updated the So, you want to write about the OSI’s Open Source AI Definition (OSAID)… article with the following section:
What are the implications for federated learning?
The OSI asks “Can you imagine an Open Source AI built with federated learning?”, and the answer is a qualified “yes”.
Federated learning (aka collaborative learning) is a technique that doesn’t require collecting the training data centrally, rather having each participant conduct training with their own generally private slice of the data, consolidating the results in a model trained on the superset of participants’ data without any having had access to others’ data. This is useful for applications like fraud detection, where a consortium of banks can work together to train a model that exceeds the performance of any one bank’s model, without sharing their customers’ data with other banks.
Federated learning is specifically designed to improve privacy by not sharing the training data, and given the data is the source for AI, it is fundamentally incompatible with Open Source. That’s fine, because not everything has to be Open Source, and some things are better off not being Open Source. Given we know that AI models can reveal their training data, the suggestion that the training process offers adequate protection for such sensitive data is as bogus as it is dangerous, both to data subjects and to those seeking to rely on such assurances to avoid legal liability. Extending the definition to allow for federated learning would be the software equivalent of collecting closed-source binary blob firmware drivers from several vendors and calling the resulting distribution Open Source, which is obviously nonsense.
Anyone claiming the ability to conceal the source (i.e. data) is a feature of OSAID — or conversely that requiring the data is a limitation of more meaningful definitions — either doesn’t understand the technology or does and is deliberately deceiving you. Either way, in doing so they have proven themselves incompetent to create such a definition.
Fortunately, federated learning can still benefit from Open Source tool chains, and the results can be released as Open Weight models which do not deceive users as to the availability of source (i.e., data).