Earlier this month Wes Turner, a Senior Lecturer at Rensselaer Polytechnic Institute, had A Quick Look at the New Open Source AI Definition from the OSI at DevFest 2024, a Google Developer Group (GDG) conference focused this year on Responsible AI. As you can see, like many before him (including myself) he was discombobulated by a document that defies common sense, from a formerly trusted organisation that has recently been gaslighting its own members into questioning their own sanity (when it’s not brow-beating them into submission).
It starts with some great context on Free & Open Source Software (FOSS) and the legal foundation on which the last quarter century of outstanding success was based — Open Source is now a $30bn industry supporting something like $8tn in activity and 98% of all enterprises. It then dives into (26m39s) the Open Source Initiative (OSI)’s sponsor-friendly attempts to map this over to AI, which respected industry analysts have likened to the Tacoma Narrows bridge collapse. If you need more background or want to write something yourself, start here: So, you want to write about the OSI’s Open Source AI Definition (OSAID)…
I’ve run the recording through the Open Source-ish Whisper model to produce a transcript which Wes has kindly made available under a CC-BY-4.0 license, enabling me to share it here. I’m also adding my own thoughts inline addressing his comments and concerns on the flawed process and its product: a dangerous proposed definition that has already been rejected by the industry (Debian, Free Software Foundation, Software Freedom Conservancy, etc.), which conflicts with the Open Source Definition (OSD), and which fails to fully protect the four freedoms.
Scroll down to the bold passages which are relevant to the discussion:
Transcript
License: CC-BY-4.0
All right, let’s get started.
So hi, I’m Wesley Turner.
I’m a professor of practice here in the CS department at RPI.
Welcome to Open Source Licensing for AI.
I wanted to give you a little bit of background on myself, just so you can see where I came at this, where I come at this.
I’m a professor of practice here at RPI, but I’m also the director of the Rensselaer Center for Open Source.
And Rensselaer Center for Open Source is a project-based, well, we’re a hybrid student academic organization.
We’re largely run by students, but we also teach an academic course every semester, where students get credit for participating in open source projects.
They can make them up, they can join projects externally.
We’re pretty free and easy on how people come up with the projects, but they do have to be open source.
So we have a vested interest in promoting open source on campus.
A lot of the students you’ll see walking around have either taken or are taking my course.
So it’s kind of an interesting addition to the RPI curriculum.
This isn’t the only experience I have in open source.
I have worked about 23 years in it.
I was the director of open source operations at a VA, an organization to help the VA open source their software.
That’s now defunct.
I left before, it was healthy when I was there, but it was a federal funding thing.
So I’ve done that, I’ve worked at a bunch of open source places, including Kitware just up north.
I worked on open source software at GE.
I worked on an open source medical image simulator or medical procedure simulator company a few years ago.
So I have a really big interest in open source.
And open source and AI is a topic that is just starting to get some attention.
So I wanted to take this opportunity to kind of talk about what’s actually open source means and how we can actually apply it to open AI.
I’m gonna use the inaugural, the first open source AI definition from the OSI, the Open Source Initiative, to kind of give you an idea of some of the concepts and some of the issues that we’re looking at.
But before we get there, that’s kind of the third bullet here, we gotta talk about two other topics.
The first is, what does the concept of free and open source actually mean?
Because how can we know if this is a good definition of open source and AI if we haven’t actually defined what the concepts mean in general?
So we’ll take a quick look at those.
And we’ll also take a quick look at intellectual property, because it impacts what we can do with AI.
It impacts how we can open source things.
It impacts the types of things we can make available to others.
So we’ll talk about those first, and then we’ll kind of take a look at the open source AI definition as it’s come out.
And see what we think about it, and how it doesn’t match into our concepts of free and open.
Free and open software, open source software has been around for a while.
How many of you use Linux?
I expect some of you at least use Linux.
That’s probably the biggest success story.
You’re all familiar with Red Hat, right?
Red Hat is a company that basically you paid money to get software that you could get for free, which is great.
That’s the basis of an open source ecosystem.
Because the truth is, you could get that software for free, but you would have to package it together.
You’d have to figure out how to install it.
You’d have to put all the tools that you need to use it together.
And you’d have to do that before you could actually use it effectively.
So the cool thing about open source is you’re not prohibited from making money off of it, you just have to think about a different business plan.
And a lot of times that business plan comes in by using your expertise to make it easier for people to accomplish the things they wanna do.
Right, so Red Hat got sold, I think, to IBM.
I know it got sold to IBM.
I think it was for like $4 billion or something.
So they basically built up a company worth $4 billion by convincing you to pay for something that you could have gotten for free, and you were grateful for it.
And that’s a win, win, win, right?
[LAUGH] You didn’t have to do all the headaches.
They got a little bit of money to make it easy and to give you support.
And they ended up being something that IBM was willing to pay $4 billion for.
So let’s take a look at what that actually means in practice.
The four freedoms of free software, free and open source software, are both based on kind of these four freedoms.
This was coined by Richard Stallman back in the 80s, I believe.
And the idea was, the story we tell, probably true, was that they had a printer at MIT Labs where he worked.
And they used that printer for years, and then they changed the software and the printer could no longer do what they wanted it to do.
And he was angry about that, because he used to be able to change the software, and then they hid it away from him.
So he decided to coin these, this is, data deserves to be free, computer programs are nothing but data.
And so it is an evil to prohibit people from accessing your code.
And this is what he came up with, the four freedoms of open source software.
And you’ll notice that since we are computer scientists in general, we start numbering with zero.
So we have four freedoms, zero through three, not four freedoms, one through four.
It’s the world we live in.
[LAUGH] Okay, so the first freedom he came up with, freedom zero, was if you have a program, you should be able to do what you want with it.
You should be able to use it for any purpose.
You shouldn’t be telling me, once I have this data, this free program that’s in the wild like the elephants and the antelopes.
You can’t tell me what I can do with it.
It can run free and do what I ask it to do.
Or I have the ability to run free with it.
The second freedom, freedom one, is the freedom to study how the program works and changes it so it does what you want it to do.
And in order to do this, you have to actually be able to see the source code.
So you have to be actually able to see the stuff that’s used to generate the program that’s running.
Now, technically speaking, this isn’t exactly true, right?
You could go in and modify the bits.
And if you ever worked in the Stone Age back in the 80s and 90s, you probably did a bit level patching to make things work the way you want them to.
But it’s tedious and you can’t do major changes.
You’re stuck in just making tweaks and improvements.
So in order to do one effectively, you have to be able to have access to the software in the preferred method of modification, which is at the source code level.
And that’s gonna be important when we start talking about our open AI definition.
Freedom two, the third freedom, if you have the software and the software’s running rampantly and free, you should be able to give that software to anybody you want because it should be able to rampage just as well on their computer as it can on yours.
So this is, if you guys have, how many of you have actually read the Microsoft Euler when you installed the system?
Okay, if you read the Euler, at least the old Euler, I’d read it once.
You didn’t buy the software, you bought the rights to use the software.
So it’s just like running a car.
They retained all rights.
You didn’t have access to the program.
If you used it in a manner they didn’t like, they had the right to take it away from you.
If you changed it, they had the right to take it away from you.
There are all sorts of rules and things you couldn’t do.
Freedom two, the third freedom, says you can’t do that.
Right?
Once you have the program, you can give it away, you can do whatever you want with it.
And then freedom three is, if you take advantage of freedom one and modify the program, well, you can give your version away to anybody else as well.
You know, so you can modify it and then give your, or sell.
You can distribute your version of the software to anybody else as well.
So these are basically four freedoms.
And this is philosophical.
Okay?
This isn’t, you know, these aren’t hard and fast rules that you can write down.
You can evaluate things in light of this philosophy, but it is a philosophical position based on the fact that it is an evil to hide away the data.
Right?
This data has to be made free.
And this is actually the basis for, if you’re familiar with the GPL licenses, or the other viral or copy left, as they’re called, licenses.
This is basically where they live.
Right?
This is where they came from.
This concept that it’s evil to prevent the software from, to make the software violate these principles.
So therefore, if you’re using my code and my code’s licensed under a permission of license, then you have to also be licensed under a permissive license.
So a little bit after this, you can understand that if you’re a company, these are not necessarily how you want to build your company.
It is a different business model.
And particularly if you’re looking at some of the GPL code, incorporating GPL code into your code makes your code open source by license.
So you have to abide by the principles as well.
So there’s another approach, and it’s a more pragmatic approach.
They both kind of live together.
It’s not like, you know, warfare.
But this is a pragmatic approach that says open source software, free software, the whole principles of making the source code open is a better way to develop software.
Because it’s a better way to develop software, we’re going to win.
We don’t have to enforce a license on anybody else.
Right?
We’re going to win because we’re better.
So therefore, we’re just going to make our software completely open.
And the cool thing here is they came up with 10 principles, you know, of course.
These are a little bit more measurable.
They’re a little bit more standard.
You know, you can kind of go down a checklist.
But they’re all derived from the original philosophical principles that Stallman wrote.
But they kind of led to a series of licenses that are called permissive licenses, where you can actually use the code.
And as long as you give proper credit to whomever generated the code, you’re free to license it the way you want.
You can incorporate it in a proprietary product, et cetera.
It’s not a strict, you know, four freedoms versus open source principles kind of thing.
You know, open source software encapsulates all of the GPL licenses.
GPL license recognizes all of the open source licenses, but most of the open source licenses.
But that’s kind of how these things break down in terms of where they’re most used and most regarded.
And the open source principles, if you read through these, and I apologize, I’ve got a lot of slides with a lot of text.
Not the best way to give the presentation, but I wanted you guys to have the material here in case you actually wanted to look through it.
So if you look through these, you know, free redistribution, right?
If you have the code, you can’t restrict somebody else from giving the code away, right?
The source, the program must include the source code again, right?
Freedom one, the second freedom said you had to be able to get the source code in order to make modifications in the preferred medium.
Derived works, derived works are just works that you create based on the previous work.
So this is your modifications that’s coming in.
Integrity of the author’s code, the only thing about that is you can’t claim something that isn’t yours.
You can use something that isn’t yours, but you can’t claim it, and you can’t make mistakes and then blame it on the other guy is essentially what that says.
No discrimination against persons or groups, right?
Freedom zero, you can run it for any purpose you want.
Once you have the code, it’s yours to do with as you like.
Again, no discrimination against fields of endeavor, same one.
Restriction of license.
This is an interesting one, but all it really means is that if I make this code and I give it to you, you automatically have the right to give it to somebody else with the license.
They don’t have to come back to me and say, “Can I have the license too?”
And then, can’t restrict other software.
If I deliver a piece of software with an open source license to you on the same disk as a piece of software without an open source license, that’s okay.
They can live together, peace, harmony, everybody getting along.
And technology neutral, it doesn’t matter whether you’re running on, you know, what chip you’re running on or individual technology or style of interface.
So those are the two main sources of open source licenses in the software community right now.
The first one is from essentially the Linux Foundation, the Free Software Foundation.
Those were originated by Stallman.
This was originated by a group called the OSI, right, the Open Source Initiative.
The Open Source Initiative has over the years kind of been the caretaker of the term open source and they actually bless licenses and say that we say that this is an open source license that meets all of our criteria.
So it becomes kind of the basis, at least for my course, is if you pick an open source license, it could be copy left, it could be permissive, I don’t care.
If you pick an open source license, a license approved by the OSI, then we’re going to just from a practical standpoint assume it’s open source and go from there.
Okay, so that’s the first part of our talk.
We’re doing pretty good.
Any questions about any of this open source, free software kind of licenses or what they actually mean?
Yeah, go ahead.
Okay, so the question is can you have a tier of an open source license like you can pick and choose some of the principles and not necessarily the others.
So that comes back to the intellectual property that we’re going to talk about in just a second.
You can do that, right?
As a creator of a work, you can license it any way you want.
It has to match all of the principles in order to be open source, okay?
But that’s different than saying you can’t do it.
Not open source.
If it’s — it depends.
You can have ethically, you know, ethical source licenses which try to impose restrictions on field of endeavor.
You can have green source licenses, do no evil licenses.
There are multiple different ones.
If you have no license or if it’s held by a single entity, then it’s called proprietary.
Okay?
And I’m a big — by the way, I teach open source.
I’ve worked in open source.
I’m a big fan of this next part because this next part says what you can and can’t do with your code.
And I think open source is a fine option.
I think, you know, free software, copy left code is a fine option.
But this is your decision.
If you’re writing the code, it’s up to you to decide what you want to do.
And that’s the beauty of being a creator, is you get to make that decision.
But that brings us to why do you get to make that decision.
You get to make that decision because of something called intellectual property.
Anybody know what intellectual property is without just reading it off of the slide?
All right.
So intellectual property is any of the various products of the intellect that have commercial value including copyrighted property such as literary or artistic works and ideational properties such as patents, business methods and industrial processes.
The set of rights protecting such works and property from unlawful infringement, any product of someone’s intellect that has commercial value, copyrights, patents, trademarks and trade secrets, this is a dictionary definition.
It’s kind of cool.
We should know what it is.
But what it calls out is four different types of intellectual property.
The first, you know, so it calls out copyrights, patents, trademarks and trade secrets and all of these kind of play in this open source realm, kind of.
The first three play in the open source realm not as much or they’re not as important to it as the copyrights.
We’re just going to go through really quick, quickly.
Trade secrets are a form of intellectual property.
You just don’t tell anybody.
So, you know, Coca-Cola has a secret formula.
Kentucky Fried Chicken has 11 herbs and spices.
You may have a special manufacturing process that allows you to create computer chips with very fine traces and, you know, nanometer scale process control.
Those are trade secrets.
If you haven’t published them and you haven’t told anybody, then they’re yours to keep until somebody else figures them out.
And that’s the problem with a trade secret.
It has the longest protection so long as nobody else figures it out.
You know, you walk into Coca-Cola’s vault and you unlock the secret formula and you start making Coca-Cola, you can compete directly with them.
You’ve probably done industrial espionage, so you probably, you know, got to get away with that or spend some time in jail.
If you discover the trade secret, then it’s pretty much gone.
Trademarks are just a shortcut for referring to things.
What is this trademark for?
Swish Nike, right?
The oldest trademark, by the way, is for Bass Ale.
Look it up.
The oldest trademark in the Western world is for Bass Ale, an English beer.
Just a red triangle.
Patent, well, that’s the exclusive — so there was a question about monopolies in the last session.
A patent is essentially a legal monopoly.
If you reveal your idea and you give detail so other people can replicate it, then various governments, including the U.S. government, European Union, et cetera, will give you an exclusive right to use that idea for a period of time.
Just the understanding that after that period of time, you’ve given away the secret and anybody else can fast follow.
So generic drugs are a big indicator or a big result of this.
Generic drugs can’t come about until after the patent’s expired.
I think for drugs, it’s now 20 years past first filing or something like that.
Getting a drug through a process can take a number of years.
So the idea was to have an effective patent length of about 17 years, if I’m not mistaken.
This is not a drug patent.
Anybody recognize what this patent is?
This should be near and dear to most of your hearts.
This is a mechanical mouse, at least one of the diagrams from a mechanical mouse.
But for our purposes, those are interesting.
The big dog is the copyright.
And the copyright is just basically a legal right given to the creator of an artistic work.
So the important thing here is that you have the right as the creator to decide how that something is actually going to be used.
It’s granted by the act of creation itself.
So you can file for a copyright, but you don’t have to.
The fact that you created something is enough to give you the copyright.
So you don’t have to go through any legal things to get to say this is copywritten and nobody else can use it.
As soon as you generate it, if you generate it yourself from your imagination, it’s an artistic work.
And you can say how that work can be used by others.
Now it may surprise you to find out that generating a computer program is considered a work of art.
You know, I can look at code structure and find it beautiful.
Maybe you can, maybe you can’t.
Art may be a little stretch even for me.
But that’s not what they mean by art.
It’s a work of skill and imagination.
So all of our basis in the right — most of our basis in the right to control our code comes from this concept of copyright.
Now I will throw out there that you can get copyright for computer processes.
So you’re not actually copyrighting the text.
What you’re doing is you are patenting the idea behind it.
And those can be very troublesome.
But in general, they’re much less — they’re not what allows us to open source our code and they’re much less prevalent.
All right.
Any questions about that?
Okay.
So as a copyright holder and a piece of code, you have the right to do with that code what you want.
And before we get on to how this applies to AI applications, I want to take another — just another second to explain what that means.
That means you can say to somebody they can or cannot use this piece of code.
How many of you have used GitHub?
How many of you put code out on GitHub?
How many of you have put code out on GitHub without a license?
Is that code open source?
I heard a no in the back.
Exactly.
What that code is, is a honey trap.
You put code out there without telling people they can use it.
Right?
And your intent may be perfectly benign and you may want people to take your code and use it.
But if it’s out there without a license, what that means is somebody who takes that code is taking copywritten code and using it.
Which means that if you can show they took your code and are using it, and you’re mean, right, they either have to stop using it or pay money.
Right?
Not licensing your code is essentially, you know, I don’t want to say fraud.
No, it’s enticing people to use something that they really don’t have a right to use.
So when you put — because you own the copyright, because you created the code and you own the copyright to the code, you have the right to tell people how they will use it, whether you exercise that right or not.
When you put a license on it, when you put an open source license, you’re specifying the specific things they can do.
And if you’re putting on an open source or free license, by definition, you’re giving them those freedoms.
And they’re written in different ways.
There are different reciprocities that you might have to do if you use a copy left.
You know, that does contaminate the code downstream.
Somebody else who takes your code and uses it also has to release it GPL.
Permissive licenses don’t have that.
But you have the right to tell them that because you wrote the code.
So you’re setting the basis for how they can use it.
So now we want to talk about what this applies to in terms of AI applications.
And by the way, I wrote this for a class about a month and a half ago.
And at the time they were on release candidate one, version 0.0.9.
And these first couple of slides I’m going to take exactly from that, or at least this first slide I’m going to take exactly from that because it was actually a lot more informative than the release notes on the first official version, which by the way, I think came out this week.
I think I noticed it came out this week after I’d already prepared all my slides for today.
So I spent a little bit of time going back through and at least verifying what I had was correct.
Fortunately, there weren’t any major changes, so I didn’t have to backtrack too much.
But a month — yeah, the release candidate came out about a month and a half ago, and these were the important things that I read.
They talk about the data information.
So if you’re talking about an artificial intelligence, an AI project, right, data is a big thing.
You’re going to have training data, you’re going to have testing data, and the data is actually kind of the secret sauce behind everything, right?
The data is the source for AI, per the original author of the OSD and founder of the OSI, Bruce Perens.
If you’ve gone to some of the lectures in this room today at least, they’ve talked about bias in the training data, they’ve talked about curation or prepping the data so that you can use it correctly, selecting the data, et cetera.
The data is a big thing, and there are problems with any model, any open release of the data that we can talk about.
So we’ll go through what it says about clarifying all that training data needs to be shared and disclosed, et cetera.
Earlier in the process they correctly identified the three legs of the Open Source AI stool were the code, data, and resulting model, only to cut one of them off by replacing data with metadata (“data information”). This is like giving me a recipe calling for unicorn horns (e.g., Facebook/Instagram social graph data which went into Llama with or without users’ explicit permission).
We’ll go about that in a few minutes, but the idea is that this is not a cut and dried clean part of the definition.
No, it’s the most contentious part, and it was that way since the start.
As you can see from my repo (samj/osaid) I’ve generated transcripts for all the OSI’s town hall meetings, and the smoking gun on the data subject was revealed by their Executive Director in the first half of the first meeting: “What kind of so access to training data is probably unavoidable. The question is what kind of access? What level of access? The full on training data set? Or is it sufficient to have a detailed description of what went into it? Or something else?”.
Apparently, the decision to exclude the data was already made last year in a closed and since-deleted Google Group (why not the now dead discussion forums at https://discuss.opensource.org, or one of the many mailing lists on https://lists.opensource.org/?). Before they embarked on the holier-than-thou, whiter-than-white “co-design” process which was so important that it had to be outsourced to a consultant lacking experience in both Open Source and AI: Open Source AI Definition: Not Open, Built by DEI, Funded by Big Tech
And in fact, it refers to four types of data, by the way.
…only to accept any of them or none, shifting the Overton (Openton?) window of Open Source all the way to the extreme end of the spectrum. If this is not an attempt to obfuscate their real stance with doublespeak, then why not just say “we accept any data or none as Open Source”?
I think this is the one thing that I did want to get out here.
Four types of data, open data, data that’s open and publicly available, public data, data that’s public but not necessarily available, obtainable, data that you can get but you may have to pay for, right, you have to come to a third source, and then data that they used that you just can’t have, regardless.
The “including for fee” wording makes the third class of data (“Obtainable”) even more toxic than the fourth (“Unshareable non-public”).
If you license NYT articles or Adobe stock images and train a model on it, what is it going to produce if not the same? In French we say “les chiens ne font pas des chats” (dogs don’t make cats), and the same applies here. Consider Linux as a Model (LaaM) which was trained on Linux kernel code under GPLv2 and generates clearly copyrighted code from same, acting more like a photocopier than a generator. This is irrespective of the amount of 4D chess played in arguing maths cannot be copyrighted, and other related nonsense we’ve heard along the way. Good luck making that argument in front of a judge when you publish a zip file full of Hollywood movies!
So that’s kind of a fly in the ointment on this whole definition.
The data is the source for AI, so the definintion is all fly, no ointment.
They also had two other features.
They’re clarifying that the code has to be complete enough for downstream recipients to understand how the training was done.
So one of the things they wanted to clarify was that you just can’t give the code, right, basically the neural network set up, maybe even the internal weights and some idea of the data and not tell them how you trained it, right, because how are you ever going to — is that really a dissemination if you’re hiding some of the most important parts?
Again, the data is the source for AI, or its symbolic instructions per Simon Wardley who argues “OSI should be standing its ground here” (they are — just on the side of their sponsors!).
A fellow Debian developer explains it very clearly: “That’s like not releasing the source code at all and claiming that it’s still free software because “a skilled person” could rewrite it. That’s obvious bullshit.”
Yeah, so that’s a second important concept.
You have to be able to know how the training was done in order to actually have any hope of using an open source AI to do similar things to what it was created to do.
And then the last one, if you’re not familiar with open source, this may not be surprising to you, but this surprised me.
And this one just clarified that the aim of open source is not and has never been to enable reproducible software.
And the same is true of open source AI.
Now you look — you think back to the four freedoms, right, and it says you have the right — one of them says you have the right to study the code and figure out how it works, make it do what you want it to work, use it for what you want to do, and you have the right to modify it.
But if you can’t reproduce the original, do you have any hope of ever being able to accomplish that first freedom?
So I’m not sold on this one.
Right?
I’m not really sold on the bullet that it never was intended to be reproducible.
They’re kind of using words that say it is not standing in the way of reproducibility, but it wasn’t to enable reproducibility.
And I’m not sure I’m buying that, because that was always one of the things that I stressed in my classes, is that if we’re writing open source software and nobody else can run the software or generate the programs from it, then are we really being open?
Right?
No, we are not. As I explained in Openness vs Completeness: Data Dependencies and Open Source AI, “For software to be Open Source you require reproducibility. Some have mistakenly claimed this goes further into the realm of Open Science (as if that’s a bad thing), but it is the reality and requirements of Open Source today: If the source code produces the software, and is made available under an Open Source License, then the software is Open Source.”
In making the case as to why the OSAID is not DFSG-compliant, said Debian developer explained:
“The source code” must include everything needed to rebuild the software so that it works the same as the original. An AI system doesn’t “work the same” — i.e., give the same output from the same input — without the training data, so the training data is clearly part of the source.”
So we have to publish all of that stuff along with it.
So I’m a little skeptical on that one.
Anyway, those were the release notes.
The actual definition is at that point, and this is where I’m kind of worried.
You’re being too generous with them here. This is an own goal of epic proportions that poses an existential threat to Open Source from the inside, for any software that “infers, from the input it receives, how to generate outputs” (i.e., all software).
Everything is a little small, but your eyes are probably better than mine, and if not, I’m going to read to you the salient points anyway, or at least summarize the salient points.
And the QR code will get you right to the definition, I hope, unless I forgot to replace it with the release candidate.
I forgot to replace the release candidate, but I think I got it right.
And in comparison, let’s kind of bring up the original four freedoms of open source and just see how these things match.
And the first one is pretty much freedom zero is essentially the first bullet point.
Use the system for any purpose and without having to ask permission.
Right?
That’s kind of the same as freedom to run the program as you wish for any purpose.
Pretty close.
Second one, freedom to study how the program works and change it so it does your computing as you wish, and you need to see the source code as a precondition.
Well, study how the system works and inspect its components.
That gets the first part.
Doesn’t really get the freedom to modify, but that’s taken care of in our third bullet point under AI where it says modify the system for any purpose, including to change its output.
So the second and third bullets kind of take care of freedom one, and then share the system for others to use with or without modifications for any purpose.
That matches the freedoms two and three.
Okay.
I have no objections to these, but why shake them up?
It says essentially exactly the same thing, but they’ve split one of them apart and combined two of them together.
So I’m not sure what the purpose was that, except that this was generated from OSI, and this was generated from the Freedom Foundation, and I’m wondering whether that was just a little refusal to accept the organization of the Free Software Foundation just as a, I don’t know, political thing.
Yes, that’s likely what’s at work here, but the freedoms to use, study, modify, and share a work are clearer than combining study & modify (even if there’s a dependency). Indeed, it has been proposed that the four freedoms be explicitly mentioned in a potential future version of the OSD.
Another pertinent point is that they claim that being able to fine-tune a model satisfies the freedom to “modify” it, but you don’t get to tell me what I can and cannot do, nor what constitutes an improvement per the Free Software definition: “Whether a change constitutes an improvement is a subjective matter. If your right to modify a program is limited, in substance, to changes that someone else considers an improvement, that program is not free.”
Anyway, let’s take a look at some of the actual clauses.
That was the philosophy behind the open source AI.
There were some clauses that have to go in here.
We said that one of the first ones was going to be this data information clause, and here’s where things get a little weird.
So when they talk about data information, what they’re saying is you need sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system.
Data information shall be made available under an approved OSI terms, under OSI approved terms.
So the substantially equivalent system is the bow toward not having reproducibility, right?
This weasel wording (including “sufficiently detailed”, “skilled person”, and “substantially equivalent system”) creates several loopholes big enough to drive a truck through sideways, as I’ve demonstrated here: Malicious compliance with the release candidate Open Source AI definition.
And you can kind of understand this because an AI system is really complex, so absolute reproducibility is going to be very difficult, right?
Impossibly difficult according to Redmonk: “I do not believe the term open source can or should be extended into the AI world.”
Not substantial reproducibility, so it does about the same thing, maybe we can kind of get there.
The particulars are kind of interesting too, because what it’s asking for is a complete description of all the data used for training, right?
Including unshareable data, the providence of the data, scope and characteristics, how the data was obtained and selected.
So all the things you would need to know, but notice that you’re not required to actually give out all of the data.
You have to have a listing of all publicly available data and where you can get it, and a listing of all training data obtainable from third parties and where to obtain it, including resources you have to pay for.
So there’s no way that this is disclosing all of the open data, right?
I’m sorry, it’s disclosing all of the open data, it’s not disclosing all of the data used.
And that’s a really interesting point from open source.
The data is the source for AI, so it’s clearly not Open Source.
So let’s take a look at why this might be so, because I had a real headache with this the first couple of times I read it, and then you start thinking about it and you can start seeing where it comes from.
Not everything has to be Open Source, and models trained with data that cannot be released cannot be considered Open Source in the same way that software containing code that cannot be released cannot be Open Source.
If I’m training my model, and particularly if I’m in the EU or in Japan or I think Australia, there are specific clauses written into law that says certain classes of data are available for harvesting to train AIs, but they’re not available for publicizing.
So you can get data under a license, you can get data that’s publicly available that somebody else owns the copyright for, and it’s a principle of free use in those areas, in those countries, that you can use that data to train your AI.
These country-specific excuses are exactly that. If anything, laws like those in Germany which allow the [ab]use of copyright content for research purposes that could produce a model that violates said copyright, while the researcher disclaims all warranty and liability for downstream use is at best unethical. Laws that prevent copyright protection of “facts” may protect the vendor, but when that bag of bits recalls and reproduces training material verbatim then users will rightly be sued by the custodians whose one job it is to protect their data: The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work. How is it possible that we find ourselves on the wrong side of this argument?
What you can’t do is take that data and make it publicly available because you don’t own the copyright.
You’ve been given rights that do not extend as far as publicizing the data.
This is like Common Crawl, and it potentially creates a “chain of custody” from a large and increasing share of human output back to content oligarchs, and in this case, anyone on the Internet.
So those are the public available training data and where to obtain it, and similarly if you buy data from somebody, they don’t necessarily give you the right to publicize it as well.
Granting the right to publicise purchased content like NYT articles or Adobe stock would render the entire corpus worthless, so that right will never be granted, except perhaps in a limited way via models delivered as a service like OpenAI’s ChatGPT (indeed, a large share of the costs for training GPT 5 are said to be licensing). Will vendors like Meta be making commercial licenses available in future, or offering their models as a service, licensed on a usage basis?
Now you can make some philosophical arguments about whether that’s enough, whether those are sufficient to keep that from being an open AI.
They’re trying to make a pragmatic choice that allows us, that allows companies to make an open AI so long as they sufficiently describe the data so that you could take similar data obtained from different sources and hopefully come up with something similar.
The technology doesn’t care for pragmatism or your feelings. Either the four freedoms of free software are fully protected, in which case you can enjoy the benefits of Open Source that the market has become accustomed to over the past quarter century, or they are not. If they are not, then the models cannot form the foundation of the next generation, and tomorrow’s computer scientists cannot “stand on the shoulders of giants” like we could.
And that also feeds back into the substantially equivalent system.
Your system will never be the same because if you have to pull the data yourself, it’s not provided to you, there’s the chance for error, there’s the chance for different data sets, there’s a chance for the data to have changed over time, additional data added, data removed.
So of all the clauses, this is the one that I find the least satisfactory, but it is still, it may be the best we can do.
We must do better. Giving me the recipe for a cake I can’t bake because it includes unobtainable ingredients is not Open Source. If I train Llama on texts with my mother rather than Facebook/Instagram’s user data, it’s not going to be anything like Llama. Indeed, without being able to verify the contents of the model means I can’t even use it in many contexts, let alone study, modify, or share it.
And if you look at the process for this, and I’ll get to some resources at the end, you’ll find out that this was actually a principle, something that was discussed quite a bit during the formulation of this definition.
Everyone has a code, this is kind of self-evident, but as promised in the release candidate, you do have to, you know, you have to give all the source code, the full specification of how the data was processed, and it has to include the code you use to manipulate and manage your data, right?
So if you’re processing and filtering your data, if you wrote code for that, that code has to be applied, and at a minimum you have to describe what you did anyway, right?
So all of that has to be available under an open source library.
So we’re good on the code, right?
Not only the code to run the, to train and run, but also to process the data and curate it to make it correct has to be provided.
Model parameters.
So if you’re talking about model parameters, maybe you’re talking about the weights of the nodes.
I think we had a question in the last class about, you know, this is just a feedback network, we’re just applying weights, it’s essentially a big matrix vector multiplication.
Yeah, and you have to provide at least the final point of the weights so that somebody can actually reimplement or reinstantiate the weights you used or the weights you generated, even if they can’t generate it from scratch using the data.
So that’s good.
The question is, you know, if our goal is to be able to modify it, can you actually modify from this?
Probably not well, right?
We heard some talks earlier about how you can go in and remove a little bit of data or maybe tweak your model.
You could do that on this.
The modifications you can make are very limited when you only have the weights. It’s like walking into a sound desk with 70 billion preconfigured knobs; tweaking them is going to change the sound, but maybe not in the ways you want. You can’t unbake the cake to add or remove ingredients, nor rearchitect the model as you could with Open Source software.
What you can’t do is you can’t make any major changes to, for instance, remove unintended bias based on data selection.
And in fact, that’s the problem I have with the whole data thing is if you can’t see the data, how do you know that there’s a bias?
You can intuit one from the behavior of the AI itself, but without being able to see the actual data that was used, you cannot prove a bias.
Well, it’s much harder to prove a bias, whereas if you could actually see the data selected, it would be easy to prove a bias because you could just not select the data the same way and unbias it, show the difference between the two variants.
Bias is a big one too because you can’t assess let alone address bias without the source data. If you want to remove Nazi content (which is not to say anyone should do that, but maybe you need to for a children’s educational model, for example, or just want to because Open Source lets you) you can’t. If you want to add security scans of transgender passengers so they don’t get sent to secondary screening, you can’t.
This is actually one of the more perverse parts of the process (Design Justice, A.I., and Escape from the Matrix of Domination), which was purportedly done in the name of minorities while actively excluding experts (“We believe that everyone is an expert based on their own lived experience, and that we all have unique and brilliant contributions to bring to a design process.”).
Said minorities absolutely need the data to deal with ethical issues like fairness and bias, and were either told they didn’t by the OSI’s board and their servants, or were [mis]represented in the results. That would not be at all surprising given the voting scandal.
And again, that’s pretty much the end of the clauses.
This just goes on to define, you know, an AI model consists of the model architecture, parameters, including weights, inference code for running the model, and goes on to define the weights.
So that’s really the philosophy behind this.
You’ll notice that as of right now, there is no open source AI license, right?
This is how an open source AI license will actually be evaluated.
It’ll be evaluated against — oops.
Yeah, it’ll be evaluated against the five slides or so that I just showed you.
That’s how it’ll be determined.
Right now, I don’t know of any, you know, actual open AI licenses that have been approved yet or, you know, set to go into effect.
And that’s actually what the last paragraph says.
It basically just says that this is our concept for what principles have to go into an open source AI license.
But how this license actually gets implemented is yet — has as of yet not been determined because, well, no one’s done it.
We’ll no doubt hear more on this when they return from holidays next month (yes, they apparently vanished after dropping the bomb on the industry at the All Things Open conference last month), but it’s also a cause for concern.
Currently we have the safety of the 96 OSI-approved licenses which anyone can apply on a self-service basis. We know they’ve been working on a checklist enumerating the components, but does that mean we’ll have a one-size-fits-all “license” for AI? Will the OSI attempt to establish themselves as the certification authority for Open Source AI? Will anyone else? Will sponsors get priority treatment?
So there you go.
This is an interesting — if you’re interested in how things and how sausage gets made, OSR — the Open Source Initiative.
By the way, ARCOS, the group that I am the director of, the students selected to become a member of ARCOS — of the Open Source Initiative years ago were an affiliate member.
We were the first student-run affiliate membership there.
So we like the OSI.
But they have an open process.
And if you want to go through and see that process in action, the links that I have up here will kind of take you through the process they used.
You know, how did they actually come up with this?
Who had input?
Some of the things that I brought out to you on how we have to be flexible on data.
A whole discussion forum where mainly the data was discussed.
And then a FAQ website as well.
So it’s kind of interesting reading.
These are the points that I found interesting.
We have about two minutes.
If anybody wants to talk about these or something else, you know, I’m very happy to do it.
The first one, I keep coming back to that reproducibility thing because this is the world that I developed open source in.
Woosh, come on.
Come on, there we go.
This is a dashboard of all the machines that one of the software systems that I worked with runs on every day.
And you can see there are multiple pages of this.
There are all sorts of compute platforms available here.
And all of them are trying to bit match exactly the data or the expected results of — well, they’re all supposed to be the same.
And you can see we have some errors here.
And the errors are taken seriously and flagged.
And if reproducibility is not what we actually wanted, we wouldn’t care.
So I’m still kind of taken back by that whole reproducibility thing.
I understand where it’s coming from.
But — now how do I get back?
Oh, here we go.
And we built a whole ecosystem based on reproducibility that isn’t actually — that was always a mirage, I guess.
Number two, I’m back on the data thing.
Is this actually open?
And if we’re worried about bad actors and open source is one of the solutions to bad actors because it allows us to see everything, but then we allow the data to hide and the data is where all the problems come in, is this actually useful?
So there’s still that kind of nagging in the back about that.
And then the third one, code must be complete enough for downstream recipients to understand how the training was done.
Is that necessary?
Is it enough?
Do we have to do more in order to increase the level of reproducibility at least?
And these are all the things that kind of come to my mind on it.
This is only version one.
The discussion is going to continue on.
I expect that it will continue on for at least a year or two.
But again, you guys just got something, as far as I’m concerned, it’s on the cutting edge.
A month ago this didn’t exist and here we are talking about it, which I think is really cool.
And please develop your own opinions.
I’ve got some.
Anyway, I think that’s it.
If you have any questions, let me know.
I’m pretty easy to get in touch with.
It’s the end of the semester, which means that my email is exploding.
So there may be a delay, but I think I answer all my email.
Maybe I lose some, but in general.
So please feel free to contact me.
And I am on campus in Amos Heaton 207 if you ever want to drop in for a talk.
Thank you.
Yeah.
So I was wondering, like, I know like a lot of companies, they develop open source software and then like recently they’ve also started releasing like different AI models open source.
So I’m wondering like how do they do that?
How do they benefit from it?
Okay, so the question is a lot of companies are doing open source.
And some of them are generating it and actually have it as a substantial investment from their company.
And how do they actually make money off it?
The Red Hat one is an example, right?
One of the ways you do it is by providing services that are difficult for the general public to do themselves.
And that’s not as hard as you think it is because, you know, if you’re going to install Linux on all your machines, right, do you want to buy it from Red Hat or do you want to hire three or four people to work 40 hours a week to keep the system up, right?
So you know, providing services, you know, a lot of companies, everybody thinks that companies love to get things for free.
No.
What they want is they want support because when it goes down, that’s when they actually lose money.
So one of the ways is to provide services that somebody else will pay for in order to use the software.
Another way is a lot of the open source stuff that’s developed, you’re working with a broader community and you’re developing something that helps you do your job better, but it’s not the secret sauce that makes you money.
You know, if you’re in that situation, then you’re going to contribute to and generate open source software because you’re working with a community so your cost to develop that software is lower, but the company’s not ever going to sell that and that was a big thing at GE.
We weren’t generating open source software to sell.
We were generating open source software to, for example, build a virtual environment where we could plan the measurement activities of a huge scan head on a part, right?
So we were generating, our income was going to come from lowering our number of flaws because we could plan better our scans.
The open source software was just, why do this by ourselves, we’re never going to sell it.
So, that’s the open source software.
I look forward to Wes’ next presentation on the subject, now he knows he’s not alone in questioning the validity of the OSI’s proposed definition.
Hopefully the Rensselaer Center for Open Source, being an affiliate member of the OSI with an advisory vote for 4 of 10 board members (4 going to members and another 2 to the board itself) will join me in taking a strong stance against the OSAID, possibly supporting Bradley M Kuhn’s ticket (he “will work arduously for my entire term to see the OSAID repealed, and republished not as a definition, but merely recommendations, and to also issue a statement that OSI published the definition sooner than was appropriate.”) and signing the Open Source Declaration?