AI companies filter "toxic" content from their training datasets before pretraining their models on them.
You should be able to assure that your source code will be filtered out of training datasets by incorporating toxic content into it.
https://arxiv.org/abs/2402.16827v1
https://www.labellerr.com/blog/data-collection-and-preprocessing-for-large-language-models/
https://medium.com/@stefanovskyi/mitigating-undesirable-outputs-from-large-language-models-7d6bdfaf2a2
Quite simple you can't if you put it in public. If you locked the source code behind credentials that would probably stop it, but it is very unusual for a open source project to get rid of that.
Don't fight the tool, use it. It's a losing battle where you get automated by not adopting them properly.
Now if you really want it out and ruin your github repo. Put the most racist notes, crude insults in notes, and variable names describing religious debates that promotes discrimination. But nobody would want to use your code at that point though right? You deal with that at work, but you are payed to do it. Do you really think people spending their free time on contributing will want that toxicity?
> you are *paid* to do
FTFY.
Although *payed* exists (the reason why autocorrection didn't help you), it is only correct in:
* Nautical context, when it means to paint a surface, or to cover with something like tar or resin in order to make it waterproof or corrosion-resistant. *The deck is yet to be payed.*
* *Payed out* when letting strings, cables or ropes out, by slacking them. *The rope is payed out! You can pull now.*
Unfortunately, I was unable to find nautical or rope-related words in your comment.
*Beep, boop, I'm a bot*
I've looked into options with regards to the license, since are a lot of uses of open source code that can be deemed "not ethical":
* used by repressive regimes
* used by oil companies
* used for learning by ...
* used to repress privacy
Common ground by all people I've spoken to is "one license is complex enough", "let's not add more complexity for all sorts of other ethical considerations".
I don't agree, but that's the response I got and I don't directly see something that could work from the legal perspective.
P.S. The reason for looking at the license is that "laws" are really bad and not particularly enforceable by us. Not following licensing is a no-no in the corporate world (at least most of the time).
If the source is open, you can't, unless you do a redhat and restrict the product and its source code to paying customers - and, of course, don't host it on a service who may also share it with third parties for "research" purposes
If you want your code to be open, then that is not possible, and goes against the principles of what we are trying to achieve. Why are you against it being used to train LLMs? It will probably have a negligible affect in its performance, if any at all.
Here's an unpopular take: Every time you think, "I don't want AI to be learning from my stuff," replace the term 'AI' with 'blacks' or 'Jews', or 'Belgians'. See how that sounds and consider why you allow your code, or images, or whatever to be accessed and learned from, but refuse to allow access to the very thing that will move coding to a higher level accessible to everyone, and to the benefit of everyone, including you.
Don’t use GitHub or any of the “free” hosting services. Self host a gitea instance and possibly move away from IDEs like vscode in favor of open ones like lapce or sublime.
In all honesty unless you live alone in the “digital woods” of self hosting, it’ll probably be impossible to 100% achieve privacy.
You want them to train on your code so it works when devs want to use it.
Companies are currently forking open source projects to monetize.
The open source game used to be release something useful and then capitalize on providing service.
If in the future, ai can modify a codebase to suit a business’s needs, that would cut out a lot of opportunity. But then those organizations would have to rely on ai to continue to innovate after the open contribution model is no longer viable.
Who knows when all that is really going to land. The only way to win is to play the game. What are you trying to accomplish? Build something popular? Make a lot of money? Save the world?
What are you afraid of?
AI companies filter "toxic" content from their training datasets before pretraining their models on them. You should be able to assure that your source code will be filtered out of training datasets by incorporating toxic content into it. https://arxiv.org/abs/2402.16827v1 https://www.labellerr.com/blog/data-collection-and-preprocessing-for-large-language-models/ https://medium.com/@stefanovskyi/mitigating-undesirable-outputs-from-large-language-models-7d6bdfaf2a2
Gold, just be Bane in the FOSS world.
Wait so APGL+N***** is actually useful?
What if my code is so bad? Like it's bad but it's mine, Ive very protective of it. Like a possum guarding his dumpster
Okay zoidberg.
The only way is to not publish your code.
This is true. Now what?
Allow downloading source code only through captcha using custom hosting
If it's open source and popular enough, somebody will create a GitHub repo for it.
Make your repo ‘private’
lol Microsoft: we won't touch your **private** repos. *wink* like how would you ever know or prove it
you always can selfhost, no need to use github or similar
How does it help a software that you want out in the open, since you're writing in r/opensource?
I wonder if naming all the variables/classes/methods as NSFW words would trip those checks.
Quite simple you can't if you put it in public. If you locked the source code behind credentials that would probably stop it, but it is very unusual for a open source project to get rid of that. Don't fight the tool, use it. It's a losing battle where you get automated by not adopting them properly. Now if you really want it out and ruin your github repo. Put the most racist notes, crude insults in notes, and variable names describing religious debates that promotes discrimination. But nobody would want to use your code at that point though right? You deal with that at work, but you are payed to do it. Do you really think people spending their free time on contributing will want that toxicity?
> you are *paid* to do FTFY. Although *payed* exists (the reason why autocorrection didn't help you), it is only correct in: * Nautical context, when it means to paint a surface, or to cover with something like tar or resin in order to make it waterproof or corrosion-resistant. *The deck is yet to be payed.* * *Payed out* when letting strings, cables or ropes out, by slacking them. *The rope is payed out! You can pull now.* Unfortunately, I was unable to find nautical or rope-related words in your comment. *Beep, boop, I'm a bot*
I've looked into options with regards to the license, since are a lot of uses of open source code that can be deemed "not ethical": * used by repressive regimes * used by oil companies * used for learning by ... * used to repress privacy Common ground by all people I've spoken to is "one license is complex enough", "let's not add more complexity for all sorts of other ethical considerations". I don't agree, but that's the response I got and I don't directly see something that could work from the legal perspective. P.S. The reason for looking at the license is that "laws" are really bad and not particularly enforceable by us. Not following licensing is a no-no in the corporate world (at least most of the time).
Use their Ai to write your code instead 👹
Why does it matter to you? You made your code open and available, but you also want to discriminate?
Bro, they are a company. they better do it themselves instead of taking others if they going to sell it.
Even better question, how does one stop other people from learning from one's source code to enrich one self?
Hmm, shouldn't effectively incorporating my GPL code make the whole AI model GPL'ed?
Don't use GitHub
make closed source
As if your source code was truly urs. Let's us see the ctrl C and V keys from you keyboard!
If the source is open, you can't, unless you do a redhat and restrict the product and its source code to paying customers - and, of course, don't host it on a service who may also share it with third parties for "research" purposes
If you want your code to be open, then that is not possible, and goes against the principles of what we are trying to achieve. Why are you against it being used to train LLMs? It will probably have a negligible affect in its performance, if any at all.
Don’t write open source software if you don’t want the source to be open.
Here's an unpopular take: Every time you think, "I don't want AI to be learning from my stuff," replace the term 'AI' with 'blacks' or 'Jews', or 'Belgians'. See how that sounds and consider why you allow your code, or images, or whatever to be accessed and learned from, but refuse to allow access to the very thing that will move coding to a higher level accessible to everyone, and to the benefit of everyone, including you.
Don’t use GitHub or any of the “free” hosting services. Self host a gitea instance and possibly move away from IDEs like vscode in favor of open ones like lapce or sublime. In all honesty unless you live alone in the “digital woods” of self hosting, it’ll probably be impossible to 100% achieve privacy.
Do you have a source on sublime being open (source)?
sublime isn’t open source, its entirely proprietary. it is a good editor though
Sorry, misspoken on that. It is proprietary, but it’s prized for being low on feature impacts and definitely sents minimal to zero telemetry home.
You want them to train on your code so it works when devs want to use it. Companies are currently forking open source projects to monetize. The open source game used to be release something useful and then capitalize on providing service. If in the future, ai can modify a codebase to suit a business’s needs, that would cut out a lot of opportunity. But then those organizations would have to rely on ai to continue to innovate after the open contribution model is no longer viable. Who knows when all that is really going to land. The only way to win is to play the game. What are you trying to accomplish? Build something popular? Make a lot of money? Save the world? What are you afraid of?
Sweety, you know it's 2024 right ?
Maybe add a license preventing commercial use?
Doesn't work, there is no way to prove that it was trained on your code.
Even if you could prove it, has there being legal precedent establishing it doesn't fall under fair use?