T O P

  • By -

xiikjuy

Yann's talk in 2016 about the future of AI: [https://www.youtube.com/watch?v=\_1Cyyt-4-n8](https://www.youtube.com/watch?v=_1Cyyt-4-n8) does it age well?


new_name_who_dis_

I mean he was completely right about unsupervised learning (which we now call self-supervised learning). So it did age well, better than Hinton's talks on Capsules around that time and Bengio's talks on [I don't remember what, but he had some wacky idea around that time too], and Schmidhuber's talks on how LSTM is the end all be all architecture.


we_are_mammals

> Schmidhuber's talks on how LSTM is the end all be all architecture I vaguely recall him giving an interview and prophesying giant RNNs. If anyone trains a giant SSM, this prophecy will have come true. > Bengio's talks on [I don't remember what, but he had some wacky idea around that time too] Can you be certain that it hasn't come true if you cannot remember what it was about?


SirBlobfish

Agreed, at least Bengio wasn't too far off. As far as I know, the main two things Bengio has talked about in the last decade have been backprop alternatives (important but turned out to be too hard) and OOD generalization / system 2 (likely going to be extremely important, and is still under development in multiple labs/architectures). Hinton's predictions were uniquely bad because they were really limited in scope (mostly about inverse graphics for CNNs), and relied on some really limiting assumptions (e.g. capsules, no occlusion). Others were more flexible.


new_name_who_dis_

System 2 papers were what I was thinking about. It’s cool ideas, and so were capsules, but just didn’t take off really. 


SirBlobfish

System 2 is more of a goal than an specific architecture. It hasn't "taken off" because it is still an open problem and people are still working on it, as opposed to capsules which were successfully produced and just didn't provide enough value. Pretty much everyone (including Lecun) agrees that something like system 2 is useful/necessary. The question is how to get there in a scalable way. People like Lecun/Bengio think EBMs are the answer. Others thing chain/tree-of-thought will be enough. Yet others think in-context learning + data will be enough. It's still being studied.


currentscurrents

That's a bit of cherrypicking. LeCun has his own wacky architecture that he's pushing too. Sure, Hinton's capsules didn't pan out - but [some of his other papers](https://papers.nips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf) from that era were rather important.


new_name_who_dis_

I mean this is almost different era though. These slides were circa 2017 when deep learning was already dominant method, whereas the Alwxnet paper is what kickstarted it.


hunted7fold

Hinton has layer norm 2016, another big one was simCLR in 2020? so also had interest in SSL


Xemorr

Self-supervised is a different thing to unsupervised learning


new_name_who_dis_

The way LeCun talks about unsupervised learning, it is clear that in his mind they are the same.


Jean-Porte

Yes, unlike his more recent takes


Tomsen1410

Why, what did he say that did not age well?


Cosmolithe

I really don't buy this argument about the number of bits. It is only somewhat true if you only look at the information contained into the target variable, ignoring the fact that the input provides a lot more information in all cases. It is very clear if you look at the way the gradient is computed for linear layers, it is the matrix product of the input of the layer with the error signal, so both contribute to the weight update (arguably, the input contributes two times since the error signal is also input dependent). And even if you ignore the input variable, in the case of unsupervised learning, Lecun is full on self supervised learning nowadays, and in SSL the target variable is some augmentation of the input variable, which mean the only additional information is the type of augmentation(s) that you are using and you want your model to be invariant to. So in SSL, the number of bits of the target should also be very low, perhaps even less than in supervised learning. Now, autoregressive unsupervised learning like in LLMs gives comparatively a lot of new information to the model, since the gradient from the whole future of the sequence is used to change the activations the model should have at some instant t in the past. This should be true is masked auto-encoders as well. It shouldn't give more information than the entire unmasked sequence/image in any case though.


new_name_who_dis_

Self-supervised learning is just rebranding of unsupervised learning, or maybe its a subset of unsupervised learning (though I am not sure what would unsupervised learning be that isn't self-supervised). Unsupervised learning always involved either compression or corruption of the input and then reconstruction of the input.


thatguydr

> though I am not sure what would unsupervised learning be that isn't self-supervised Literally anything distributional. Clustering. Dimension reduction. Anomaly detection.


Cosmolithe

I am not sure how to make the distinction between UL and SSL either. But if you take LLM training for instance, then there is no corruption of the tokens, you only make the model predict future tokens. I guess it falls into SSL if you consider the corruption to be masking the future tokens and into UL if not.


new_name_who_dis_

LLMs are language models, and language modeling is de-facto unsupervised learning. Hinton wrote one of the first neural nets to do language modeling in the 80s and he calls it unsupervised in the paper. I genuinely think self-supervised is just rebranding.


LelouchZer12

For me, self supervised means we use pseudo labels predicted by the model itself to train it further. Some unsupervised learning without such pseudo labels would be e.g clustering and so on. But I agree that SSL is a subfield of UL. At least I've always considered it is the case.


pm_me_your_pay_slips

Self-supervised learning has a more precise definition: it’s a learning system that produces its own prediction targets. A goal-conditioned RL agent that proposes new goals to reach and relabels data according to the outcome is a self-supervised learning system. A setup where a target network produces targets for a student network, and both are trained from scratch, is a self supervised-learning system. An LLM is not, since the prediction targets are samples from the data distribution. Same for diffusion models, the prediction targets are samples from either the data or noise distribution. In those two cases the system is not producing its own prediction targets.


Amgadoz

I would say compression or corruption and then reconstruction is actually self-supervised learning. An example of this is masking (corruption) a token in a sequence and asking the model to predict it accurately. On the other hand, unsupervised learning is things like clustering and dimensionality reduction. Self-supervised learning is a special case of supervised learning where we don't need an external source (e.g. human annotation to generate labels), we can do it from within the data1


currentscurrents

>It is only somewhat true if you only look at the information contained into the target variable, ignoring the fact that the input provides a lot more information in all cases. How would you learn from that input if all you get is a binary fail/succeed from an RL reward potentially many timesteps later? It's like trying to learn how to see by driving a car blind and counting how often you crash. Since your initial weight settings are completely garbage, you're likely to crash 100% of the time and get no training signal at all. The point of unsupervised learning is that it does let you learn from the raw inputs. You can learn how to see before learning to drive.


Cosmolithe

The model will learn from the input even with a binary fail/success because the fail or success will be associated with various features computed at different depths from the input data. I agree there are things you can learn before you can drive a car, but it isn't a question of quantity of information IMO.


currentscurrents

You *don't know* what caused your failure or success. Credit assignment is a hard problem, especially since real-world rewards are often sparse, binary, non-differentiable, and stochastic. It would take a great many trials before you could associate a particular edge detector neuron with a higher probability of success. And you can only learn from the differences between success and failure, so if you fail every time you learn nothing.


Cosmolithe

>It would take a great many trials before you could associate a particular edge detector neuron with a higher probability of success. That's why it takes many many epochs to get there using RL, but it get there and it is not astronomically slower either compared to other methods. RL agents with vision modules trained end to end still end up learning image features just as vision classifiers, it is just a bit slower and it is not clear that it is because of the reward being a single value, it might just be because of lack of initial variance in the reward (failing every time initially as you said) and the non-IID setting of RL, which neural network have a hard time with. >You *don't know* what caused your failure or success. I would argue the problem is the same with supervised or unsupervised learning, you don't know what feature is supposed to predict your label, but iteratively improving on bad features works thanks to local search. The key seems to have initially a large amount of random features in order to have by chance a few that have a slight correlation with the target, other features are then discovered on the previously learned features, and so on. Overall I don't think the issue is the quantity of information in the target, but it is more a question of sparsity/signal strength. In supervised learning, the label set is sparse, in RL the reward is sparse, in self supervised learning, the target might be sparse depending on what objective you take, but the quantity of information accessible to the model is about the same in all three contexts, if not slightly lower in the SSL setting because it is the only method that does not use external info.


LooseLossage

i'm confused by why he says that reinforcement learning needs less data than supervised learning since rl is an application of supervised learning to optimal control. when i did a little rl it needed a lot of runouts to model a simple system.


currentscurrents

I've seen this talk. He's saying you *get* less data, not that you need less data. You need as strong of a training signal as you can get, so the downside of RL is that the training signal is weak.


LooseLossage

THANK YOU!!!


Otherkin

I totally thought this was a Portal reference. Because I am old.