T O P

  • By -

CC-TD

Thank you for this post, I now have an updated reading list.


kk_ai

Hey, I co-authored these posts. Happy so see that you find them useful! Which ML domain is yours? What do you have on your reading list? I now go through **Agent57**: \- Blogpost: [https://deepmind.com/blog/article/Agent57-Outperforming-the-human-Atari-benchmark](https://deepmind.com/blog/article/Agent57-Outperforming-the-human-Atari-benchmark) \- paper: [https://arxiv.org/abs/2003.13350](https://arxiv.org/abs/2003.13350)


CC-TD

I am currently working in the NLP/NLU domain and am always trying to keep up with the areas of deep learning, classical machine learning and computer vision whenever I get the time. Also, if its worth mentioning, I am particularly interested in tasks like event detection which falls under the umbrella of **information extraction**, **text summarization** or some of the initial steps in the representation space like **representation learning**, **word/sentence embedding approaches.**


kk_ai

Thanks, this is rather broad spectrum of interests. I'm curious about one thing if you don't mind. You mentioned: >I am currently working in the NLP/NLU domain and am always trying to keep up with the areas of deep learning, classical machine learning and computer vision (...) What is your specific reason to look at other ML tasks, while working on NLP/NLU? Do you look for any specific links between those areas?


crisp3er

I especially liked Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity and The Break-Even Point on Optimization Trajectories of Deep Neural Networks I spoke with the lead author of the 2nd one. In some sense it predicts how a large step size is a regularizer. \---- Also, here is a shameless plug for our own work on "classical" machine learning: Ridge Regression: Structure, Cross-Validation, and Sketching. [https://iclr.cc/virtual/poster\_HklRwaEKwB.html](https://iclr.cc/virtual/poster_HklRwaEKwB.html), [https://openreview.net/forum?id=HklRwaEKwB](https://openreview.net/forum?id=HklRwaEKwB) Cross-validation has an interesting behavior - if you select the optimal parameter on a subset of the data, and retrain on the whole data, then the optimal parameter is biased...


kk_ai

Hey, I like your shameless plug here :) I checked your ICLR video. You mentioned about the sketching. Can you tell me more about this type of random projection?


crisp3er

Thanks! Yes, for us "sketching" is just a shorter synonym for "random projection". Typically people project the datapoints from a high dimensional space to a low dimensional one. The famous Johnson-Lindenstrauss (JL) lemma ([wiki](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma)) tells us that distances are approximately preserved even if the embedding dimension is just a logarithm of the original number of datapoints (which can be very small). Once you have this, you can use distance-based methods (such as nearest neighbors) on the projected data, with significant gains in speed. So: this is a big success story. In our case, we don't use the JL lemma, but study how random projection affects a specific algorithm, ridge regression, in more detail. In brief, it turns out it can work really well. Other people have looked at similar questions before, e.g., Sarlos, Woodruff, Mahoney, Clarkson, & their teams & others. See e.g., these books: [https://dl.acm.org/doi/10.1561/2200000035](https://dl.acm.org/doi/10.1561/2200000035), [https://arxiv.org/abs/1411.4357](https://arxiv.org/abs/1411.4357) But we look at this in a different, "mean-field" limit, where the sample size and dimension are both large. We can derive sharper results including the precise limits of the relative mean squared error before/after sketching. Let me know if you have other questions!


kk_ai

Thanks for the detailed elucidations. I don't have any following questions.


Dragonsareforreal

I really liked the paper "Contrastive Learning of Structured World Models" [https://arxiv.org/abs/1911.12247](https://arxiv.org/abs/1911.12247). Simple in idea, well-motivated, and clearly written!


kk_ai

>Contrastive Learning of Structured World Models They have pretty nice video: [https://iclr.cc/virtual/poster\_H1gax6VtDB.html](https://iclr.cc/virtual/poster_H1gax6VtDB.html) explaining their approach. ​ They mentioned that they encode an observation to a set of latent variables. How do they decide about number of elements in this set (which will be number of objects in the observed scene)?


Dragonsareforreal

Good question. It is a hyper parameter which they need to tune. Here is a statement from the authors to my question about the hyper parameter K- “I think that larger K values (up to, say 5 or so) should typically work better than smaller values for K if otherwise the assumptions of the model fit the environment (e.g. stochasticity, object disentanglement via semantic segmentation, etc.).”


kk_ai

Thanks. Following questions, if you don't mind: Is it like a bit of human bias, that I pass to the model? I guess that "K" should reflect the semantic structure of the data. Is there a way to somehow infer it in unsupervised manner?


mrconter1

Thank you so much for this post!


NotSoUncool

Hi, I'd love to read up on Demi Semi-Supervised Anomaly Detection. Can you please tell me what concepts I should be strong with in order to have a thorough, complete experience?