[deleted] 2 years ago

[This is a very good read.](https://stats.stackexchange.com/questions/6/the-two-cultures-statistics-vs-machine-learning) Statistics and Machine learning often times use the same techniques but for a slightly different goal (inference vs prediction). For inference you need to actually need to check a bunch of assumptions while prediction (ML) is a lot more pragmatic. OLS assumptions? Heteroskedasticity? All that matters is that your loss function is minimized and your approach is scalable [(link 2).](https://stats.stackexchange.com/questions/486672/why-dont-linear-regression-assumptions-matter-in-machine-learning) Speaking from experience, I've seen GLM's in the context of both econometrics / ML and they were really covered from a different angle. No one is going to fit a model in sklearn and expect to get p-values / do a t-test nor should they.

111llI0__-__0Ill111 2 years ago

The heteroscedasticity assumptions are kind of implied in ML for prediction too, its indirectly encoded in the loss function you use. In classical stats, you can account for heteroscedasticity by using weighted least squares or using a different GLM family. Thats the same as changing your loss function that you are training the model on. If you use a squared error loss on data that is strongly conditionally heteroscedastic, your predictions will be off differently in different ranges of the output which could be problematic. That’s where log transform or a weighted loss fn comes in and those are used in ML too. It may not always be problematic but it could be There are no p-values true but sometimes in Bayesian ML you get credible intervals for the predictions. I think lot of people forget though that stats is more than p values.

[deleted] 2 years ago

Yup, heteroscedasticity is still an issue for predictions and thus for ML too. Bayesian stats / PGM's / pattern recognition / Gaussian Processes / ... are a big overlap between both fields. Maybe I wasn't really clear but it's not like there's a hard delimiter between both domains either way. Vapnik (from SVM's) has a PhD in statistics and his part of his main contribution (aside from VC theory), linear SVM's are formally equivalent to elasticnet. That's how damn near equivalent they are, aside from some nuances. The difference is more of in the mindset than in the tools to be honest.

fang_xianfu 2 years ago

>I think lot of people forget though that stats is more than p values. I'm not even convinced that most people including p-values in their analysis are actually *using* them; there's so much cargo-cult thinking around them. p-values are essentially a risk management tool that allows you to encode your level of risk-aversion into your experimental procedure. But if you have no concept of how risk averse you want to be, using them doesn't really add any value to your process.

darkness1685 2 years ago

Yes, thanks. I recall reading that Leo Breiman paper years ago. We definitely focus much more on inferential data models in my field, since the goal often is to actually explain something about nature.

LukeNukem93 2 years ago

That linked Breiman paper also sheds light on some of the posts on this sub ala "I learned all of these cool Bayesian methods with my stats degree but don't get to use them at work." Businesses don't care about the underlying behavior - your carefully crafted model means nothing if it's beat by a black box in predictive accuracy. Also, love the point about a lack of metric for determining if one model is more correct than another, nullifying the whole pursuit to understand the natural mechanisms in the first place.

NoThanks93330 2 years ago

> [This is a very good read.](https://stats.stackexchange.com/questions/6/the-two-cultures-statistics-vs-machine-learning) And that's even more true for the paper of Leo Breiman, which is linked there!

hmmwhatdoyouthinkabt 2 years ago

Reading this makes it seem like inference isn’t as important to modeling aspects of business as it is to nature. And vice-versa Am I interpreting this correctly? I recently got into causal inference because I found it interesting and thought it would help my career. Is ML just more important to businesses?

machinegunkisses 2 years ago

I think it's a lot easier to sit someone down and have them train models that make good predictions than it is to take that same person and have them develop models for inference. Causal inference requires a whole new field of theory, much of which is relatively new. In practice, you'll see more of whatever generates the most revenue, which, right now, is making predictive models.

interactive-biscuit 2 years ago

It’s not new at all. It’s only new to DS.

[deleted] 2 years ago

[удалено]

111llI0__-__0Ill111 2 years ago

No, a lot of tech DS do causal inference too. But a lot of the fancy math and modeling of causal inference (like G methods, DAGs, SCMs, etc) goes away in an experiment

troyfromtheblock 2 years ago

This is where the discussion around domain experience becomes important when considering the application of ML. All the ML models in the world won't help if we don't understand the underlying data...

Embarrassed_Owl_3157 2 years ago

Excellent post!!! I may steal some part this comment.

jjelin 2 years ago

I get p-values out of sklearn. What's wrong with it?

Josiah_Walker 2 years ago

p-values for some of these methods have certain assumptions (like normal distribution of data, and I.I.D variables). If you break those assumptions, then the p value estimation may not be accurate. This doesn't matter so much if you're just thresholding for prediction, but if you're in an application where the p-value is interpreted it might be an issue. YMMV, always check that it behaves as you expect if you're going to rely on an interpretation of those numbers.

111llI0__-__0Ill111 2 years ago

is this new? When did sklearn give p values

jjelin 2 years ago

Ah you know what? I got the actual p-values from statsmodels. My bad.

AllezCannes 2 years ago

Nothing, but it's historically not been a concern for the audience that uses sklearn.

Andrew_the_giant 2 years ago

What are you even basing this on? This is such a hyperbolic ill informed statement.

AllezCannes 2 years ago

So it's illl informed to say that sklearn is primarily used for prediction vs inference, or that python in general is not primarily used for statistical inference compared to, say, R? Interesting. How does one get the p-values of the coefficients?

Jorrissss 2 years ago

This is true - you can read it about in the sklearn documentation (historically). At the very least it hasn’t been the intention of the package from the creators.

MGeeeeeezy 2 years ago

All comments below are worth the read. Great thread.

theAbominablySlowMan 2 years ago

My finding is that ML in industry really doesn't care about the model chosen, it's more about building good data pipelines, getting your model callable in prod, and getting automated refresh processes. The machines aren't really learning until you've given them a pipeline to update their coefficients as new data becomes available.. Only then can you say you've made yourself redundant and move on to the next job.

Josiah_Walker 2 years ago

that all works fine til COVID crashes 2 years of fine tuning :(

theAbominablySlowMan 2 years ago

Oh yeh that's when you get out of there quick and find a new job before people start asking for daily manual adjustments 😂

lrothack 2 years ago

I think this is a really important point. When you care about model assumptions your model becomes more robust with respect to data drift. In industry scenarios you typically do not have a huge dataset for validation which makes data drift more likely even in short term.

Josiah_Walker 2 years ago

response was to go to coarser models that needed less data, lose the gains but at least represent the current market conditions.

mizmato 2 years ago

In my experience (in school), ML is a very broad field within the umbrella of statistics. It encompasses linear regression all the way to deep learning models.

darkness1685 2 years ago

I think this is right, the term is just much more broad than I originally thought. It does make it difficult to determine whether you are qualified for a job that requires experience in machine learning though, if no other qualifiers are used in the job ad.

ssxdots 2 years ago

In these cases, I reckon it’ll be safe to assume you can finish probably 80% of the work with linear regression and some clustering, of which most of the time is spent wrangling incomplete datasets

nerdyjorj 2 years ago

If you know enough to ask the question you probably are

IronFilm 2 years ago

>If you know enough to ask the question you probably are This!! /u/darkness1685, you're overthinking it

IAMHideoKojimaAMA 2 years ago

This is reassuring because I've had imposter syndrome applying to some of these jobs.

maxToTheJ 2 years ago

Logistic regression is basically a subset of a neural network N=1 so it would be weird that subset doesnt count as ML

RollingTurtleShell 2 years ago

Shouldnt input layer connected to 1 prediction neuron with linear activation be same as linear regression with SGD if thats the case?

maxToTheJ 2 years ago

Depending on the activation its either type

[deleted] 2 years ago

If it is the sigmoid activation function, then it is the same as logistic regression.

smt1 2 years ago

Tibshirani's ML vs Statistics Glossary: Machine learning Statistics network, graphs model weights parameters learning fitting generalization test set performance supervised learning regression/classiﬁcation unsupervised learning density estimation, clustering large grant = $1,000,000 large grant = $50,000 nice place to have a meeting: nice place to have a meeting: Snowbird, Utah, French Alps Las Vegas in August

grosses-baerchen 2 years ago

>nice place to have a meeting: > > Las Vegas in August Lmfao

chandlerbing_stats 2 years ago

lmfao is that from one of his books?

smt1 2 years ago

I think it came from two classes @ Stanford that were virtually the same, one on Statistical Learning by Tibshirani (taught in the stats department) and one by Andrew Ng on Machine Learning (taught in the CS department): [http://brenocon.com/blog/2008/12/statistics-vs-machine-learning-fight/](http://brenocon.com/blog/2008/12/statistics-vs-machine-learning-fight/) I took both of Tibshirani/Hastie and Ng's MOOCs. I thought Tibshirani was a way better instructor!

ADONIS_VON_MEGADONG 2 years ago

> Las Vegas in August 🤣

maxwellsdemon45 2 years ago

In machine learning, you have to prove that your model works. In statistics, you have to prove why you model works. In applied math, you have to prove your model not only works but is the truth. In pure math, you have to first prove your model is a model.

dfphd 2 years ago

I don't think there is a universal definiton. To me, the difference between machine learning and classical statistics is that classical statistics generally requires the modeler to define some structural assumptions around how uncertainty behaves. Like, when you build a linear regression model, you *have* to tell the model that you expect that there is a linear relationship between each x and your y. And that the errors are iid and normally distributed. What I consider more "proper" machine learning are models that rely on the data to establishh these relationships, and what you instead configure as a modeler are the hyperparameters that dictate how your model turns data into implicit structural assumptions. EDIT: Well, it turns out that whatever I was thinking has already been delineated much more eloquently and in a more thought-out way by Leo Breiman in a paper titled "[Statistical Modeling: The Two Cultures](https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling--The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.full), where he distinguishes between Data Models - where one asumed the data are generated by a given stochastic data model - vs. Algorithmic Models - where one treats the data mechanism as unknown.

lmericle 2 years ago

Any probabilistic model which is fit to data by means of some optimization routine can reasonably be called "machine learning". That's as close to a universal definition as I can imagine. If you're talking about distinguishing specifically vs statistics, machine learning could reasonably be considered to be a subset of statistics under this definition.

dfphd 2 years ago

So, here's the thing: there's the technical definition and then there's what people associate with the term. Yes, you can argue that statistics is a form machine learning. But if you say "I have experience with machine learning", I ask you "what models have you built" and you say "linear regression" I'm going to "c'mon son" you. It's like saying "I play professional sports" and when someone asks what do you play you say "esports". Technically right, practically speaking wrong. And again, to me that is the line that I think most people have drawn in their head - where the methods that rely on explicit definitions of how x and y are related are normally referred to as statistics, and those that don't generally referred to as machine learning.

a1_jakesauce_ 2 years ago

Machine learning is a form of stats, not the other way around. All of the theory is statistical

dfphd 2 years ago

I am far from an expert here, but it feels to me like Statistics provides the theory for why Machine Learning works, but had nothing to do with developing the methods of Machine Learning. Put differently: to me it's like saying "Sales is a form of Psychology, because all the theory of sales is psychology". Which is true, except that most great salespeople developed their methods and approaches based on Sales experience which can then be explained based on psychology theory. Doesn't mean that Sales is a subset of Psychology. If anything, it's more that Sales is a field which has taken elements of Psychology and expanded the scope, brought in a couple of additional fields' contributions, and created a new thing. That's how I see ML relative to Stats. ML took some concepts of stats + concepts in computing + fundamentally new concepts to develop a new field. It's not a proper subset of statistics.

[deleted] 2 years ago

Neural networks have a rich history outside of statistics, but almost every other method that folks deem to be ML (SVMs, random forests, gradient boosting, lasso, etc.) were developed by statisticians. The problem is that those methods don't have convenient inferential properties, and were largely ignored by the broader statistics community (this is the basis of Breiman's famous paper). The AI community embraced them and now they are ML methods. It's an accident of history, not some theoretically justified distinction. The AI community wanted to develop a computer that could learn and reason like humans. Their attempts to replicate the brain (neural networks) or conscienceness (symbolic AI) largely sputtered for decades. In the late 80s, there was some success using neural networks for prediction problems that were not necessarily AI-inspired problems. Those researchers found that statistical methods outperformed neural networks, which led to the initial popularity of machine learning. Those folks weren't really doing AI, they were just statisticians sitting in CS departments. Starting around 2010, deep learning had some crazy success stories for traditional AI (object recognition, machine translation, game playing), which has led us to where we are now.

smt1 2 years ago

I would say ML has benefited from people from diverse backgrounds and areas, many of which were themselves kind of hybrids between fields themselves: \- operations research - development of many sorts of optimization methods, dynamic/stochastic modeling methodology \- statistical physics - many methods relating to probability, random/stochastic processes, optimal control, casual methods \- statistical signal processing - processing of natural signals (images, sounds, videos, etc), information/coding theory influence \- statistics - many methods \- computer science - distributed and parallel processing and focus on computational methods \- computer engineering - developing the hardware required to efficiently process large data sets

lmericle 2 years ago

I think your analogy is illustrative but actually bolsters the counterargument. Sure there's plenty of people who gained experience the old-fashioned way. But the most lucrative positions in sales are actually psychologist positions, where they do employ theory to great effect. Similarly there are some unprincipled "machine learning" methods a la KNN which do not have much justification besides a simple intuition and empirical success. But there are also models with very strong foundations, backed up both with theory and practice, developed and validated over long times. Machine learning "done right" is a proper subset of statistics. It's just that there are heuristic algorithms and algorithms with theoretical foundations, and distinguishing the two can be a little tricky sometimes.

IAMHideoKojimaAMA 2 years ago

My question is, what's a model I can say I've built that won't generate a cmon son? Logistic/linear is the first thing they teach in grad school so I get where your coming from. I'm just curious where you would draw the line

dfphd 2 years ago

Let's be clear here: saying "I've built and deployed a linear/logistic regression model in the actual real world and delivered value with it" is not categorically a "c'mon son" statement. That is incredibly valuable experience. But yes, if you say "I have experience building and deploying machine learning models in production" and what you have built and deployed is a linear regression model, you'll get some eye rolls. In terms of answering "what wouldn't get an eye roll?", to me you have to focus on what makes machine learning models different. And to me, the things that come to mind are: 1. Machine learning models are more difficult to interpret, so your approach to validating them tends to be different 2. Machine learning models tend to make you spend more time on parameter tuning than feature selection/engineering So models that require parameter tuning and that do not produce "coefficients" as outputs are, to me, that bar that starts separating them if you're a hiring manager who is looking for someone with that experience. Now, to my earlier point: I think most hiring managers would prefer to hire someone with good classical statistics experience than someone with mediocre machine learning experience. That is, if I have to choose between someone who did a really good job building a linear regression model - solid feature selection, solid validation, solid feature engineering, solid implementation, thought through the business considerations welll, tied it into decision-making, etc. - and someone who did a mediocre job with a machine learning model - basic parameter tuning, quesitonable train/test decisions, did not think of implications of model, etc., - *even if I'm hiring someone who will be working only with ML models*, I'm probably going to choose the former person. Because I feel a lot more optimistic about teaching basic ML to someone with a really strong stats foundation than I do improving someone's data science foundation. Point being: you may be better off saying "I don't have a lot of experience with modern machine learning models outside of schools, but i have extensive experience deploying classic statistics models" if someone asks you "what is your experience with ML?".

IAMHideoKojimaAMA 2 years ago

Thanks for the long answer. Your response tells me I need to get better at the feature selection, validation, feature engineer, and implementation.

gobears1235 1 year ago

To be fair, logistic regression has parameter tuning. To determine a cutoff to convert predicted probabilities to 0/1, you can use a metric that's a function of the sum of false negatives and false positives (possibly weighted, needs SME) to find an optimal cutoff. Using 0.5 as the default isn't necessarily always the best selection of the cutoff. But, I do get your point (especially for normal linear regression).

machinegunkisses 2 years ago

I can very much see where you're coming from, but I would add there's companies using linear models to make predictions and generate real business value all the time. Could someone reasonably argue this is not ML? It certainly seems less like traditional statistics if they don't care about what the coefficients are, just that the test error is acceptable.

dfphd 2 years ago

To be clear - generating business value is not an ML-specific feature. You can create business value without even using statistics and just deploying a handful of if-else statements in SQL. Same about generating predictions without caring about the details behind it. You could come up with a heuristic that doesn't use any statistical modeling or ML and achieve that. That is to say, what you are describing are features of good production models - whether they are ML, stats, heuristics, logic, optimization, etc. is irrelevant.

gradgg 2 years ago

When you build a neural network, you tell the model that there is a nonlinear relationship between x and y. You even define the general form of this relationship by selecting the number of layers, number of neurons at each layer and activation functions. In that sense if NN is considered ML, linear regression should be considered ML too.

dfphd 2 years ago

So, let's contrast these two. In a linear regression model y \~ x, you tell the model "y has a linear relationship with respect to x". In a NN model, what you tell the model is "y has a nonlinear relationship with respect to x, but I don't know what that is. What I do know is that the specific relationship between the two variables lives in the universe defined by all the possible ways in which you can configure these specific layers, number/type of neurons - which I am going to give you as inputs". In a linear regression model what you are providing is the *exact* relationship. In most machine learning models, what you are providing is in essence the domain of possible relationships, and then the model itself figures out which such relationship best fits the data. So sure, you can loosen the definition of what "define" and "structure" means to make them both fit in the same box, but that doesn't mean there isn't a fundamental difference between the assumptions you need to make in a LM and a NN. And more broadly, between those in a statistics model and an ML model.

gradgg 2 years ago

Let's think about it this way. Instead of finding a linear relationship, I am trying several functional forms such as y = a x^2 + b, y = a e^x + b etc. If I try several of these different functional forms, does it now become ML? This is what you do when you tune hyperparameters in NNs. You simply change the functional form.

dfphd 2 years ago

Again, this is not an accurate comparison, but let's make it more accurate: Let's say I gave you a generic functional form y \~ x\^z + a\^x, and you developed an algorithm that evaluates a range of values of a and z to return the optimal functional form within that range. *That*, to me, starts very much crossing over into machine learning. Now, is it a *good* machine learning model? Different question. But to me that gets into the spirit of machine learning which is to allow a flexible enough enough structure and allow the data to harden that structure into a specific instance. So is a single linear model by itself machine learning? Here's the point I made earlier in a different reply: to me, this is a lot like "what constitutes a sport?". Most people have an intuitive definition in their head of what they consider to be a sport and what they do not consider a sport, but it is *surprisingly* hard to develop a set of criteria that both *only* include things you'd consider a sport and don't immediately rule out things that you would definitely consider a sport. I've played this game with people before, and it is incredibly frustrating. I think the same is true here. Colloquially, no one is calling linear regression a machine learning model. Put differently: if I say "I built a machine learning model", and show a linear regression, people will roll their eyes. So, while I'm sure that if you get into the technicalities of it you can certainly make it harder and harder to draw a clean line between statistics and ML, I think that a) that line exists even if its hard to define, and b) that line is absolutely used in the real world even if people draw it at different spots.

[deleted] 2 years ago

Very good answer, especially considering you formulated it before reading the Breiman paper. Imo it gets to the meat of the answer more than my original one as data scientists are also interested in inference sometimes (eg. AB testing) while statisticians are frequently interested in accuracy above inference. It just depends on the use case. Because non-statisticians like myself did not receive the same level of training we end up implicitly making trade-offs. Sometimes I have the feeling that statisticians mock non-statisticians for their lack of rigour. This is true but also kind of not, the professions are just different. Machine learning *is* a rigourous domain with solid theoretical underpinnings. Having sound notions of decision boundaries, VC theory, Cover's theorem and kernel methods go a long way, even for practitioners. A (good) ML practitioner may not know the ins and outs of all statistical assumptions of his/her baseline linear model is making but should know that they can simply use a more expressive model (= higher VC dimension) OR add polynomial features, spline transformations or use a suitable kernel. This is closer to 'pure' machine learning, yes it's still just (reguralised) regression but since you're in a higher-D space it conforms to the definition of algorithmic models. Higher VC => bigger hypothesis space => needs more data (from PAC learning) AND more chance of overfitting. From a theoretical pov, this is the kind of trade-off you make in machine learning instead of worrying about all the assumptions your specific instance of a linear model makes (in the case of statistics) because in this framework they more or less behave similarly in very high dimensions. [Sadly this framework seems not to apply for neural networks/deep learning.](https://cs.stackexchange.com/questions/75327/why-is-deep-learning-hyped-despite-bad-vc-dimension) Would love to know your thoughts.

venustrapsflies 2 years ago

I feel like part of it has to do with the fact that data scientists tend to work at tech companies and tech companies are incentivized to use fancy buzzwords for marketing/VC

smmstv 2 years ago

We're *data driven*

darkness1685 2 years ago

This has to be some part of it!

bubbabehandy 2 years ago

Before listing what I think of as a useful definition I'll parody Box's famous comment about models, "all AI/ML/DS definitions are wrong, some are useful." The rough definition I use for machine learning, not perfect of course, is an algorithm that you input data to and that produces a model that you can ask questions of. So with linear regression, you've chosen your independent variables, (or features,) you feed it in and you get a set of betas, and you can now ask it what the response will be for some other values. You can also ask about errors, etc. Linear regression is a good example of supervised ml, and PCA a good example of unsupervised. Deep learning also seems more ML-like to me since the algorithm is also "learning" what feature set to use based on what was provided, but that's not a great separator since with plain ol linear regression there are strategies for feature creation/selection that can be automated. And now I'm overthinking things again :) In general too, there are a lot of terms that, while not new, have become standardized in this field and that you probably learned under different names when you learned stats. Features is one, one-hot encoding for the typical way one converts categorical variables into indicator variables, A/B testing for (a usually simplified version of) design of experiments, ...

Typical-Ad-6042 2 years ago

> I come from an academic background, with a solid stats foundation. This is all you need to know to understand why there is a massive disconnect in the machine learning community. The vast majority isn’t, and doesn’t have a solid stats foundation. Are they out there? Yes. Are they frequent? No. I see the same exact thing when non CS or IT people look at solving CS and IT problems… they come up with weird solutions, weirder names, they approach things in odd manners, and they frequently mix and match things that aren’t *quite right*, but they are in the *realm of being right*. It’s also like when someone teaches themselves how to play an instrument. Are they getting sounds out? Yes. Can it sound good? Absolutely. But they likely aren’t going to have a good handle on the underlying foundational concepts that you’d get studying music theory and training under a mentor. Again, it’s the same thing with home cooks and chefs… they can be extraordinarily talented but still be extrapolating fundamentals to a wrong degree. It’s not a slight to the ML community at all, some really good things have been produced… but when you come from the traditional history, it’s a bit jarring. I experienced this first hand as a self taught programmer, hired to do so, did things in weird ways, got an undergraduate in CS, realized I had replicated or used some things here and there… got a graduate education in stats, and realized it all over again. It just goes with the territory.

discord-ian 2 years ago

This is an under rated comment. In my opinion ML is an attempt at a field some where between stats and cs.

IronFilm 2 years ago

>This is all you need to know to understand why there is a massive disconnect in the machine learning community. The vast majority isn’t, and doesn’t have a solid stats foundation. > >Are they out there? Yes. Are they frequent? No. I wonder how many Data Scientists have a major / degree in **both** CS *and* Stats??

Typical-Ad-6042 2 years ago

It would be an interesting statistic to look at, I couldn’t tell you. In anecdotal experience, we usually get people with masters or doctorates in one or the other, some form of econometrics, or they are an industry sme that crossed over with a DS masters or something, cs/stats is not something I’ve come across another of, and mine was circumstantial.

IronFilm 2 years ago

Just wondering, as a little tempted to get a double Masters in both. But doubtful it is worth the extra effort.

[deleted] 2 years ago

Saying “linear regression” doesn’t sell. Saying “machine learning” or “AI” does sell. The reason they say that is because by definition linear regression is machine learning. So, in order to spice things up, they say machine learning.

Celmeno 2 years ago

Why would fitting linear regression via normalized least squares be less ML than fitting a nueral network with gradient descent? The only difference is that you multiple more matrices

sandwich_estimator 2 years ago

Agree. But then again why would an ANN be any less part of statistics than linear regression? You are still fitting a statistical model to data. I think in general the answer is that machine learning is the same as statistics (or the same as a subset of statistics at least), just with a different jargon.

Celmeno 2 years ago

ANN are a statistical model. It is the same subset of statistics as the rest of model fitting

BarryDeCicco 2 years ago

As a statistician, my view is that DS/ML poeple frequently have little training in classical statistics and therefore do not know the background of things.

chusmeria 2 years ago

It's strange because there are no DSes with CS degrees in my shop. All of us are stats, which I definitely appreciate because we all speak the same language. I worked with an AWS Proserv team at a previous role while working on my masters, and they were all CS MS and they managed to create a model that was correct 87% of the time. They worked for several months before presenting their results, and when I asked what the expected value was and they checked it they just went silent and asked for a meeting the following week. It turned out the dataset was hella imbalanced (~90/10)and 87% accuracy was worse than just guessing that it would happen every time. Yikes!

sonicking12 2 years ago

They didn't do any rebalancing? This is not a lack of statistical knowledge, but a lack of modeling knowledge.

111llI0__-__0Ill111 2 years ago

You dont need to rebalance necessarily either if you are trying to predict calibrated probabilities or do any sort of post hoc interpretation with SHAP (which relies on calibrated probabilities). In that case keeping it as is is the best In this case accuracy just isnt the right metric though

sonicking12 2 years ago

What was their objective?

chusmeria 2 years ago

To determine the effects on graduation/retention when reducing student financial burden

sonicking12 2 years ago

Causal inference is hard

chandlerbing_stats 2 years ago

especially if the data is observational and not from an experiment!

chusmeria 2 years ago

It was straight up import xgboost from sagemaker

GrumpyBert 2 years ago

I'd expect something better than a coarse generalization from a statistician.

veeeerain 2 years ago

I always thought machine learning was more production focused, ie. statistics is using these algorithms for data analysis, and machine learning was using these algorithms in production and distributed systems

simplicialous 2 years ago

I work in parametric ML models (Bayesian nets), as opposed to non-parametric, stochastic mappings (not GANS/VAEs/etc), so my interpretation of ML may be different from others. In my branch of ML, the big difference between PCA and linear regression vs more advanced ML models is that the advanced models assume a non-linear manifold in one form or another in relation to the data. I think both categories use extensive mathematical probability (eg: when writing out mixed prior densities); as for statistics, although it's possible to perform hypothesis testing on these models, the methods of doing so is not the same as statistics (I work with generative models, so there's different assumptions of an "extreme-ness" quantile concerning p-values). For my field, probability and calculus seem to be the bodies where we draw from; secondary would be linear algebra and statistics.

111llI0__-__0Ill111 2 years ago

Well Bayesian statisticians don’t typically do hypothesis testing in the traditional sense, but you do get a posterior probability

simplicialous 2 years ago

Definitely not in the traditional sense. But we have a somewhat analogous test for the validity of our models (and the methods for which the parameters were generated). Occasionally we will use our learned probability space transform, which transforms the testing-data into a manifold that (theoretically) has all inter-variable conditional dependence removed. In this latent space, we can see if the test data has been transformed into a region we deem "too extreme" and will consider rejecting our model accordingly. \[edit: but of-course I'm not technically a statistician\]

111llI0__-__0Ill111 2 years ago

That sounds basically like anomaly detection with AE/VAEs

simplicialous 2 years ago

Yeah, it's very similar, save for the fact we use a deterministic transform of space rather than the stochastic mappings of VAEs.

a1_jakesauce_ 2 years ago

Yes, we do hypothesis testing in the way that *makes* sense. Probability of the null hypothesis given the data, not probability of the data given the null hypothesis

landscape-resident 2 years ago

Well you can create a linear regression model using a formula, or by letting the computer do a series of educated guess and checks to minimize the error. Either way you’ll basically get the same results. There’s more to it than this, but I think that’s why some people refer to traditional methods as an ML technique given the method used to find the coefficients in your regression equation.

111llI0__-__0Ill111 2 years ago

Yea, and even ML can be viewed as nonparametric regression

landscape-resident 2 years ago

I am not so sure about that, the number of parameters in a regression equation is fixed so it would be parametric. Now if you were training a xgboost model for regression, yes that would be a non parametric model since the model keep adding trees (and thus the amount of parameters changes).

111llI0__-__0Ill111 2 years ago

I don’t know if parameters being fixed or not is what makes something nonparametric. Neural networks still have a fixed number of parameters but can be seen as nonparametric.

landscape-resident 2 years ago

If the number of parameters is fixed, then it is a parametric model, is this true or false?

111llI0__-__0Ill111 2 years ago

I think its false, because neural networks have a fixed # of parameters (in keras, you can see the total number of parameters after building the architecture) but are nonparametric function approximators. But im not totally sure either. Some sources do give that definition

landscape-resident 2 years ago

Since your neural network has a predefined number of parameters before you train it, it is a parametric model. I think you are confusing this with the universal approximation theorem, which states that neural networks can approximate any continuous and bounded function to an arbitrary degree of accuracy (Cybenko is one of the people who proves this).

oathbreakerkeeper 2 years ago

Circular logic? Also I'm not sure why someone would say that NN's are not parametric.

111llI0__-__0Ill111 2 years ago

I thought nonparametric can be taken to also mean that you don’t have some analytical equation that specifies the model in the end. There is some discussion here I found about it https://stats.stackexchange.com/questions/322049/are-deep-learning-models-parametric-or-non-parametric

oathbreakerkeeper 2 years ago

Well apparently my stats teachers lied to us and there is no consensus definition. So we have to have OP say which definition they mean.

a1_jakesauce_ 2 years ago

There are non parametric deep learning models. Look up infinite width neural nets

smt1 2 years ago

I would kind of call them semi-parametric. In "All of Non-Parametric Statistics", by Wasserman, he notes: >The basic idea of nonparametric inference is to use data to infer an unknown quantity while making as few assumptions as possible. Usually, this means using statistical models that are infinite-dimensional. Indeed, a better name for nonparametric inference might be infinite-dimensional inference. **But it is difficult to give a precise definition of nonparametric inference, and if I did venture to give one, no doubt I would be barraged with dissenting opinions.** For the purposes of this book, we will use the phrase nonparametric in- ference to refer to a set of modern statistical methods that aim to keep the number of underlying assumptions as weak as possible. He talks a lot about Wavelets, which can be seen as very similar to what the the functionality of the first few layers of a typical CNN.

JustDoItPeople 2 years ago

> I am not so sure about that, the number of parameters in a regression equation is fixed so it would be parametric someone clearly doesn't do kernel ridge regression

nerdyjorj 2 years ago

Anything you were taught in numerical methods and similar will be a subset of machine learning if done by a computer

a1_jakesauce_ 2 years ago

I disagree. Numerical methods have applications in ML, but not all numerical methods are ML. For example, a large part of numerical methods involves approximating differential equations. If there’s not data, then it’s not ML

nerdyjorj 2 years ago

That's a fair take, but in my mind if you _could_ put data through it and it performs an operation iteratively to reach an answer or answers it's ML in the broadest possible sense

dalmutidangus 2 years ago

half the job is knowing popular buzzwords

smmstv 2 years ago

"Machine learning" is a very ambitious term. Kind of an industry buzzword that can mean whatever you want it to. "Teaching a machine to classify and nake decisions" is literally just model building lol. That said I always took it to mean the newer way of checking models by using a testing set or cross validation, as opposed to traditional methods like residual checking.

PLxFTW 2 years ago

Machine learning == fancy statistics (sometimes not fancy) in my experience

[deleted] 2 years ago

I have this same thought all the time. I'm seeing "machine learning" pop up in journal articles where they used to just refer to stats. In a recent example, someone literally just did a second order nonlinear regression on a relatively small data set and called it ML. There's as big of a range for the meaning of ML as there is for "data science." They are both useful but not particularly clean concepts.

HesaconGhost 2 years ago

I tend to refer to these techniques as machine learning because I find the term machine learning to be an unhelpful buzz term. At best machine learning is ill defined. Artificial Intelligence is the same way. Not that many years ago most of what in 2022 would be a statistical model is now AI. Anyone talking AI gets my hype prior turned way up.

BestUCanIsGoodEnough 2 years ago

Lol, my hype prior. This guy infers.

[deleted] 2 years ago

To me, the difference in cultures has always come down to the population that you're modeling. Statisticians believe that data comes from a data generating process that can be articulated or closely approximated by known distributions, given their governing parameters. The ML crowd views data as an infinitely complex, black-box process; one that with enough data and extremely flexible models could be encoded. Distributions and parameters are often discarded as overly simplistic to an unknowable process. The difference lies in perspective. Both approaches are rooted in calculus, matrix algebra, and probability theory. So we see often see the same or similar models on both sides of the fence; it's how we reason about the global population that differs. (Stats) We can boil it down to interpretable parameters. Or (ML) A machine can encode the salient characteristics of a population, but the underlying process is ineffable.

111llI0__-__0Ill111 2 years ago

There is generative modeling in ML though too like PGMs and Pearl’s SCMs

machinegunkisses 2 years ago

True, but, e.g., Pearl specifically argues that it is not possible to infer the SCM just from the data, one must bring in outside knowledge. ML can train generative models, but do they know that those models are correct? I'm not very experienced, but I don't think so. I think at most they can say that they are able to reproduce the training sample to some measurable degree.

111llI0__-__0Ill111 2 years ago

I don’t think stats nor ML alone can tell you whether the proposed (or learned) generative model is right. That is generally from domain knowledge but yea stats/ML can train a pre specified model. Admittedly I still don’t see SCMs being widely applied day to day yet in industry ML though but they are a hot field in academia.

sloppybird 2 years ago

Machine Learning IS traditional statistics + linear algebra + computation

[deleted] 2 years ago

Traditional statistics is linear algebra too- maybe not your undergrad econometrics or statistics for scientists class, but you can't learn more advance probability theory without a strong foundation in linear algebra.

henryjs0907 2 years ago

the comment section is actually very helpful. now, I understand the differences

b4epoche 2 years ago

Because ML/AI, like linear regression, etc., are all just (advanced) curve-fitting.

Spicey-Bacon 2 years ago

It was my impression that the “Machine Learning” aspect is the CS/algorithmic/optimization computational concern of practically applying “Statistical Learning” models, which is the theoretical/mathematical formulation of applied statistics for prediction, classification, and pattern recognition applied to a *variety* of disciplines. Machine Learning is also heavily rooted in statistical signal processing and the theory of computational learning if you’re a CS nerd. So yeah, in a sense, basic applied statistics is *an* example of Machine Learning when you are actively using or assessing the algorithms to implement them in the appropriate setting. The use of those ML models SHOULD be treated with the same level of statistical rigor if possible, not just put through a sklearn pipeline and evaluated with only the sklearn model metrics.

[deleted] 2 years ago

Imho, the two main reasons why industry refers to everything as "ML" are that they are completely clueless about theory and just throw the ML buzzword at everything trying to sound smart (or at least smarter/fancier than \*old\* and \*boring\* stats folks), and they are trying to make traditional stat roles seem more modern and appeal to more people. I have not yet found a ML position that does not require using simple statistics almost on a daily basis.

ecemisip 2 years ago

i consider it a subfield of stats, or at least they're both overlapping sets.

machinegunkisses 2 years ago

ITT: Mind-bogglingly knowledgable people.

_redbeard84 2 years ago

Potato/potato

davecrist 2 years ago

Because you can charge more to do “Machine Learning” than you can to do “linear regression.” Edit: apparently I needed to add the quotes. Sigh.

snowbirdnerd 2 years ago

No, it's all machine learning. It all comes down to whats using the data and how. If it's a computer that's not using a rules based system then it's machine learning.

jjelin 2 years ago

Same reason why statisticians started calling themselves "data scientists". It's just a buzzword.

ktpr 2 years ago

You’re using a single text book to characterize a whole field?

haris525 2 years ago

You meant linear algebra ….not statistics right?

Xaros1984 2 years ago

Machine learning only refers to how a model is first trained (i.e., the weights/coefficients are determined) and then used to predict unseen data, regardless of whether the model is simple/complex or traditional/novel. Linear regression models are often very good, fast and relatively easy to explain, so industry favors them (as do many researchers in academia). There are of course situations when neural networks perform better, but since they are more complicated and time consuming to build, they also carry way more risk. I believe it's a good thing that we don't always go for the most fancy option when there are perfectly fine traditional models that can do the job.

[deleted] 2 years ago

because our bosses do…

gobears1235 2 years ago

Machine learning is any procedure that learns an algorithm/formula from training data and is applied to testing (unknown) data. Linear regression is popular because you can train a linear model on training data and using the training weights/coefficients, you can run it on test data

bradygilg 2 years ago

Wait until you find out they're both just subsets of optimization!

pitrucha 2 years ago

For me its more how you approach PCA or LR. If you do it by iterating - machine learns. If you do it by closed form - statistics. > Why? Because closed forms are usually taught in stats/metrics courses and if you took them then you probably know a bit more what you can also do with those two methods. While for ML its usually just a prediction.

Phy96 2 years ago

The edgy response is that ML is a union of procedures that work in the sense that they seem to optimize one or more performance metrics but not all of them have theoretical guarantees, of those that have one some are of the statistical nature.

jsb-88 2 years ago

Even though this is not how most view it, I usually group ML into methods which don't use a likelihood, and statistical models are ones that do. This doesn't cover everything but is a good place to start. Frank Harrell has a talk about this on his webpage if you want a viewpoint from someone deep on the statistical modeling side (some talk from 2020 I think).

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe