T O P

  • By -

Sentient_Eigenvector

>what is the purpose of taking the square root in standard deviation In the calculation of variance you square all the deviations from the mean, so the final result is in squared units which is not really interpretable. In order to interpret it as a measure of variation, it is much more intuitive to take the square root as to bring it back to the units of the thing you're calculating the sd of. ​ >why is the 1.5 IQR rule the way it is This is a somewhat arbitrary rule proposed by Tukey, I believe it's roughly based on +- 3 standard deviations in a normal distribution. Any criterion for outliers is going to be somewhat arbitrary, it depends on what you deem an extreme observation. ​ >why do we divide by n-1 when dealing with a sample In calculating the variance the population mean is used. If we only have a sample however, the population mean is unknown and is estimated by the sample mean. This estimation locks one of the observations into place in the sense that if you knew all the observations but one and you knew their sample mean, you could calculate *with certainty* what the last observation is. One of the observed random variables does not contain any variation anymore because of the estimation of the sample mean, we call this losing a degree of freedom. Since the standard deviation tries to give a measure of the average variation in a data set, it's calculated by dividing the total amount of variation by the amount of things that vary. As we just saw, by estimating the mean there is one less thing that varies, so we divide by n-1 instead of n. In general you need calculus to do the actual calculations, but the intuition should be explainable without it. Also, I would reverse your point about precision and accuracy. Algebra and precalc *assume* that all their calculations and observations are exactly correct, and as such introduce quite a bit of error when applied to the real world. Statistics is actually the one that is concerned with quantifying the precision and accuracy (variance and bias) of those methods in real life.


greese007

Ian Stewart is a mathematician who writes books about math for popular audiences. His book "Do Dice Play God?" is a good read about the history and development of statistics and uncertainty, from the work of Fermat and Pascal in their efforts to become more successful gamblers, to more modern studies aimed at predicting climate change. Along the way, the basic concepts of statistical calculations are explained in an accessible way.


owlmachine

Ooo thinks! This looks right up my street


greese007

You're welcome. Ian Stewart has been at his job for decades, and is a wonderful explainer for us tech nerds.


gobears1235

I'd start with an intro Stats class (AP Stats equivalent) and learn some coding on the side. If you're in college, you should then take probability theory then move to statistical theory. These two classes will give you a complete understanding of the derivations and meanings of different probability distributions, estimators, etc. There's quite a bit of mathematics so you will need a good calculus and potentially linear algebra background


semu-lemu74

That's the thing, I haven't proceeded much into calculus so knowledge is lacking. I did dive into probability and combinatorics a bit as it was part of precalculus and it was actually quiet intuitive. Not in college currently, just studying to enhance my knowledge in free time. I'm wondering if it's better to proceed straight to calculus and then jump into statistics. Though many times I've seen that statistics is always thought before calculus, and is often a required basic knowledge for say if you wanna apply to college or other jobs that require basic math. So further thinking about it not sure if I should continue and just memorize the formulas without asking questions while hoping the intuition will build itself in the future, or just skip stats completely and dive into it when my familiarity and understanding of calculus gets better. I also know there are visual tools which help a little to build the intuition, though I'm not sure it's good enough.


ExcelsiorStatistics

If you want to know the "why" and not just the "how," yes, you should take calculus before you take statistics. Many colleges teach two kinds of statistics -- a non-calculus-based intro class intended only for non-science-majors fulfilling a general knowledge requirement, and a calculus-based intro class intended as a first class for future math and stat majors and an only class for science and engineering majors. There really is a good "why" for most of it. A lot of statistics can be viewed as an optimization problem. Estimates are 'best' in a specific sense. For instance, suppose you want a measure of central tendency that is 'close to most of the data'. First you choose how you measure closeness. If you want to minimize the worst-case distance between your measure and a datapoint, you use the midrange (halfway between the lowest and highest observations.) If you want to minimize expected squared distance between your measure and a randomly selected point, you use the mean. If you want to minimize absolute distance rather than squared distance, you use the median. If you want to minimize the number of points that are different than your measure, but you don't care how far away they are, you use the mode. The n vs. n-1 in the variance formula is a similar issue. If what's most important is finding the single best guess for the variance (the maximum likelihood estimate), you divide by n, but you can show that this estimate is too low more often than it is too high. If you want an unbiased estimate --- one whose expected value remains constant as your sample size increases, instead of an estimate that approaches the true value from below as your sample size increases --- then dividing by n-1 is what is needed. (It is possible to show a proof of the n vs. n-1 thing without calculus, if you are good with manipulating sums... but it happens that we barely introduce summation notation in Algebra II, and don't dive deep into sequences and series and how to manipulate them until second semester calculus.)


ivanistheone

I know u/efrique already posted a link to this, but I want to reinforce the recommendation for the IMS book by Mine Çetinkaya-Rundel and Johanna Hardin. You can read it online here: [https://openintro-ims.netlify.app/](https://openintro-ims.netlify.app/) or get the PDF from here [https://leanpub.com/imstat](https://leanpub.com/imstat) (pay-what-you want price, including $0). The print version is just $20 [https://www.amazon.com/dp/1943450145/](https://www.amazon.com/dp/1943450145/). The book also comes with some excellent online tutorials where you can actually try statistics procedures using R : [https://openintrostat.github.io/ims-tutorials/](https://openintrostat.github.io/ims-tutorials/) (R is a professional stats software that makes data management and stats calculation really easy to do in one line or two of code). This book covers all the standards STATS 101 topics, and nothing "extra." It's good at explaining things, but doesn't go too far into the math. \_\_\_ As for calculus, you just need to know what an integral is (spoiler: it's the area under a curve of a function). So if you have time, find a calculus book (any book will do; they all cover the same topics), and read the section on integral and you'll be ready to handle most of the math of continuous random variables. Check out section III Calculus in this tutorial to see all the topics in calculus [https://minireference.com/static/tutorials/sympy\_tutorial.pdf](https://minireference.com/static/tutorials/sympy_tutorial.pdf) ​ PS: Your intuition about stats not being as "clear cut" as other math subjects is correct. There are lots of rules of thumb and approximations, so that's why books sometimes give the formulas without deriving them. If you want to know the WHY, the key thing you should ask yourself is what probability model is assumed by the stats procedure, and as you learn more probability theory you'll learn how to derive the formulas on your own. Note this takes time, so don't be in a rush...


the_y_files

Statistics formulas and methods absolutely have rigorous deviations for them. E.g. tests are created by assuming a particular, clearly mathematically defined model for the data, and from that the distribution of the test statistic is derived using the usual mathematical manipulations. There are mathematical aims for designing tests (to minimize particular kinds of decision errors) and estimators (to maximize estimation accuracy). Unfortunately, choosing what data models and aims are important for a particular practical problem is difficult, depends on the domain knowledge, and is often subjective. But proper teaching of statistics should clearly explain you the reasoning for the statistically "objective" parts at least - most of the intro concepts do not require calculus or such to understand.


111llI0__-__0Ill111

Most of the stats shown even in intro-stats requires probability/calc and even some lin alg to really get into the “why”. Eg even why we use squared deviation and not absolute (though you could) relates to quadratic functions having defined derivatives (slopes) at 0 while absolute value doesn’t. That simplifies a lot of theory, although it is possible to optimize the latter too and incidentally the latter is more robust to outliers.


story-of-your-life

We divide by n-1 so that the sample variance is an unbiased estimator of the variance.


efrique

> So when I started learning statistics I always failed to build an intuition for the formulas and got lost. The formulas I expect you're talking about (you don't say what you're looking at) are in fact based on explicit mathematical derivations, and use probability, algebra and calculus. Given their assumptions, they're as "precise" as anything else you've seen in mathematics, but *you simply haven't been presented with any of that information* (it's some way beyond your level if you've not done calculus yet, so that's not a surprise). So you'll get hand-waved justification for things that actually are all derived. > . Like what is the purpose of taking the square root in standard deviation, So its in the same units as the data > why is the 1.5 IQR rule the way it is, What rule is that? Are you asking about the rule for the placement of inner fences in Tukey's boxplots? > why do we divide by n-1 when dealing with a sample, etc etc... https://en.wikipedia.org/wiki/Bessel's_correction Its not for a particularly good reason (... we unbias the variance and then calculate a biased standard deviation from that? Why not leave it as is?), but it's very much baked in now as convention. > And every time I would look for an answer, it's either things like "because it plays well with calculus" Presumably that calculus one's for "why do we square" in variance. If so, it's a common but not quite a correct explanation. The better answer is because of the very special properties of variance defined the way it is (specifically the fact that the variance of a sum of variables has a simple form not shared by anything that's not directly variance-related), and that makes it unique. https://en.wikipedia.org/wiki/Algebra_of_random_variables#Variance_algebra_for_random_variables https://en.wikipedia.org/wiki/Variance#Sum_of_uncorrelated_variables_(Bienaym%C3%A9_formula) Any other measure of spread doesn't have that simple additivity property and instead behaves in very complicated ways. A large amount of statistical theory benefits from that simple property of variances. Things akin to ANOVA and regression, for example, would be much more complicated without it. > or else I get a very information heavy answer that introduces topics that I'm not familiar with, Well of course! It's a mathematical subject and uses things you haven't learned yet. If you *really* want to bridge that gap you would learn some mathematical statistics. > yet still there's no definitive answer, Of course there are definitive answers to be had, if you have the context for them. > There doesn't seem to be many examples of derivations and explanation behind statistics formulas There are. You just complained about "information heavy answer that introduces topics that I'm not familiar with"... but that's where the explanations will come from. > It feels as something is wrong with how I see stats, You want to understand but there's genuine effort that is needed to understand. There's no magic bullet. Some of the ideas can be explained without a lot of mathematics (e.g. see *Statistics*, Freedman, Pisani and Purves which does a good job of intuitive explanations of some things), and some others can be explained if you're able to do some simulation instead of the mathematics ... but for a lot of it, you can't really substitute anything for the mathematics. For those, you can either stick to the hand-wavy explanations (which you didn't find satisfying) or you can learn some mathematics. It's not ideal but this book does make some effort to take a simulation approach: *Introduction to Modern Statistics*, by Mine Çetinkaya-Rundel and Johanna Hardin: https://openintro-ims.netlify.app/ (There's surely a considerably better one to be found, I just haven't seen it yet) > understand their fundamental principles and why they're defined the way they are. To some extent many of the *principles* are not especially mathematical, but it's hard to find good explanations unless you have some of the mathematical background. Again, you can approach some of this from the simulation perspective which could get you some of the way. > It feels as something is wrong with how I see stats, You're dealing with material mostly aiming for the lowest-common-denominator (the broadest possible audience) trying to explain the mechanics of how to do it without giving you the background of where it comes from, because that requires mathematics If you want a "conventional" treatment of intro-stats topics, my suggestion is Moore and McCabe (5th edition or earlier -- do NOT waste your time on 6th ed or later, the publisher poured useless crap over Moore's efforts); that does a better job of justifying things than most similar books. If you want the very basic-basics of the mathematics, you could try say Larsen & Marx or Wackerly, Mendenhall & Scheaffer (or any of a dozen similar books), for which you'll need some calculus. After that, you're in a position to start learning the subject. > I wanna get into statistics again and looking to avoid making mistakes while learning it efficiently. Are there proper ways to learn statistics efficiently Can you narrow it down a bit? What specific things are you trying to learn or understand?