WEBVTT 1 00:00:00.160 --> 00:00:03.279 Welcome to the deep dive. You know, when most people 2 00:00:03.319 --> 00:00:07.320 hear machine learning or maybe AI, I think the first 3 00:00:07.320 --> 00:00:09.960 thing that comes to mind is the code. 4 00:00:10.080 --> 00:00:13.240 Right, Oh, absolutely, Python, scripts, neural nets, all that complex 5 00:00:13.279 --> 00:00:15.359 engineering stuff. That's the flashy part. 6 00:00:15.519 --> 00:00:17.199 Yeah, the engine, It is the engine. Yeah. 7 00:00:17.199 --> 00:00:19.679 But what our source material for today really emphasizes, and 8 00:00:19.719 --> 00:00:23.440 it's looking specifically at the prerequisites for even building those engines, 9 00:00:24.199 --> 00:00:27.920 is that the real foundation. It isn't the code, ok, 10 00:00:28.160 --> 00:00:32.000 it's math specifically, it's statistics. You could almost call it 11 00:00:32.039 --> 00:00:33.640 a preliminary requirement. 12 00:00:33.799 --> 00:00:35.759 Right. So that's our mission today. Then we're aiming to 13 00:00:35.759 --> 00:00:38.039 give you a bit of an intellectual shortcut here. We 14 00:00:38.079 --> 00:00:41.679 want to pull out the essential statistical concepts that the 15 00:00:41.679 --> 00:00:45.600 core of vocabulary and the toolkit you need for exploring data, 16 00:00:45.840 --> 00:00:48.600 cleaning it up, getting ready for predictive modeling. 17 00:00:48.280 --> 00:00:51.280 Basically saving you the trouble of reading the whole textbook page. 18 00:00:51.079 --> 00:00:54.679 By page exactly. This is about getting that statistical fluency 19 00:00:54.719 --> 00:00:57.200 you need before you even think about training. 20 00:00:56.920 --> 00:01:00.520 A model, and it's not just about passing some exam. 21 00:01:00.920 --> 00:01:05.359 You genuinely need these concepts because well, every single step 22 00:01:05.400 --> 00:01:07.640 in an mL pipeline from the moment you get the 23 00:01:07.719 --> 00:01:11.359 data to evaluating how well your model did. It's fundamentally 24 00:01:11.400 --> 00:01:12.519 a statistical operation. 25 00:01:13.200 --> 00:01:15.879 Okay, so where do we start. I guess right at 26 00:01:15.879 --> 00:01:20.000 the beginning, recognizing what kind of data you're even dealing with. 27 00:01:20.319 --> 00:01:23.439 That's the spot the sources remind us that data collection 28 00:01:23.560 --> 00:01:26.959 isn't just you know, chaos. It's usually driven by trying 29 00:01:26.959 --> 00:01:29.000 to answer some real world question. 30 00:01:28.879 --> 00:01:31.560 Like market research before you launch a product, maybe. 31 00:01:31.400 --> 00:01:33.760 Exactly is this product feasible? Who are we trying to 32 00:01:33.799 --> 00:01:34.719 reach that kind of thing? 33 00:01:34.840 --> 00:01:37.280 And the answers we get the actual numbers we collect 34 00:01:37.280 --> 00:01:37.959 in store. 35 00:01:38.000 --> 00:01:41.799 Those are grouped into what statisticians call random variables. They're 36 00:01:41.879 --> 00:01:44.719 the numerical backbone of whatever research you're doing. 37 00:01:44.840 --> 00:01:47.760 Okay, random variables, So how do we bring some order 38 00:01:47.799 --> 00:01:49.319 to that? How do we structure them? 39 00:01:49.439 --> 00:01:52.200 Well, we mainly split them based on what they can 40 00:01:52.239 --> 00:01:55.719 actually measure. First up, you've got discrete random variables. 41 00:01:55.760 --> 00:01:57.079 Discrete meaning separate. 42 00:01:57.359 --> 00:01:59.519 Yeah, I think fixed counts. They have to be whole numbers. 43 00:01:59.519 --> 00:02:03.319 It can't be like counting how many people clicked on 44 00:02:03.359 --> 00:02:05.719 an ad or the number of gold medals a country 45 00:02:05.719 --> 00:02:09.560 one in the Olympics. Not it fixed counts, Definite fixed counts. 46 00:02:09.400 --> 00:02:12.360 So what's the other kind the stuff that isn't fixed counts? 47 00:02:12.599 --> 00:02:15.840 That would be your continuous random variable. This one stores 48 00:02:15.919 --> 00:02:19.879 values that can be decimals or floats, and theoretically, at 49 00:02:19.919 --> 00:02:22.759 least you could measure them with infinite precision. 50 00:02:22.479 --> 00:02:24.000 Like height or weight. 51 00:02:24.240 --> 00:02:28.800 Perfect examples height, weight, temperature. You can always, in theory, 52 00:02:29.120 --> 00:02:31.479 add another decimal place to make the measurement finer. 53 00:02:31.800 --> 00:02:34.960 Okay, that makes sense for numbers, But what if the 54 00:02:35.039 --> 00:02:37.280 data isn't a number at all, Like if it's just 55 00:02:37.319 --> 00:02:40.199 a label someone's city or maybe their preferred brand. 56 00:02:40.319 --> 00:02:44.000 Ah, good question. Then you're working with categorical variables. And 57 00:02:44.039 --> 00:02:47.120 this is where we need another layer of distinction, because 58 00:02:47.479 --> 00:02:50.960 how an mL algorithm handles these depends a lot on 59 00:02:51.000 --> 00:02:54.680 whether the categories have some kind of internal meaning or order. 60 00:02:54.759 --> 00:02:57.680 Wait, internal meaning? Why does that matter? Isn't red just 61 00:02:57.919 --> 00:02:58.840 red to a computer? 62 00:02:59.159 --> 00:03:02.759 It matters quite a bit, actually, mostly because it impacts 63 00:03:02.840 --> 00:03:05.919 how you encode that data before feeding it to a model. 64 00:03:06.759 --> 00:03:11.159 If the categories have absolutely no inherent rank or order, 65 00:03:11.560 --> 00:03:15.439 we call them nominal variables. Okay, think gender like male female, 66 00:03:16.000 --> 00:03:19.479 or maybe types of fruit apple, banana, orange. You can't 67 00:03:19.520 --> 00:03:23.520 really rank one above the other. Logically, they're just distinct groups. 68 00:03:23.400 --> 00:03:26.159 Right, distinct groups makes sense, But what if they can 69 00:03:26.199 --> 00:03:27.240 be ranked and it's. 70 00:03:27.159 --> 00:03:33.000 An ordinal variable. Think about say a customer satisfaction rating low, medium, high. 71 00:03:33.360 --> 00:03:35.840 Ah, Okay, there's a clear hierarchy. 72 00:03:35.439 --> 00:03:38.840 There exactly, And knowing this difference is key before you 73 00:03:38.879 --> 00:03:42.800 start your future engineering phenomenal variables. The algorithm often needs 74 00:03:42.800 --> 00:03:46.599 to treat each category as totally separate, maybe using something 75 00:03:46.639 --> 00:03:49.759 called one hot encoding. But for ordinal variables you might 76 00:03:49.800 --> 00:03:52.439 be able to use encoding methods that preserve that ranking, 77 00:03:52.800 --> 00:03:56.120 which can sometimes make the model simpler or even more accurate. 78 00:03:56.520 --> 00:03:58.840 So yeah, knowing this distinction is pretty fundamental. 79 00:03:59.000 --> 00:04:00.719 All right, So we figure fur out what kind of 80 00:04:00.800 --> 00:04:04.719 variables we have. What's the immediate next step? Usually it's 81 00:04:04.800 --> 00:04:09.159 descriptive statistics, right, trying to summarize potentially huge data sets. 82 00:04:09.240 --> 00:04:12.919 Yes, exactly. We're moving from just defining things to actually 83 00:04:12.960 --> 00:04:15.080 starting to tell the story hidden the data. The first 84 00:04:15.120 --> 00:04:18.519 step is usually summarizing it, focusing on its center and 85 00:04:18.560 --> 00:04:19.120 it's spread. 86 00:04:19.279 --> 00:04:22.439 Okay, center and spread. Let's start with the center. Measures 87 00:04:22.439 --> 00:04:24.879 of central tendency is that the term that's the one and. 88 00:04:24.879 --> 00:04:27.439 The big three here are the mean the median and 89 00:04:27.480 --> 00:04:28.079 the mode. 90 00:04:28.399 --> 00:04:31.120 Everyone knows the mean the average, Right, add them all up, 91 00:04:31.160 --> 00:04:34.439 divide by how many there are. Seems simple, But what's 92 00:04:34.480 --> 00:04:37.600 the specific mL insight? Why is it so important? 93 00:04:37.839 --> 00:04:41.240 Well, mathematically, the mean is the center of balance for 94 00:04:41.279 --> 00:04:44.240 your data. But what's really interesting is how it connects 95 00:04:44.240 --> 00:04:47.279 directly to prediction. How So, when you build, say a 96 00:04:47.399 --> 00:04:51.439 simple linear regression model, what you're essentially doing is trying 97 00:04:51.439 --> 00:04:54.959 to draw a line that minimizes the square distance between 98 00:04:54.959 --> 00:04:57.279 that line and all your data points. Yeah, the mean 99 00:04:57.360 --> 00:05:00.160 turns out to be the single value that inherent only 100 00:05:00.199 --> 00:05:01.720 minimizes that scored error. 101 00:05:02.040 --> 00:05:05.040 Huh. So it's like the best guess if you knew 102 00:05:05.079 --> 00:05:05.759 nothing else. 103 00:05:06.160 --> 00:05:09.720 It's the optimal point prediction if you had zero other information. Yes, 104 00:05:09.839 --> 00:05:11.120 it's a point of minimum error. 105 00:05:11.279 --> 00:05:15.279 Okay, but the mean has that famous weakness, right, the 106 00:05:15.279 --> 00:05:18.199 outlier problem. Like if your averaging salary is in a 107 00:05:18.199 --> 00:05:21.720 small startup and suddenly the CEO's twenty million dollars salary 108 00:05:21.759 --> 00:05:22.519 gets added. 109 00:05:22.279 --> 00:05:26.759 In exactly that one massive outlier just yanks the average 110 00:05:26.800 --> 00:05:30.360 way way up, making it not very representative of the 111 00:05:30.360 --> 00:05:31.360 typical employee. 112 00:05:31.439 --> 00:05:32.720 So that's where the medium comes in. 113 00:05:32.959 --> 00:05:36.360 Precisely, The median is the exact middle value when you 114 00:05:36.399 --> 00:05:38.800 sort your data from smallest to largest. Fifty percent of 115 00:05:38.800 --> 00:05:40.720 the data is below it, fifty percent is above it. 116 00:05:40.759 --> 00:05:43.279 And because it only cares about the middle position. 117 00:05:43.519 --> 00:05:46.759 It's incredibly robust to those extreme outliers. That twenty million 118 00:05:46.759 --> 00:05:50.279 dollars salary doesn't really affect the median much, if at all. 119 00:05:50.519 --> 00:05:52.759 And if you have an even number of data points, 120 00:05:53.519 --> 00:05:54.839 no single middle value. 121 00:05:54.959 --> 00:05:57.319 Simple you just take the average of the two middle values. 122 00:05:57.639 --> 00:05:59.480 Still gives you that robust central point. 123 00:05:59.560 --> 00:06:03.199 Okay, so mean is air minimizing, but sensitive to outliers, 124 00:06:03.240 --> 00:06:05.480 meeting is robust. What about the third one, the mode? 125 00:06:05.839 --> 00:06:08.920 The mode is even simpler. It's just the value that 126 00:06:09.000 --> 00:06:12.079 shows up most often in your data set, most frequent yep. 127 00:06:12.759 --> 00:06:16.399 It's typically most useful for categorical data, finding the most 128 00:06:16.399 --> 00:06:18.759 popular choice or the most common group. 129 00:06:18.839 --> 00:06:20.360 And he quirks with the mode. 130 00:06:20.519 --> 00:06:23.439 Couple interesting ones. It's the only measure of center that 131 00:06:23.519 --> 00:06:26.360 might not actually be present in your data, which sounds 132 00:06:26.399 --> 00:06:28.920 weird but can happen. And you can also have more 133 00:06:28.959 --> 00:06:32.560 than one mode, like bimodal exactly by moodal if there 134 00:06:32.560 --> 00:06:35.240 are two peaks, or even multimodal that can be a 135 00:06:35.240 --> 00:06:37.480 clue that your data might actually be composed of a 136 00:06:37.519 --> 00:06:39.639 couple of different underlying groups or clusters. 137 00:06:39.680 --> 00:06:43.279 Okay, so we found the center using mean, median or mode. 138 00:06:43.800 --> 00:06:47.439 But you said center alone isn't enough. Two data sets 139 00:06:47.439 --> 00:06:50.000 could have the same mean but look totally different. 140 00:06:50.319 --> 00:06:53.439 Right. Imagine one data set clustered tightly around the mean 141 00:06:54.040 --> 00:06:57.519 and another spread way out, same mean, very different story. 142 00:06:57.720 --> 00:07:01.480 That's why we need measures of disperge or spread, and 143 00:07:01.519 --> 00:07:04.199 the main ones are variance in standard deviation STY. 144 00:07:04.319 --> 00:07:07.240 Okay, variance and STY. They both measure spread, right, how 145 00:07:07.279 --> 00:07:09.639 far data points tend to be from the center, usually 146 00:07:09.680 --> 00:07:10.000 the mean. 147 00:07:10.240 --> 00:07:13.600 That's the core idea. A high value for either variants 148 00:07:13.639 --> 00:07:17.600 or SD means the data is really spread out, dispersed widely. 149 00:07:18.160 --> 00:07:21.000 A small value means everything's huddled close to the mean. 150 00:07:21.879 --> 00:07:24.519 So if they measure the same basic thing, why do 151 00:07:24.560 --> 00:07:27.879 we need both? What's the practical difference, especially thinking about 152 00:07:27.879 --> 00:07:28.720 machine learning? 153 00:07:28.920 --> 00:07:32.519 Okay, so mathematically, the standard deviation is just the square 154 00:07:32.600 --> 00:07:35.399 root of the variance. The absolute key difference is the 155 00:07:35.519 --> 00:07:39.600 units units. Yeah, variance is calculated using square differences, so 156 00:07:39.639 --> 00:07:42.800 it's units are the square of the original data's units. 157 00:07:43.319 --> 00:07:45.920 If you measure at height in meters, the variance is 158 00:07:45.959 --> 00:07:49.360 in meters squared, which is kind of awkward to interpret directly. 159 00:07:49.319 --> 00:07:50.160 Not very intuitive. 160 00:07:50.240 --> 00:07:53.240 But the standard deviation, because it's the square root, is 161 00:07:53.279 --> 00:07:55.879 back in the original units. So if your height data 162 00:07:55.920 --> 00:07:58.160 is in meters, the SD is also in meters. 163 00:07:58.240 --> 00:08:01.439 Ah. Okay, so s D is easier to compare directly 164 00:08:01.439 --> 00:08:02.079 to the mean. 165 00:08:02.199 --> 00:08:06.199 Much easier. It makes SD far better for interpretation, for reporting, 166 00:08:06.439 --> 00:08:09.920 and really crucially for something called feature scaling or normalization 167 00:08:10.040 --> 00:08:10.480 in mL. 168 00:08:10.600 --> 00:08:11.519 Why future scaling. 169 00:08:11.720 --> 00:08:14.160 Well, often in mL you have features measured on totally 170 00:08:14.160 --> 00:08:18.240 different scales, maybe aging years, income in thousands of dollars, 171 00:08:18.279 --> 00:08:21.920 heightened centimeters. Models can sometimes struggle with that or give 172 00:08:21.959 --> 00:08:24.959 too much weight to features with larger numerical values. 173 00:08:24.720 --> 00:08:26.160 So you need to put them on a level playing 174 00:08:26.160 --> 00:08:27.040 field exactly. 175 00:08:27.199 --> 00:08:29.600 You often rescale features so they have a mean of 176 00:08:29.680 --> 00:08:33.000 zero and a standard deviation of one, And standard deviation 177 00:08:33.120 --> 00:08:35.480 is the metric you use to do that rescaling properly. 178 00:08:35.840 --> 00:08:39.279 It's fundamental for pre processing data for many algorithms. 179 00:08:39.399 --> 00:08:42.960 Okay, we've gone from defining data types to summarizing them 180 00:08:43.000 --> 00:08:46.200 with center and spread. Now how do we pivot towards 181 00:08:46.279 --> 00:08:49.480 using this data for prediction. That feels like the next 182 00:08:49.519 --> 00:08:50.320 logical step. 183 00:08:50.879 --> 00:08:53.559 It is, and that pivot really starts by defining a 184 00:08:53.559 --> 00:08:57.200 potential cause and effect relationship. This is where we introduce 185 00:08:57.240 --> 00:09:00.080 the concepts of dependent and independent. 186 00:08:59.600 --> 00:09:01.720 Variable right setting up the experiment. 187 00:09:01.799 --> 00:09:05.799 Essentially, pretty much, we're defining our modeling goal. What factor 188 00:09:05.879 --> 00:09:10.120 are we changing or observing the independent variable, and what 189 00:09:10.200 --> 00:09:14.080 outcome are we measuring the effect on the dependent variable. 190 00:09:14.360 --> 00:09:17.879 So the independent variable is the input, the thing we control, 191 00:09:18.080 --> 00:09:20.159 or the factor we think is causing a change. 192 00:09:20.200 --> 00:09:23.240 Exactly like in a drug trial, the dosage level would 193 00:09:23.240 --> 00:09:26.879 be the independent variable. Or using an example from the source, 194 00:09:27.200 --> 00:09:29.440 maybe the type of pitch a pitcher throws to a batter. 195 00:09:29.600 --> 00:09:30.639 That's the input being. 196 00:09:30.559 --> 00:09:33.240 Varied, and the dependent variable is the output the result 197 00:09:33.399 --> 00:09:35.600 What happens because of the independent variable. 198 00:09:35.759 --> 00:09:38.919 Yes, it's the variable being tested or measured that responds 199 00:09:38.960 --> 00:09:42.679 to the changes. In that baseball example, the batter's performance, 200 00:09:42.720 --> 00:09:45.240 did they hit it how well? That's the dependent variable. 201 00:09:45.559 --> 00:09:47.320 Its value depends on the pitch type. 202 00:09:47.440 --> 00:09:50.960 And getting these two defined correctly seems absolutely critical. It's 203 00:09:50.960 --> 00:09:54.360 basically framing the entire problem you want your mL model 204 00:09:54.399 --> 00:09:54.879 to solve. 205 00:09:55.080 --> 00:09:58.240 It is you're specifying the relationship you intend to model 206 00:09:58.279 --> 00:09:58.759 and predict. 207 00:09:58.879 --> 00:10:02.120 Now, underpinning all all of this statistical analysis, all these 208 00:10:02.120 --> 00:10:05.639 measurements and relationships, there's a really core principle that gives 209 00:10:05.720 --> 00:10:08.279 us confidence in the results, right, the law of large 210 00:10:08.320 --> 00:10:09.480 numbers LLLN. 211 00:10:09.600 --> 00:10:12.639 Ah. Yes, the LLN. It's absolutely fundamental. It's kind of 212 00:10:12.679 --> 00:10:15.000 the bedrock that makes statistics work reliably. 213 00:10:15.240 --> 00:10:17.799 So what is its state? In simple terms, it. 214 00:10:17.759 --> 00:10:21.000 Basically says that if you repeat the same experiment over 215 00:10:21.080 --> 00:10:23.360 and over and over again a huge number of times, 216 00:10:23.799 --> 00:10:26.559 the average of the results you get will get closer 217 00:10:26.559 --> 00:10:30.240 and closer to the true expected theoretical value. 218 00:10:30.320 --> 00:10:31.159 Like flipping a coin. 219 00:10:31.320 --> 00:10:34.679 Perfect example, flip a coin just ten times, you might 220 00:10:34.720 --> 00:10:38.080 easily get say seven heads and three tails. That's pretty 221 00:10:38.120 --> 00:10:40.679 far from the expected fifty to fifty, right, But flip 222 00:10:40.720 --> 00:10:43.440 that same coin a million times or ten million times, 223 00:10:44.039 --> 00:10:46.120 the ratio of heads to tails is going to get 224 00:10:46.159 --> 00:10:49.360 incredibly close to exactly one to one. It converges on 225 00:10:49.399 --> 00:10:51.200 the true probability. 226 00:10:50.639 --> 00:10:54.120 And it's that convergence that lets us trust statistical methods exactly. 227 00:10:54.279 --> 00:10:57.840 It validates the whole idea of using probabilities and statistics 228 00:10:57.840 --> 00:11:01.799 derived from experiments or samples to stand underlying truths. It 229 00:11:01.799 --> 00:11:04.399 allows us to have confidence in probabilistic models. 230 00:11:04.519 --> 00:11:07.240 So the LN gives us the confidence then to take 231 00:11:07.320 --> 00:11:09.600 results we see in a smaller sample of data and 232 00:11:09.639 --> 00:11:13.200 make reasonable conclusions about the entire population it came from, 233 00:11:13.240 --> 00:11:15.039 which sounds like statistical inference. 234 00:11:15.440 --> 00:11:19.159 That's precisely what statistical inference is about, and it leads 235 00:11:19.240 --> 00:11:22.200 directly to the main framework we use for making those decisions. 236 00:11:22.399 --> 00:11:23.480 Hypothesis testing. 237 00:11:23.840 --> 00:11:27.320 Okay, hypothesis testing. This is where we formally test an 238 00:11:27.320 --> 00:11:28.600 idea using the data. 239 00:11:28.720 --> 00:11:31.559 Yes, it's the structured process where we use the summary 240 00:11:31.559 --> 00:11:34.960 statistics we calculated combined with our understanding of probability in 241 00:11:35.000 --> 00:11:39.200 the LLN to draw conclusions about a whole population based 242 00:11:39.320 --> 00:11:40.879 only on evidence from a sample. 243 00:11:41.240 --> 00:11:44.440 And it usually involves setting up two competing ideas beforehand. 244 00:11:44.559 --> 00:11:47.960 Correct you have a kind of statistical showdown. The main 245 00:11:48.000 --> 00:11:49.879 goal is to see if there's enough evidence in your 246 00:11:49.919 --> 00:11:52.639 sample data to reject the null hypothesis. 247 00:11:52.759 --> 00:11:56.159 The null hypothesis being the default skeptical position. 248 00:11:56.320 --> 00:11:59.440 Always it's the statement of no effect, no difference, or 249 00:11:59.440 --> 00:12:03.200 no relation. For example, this new drug has no effect 250 00:12:03.279 --> 00:12:05.720 on recovery time compared to the place ebo. It's the 251 00:12:05.720 --> 00:12:08.639 status quo assumption, and we test that against against the 252 00:12:08.679 --> 00:12:12.000 alternative hypothesis. This is the statement that contradicts the null. 253 00:12:12.440 --> 00:12:15.240 It's what you, as the researcher, might actually suspect or 254 00:12:15.279 --> 00:12:18.519 hope to prove, like, no, this new drug does reduce 255 00:12:18.559 --> 00:12:19.240 recovery time. 256 00:12:19.480 --> 00:12:22.960 So the whole process is about gathering enough statistical evidence 257 00:12:23.320 --> 00:12:27.679 to confidently say, Okay, we can reject the no effect 258 00:12:27.759 --> 00:12:29.679 idea in favor of the there is an. 259 00:12:29.600 --> 00:12:33.759 Effect idea precisely, and that level of statistical confidence, often 260 00:12:33.799 --> 00:12:36.960 expressed as a P value or a confidence interval, is 261 00:12:37.000 --> 00:12:39.679 what determines whether you feel justified in acting on your 262 00:12:39.720 --> 00:12:41.759 findings or making a claim about the population. 263 00:12:42.000 --> 00:12:44.840 All right, let's pull this together. We've walked through quite 264 00:12:44.840 --> 00:12:48.799 a statistical toolkit, understanding the different types of variables you encounter. 265 00:12:48.679 --> 00:12:51.679 Discrete, continuous, nominal, ordinal. 266 00:12:51.480 --> 00:12:54.639 Yeah, then summarizing them with measures of center like the 267 00:12:54.720 --> 00:12:59.679 mean and median, understanding spread with standard deviation especially ysds 268 00:13:00.159 --> 00:13:01.480 full practically. 269 00:13:01.159 --> 00:13:03.559 Right, those units matter for comparison and scaling. 270 00:13:03.759 --> 00:13:06.600 Then we moved into setting up predictions by defining dependent 271 00:13:06.679 --> 00:13:10.399 and independent variables, and finally, the framework for making decisions 272 00:13:10.440 --> 00:13:14.399 based on sample data hypothesis testing built on the confidence 273 00:13:14.440 --> 00:13:16.320 given by the law of large numbers. 274 00:13:16.559 --> 00:13:19.000 It really does form the essential foundation. You can see 275 00:13:19.000 --> 00:13:22.320 how these concepts are well mandatory before you jump into 276 00:13:22.320 --> 00:13:24.039 the more complex mL algorithms. 277 00:13:24.159 --> 00:13:27.000 They really are the entry point for any serious study 278 00:13:27.080 --> 00:13:27.639 or application. 279 00:13:28.000 --> 00:13:30.080 But here's a final thought, something that connects back to 280 00:13:30.120 --> 00:13:34.399 that law of large numbers. The LN guarantees convergence. It 281 00:13:34.399 --> 00:13:38.799 gives us certainty, but only over a massive number of trials. 282 00:13:39.159 --> 00:13:40.799 A million coin flips. 283 00:13:40.519 --> 00:13:42.639 Right, requires huge scale exactly. 284 00:13:43.200 --> 00:13:46.200 But in the real world, doing market analysis, building a 285 00:13:46.200 --> 00:13:50.480 product prototype, maybe even running a clinical trial, we almost 286 00:13:50.559 --> 00:13:53.519 never have a million data points. We work with samples, 287 00:13:53.799 --> 00:13:57.759 sometimes relatively small samples because collecting data is expensive or 288 00:13:57.759 --> 00:13:58.480 time consuming. 289 00:13:58.759 --> 00:14:02.480 So the certainty we get is an absolute. It's usually probabilistic, 290 00:14:02.559 --> 00:14:05.360 like saying we're ninety five percent confident or maybe ninety 291 00:14:05.399 --> 00:14:06.840 nine percent confident. 292 00:14:06.480 --> 00:14:09.600 Right, which leads to the provocative question. If the law 293 00:14:09.639 --> 00:14:13.759 of large numbers only guarantees truth over immense scale, how 294 00:14:13.799 --> 00:14:16.840 often are our everyday decisions, maybe in business, launching a 295 00:14:16.840 --> 00:14:20.639 new feature, or even interpreting a political poll, actually based 296 00:14:20.639 --> 00:14:22.679 on what could be called the fallacy of small. 297 00:14:22.519 --> 00:14:25.360 Numbers, meaning we're drawing conclusions from samples that might be 298 00:14:25.399 --> 00:14:28.559 too small to really trust the LLNS guarantee potentially. 299 00:14:30.480 --> 00:14:34.080 So the question for you, the listener, is what level 300 00:14:34.080 --> 00:14:37.000 of statistical certainty that ninety five percent, that ninety nine 301 00:14:37.039 --> 00:14:40.240 percent are you willing to accept, Especially when you're moving 302 00:14:40.320 --> 00:14:44.039 from analyzing a potentially small, expensive sample to making a 303 00:14:44.080 --> 00:14:47.360 big assumption about the entire population, an assumption that could 304 00:14:47.360 --> 00:14:49.519 have major consequences, maybe cost millions. 305 00:14:49.600 --> 00:14:51.039 How much uncertainty can you live with? 306 00:14:51.240 --> 00:14:54.720 What's your threshold for risk framed in that statistical confidence? 307 00:14:54.960 --> 00:14:57.279 Definitely something to think about. A great place to leave 308 00:14:57.279 --> 00:14:59.240 it for the steep dive. Thanks for joining us,