WEBVTT 1 00:00:00.120 --> 00:00:02.359 Welcome to the deep dive, where we cut through the 2 00:00:02.399 --> 00:00:05.200 noise to get straight to the knowledge you need. Today, 3 00:00:05.240 --> 00:00:09.279 we're plunging into data science, a field that's well fundamentally 4 00:00:09.359 --> 00:00:12.320 reshaping our world, often without us even realizing it. 5 00:00:12.320 --> 00:00:12.960 It really is. 6 00:00:13.359 --> 00:00:17.480 Think about it. The surprising underdog victory in moneyball, that's 7 00:00:17.600 --> 00:00:22.719 data science. Or those perfectly tailored recommendations from your streaming 8 00:00:22.760 --> 00:00:24.079 services every single. 9 00:00:23.960 --> 00:00:25.480 Night also data science. 10 00:00:25.719 --> 00:00:26.000 Yep. 11 00:00:26.600 --> 00:00:31.600 It's this force, quietly yet powerfully at work across our society, 12 00:00:31.719 --> 00:00:33.799 far beyond what you might immediately see. 13 00:00:33.840 --> 00:00:37.079 Absolutely, and in this deep dive, our mission is to 14 00:00:37.119 --> 00:00:40.840 distill the let's say, intricate insights from our source material, 15 00:00:41.079 --> 00:00:43.840 which is a brilliant guide to data science, into a clear, 16 00:00:43.960 --> 00:00:46.640 engaging and practical understanding for you exactly. 17 00:00:46.759 --> 00:00:48.600 We're here to give you a shortcut to being well 18 00:00:48.600 --> 00:00:52.520 informed exploring data science not just as a technical discipline, 19 00:00:52.520 --> 00:00:55.880 but maybe more as the art of finding patterns in data. 20 00:00:56.000 --> 00:00:59.079 The art of finding patterns. I like that it captures 21 00:00:59.119 --> 00:01:02.000 so much, doesn't it. It speaks to the creativity involved 22 00:01:02.000 --> 00:01:06.799 in solving complex problems, preparing messy data, and ultimately telling 23 00:01:06.840 --> 00:01:09.680 a compelling story with what you find. It does and 24 00:01:09.719 --> 00:01:12.799 speaking of hidden depths, data science is very much like 25 00:01:12.799 --> 00:01:16.480 an iceberg. You often only see the tip the sleek 26 00:01:16.599 --> 00:01:21.480 apps or powerful AI, but most of the complex foundational 27 00:01:21.519 --> 00:01:25.760 work that's hidden beneath the surface, most of it. Today, 28 00:01:26.079 --> 00:01:29.599 we're going to try and illuminate those unseen processes, diving 29 00:01:29.640 --> 00:01:31.760 into the core of how it all actually works. 30 00:01:32.200 --> 00:01:35.359 That's a perfect analogy, and to illuminate that hidden bulk, 31 00:01:35.400 --> 00:01:37.840 we're going to walk you through the iterative data analysis 32 00:01:37.879 --> 00:01:41.519 life cycle. This is like the foundational framework for pretty 33 00:01:41.599 --> 00:01:44.239 much any data science project. Okay, and it's rarely a 34 00:01:44.280 --> 00:01:47.680 straight line. It's much more cyclical, meaning you often revisit 35 00:01:47.719 --> 00:01:52.040 steps as you uncover new insights or unexpected challenges pop up. 36 00:01:52.120 --> 00:01:54.519 Right, So it's a loop, not just a linear path. Yeah, 37 00:01:54.560 --> 00:01:56.959 that makes intuitive sense when you're dealing with something as 38 00:01:57.040 --> 00:02:00.120 dynamic as data. So what are the key stages be 39 00:02:00.200 --> 00:02:01.239 exploring in this journey? 40 00:02:01.400 --> 00:02:06.959 Well, they start with discovery, then move through source, prepare, explore, create, analyse, communicate, 41 00:02:07.040 --> 00:02:11.199 and finally operationalize. Each step builds on the last. But 42 00:02:11.319 --> 00:02:14.639 like you said, the real power lies in its iterative nature. 43 00:02:15.000 --> 00:02:16.840 You can loop back at pretty much any point. 44 00:02:17.000 --> 00:02:21.240 Right, let's unpack this starting with that crucial first step discovery. 45 00:02:21.439 --> 00:02:24.199 This is all about framing the problem right, and it 46 00:02:24.280 --> 00:02:27.800 sounds straightforward, but defining what you're actually trying to solve 47 00:02:28.039 --> 00:02:29.919 seems absolutely paramount. 48 00:02:30.039 --> 00:02:33.840 Oh it is because different people often have completely different 49 00:02:33.879 --> 00:02:37.240 ideas about what the real problem is. Our source material 50 00:02:37.280 --> 00:02:40.439 gives a great example. Imagine a music streaming company trying 51 00:02:40.520 --> 00:02:43.919 to fix a subscription problem. Okay, the sales director might 52 00:02:43.960 --> 00:02:47.240 immediately jump to thinking we need to attract new subscribers, 53 00:02:47.439 --> 00:02:49.919 but the finance director they might see the exact same 54 00:02:49.960 --> 00:02:53.759 problem as one of customer retention. You know, existing users 55 00:02:53.800 --> 00:02:54.919 aren't engaging enough. 56 00:02:55.159 --> 00:02:58.560 Ah, two completely different angles for the same business challenge. Yeah. 57 00:02:58.599 --> 00:03:02.840 So getting that problem definition crystal clear right at the 58 00:03:02.879 --> 00:03:06.319 outset really sets the entire direction for your project. It 59 00:03:06.360 --> 00:03:08.719 really does, and it's worth remembering. I suppose that in 60 00:03:08.800 --> 00:03:13.039 smaller teams, one person might wear many hats covering roles 61 00:03:13.039 --> 00:03:16.560 that in larger organizations would be spread across many specialists. 62 00:03:16.599 --> 00:03:20.039 Precisely, and beyond just the internal viewpoints, you also need 63 00:03:20.080 --> 00:03:24.599 to understand the domain context, the actual real world environment 64 00:03:24.639 --> 00:03:25.840 where the problem exists. 65 00:03:26.000 --> 00:03:26.199 Right. 66 00:03:26.319 --> 00:03:30.159 For instance, analyzing phone faults in an emergency response organization, 67 00:03:30.520 --> 00:03:34.039 where reliability can be literally a matter of life and death. Well, 68 00:03:34.039 --> 00:03:36.680 that's vastly different from doing the same analysis for a 69 00:03:36.759 --> 00:03:41.520 typical office environment. Completely difference exactly the context fundamentally changes 70 00:03:41.560 --> 00:03:44.800 the problem and its implications. It dictates everything, from say, 71 00:03:44.879 --> 00:03:47.879 data quality standards, to the urgency of finding solutions. 72 00:03:48.319 --> 00:03:52.159 That distinction really matters. Okay, so once you understand the problem, 73 00:03:52.360 --> 00:03:56.120 you need the raw material data. This brings us nicely 74 00:03:56.159 --> 00:04:00.800 to step two. Understanding and sourcing data mediately stands out 75 00:04:00.800 --> 00:04:03.680 Here is a critical emphasis on using the right data, 76 00:04:03.800 --> 00:04:05.560 not just any data you can get your hands on. 77 00:04:05.840 --> 00:04:09.120 That's such a fundamental distinction, and one that trips up 78 00:04:09.120 --> 00:04:12.960 even giants. Our source highlights the cautionary tale of seers. 79 00:04:13.000 --> 00:04:17.639 Remember them once a retail behemoth, Yeah, I do. They 80 00:04:17.680 --> 00:04:22.240 focused intensely on traditional financial KPIs key performance indicators like 81 00:04:22.439 --> 00:04:25.800 pure sales numbers. They were hitting their targets technically, yet 82 00:04:25.879 --> 00:04:31.199 beneath those apparently strong financial results, customer satisfaction was plummeting, 83 00:04:31.720 --> 00:04:36.480 you know, poor service, outdated stores. Seers struggled to adapt, 84 00:04:36.759 --> 00:04:41.079 eventually filing for bankruptcy. They're focused on just numerical or 85 00:04:41.160 --> 00:04:47.240 quantitative data, completely obscured crucial qualitative insights about how customers 86 00:04:47.279 --> 00:04:47.920 actually felt. 87 00:04:47.959 --> 00:04:49.839 So it's not just about what you can easily count. 88 00:04:50.120 --> 00:04:53.040 You need both quantitative data, which is all about amounts, 89 00:04:53.120 --> 00:04:56.480 usually numerical, and qualitative data, which is more subjective, often 90 00:04:56.519 --> 00:04:59.920 words like customer reviews or direct feedback about how some 91 00:05:00.120 --> 00:05:01.519 feels about their broadband service. 92 00:05:01.600 --> 00:05:05.319 Perhaps exactly right. And even within quantitative data there's a spectrum. 93 00:05:05.399 --> 00:05:09.000 It's often classified using the ny SARSA nominal, ordinal, interval 94 00:05:09.040 --> 00:05:11.800 and ratio in a Y are. The key insight here 95 00:05:11.920 --> 00:05:15.720 really is that knowing the scale dictates what mathematical operations 96 00:05:15.759 --> 00:05:19.199 you can actually perform on your data. You can't meaningfully 97 00:05:19.240 --> 00:05:22.040 average categories like red or blue, which are nominal, but 98 00:05:22.120 --> 00:05:26.040 you can certainly count them. Understanding NR prevents these fundamental 99 00:05:26.120 --> 00:05:29.759 analytical errors and really guides your choice of model later on. 100 00:05:30.120 --> 00:05:32.519 Got it Now, Here's where it gets really interesting and 101 00:05:32.639 --> 00:05:36.959 maybe a bit tricky. Bias and skew. Mark Twain famously 102 00:05:37.040 --> 00:05:41.519 quipped about lies, damn lies and statistics. How does bias 103 00:05:41.560 --> 00:05:45.959 sneak into our data, making statistics so easily manipulated? Sometimes? 104 00:05:46.360 --> 00:05:48.879 Well, it often comes down to how you collect your data, 105 00:05:49.000 --> 00:05:52.439 your sampling methods. If you only ask, say, a group 106 00:05:52.480 --> 00:05:54.680 of young males if they want more football on TV, 107 00:05:55.240 --> 00:05:57.839 you're highly likely to get a yes. Bias, the results 108 00:05:57.839 --> 00:06:00.800 would probably be very different if you asked a broad demographic. 109 00:06:00.959 --> 00:06:03.920 Makes sense, So bias creeps in when you're sample or 110 00:06:03.920 --> 00:06:06.199 maybe even the way you ask the questions causes the 111 00:06:06.199 --> 00:06:09.120 results to lean a certain way, leading to incorrect or 112 00:06:09.160 --> 00:06:10.360 misleading conclusions. 113 00:06:10.680 --> 00:06:14.040 And data can also be skewed right, meaning it sort 114 00:06:14.040 --> 00:06:17.560 of disproportionately leans in one direction, pulling your averages. 115 00:06:17.120 --> 00:06:18.079 With it precisely. 116 00:06:18.800 --> 00:06:22.120 The most important takeaway here seems to be how easily 117 00:06:22.199 --> 00:06:26.000 data can be presented to tell a desired story rather 118 00:06:26.079 --> 00:06:29.879 than the full, impartial truth. This raises an important question 119 00:06:29.920 --> 00:06:33.000 for you listening. Yeah, what stands out to you when 120 00:06:33.000 --> 00:06:36.399 you think about the potential for bias data and skewed findings. 121 00:06:36.480 --> 00:06:39.160 Yeah, the potential for misinterpretation is just immense. And then 122 00:06:39.160 --> 00:06:41.839 you layer on top of that big data, which only 123 00:06:41.879 --> 00:06:44.800 amplifies these challenges. It's often defined by the three vs. 124 00:06:44.920 --> 00:06:49.600 The three v's volume just vast amounts velocity, the sheer 125 00:06:49.680 --> 00:06:53.240 speed at which it's created and needs processing, and variety. 126 00:06:53.399 --> 00:06:56.920 That mix of structured data like spreadsheets with unstructured stuff 127 00:06:57.000 --> 00:06:59.399 like video, social media posts, or audio. 128 00:06:59.600 --> 00:07:02.759 And then sometimes it expands to the five v's adding veracity, 129 00:07:02.800 --> 00:07:06.120 which is about the quality and accuracy and variability, counting 130 00:07:06.120 --> 00:07:10.120 for inconsistencies like how employees might be defined differently across 131 00:07:10.240 --> 00:07:14.199 various internal systems in a company. These characteristics clearly create 132 00:07:14.319 --> 00:07:19.079 significant hurdles in identifying, accessing, and frankly trusting the data 133 00:07:19.120 --> 00:07:20.279 you actually need for your project. 134 00:07:20.319 --> 00:07:23.000 They certainly do. And as for a collection in storage, 135 00:07:23.079 --> 00:07:26.600 data can originate from all sorts of places, sensors, human entry, 136 00:07:26.879 --> 00:07:29.519 or can even be synthetically generated these days, and it 137 00:07:29.560 --> 00:07:31.759 needs to be stored efficiently, whether that's in its raw 138 00:07:31.839 --> 00:07:34.800 format in a data lake structure, nicely for reporting in 139 00:07:34.800 --> 00:07:38.240 a data warehouse, or maybe in a more focused subset 140 00:07:38.319 --> 00:07:39.879 in a data mart. 141 00:07:39.720 --> 00:07:42.519 In overseeing all of this, you mentioned data governance, which 142 00:07:42.839 --> 00:07:45.519 sounds a bit like the unsung hero of the data world. 143 00:07:45.639 --> 00:07:49.399 It absolutely is. Data governance ensures the data is safe, efficient, 144 00:07:49.560 --> 00:07:53.600 and crucially reliable for use. It covers everything from who 145 00:07:53.639 --> 00:07:57.319 gets access to mandating storage requirements, and it really underpins 146 00:07:57.399 --> 00:08:01.560 data quality, which is vital. Okay emphasizes three lenses for 147 00:08:01.600 --> 00:08:04.759 thinking about data quality accuracy, is it complete, is it 148 00:08:04.800 --> 00:08:08.639 properly recorded? Latency? How old is it? Is it still 149 00:08:08.680 --> 00:08:11.519 relevant for the decision you need to make? And lineage? 150 00:08:11.759 --> 00:08:14.079 Where did it actually come from? Can its journey be 151 00:08:14.160 --> 00:08:15.199 traced and trusted? 152 00:08:15.439 --> 00:08:17.279 Right? The provenance exactly? 153 00:08:17.360 --> 00:08:20.399 The old adage garbage in garbage out perfectly applies here. 154 00:08:20.480 --> 00:08:22.839 If your source data is poor, your insights will be 155 00:08:22.879 --> 00:08:25.879 two no matter how sophisticated your analysis might be. 156 00:08:26.160 --> 00:08:29.720 Okay, So once you've sourced your data, it's rarely ready 157 00:08:29.720 --> 00:08:32.679 to just use straight away. This brings us to step 158 00:08:32.759 --> 00:08:37.759 three preparation, or as it's often called, data wrangling. My 159 00:08:37.840 --> 00:08:40.639 understanding is this is typically the most time consuming part 160 00:08:40.639 --> 00:08:42.480 of a data science project. Is that fair? 161 00:08:42.639 --> 00:08:45.799 Oh? Absolutely, It's often where the bulk of the effort lies. 162 00:08:45.840 --> 00:08:49.679 It's all about making the data suitable for analysis, and 163 00:08:49.840 --> 00:08:52.840 the form of the data really matters here. This includes 164 00:08:52.879 --> 00:08:54.840 its granularity. 165 00:08:54.120 --> 00:08:54.679 Green larity. 166 00:08:54.759 --> 00:08:57.440 Yeah. For instance, do you need daily phone fault data 167 00:08:57.480 --> 00:09:00.399 for shift planning or would monthly data suff vi if 168 00:09:00.399 --> 00:09:03.679 you're just doing, say a recruitment strategy. A key insight 169 00:09:03.759 --> 00:09:06.440 here is that you can consolidate less granular data from 170 00:09:06.440 --> 00:09:09.679 more detail. You can always roll up daily data into 171 00:09:09.720 --> 00:09:12.919 monthly But you can't magically break down monthly data into 172 00:09:12.960 --> 00:09:15.399 daily insights if you didn't collect it that way. 173 00:09:15.519 --> 00:09:18.759 Right, you can't invent detail and scale matters too, doesn't 174 00:09:18.799 --> 00:09:21.679 it right? If you're comparing, say, phone age in years 175 00:09:22.039 --> 00:09:25.559 with usage in minutes, feature scaling ensures that the larger 176 00:09:25.639 --> 00:09:29.639 numerical range of minutes doesn't unfairly dominate your model compared 177 00:09:29.679 --> 00:09:30.879 to the influence of age. 178 00:09:31.000 --> 00:09:34.080 Exactly. It's about giving every feature a fair chance to 179 00:09:34.080 --> 00:09:37.000 contribute to the model's findings. Makes sense, And during this 180 00:09:37.080 --> 00:09:43.000 preparation phase you'll inevitably encounter common data quality risks, missing values, 181 00:09:43.399 --> 00:09:45.600 duplicate records, and outliers those. 182 00:09:45.440 --> 00:09:47.399 Extreme values right, the odd ones out. 183 00:09:47.559 --> 00:09:49.879 Yeah, And for outliers you have to make a conscious 184 00:09:49.919 --> 00:09:53.039 decision whether to keep them, remove them, or maybe even 185 00:09:53.080 --> 00:09:55.679 correct them, depending on what caused them and what impact 186 00:09:55.679 --> 00:09:58.600 they're having. And of course, the ever present risk of 187 00:09:58.639 --> 00:10:02.080 inherent bias can still lurk here even after sourcing. 188 00:10:02.559 --> 00:10:04.840 So how do you actually go about checking for these 189 00:10:04.840 --> 00:10:07.360 issues effectively? What are the practical steps. 190 00:10:07.240 --> 00:10:11.559 You perform practical checks? This includes visual inspection literally looking 191 00:10:11.559 --> 00:10:14.080 at a sample of the data, maybe sorting it, looking 192 00:10:14.120 --> 00:10:18.799 at the edges for anomalies. Then graphical inspection using charts 193 00:10:18.879 --> 00:10:22.559 like histograms to spot skewness or box plots to easily 194 00:10:22.600 --> 00:10:27.759 identify outliers. Okay, And finally cross checks. This means verifying 195 00:10:27.799 --> 00:10:31.080 your transformed data against its original source to make sure 196 00:10:31.080 --> 00:10:34.960 you haven't introduced errors during the wrangling process consistency checks. 197 00:10:35.000 --> 00:10:38.200 This all sounds like really meticulous work, but it's clearly 198 00:10:38.320 --> 00:10:42.360 essential for any solid analysis down the line. Speaking of analysis, 199 00:10:42.399 --> 00:10:45.039 let's move to step four, the analytical engine. This is 200 00:10:45.080 --> 00:10:48.480 where we dive into basic concepts and model selection, where 201 00:10:48.519 --> 00:10:51.840 the magic of statistics truly starts to transform that raw, 202 00:10:52.080 --> 00:10:53.080 prepared data. 203 00:10:53.120 --> 00:10:55.519 It's definitely where the patterns start to emerge. We can 204 00:10:55.559 --> 00:10:58.000 begin with the basics averages. We all know the mean, 205 00:10:58.080 --> 00:11:01.480 the standard arithmetic average, but the median, the middle value 206 00:11:01.480 --> 00:11:03.840 when you order your data is often your secret weapon, 207 00:11:04.120 --> 00:11:07.000 especially with skewed data. Why is that because it's far 208 00:11:07.080 --> 00:11:10.320 more robust to outliers. It gives you the true typical 209 00:11:10.399 --> 00:11:14.480 value in sewed data sets, like understanding typical house prices 210 00:11:14.559 --> 00:11:18.480 in an area without being massively swayed by one huge 211 00:11:18.519 --> 00:11:19.320 mansion sale. 212 00:11:19.480 --> 00:11:20.840 Ah Okay, that makes sense. 213 00:11:20.879 --> 00:11:24.159 And the mode the mode is simply the most frequent value, 214 00:11:24.600 --> 00:11:27.440 really useful for categorical data where you just want to 215 00:11:27.480 --> 00:11:31.200 know what's most common, like the most popular response in 216 00:11:31.240 --> 00:11:32.519 a survey, got it. 217 00:11:32.600 --> 00:11:35.200 And then there measures a spread, which tell you about 218 00:11:35.200 --> 00:11:37.919 the diversity or variability in your data, not just its 219 00:11:37.919 --> 00:11:42.399 center point. We have range variance and the more intuitive 220 00:11:42.399 --> 00:11:43.120 standard deviation. 221 00:11:43.639 --> 00:11:45.639 Right, so if you're looking at those house prices again, 222 00:11:45.919 --> 00:11:48.799 a high standard deviation means prices vary a lot around 223 00:11:48.840 --> 00:11:51.679 the average, while a low one means they're tightly clustered. 224 00:11:51.679 --> 00:11:55.200 It gives you a sense of consistency or while lack thereof. 225 00:11:54.840 --> 00:11:56.600 Hopes quantify that spread exactly. 226 00:11:57.120 --> 00:12:00.240 And then there's probability, which is really the length of 227 00:12:00.320 --> 00:12:03.679 uncertainty itself, whether it's simple dice rolls or coin flips. 228 00:12:04.080 --> 00:12:08.240 Probability helps us quantify likelihood, and the law of large 229 00:12:08.320 --> 00:12:12.200 numbers is a powerful concept here. Basically, the more trials 230 00:12:12.240 --> 00:12:14.759 you run, or the more data points you have, the 231 00:12:14.879 --> 00:12:18.440 closer your observed frequency will get to the theoretical probability. 232 00:12:19.240 --> 00:12:22.679 This makes your data driven insights more reliable and less 233 00:12:22.679 --> 00:12:24.360 prone to just random fluctuations. 234 00:12:24.600 --> 00:12:28.639 That's a really powerful idea. More data, more certainty in 235 00:12:28.679 --> 00:12:31.600 a way. We also briefly touch on the Cartesian plane 236 00:12:31.639 --> 00:12:35.039 and distance. This might sound like high school geometry flashback. Yeah, 237 00:12:35.120 --> 00:12:38.799 maybe a little, but it's actually foundational for how many 238 00:12:38.879 --> 00:12:43.159 statistical models understand the spatial relationships and similarities between different 239 00:12:43.200 --> 00:12:46.399 data points. It's how they see how close or far 240 00:12:46.559 --> 00:12:48.679 things are from each other in a mathematical space. 241 00:12:48.759 --> 00:12:51.679 Absolutely, it underpins a lot of modeling. So once you 242 00:12:51.759 --> 00:12:55.279 grasp these fundamental concepts, you're ready for the critical decision 243 00:12:55.519 --> 00:12:58.279 choosing the right model. It's so crucial because the wrong 244 00:12:58.320 --> 00:13:01.240 method will lead you to limited or maybe even misleading insights, 245 00:13:01.240 --> 00:13:03.600 no matter how good your data prep was. We can 246 00:13:03.639 --> 00:13:07.279 categorize analytics into three main types, broadly speaking. 247 00:13:07.120 --> 00:13:11.639 First, descriptive analytics, which looks backward to understand the past, 248 00:13:12.200 --> 00:13:16.960 like simply understanding last month's coffee shops sales trends, what happened? 249 00:13:17.080 --> 00:13:19.879 Then predictive analytics, which tries to peek into the future, 250 00:13:19.919 --> 00:13:22.879 like forecasting that latte sales are likely to increase next 251 00:13:22.879 --> 00:13:26.039 winter based on historical patterns and maybe weather data. 252 00:13:26.159 --> 00:13:29.919 Okay, looking ahead, And finally, prescriptive analytics, which is about 253 00:13:29.960 --> 00:13:32.960 deciding what action to take based on those predictions. 254 00:13:33.000 --> 00:13:37.080 Exactly, So based on that Latte prediction, you'd proactively decide, okay, 255 00:13:37.399 --> 00:13:41.120 let's stock up on ingredients and adjust staff schedules. They 256 00:13:41.159 --> 00:13:44.440 really work together for that full circle informed decision making 257 00:13:44.480 --> 00:13:47.159 process makes sense. And for each of these types, there 258 00:13:47.159 --> 00:13:50.799 are specific model types we can use. For understanding fundamental 259 00:13:50.799 --> 00:13:55.399 relationships between variables. We often use correlation, okay, for example, 260 00:13:55.519 --> 00:13:58.320 exploring if more gaming hours tend to correspond to lower 261 00:13:58.360 --> 00:14:01.759 student grades, or if increase least advertising spend is associated 262 00:14:01.799 --> 00:14:05.159 with higher sales revenue. But the crucial insight here, the 263 00:14:05.200 --> 00:14:09.240 one everyone needs to remember, is correlation does not imply causation. 264 00:14:10.120 --> 00:14:11.720 Ah. Yes, the classic say it. 265 00:14:11.759 --> 00:14:15.879 Again, Correlation does not imply causation. Just because two things 266 00:14:15.919 --> 00:14:18.960 move together doesn't automatically mean one causes the other. There 267 00:14:19.000 --> 00:14:21.960 could be a third factor, or it could be coincidence. 268 00:14:21.480 --> 00:14:25.320 Such a critical pitfall to avoid. Okay. Then we have regression, 269 00:14:25.360 --> 00:14:28.200 which you said is fantastic for predicting numerical values. 270 00:14:28.440 --> 00:14:31.200 That's right, like predicting how much a mobile phone's battery 271 00:14:31.200 --> 00:14:33.919 capacity is likely to decrease as the phone gets older, 272 00:14:34.440 --> 00:14:36.480 predicting a specific number. 273 00:14:36.320 --> 00:14:38.919 Gotcha, and for forecasting patterns over time. 274 00:14:39.159 --> 00:14:43.080 For that time series analysis is key. Airlines, for example, 275 00:14:43.279 --> 00:14:46.600 use this expensively to forecast passenger demand. It helps pick 276 00:14:46.679 --> 00:14:50.039 up on trends, seasonality, and other complex patterns and data 277 00:14:50.039 --> 00:14:53.639 that evolve over time. Models like ARIMA are common. 278 00:14:53.360 --> 00:14:56.279 Here Arima okay. And when you need to sort data 279 00:14:56.279 --> 00:15:01.039 into pre defined categories like gues no or customer customer. 280 00:15:00.799 --> 00:15:04.559 You'd use classification. Think of an e commerce platform predicting 281 00:15:04.679 --> 00:15:07.960 which website visitors are most likely to actually make a purchase, 282 00:15:08.600 --> 00:15:11.399 or maybe a decision tree model helping someone decide which 283 00:15:11.399 --> 00:15:14.000 phone to buy based on their budget and preferred brand. 284 00:15:14.440 --> 00:15:17.639 It guides you through a series of questions to a category, right, like. 285 00:15:17.600 --> 00:15:19.320 A float chart. And what if you want to group 286 00:15:19.399 --> 00:15:23.240 similar data points together without knowing the categories beforehand. 287 00:15:23.279 --> 00:15:27.480 Oh, that's clustering. Imagine segmenting your customer base based on 288 00:15:27.559 --> 00:15:32.039 their actual buying habits into distinct groups you didn't predefine. 289 00:15:32.200 --> 00:15:35.919 Methods like Kai means clustering can reveal these hidden customer 290 00:15:35.960 --> 00:15:38.360 personas just from the data itself. 291 00:15:38.559 --> 00:15:43.080 Finding natural groups. Yeah okay. And finally, association. 292 00:15:43.080 --> 00:15:47.559 Association helps you discover relationships between items. It's famously used 293 00:15:47.559 --> 00:15:50.679 in market basket analysis and retail to see which products 294 00:15:50.679 --> 00:15:53.879 are frequently bought together. The classic example is people who 295 00:15:53.879 --> 00:15:57.679 buy diapers often buy beer apparently, or maybe more commonly, 296 00:15:57.720 --> 00:15:58.360 bread and butter. 297 00:15:58.679 --> 00:16:01.639 Right, finding those connections. Okay, So after selecting and building 298 00:16:01.679 --> 00:16:04.759 your model, model evaluation becomes critical. You need to know 299 00:16:04.759 --> 00:16:07.360 if your predictions are actually meaningful and reliable, not just 300 00:16:07.480 --> 00:16:08.200 random flukes. 301 00:16:08.279 --> 00:16:11.879 Right, absolutely crucial. You need to assess its performance using 302 00:16:12.000 --> 00:16:15.559 various metrics and concepts. One common one is the P value. 303 00:16:15.679 --> 00:16:18.519 The P value often misunderstood it is. 304 00:16:18.519 --> 00:16:21.720 It's not just about surprise. It's your model's way of asking, 305 00:16:22.200 --> 00:16:24.600 how likely is it that I observe this result or 306 00:16:24.639 --> 00:16:28.000 something even more extreme if there was no real effect 307 00:16:28.120 --> 00:16:31.519 actually occurring in the world purely by random chance. Okay, 308 00:16:31.879 --> 00:16:34.879 A tiny pea value gives you confidence that your findings 309 00:16:34.879 --> 00:16:38.080 are statistically significant, meaning they're unlikely to be just a 310 00:16:38.120 --> 00:16:40.360 fluke of the data you happen to collect, right, not 311 00:16:40.480 --> 00:16:44.000 just noise exactly. And you also critically compare the model's 312 00:16:44.039 --> 00:16:46.840 performance on train data, the data used to build it, 313 00:16:47.200 --> 00:16:50.679 and test data new data it hasn't seen before. This 314 00:16:50.720 --> 00:16:54.960 helps spot overfitting. Overfitting that's where the model performs brilliantly 315 00:16:55.000 --> 00:16:57.679 on the data it's already seen, but completely falls apart 316 00:16:57.720 --> 00:17:00.720 when it encounters new unseen data because it learned the 317 00:17:00.759 --> 00:17:05.000 training data too specifically, including its noise, or the opposite 318 00:17:05.119 --> 00:17:08.079 underfitting where it's too simple and performs poorly on both. 319 00:17:08.279 --> 00:17:10.880 Finding that balance and you also need to analyze errors 320 00:17:11.200 --> 00:17:13.599 like false positives and false negatives definitely. 321 00:17:13.720 --> 00:17:16.240 For example, predicting someone has the flu when they don't 322 00:17:16.519 --> 00:17:20.200 is a false positive that has very different real world implications, 323 00:17:20.240 --> 00:17:24.000 maybe unnecessary warrior treatment, than a false negative, which is 324 00:17:24.039 --> 00:17:26.519 predicting they don't have the flu when they actually do. 325 00:17:27.200 --> 00:17:31.640 Understanding the specific consequences of your model's errors is paramount 326 00:17:31.680 --> 00:17:33.200 in deciding if it's fit for purpose. 327 00:17:33.440 --> 00:17:37.519 Absolutely okay. This careful evaluation then leads us nicely to 328 00:17:37.599 --> 00:17:41.440 step five visualizations, where you actually tell the story with 329 00:17:41.519 --> 00:17:45.480 your data. Our source says it beautifully, numbers can transform 330 00:17:45.519 --> 00:17:48.319 into stories and insights leap off the page. 331 00:17:48.480 --> 00:17:50.839 I love that framing too, It really captures it. It's 332 00:17:50.839 --> 00:17:53.119 about so much more than just picking a chart type. 333 00:17:53.240 --> 00:17:55.599 It's about crafting a compelling visual. 334 00:17:55.319 --> 00:17:57.599 Narrative, and the key insights here seem to be knowing 335 00:17:57.599 --> 00:17:59.880 your audience, choosing the right chart type to convey your 336 00:18:00.039 --> 00:18:05.759 specific message, clearly simplifying complex data visually and using color wisely, 337 00:18:06.200 --> 00:18:10.440 especially considering accessibility for users with colorblindness, which is often overlooked. 338 00:18:10.720 --> 00:18:14.720 Absolutely good user experience principles applied just as much here. 339 00:18:15.240 --> 00:18:18.079 Think about interactivity through things like tooltips that pop up 340 00:18:18.079 --> 00:18:21.279 with details or filters that allow your audience to explore 341 00:18:21.319 --> 00:18:23.799 the data at their own pace and get answers to 342 00:18:23.839 --> 00:18:25.279 their own specific questions. 343 00:18:25.400 --> 00:18:28.920 Yeah, letting them dig in. We use so many chart types, yeah, 344 00:18:28.960 --> 00:18:32.319 bar charts, histograms, line graphs, scatter plots, box and whisker 345 00:18:32.359 --> 00:18:35.599 plots for showing spread heat maps, even stem and leaf 346 00:18:35.599 --> 00:18:37.200 plots sometimes true. 347 00:18:37.200 --> 00:18:40.000 But often the real power comes from combining charts, like 348 00:18:40.079 --> 00:18:43.680 putting a scatterplot showing individual data points alongside a line 349 00:18:43.720 --> 00:18:46.759 graph showing the overall trend That can tell a much 350 00:18:46.799 --> 00:18:50.240 more complete story, say about sales performance over time, showing 351 00:18:50.279 --> 00:18:54.079 both individual transaction outliers and the overall profit trends together. 352 00:18:54.240 --> 00:18:56.440 Good point, But there's a word of caution here too. 353 00:18:56.680 --> 00:19:00.880 Yes, definitely, visualizations can be subjective and quite a motive. 354 00:19:01.000 --> 00:19:03.640 It's important to avoid making them overly technical for your 355 00:19:03.640 --> 00:19:07.599 audience or using distracting elements like save three D graphs 356 00:19:07.680 --> 00:19:11.319 which rarely add clarity and often just confuse the core message. 357 00:19:11.480 --> 00:19:12.599 Keep it clean and clear. 358 00:19:12.799 --> 00:19:17.400 Good advice. Okay, That brings us to the bigger picture, 359 00:19:17.720 --> 00:19:21.359 exploring the broader implications of data science, especially as it 360 00:19:21.400 --> 00:19:24.319 evolves into AI and touches more parts of our lives. 361 00:19:24.559 --> 00:19:27.640 This is such a critical discussion, and we sometimes encounter 362 00:19:27.720 --> 00:19:31.519 situations where the very application of data can spark significant 363 00:19:31.519 --> 00:19:36.160 public debate, raising ethical questions like what well, consider the 364 00:19:36.240 --> 00:19:39.359 controversy around the A level exam grading during the COVID 365 00:19:39.400 --> 00:19:43.599 pandemic in the UK. Algorithms used historical school performance data 366 00:19:43.640 --> 00:19:46.759 to help assign grades when exams couldn't happen. This led 367 00:19:46.799 --> 00:19:50.279 to widespread public outcry. I remember that many felt it 368 00:19:50.319 --> 00:19:53.960 was deeply unfair to individual high achieving students in historically 369 00:19:54.000 --> 00:19:57.920 lower performing schools. It really highlighted the challenges of algorithmic 370 00:19:57.960 --> 00:20:01.119 fairness and how the public reacts when data driven decisions 371 00:20:01.160 --> 00:20:03.079 don't seem to align with perceived equity. 372 00:20:03.400 --> 00:20:06.400 That example clearly shows the real world impact these models 373 00:20:06.440 --> 00:20:09.359 can have and the importance of public perception and trust. 374 00:20:09.839 --> 00:20:13.319 It's also why questions arise like are people comfortable with 375 00:20:13.359 --> 00:20:16.000 their data being used in this specific way. We saw 376 00:20:16.000 --> 00:20:18.799 Elon Musk raise concerns about his private jet movements being 377 00:20:18.839 --> 00:20:22.440 publicly tracked, citing personal privacy and safety risks for his family. 378 00:20:22.759 --> 00:20:24.119 It's a constant. 379 00:20:23.640 --> 00:20:27.799 Tension, indeed, and to navigate these complexities we have legal frameworks. 380 00:20:28.319 --> 00:20:32.680 In Europe. For instance, the GDPR principles are key lawful, fair, 381 00:20:32.839 --> 00:20:39.599 transparent processing, limited purpose data minimization, accuracy, storage limitation, integrity 382 00:20:39.599 --> 00:20:43.599 and confidentiality and accountability. These also have stricter rules for 383 00:20:43.720 --> 00:20:47.160 special category data like health information or race, which requires 384 00:20:47.160 --> 00:20:51.960 specific explicit consent. Regulatory bodies like the Information Commissioner's Office, 385 00:20:52.000 --> 00:20:55.039 the ICO and the UK enforce these rules. They even 386 00:20:55.079 --> 00:20:58.000 reprimanded a school for using a facial recognition system for 387 00:20:58.079 --> 00:21:01.920 cashless catering, emphasizing the needed for robust legal compliance around 388 00:21:01.960 --> 00:21:03.920 how data is used, especially sensitive data. 389 00:21:04.039 --> 00:21:08.519 It's a complex landscape. Then there's the exciting but also 390 00:21:08.640 --> 00:21:12.799 maybe slightly intimidating, rapidly evolving world of machine learning and 391 00:21:12.880 --> 00:21:17.039 artificial intelligence. It feels important to clarify their relationship because 392 00:21:17.039 --> 00:21:19.319 the terms are often used interchangeably. 393 00:21:18.720 --> 00:21:22.440 Aren't they They are, and they're deeply interconnected but distinct. 394 00:21:22.559 --> 00:21:25.200 You can think of it like this, data science methods 395 00:21:25.200 --> 00:21:28.279 are often used to develop machine learning models, and machine 396 00:21:28.359 --> 00:21:31.079 learning techniques can be applied to solve data science problems 397 00:21:31.240 --> 00:21:32.960 and also to create AI systems. 398 00:21:33.119 --> 00:21:34.519 Okay, so how do we define them? 399 00:21:34.960 --> 00:21:38.920 We can define machine learning mL generally as software that 400 00:21:39.000 --> 00:21:42.160 improves as it performs a task through experience with data, 401 00:21:42.480 --> 00:21:47.519