WEBVTT

1
00:00:00.120 --> 00:00:02.359
<v Speaker 1>Welcome to the deep dive, where we cut through the

2
00:00:02.399 --> 00:00:05.200
<v Speaker 1>noise to get straight to the knowledge you need. Today,

3
00:00:05.240 --> 00:00:09.279
<v Speaker 1>we're plunging into data science, a field that's well fundamentally

4
00:00:09.359 --> 00:00:12.320
<v Speaker 1>reshaping our world, often without us even realizing it.

5
00:00:12.320 --> 00:00:12.960
<v Speaker 2>It really is.

6
00:00:13.359 --> 00:00:17.480
<v Speaker 1>Think about it. The surprising underdog victory in moneyball, that's

7
00:00:17.600 --> 00:00:22.719
<v Speaker 1>data science. Or those perfectly tailored recommendations from your streaming

8
00:00:22.760 --> 00:00:24.079
<v Speaker 1>services every single.

9
00:00:23.960 --> 00:00:25.480
<v Speaker 2>Night also data science.

10
00:00:25.719 --> 00:00:26.000
<v Speaker 1>Yep.

11
00:00:26.600 --> 00:00:31.600
<v Speaker 2>It's this force, quietly yet powerfully at work across our society,

12
00:00:31.719 --> 00:00:33.799
<v Speaker 2>far beyond what you might immediately see.

13
00:00:33.840 --> 00:00:37.079
<v Speaker 1>Absolutely, and in this deep dive, our mission is to

14
00:00:37.119 --> 00:00:40.840
<v Speaker 1>distill the let's say, intricate insights from our source material,

15
00:00:41.079 --> 00:00:43.840
<v Speaker 1>which is a brilliant guide to data science, into a clear,

16
00:00:43.960 --> 00:00:46.640
<v Speaker 1>engaging and practical understanding for you exactly.

17
00:00:46.759 --> 00:00:48.600
<v Speaker 2>We're here to give you a shortcut to being well

18
00:00:48.600 --> 00:00:52.520
<v Speaker 2>informed exploring data science not just as a technical discipline,

19
00:00:52.520 --> 00:00:55.880
<v Speaker 2>but maybe more as the art of finding patterns in data.

20
00:00:56.000 --> 00:00:59.079
<v Speaker 1>The art of finding patterns. I like that it captures

21
00:00:59.119 --> 00:01:02.000
<v Speaker 1>so much, doesn't it. It speaks to the creativity involved

22
00:01:02.000 --> 00:01:06.799
<v Speaker 1>in solving complex problems, preparing messy data, and ultimately telling

23
00:01:06.840 --> 00:01:09.680
<v Speaker 1>a compelling story with what you find. It does and

24
00:01:09.719 --> 00:01:12.799
<v Speaker 1>speaking of hidden depths, data science is very much like

25
00:01:12.799 --> 00:01:16.480
<v Speaker 1>an iceberg. You often only see the tip the sleek

26
00:01:16.599 --> 00:01:21.480
<v Speaker 1>apps or powerful AI, but most of the complex foundational

27
00:01:21.519 --> 00:01:25.760
<v Speaker 1>work that's hidden beneath the surface, most of it. Today,

28
00:01:26.079 --> 00:01:29.599
<v Speaker 1>we're going to try and illuminate those unseen processes, diving

29
00:01:29.640 --> 00:01:31.760
<v Speaker 1>into the core of how it all actually works.

30
00:01:32.200 --> 00:01:35.359
<v Speaker 2>That's a perfect analogy, and to illuminate that hidden bulk,

31
00:01:35.400 --> 00:01:37.840
<v Speaker 2>we're going to walk you through the iterative data analysis

32
00:01:37.879 --> 00:01:41.519
<v Speaker 2>life cycle. This is like the foundational framework for pretty

33
00:01:41.599 --> 00:01:44.239
<v Speaker 2>much any data science project. Okay, and it's rarely a

34
00:01:44.280 --> 00:01:47.680
<v Speaker 2>straight line. It's much more cyclical, meaning you often revisit

35
00:01:47.719 --> 00:01:52.040
<v Speaker 2>steps as you uncover new insights or unexpected challenges pop up.

36
00:01:52.120 --> 00:01:54.519
<v Speaker 1>Right, So it's a loop, not just a linear path. Yeah,

37
00:01:54.560 --> 00:01:56.959
<v Speaker 1>that makes intuitive sense when you're dealing with something as

38
00:01:57.040 --> 00:02:00.120
<v Speaker 1>dynamic as data. So what are the key stages be

39
00:02:00.200 --> 00:02:01.239
<v Speaker 1>exploring in this journey?

40
00:02:01.400 --> 00:02:06.959
<v Speaker 2>Well, they start with discovery, then move through source, prepare, explore, create, analyse, communicate,

41
00:02:07.040 --> 00:02:11.199
<v Speaker 2>and finally operationalize. Each step builds on the last. But

42
00:02:11.319 --> 00:02:14.639
<v Speaker 2>like you said, the real power lies in its iterative nature.

43
00:02:15.000 --> 00:02:16.840
<v Speaker 2>You can loop back at pretty much any point.

44
00:02:17.000 --> 00:02:21.240
<v Speaker 1>Right, let's unpack this starting with that crucial first step discovery.

45
00:02:21.439 --> 00:02:24.199
<v Speaker 1>This is all about framing the problem right, and it

46
00:02:24.280 --> 00:02:27.800
<v Speaker 1>sounds straightforward, but defining what you're actually trying to solve

47
00:02:28.039 --> 00:02:29.919
<v Speaker 1>seems absolutely paramount.

48
00:02:30.039 --> 00:02:33.840
<v Speaker 2>Oh it is because different people often have completely different

49
00:02:33.879 --> 00:02:37.240
<v Speaker 2>ideas about what the real problem is. Our source material

50
00:02:37.280 --> 00:02:40.439
<v Speaker 2>gives a great example. Imagine a music streaming company trying

51
00:02:40.520 --> 00:02:43.919
<v Speaker 2>to fix a subscription problem. Okay, the sales director might

52
00:02:43.960 --> 00:02:47.240
<v Speaker 2>immediately jump to thinking we need to attract new subscribers,

53
00:02:47.439 --> 00:02:49.919
<v Speaker 2>but the finance director they might see the exact same

54
00:02:49.960 --> 00:02:53.759
<v Speaker 2>problem as one of customer retention. You know, existing users

55
00:02:53.800 --> 00:02:54.919
<v Speaker 2>aren't engaging enough.

56
00:02:55.159 --> 00:02:58.560
<v Speaker 1>Ah, two completely different angles for the same business challenge. Yeah.

57
00:02:58.599 --> 00:03:02.840
<v Speaker 1>So getting that problem definition crystal clear right at the

58
00:03:02.879 --> 00:03:06.319
<v Speaker 1>outset really sets the entire direction for your project. It

59
00:03:06.360 --> 00:03:08.719
<v Speaker 1>really does, and it's worth remembering. I suppose that in

60
00:03:08.800 --> 00:03:13.039
<v Speaker 1>smaller teams, one person might wear many hats covering roles

61
00:03:13.039 --> 00:03:16.560
<v Speaker 1>that in larger organizations would be spread across many specialists.

62
00:03:16.599 --> 00:03:20.039
<v Speaker 2>Precisely, and beyond just the internal viewpoints, you also need

63
00:03:20.080 --> 00:03:24.599
<v Speaker 2>to understand the domain context, the actual real world environment

64
00:03:24.639 --> 00:03:25.840
<v Speaker 2>where the problem exists.

65
00:03:26.000 --> 00:03:26.199
<v Speaker 1>Right.

66
00:03:26.319 --> 00:03:30.159
<v Speaker 2>For instance, analyzing phone faults in an emergency response organization,

67
00:03:30.520 --> 00:03:34.039
<v Speaker 2>where reliability can be literally a matter of life and death. Well,

68
00:03:34.039 --> 00:03:36.680
<v Speaker 2>that's vastly different from doing the same analysis for a

69
00:03:36.759 --> 00:03:41.520
<v Speaker 2>typical office environment. Completely difference exactly the context fundamentally changes

70
00:03:41.560 --> 00:03:44.800
<v Speaker 2>the problem and its implications. It dictates everything, from say,

71
00:03:44.879 --> 00:03:47.879
<v Speaker 2>data quality standards, to the urgency of finding solutions.

72
00:03:48.319 --> 00:03:52.159
<v Speaker 1>That distinction really matters. Okay, so once you understand the problem,

73
00:03:52.360 --> 00:03:56.120
<v Speaker 1>you need the raw material data. This brings us nicely

74
00:03:56.159 --> 00:04:00.800
<v Speaker 1>to step two. Understanding and sourcing data mediately stands out

75
00:04:00.800 --> 00:04:03.680
<v Speaker 1>Here is a critical emphasis on using the right data,

76
00:04:03.800 --> 00:04:05.560
<v Speaker 1>not just any data you can get your hands on.

77
00:04:05.840 --> 00:04:09.120
<v Speaker 2>That's such a fundamental distinction, and one that trips up

78
00:04:09.120 --> 00:04:12.960
<v Speaker 2>even giants. Our source highlights the cautionary tale of seers.

79
00:04:13.000 --> 00:04:17.639
<v Speaker 2>Remember them once a retail behemoth, Yeah, I do. They

80
00:04:17.680 --> 00:04:22.240
<v Speaker 2>focused intensely on traditional financial KPIs key performance indicators like

81
00:04:22.439 --> 00:04:25.800
<v Speaker 2>pure sales numbers. They were hitting their targets technically, yet

82
00:04:25.879 --> 00:04:31.199
<v Speaker 2>beneath those apparently strong financial results, customer satisfaction was plummeting,

83
00:04:31.720 --> 00:04:36.480
<v Speaker 2>you know, poor service, outdated stores. Seers struggled to adapt,

84
00:04:36.759 --> 00:04:41.079
<v Speaker 2>eventually filing for bankruptcy. They're focused on just numerical or

85
00:04:41.160 --> 00:04:47.240
<v Speaker 2>quantitative data, completely obscured crucial qualitative insights about how customers

86
00:04:47.279 --> 00:04:47.920
<v Speaker 2>actually felt.

87
00:04:47.959 --> 00:04:49.839
<v Speaker 1>So it's not just about what you can easily count.

88
00:04:50.120 --> 00:04:53.040
<v Speaker 1>You need both quantitative data, which is all about amounts,

89
00:04:53.120 --> 00:04:56.480
<v Speaker 1>usually numerical, and qualitative data, which is more subjective, often

90
00:04:56.519 --> 00:04:59.920
<v Speaker 1>words like customer reviews or direct feedback about how some

91
00:05:00.120 --> 00:05:01.519
<v Speaker 1>feels about their broadband service.

92
00:05:01.600 --> 00:05:05.319
<v Speaker 2>Perhaps exactly right. And even within quantitative data there's a spectrum.

93
00:05:05.399 --> 00:05:09.000
<v Speaker 2>It's often classified using the ny SARSA nominal, ordinal, interval

94
00:05:09.040 --> 00:05:11.800
<v Speaker 2>and ratio in a Y are. The key insight here

95
00:05:11.920 --> 00:05:15.720
<v Speaker 2>really is that knowing the scale dictates what mathematical operations

96
00:05:15.759 --> 00:05:19.199
<v Speaker 2>you can actually perform on your data. You can't meaningfully

97
00:05:19.240 --> 00:05:22.040
<v Speaker 2>average categories like red or blue, which are nominal, but

98
00:05:22.120 --> 00:05:26.040
<v Speaker 2>you can certainly count them. Understanding NR prevents these fundamental

99
00:05:26.120 --> 00:05:29.759
<v Speaker 2>analytical errors and really guides your choice of model later on.

100
00:05:30.120 --> 00:05:32.519
<v Speaker 1>Got it Now, Here's where it gets really interesting and

101
00:05:32.639 --> 00:05:36.959
<v Speaker 1>maybe a bit tricky. Bias and skew. Mark Twain famously

102
00:05:37.040 --> 00:05:41.519
<v Speaker 1>quipped about lies, damn lies and statistics. How does bias

103
00:05:41.560 --> 00:05:45.959
<v Speaker 1>sneak into our data, making statistics so easily manipulated? Sometimes?

104
00:05:46.360 --> 00:05:48.879
<v Speaker 2>Well, it often comes down to how you collect your data,

105
00:05:49.000 --> 00:05:52.439
<v Speaker 2>your sampling methods. If you only ask, say, a group

106
00:05:52.480 --> 00:05:54.680
<v Speaker 2>of young males if they want more football on TV,

107
00:05:55.240 --> 00:05:57.839
<v Speaker 2>you're highly likely to get a yes. Bias, the results

108
00:05:57.839 --> 00:06:00.800
<v Speaker 2>would probably be very different if you asked a broad demographic.

109
00:06:00.959 --> 00:06:03.920
<v Speaker 2>Makes sense, So bias creeps in when you're sample or

110
00:06:03.920 --> 00:06:06.199
<v Speaker 2>maybe even the way you ask the questions causes the

111
00:06:06.199 --> 00:06:09.120
<v Speaker 2>results to lean a certain way, leading to incorrect or

112
00:06:09.160 --> 00:06:10.360
<v Speaker 2>misleading conclusions.

113
00:06:10.680 --> 00:06:14.040
<v Speaker 1>And data can also be skewed right, meaning it sort

114
00:06:14.040 --> 00:06:17.560
<v Speaker 1>of disproportionately leans in one direction, pulling your averages.

115
00:06:17.120 --> 00:06:18.079
<v Speaker 2>With it precisely.

116
00:06:18.800 --> 00:06:22.120
<v Speaker 1>The most important takeaway here seems to be how easily

117
00:06:22.199 --> 00:06:26.000
<v Speaker 1>data can be presented to tell a desired story rather

118
00:06:26.079 --> 00:06:29.879
<v Speaker 1>than the full, impartial truth. This raises an important question

119
00:06:29.920 --> 00:06:33.000
<v Speaker 1>for you listening. Yeah, what stands out to you when

120
00:06:33.000 --> 00:06:36.399
<v Speaker 1>you think about the potential for bias data and skewed findings.

121
00:06:36.480 --> 00:06:39.160
<v Speaker 2>Yeah, the potential for misinterpretation is just immense. And then

122
00:06:39.160 --> 00:06:41.839
<v Speaker 2>you layer on top of that big data, which only

123
00:06:41.879 --> 00:06:44.800
<v Speaker 2>amplifies these challenges. It's often defined by the three vs.

124
00:06:44.920 --> 00:06:49.600
<v Speaker 2>The three v's volume just vast amounts velocity, the sheer

125
00:06:49.680 --> 00:06:53.240
<v Speaker 2>speed at which it's created and needs processing, and variety.

126
00:06:53.399 --> 00:06:56.920
<v Speaker 2>That mix of structured data like spreadsheets with unstructured stuff

127
00:06:57.000 --> 00:06:59.399
<v Speaker 2>like video, social media posts, or audio.

128
00:06:59.600 --> 00:07:02.759
<v Speaker 1>And then sometimes it expands to the five v's adding veracity,

129
00:07:02.800 --> 00:07:06.120
<v Speaker 1>which is about the quality and accuracy and variability, counting

130
00:07:06.120 --> 00:07:10.120
<v Speaker 1>for inconsistencies like how employees might be defined differently across

131
00:07:10.240 --> 00:07:14.199
<v Speaker 1>various internal systems in a company. These characteristics clearly create

132
00:07:14.319 --> 00:07:19.079
<v Speaker 1>significant hurdles in identifying, accessing, and frankly trusting the data

133
00:07:19.120 --> 00:07:20.279
<v Speaker 1>you actually need for your project.

134
00:07:20.319 --> 00:07:23.000
<v Speaker 2>They certainly do. And as for a collection in storage,

135
00:07:23.079 --> 00:07:26.600
<v Speaker 2>data can originate from all sorts of places, sensors, human entry,

136
00:07:26.879 --> 00:07:29.519
<v Speaker 2>or can even be synthetically generated these days, and it

137
00:07:29.560 --> 00:07:31.759
<v Speaker 2>needs to be stored efficiently, whether that's in its raw

138
00:07:31.839 --> 00:07:34.800
<v Speaker 2>format in a data lake structure, nicely for reporting in

139
00:07:34.800 --> 00:07:38.240
<v Speaker 2>a data warehouse, or maybe in a more focused subset

140
00:07:38.319 --> 00:07:39.879
<v Speaker 2>in a data mart.

141
00:07:39.720 --> 00:07:42.519
<v Speaker 1>In overseeing all of this, you mentioned data governance, which

142
00:07:42.839 --> 00:07:45.519
<v Speaker 1>sounds a bit like the unsung hero of the data world.

143
00:07:45.639 --> 00:07:49.399
<v Speaker 2>It absolutely is. Data governance ensures the data is safe, efficient,

144
00:07:49.560 --> 00:07:53.600
<v Speaker 2>and crucially reliable for use. It covers everything from who

145
00:07:53.639 --> 00:07:57.319
<v Speaker 2>gets access to mandating storage requirements, and it really underpins

146
00:07:57.399 --> 00:08:01.560
<v Speaker 2>data quality, which is vital. Okay emphasizes three lenses for

147
00:08:01.600 --> 00:08:04.759
<v Speaker 2>thinking about data quality accuracy, is it complete, is it

148
00:08:04.800 --> 00:08:08.639
<v Speaker 2>properly recorded? Latency? How old is it? Is it still

149
00:08:08.680 --> 00:08:11.519
<v Speaker 2>relevant for the decision you need to make? And lineage?

150
00:08:11.759 --> 00:08:14.079
<v Speaker 2>Where did it actually come from? Can its journey be

151
00:08:14.160 --> 00:08:15.199
<v Speaker 2>traced and trusted?

152
00:08:15.439 --> 00:08:17.279
<v Speaker 1>Right? The provenance exactly?

153
00:08:17.360 --> 00:08:20.399
<v Speaker 2>The old adage garbage in garbage out perfectly applies here.

154
00:08:20.480 --> 00:08:22.839
<v Speaker 2>If your source data is poor, your insights will be

155
00:08:22.879 --> 00:08:25.879
<v Speaker 2>two no matter how sophisticated your analysis might be.

156
00:08:26.160 --> 00:08:29.720
<v Speaker 1>Okay, So once you've sourced your data, it's rarely ready

157
00:08:29.720 --> 00:08:32.679
<v Speaker 1>to just use straight away. This brings us to step

158
00:08:32.759 --> 00:08:37.759
<v Speaker 1>three preparation, or as it's often called, data wrangling. My

159
00:08:37.840 --> 00:08:40.639
<v Speaker 1>understanding is this is typically the most time consuming part

160
00:08:40.639 --> 00:08:42.480
<v Speaker 1>of a data science project. Is that fair?

161
00:08:42.639 --> 00:08:45.799
<v Speaker 2>Oh? Absolutely, It's often where the bulk of the effort lies.

162
00:08:45.840 --> 00:08:49.679
<v Speaker 2>It's all about making the data suitable for analysis, and

163
00:08:49.840 --> 00:08:52.840
<v Speaker 2>the form of the data really matters here. This includes

164
00:08:52.879 --> 00:08:54.840
<v Speaker 2>its granularity.

165
00:08:54.120 --> 00:08:54.679
<v Speaker 1>Green larity.

166
00:08:54.759 --> 00:08:57.440
<v Speaker 2>Yeah. For instance, do you need daily phone fault data

167
00:08:57.480 --> 00:09:00.399
<v Speaker 2>for shift planning or would monthly data suff vi if

168
00:09:00.399 --> 00:09:03.679
<v Speaker 2>you're just doing, say a recruitment strategy. A key insight

169
00:09:03.759 --> 00:09:06.440
<v Speaker 2>here is that you can consolidate less granular data from

170
00:09:06.440 --> 00:09:09.679
<v Speaker 2>more detail. You can always roll up daily data into

171
00:09:09.720 --> 00:09:12.919
<v Speaker 2>monthly But you can't magically break down monthly data into

172
00:09:12.960 --> 00:09:15.399
<v Speaker 2>daily insights if you didn't collect it that way.

173
00:09:15.519 --> 00:09:18.759
<v Speaker 1>Right, you can't invent detail and scale matters too, doesn't

174
00:09:18.799 --> 00:09:21.679
<v Speaker 1>it right? If you're comparing, say, phone age in years

175
00:09:22.039 --> 00:09:25.559
<v Speaker 1>with usage in minutes, feature scaling ensures that the larger

176
00:09:25.639 --> 00:09:29.639
<v Speaker 1>numerical range of minutes doesn't unfairly dominate your model compared

177
00:09:29.679 --> 00:09:30.879
<v Speaker 1>to the influence of age.

178
00:09:31.000 --> 00:09:34.080
<v Speaker 2>Exactly. It's about giving every feature a fair chance to

179
00:09:34.080 --> 00:09:37.000
<v Speaker 2>contribute to the model's findings. Makes sense, And during this

180
00:09:37.080 --> 00:09:43.000
<v Speaker 2>preparation phase you'll inevitably encounter common data quality risks, missing values,

181
00:09:43.399 --> 00:09:45.600
<v Speaker 2>duplicate records, and outliers those.

182
00:09:45.440 --> 00:09:47.399
<v Speaker 1>Extreme values right, the odd ones out.

183
00:09:47.559 --> 00:09:49.879
<v Speaker 2>Yeah, And for outliers you have to make a conscious

184
00:09:49.919 --> 00:09:53.039
<v Speaker 2>decision whether to keep them, remove them, or maybe even

185
00:09:53.080 --> 00:09:55.679
<v Speaker 2>correct them, depending on what caused them and what impact

186
00:09:55.679 --> 00:09:58.600
<v Speaker 2>they're having. And of course, the ever present risk of

187
00:09:58.639 --> 00:10:02.080
<v Speaker 2>inherent bias can still lurk here even after sourcing.

188
00:10:02.559 --> 00:10:04.840
<v Speaker 1>So how do you actually go about checking for these

189
00:10:04.840 --> 00:10:07.360
<v Speaker 1>issues effectively? What are the practical steps.

190
00:10:07.240 --> 00:10:11.559
<v Speaker 2>You perform practical checks? This includes visual inspection literally looking

191
00:10:11.559 --> 00:10:14.080
<v Speaker 2>at a sample of the data, maybe sorting it, looking

192
00:10:14.120 --> 00:10:18.799
<v Speaker 2>at the edges for anomalies. Then graphical inspection using charts

193
00:10:18.879 --> 00:10:22.559
<v Speaker 2>like histograms to spot skewness or box plots to easily

194
00:10:22.600 --> 00:10:27.759
<v Speaker 2>identify outliers. Okay, And finally cross checks. This means verifying

195
00:10:27.799 --> 00:10:31.080
<v Speaker 2>your transformed data against its original source to make sure

196
00:10:31.080 --> 00:10:34.960
<v Speaker 2>you haven't introduced errors during the wrangling process consistency checks.

197
00:10:35.000 --> 00:10:38.200
<v Speaker 1>This all sounds like really meticulous work, but it's clearly

198
00:10:38.320 --> 00:10:42.360
<v Speaker 1>essential for any solid analysis down the line. Speaking of analysis,

199
00:10:42.399 --> 00:10:45.039
<v Speaker 1>let's move to step four, the analytical engine. This is

200
00:10:45.080 --> 00:10:48.480
<v Speaker 1>where we dive into basic concepts and model selection, where

201
00:10:48.519 --> 00:10:51.840
<v Speaker 1>the magic of statistics truly starts to transform that raw,

202
00:10:52.080 --> 00:10:53.080
<v Speaker 1>prepared data.

203
00:10:53.120 --> 00:10:55.519
<v Speaker 2>It's definitely where the patterns start to emerge. We can

204
00:10:55.559 --> 00:10:58.000
<v Speaker 2>begin with the basics averages. We all know the mean,

205
00:10:58.080 --> 00:11:01.480
<v Speaker 2>the standard arithmetic average, but the median, the middle value

206
00:11:01.480 --> 00:11:03.840
<v Speaker 2>when you order your data is often your secret weapon,

207
00:11:04.120 --> 00:11:07.000
<v Speaker 2>especially with skewed data. Why is that because it's far

208
00:11:07.080 --> 00:11:10.320
<v Speaker 2>more robust to outliers. It gives you the true typical

209
00:11:10.399 --> 00:11:14.480
<v Speaker 2>value in sewed data sets, like understanding typical house prices

210
00:11:14.559 --> 00:11:18.480
<v Speaker 2>in an area without being massively swayed by one huge

211
00:11:18.519 --> 00:11:19.320
<v Speaker 2>mansion sale.

212
00:11:19.480 --> 00:11:20.840
<v Speaker 1>Ah Okay, that makes sense.

213
00:11:20.879 --> 00:11:24.159
<v Speaker 2>And the mode the mode is simply the most frequent value,

214
00:11:24.600 --> 00:11:27.440
<v Speaker 2>really useful for categorical data where you just want to

215
00:11:27.480 --> 00:11:31.200
<v Speaker 2>know what's most common, like the most popular response in

216
00:11:31.240 --> 00:11:32.519
<v Speaker 2>a survey, got it.

217
00:11:32.600 --> 00:11:35.200
<v Speaker 1>And then there measures a spread, which tell you about

218
00:11:35.200 --> 00:11:37.919
<v Speaker 1>the diversity or variability in your data, not just its

219
00:11:37.919 --> 00:11:42.399
<v Speaker 1>center point. We have range variance and the more intuitive

220
00:11:42.399 --> 00:11:43.120
<v Speaker 1>standard deviation.

221
00:11:43.639 --> 00:11:45.639
<v Speaker 2>Right, so if you're looking at those house prices again,

222
00:11:45.919 --> 00:11:48.799
<v Speaker 2>a high standard deviation means prices vary a lot around

223
00:11:48.840 --> 00:11:51.679
<v Speaker 2>the average, while a low one means they're tightly clustered.

224
00:11:51.679 --> 00:11:55.200
<v Speaker 2>It gives you a sense of consistency or while lack thereof.

225
00:11:54.840 --> 00:11:56.600
<v Speaker 1>Hopes quantify that spread exactly.

226
00:11:57.120 --> 00:12:00.240
<v Speaker 2>And then there's probability, which is really the length of

227
00:12:00.320 --> 00:12:03.679
<v Speaker 2>uncertainty itself, whether it's simple dice rolls or coin flips.

228
00:12:04.080 --> 00:12:08.240
<v Speaker 2>Probability helps us quantify likelihood, and the law of large

229
00:12:08.320 --> 00:12:12.200
<v Speaker 2>numbers is a powerful concept here. Basically, the more trials

230
00:12:12.240 --> 00:12:14.759
<v Speaker 2>you run, or the more data points you have, the

231
00:12:14.879 --> 00:12:18.440
<v Speaker 2>closer your observed frequency will get to the theoretical probability.

232
00:12:19.240 --> 00:12:22.679
<v Speaker 2>This makes your data driven insights more reliable and less

233
00:12:22.679 --> 00:12:24.360
<v Speaker 2>prone to just random fluctuations.

234
00:12:24.600 --> 00:12:28.639
<v Speaker 1>That's a really powerful idea. More data, more certainty in

235
00:12:28.679 --> 00:12:31.600
<v Speaker 1>a way. We also briefly touch on the Cartesian plane

236
00:12:31.639 --> 00:12:35.039
<v Speaker 1>and distance. This might sound like high school geometry flashback. Yeah,

237
00:12:35.120 --> 00:12:38.799
<v Speaker 1>maybe a little, but it's actually foundational for how many

238
00:12:38.879 --> 00:12:43.159
<v Speaker 1>statistical models understand the spatial relationships and similarities between different

239
00:12:43.200 --> 00:12:46.399
<v Speaker 1>data points. It's how they see how close or far

240
00:12:46.559 --> 00:12:48.679
<v Speaker 1>things are from each other in a mathematical space.

241
00:12:48.759 --> 00:12:51.679
<v Speaker 2>Absolutely, it underpins a lot of modeling. So once you

242
00:12:51.759 --> 00:12:55.279
<v Speaker 2>grasp these fundamental concepts, you're ready for the critical decision

243
00:12:55.519 --> 00:12:58.279
<v Speaker 2>choosing the right model. It's so crucial because the wrong

244
00:12:58.320 --> 00:13:01.240
<v Speaker 2>method will lead you to limited or maybe even misleading insights,

245
00:13:01.240 --> 00:13:03.600
<v Speaker 2>no matter how good your data prep was. We can

246
00:13:03.639 --> 00:13:07.279
<v Speaker 2>categorize analytics into three main types, broadly speaking.

247
00:13:07.120 --> 00:13:11.639
<v Speaker 1>First, descriptive analytics, which looks backward to understand the past,

248
00:13:12.200 --> 00:13:16.960
<v Speaker 1>like simply understanding last month's coffee shops sales trends, what happened?

249
00:13:17.080 --> 00:13:19.879
<v Speaker 2>Then predictive analytics, which tries to peek into the future,

250
00:13:19.919 --> 00:13:22.879
<v Speaker 2>like forecasting that latte sales are likely to increase next

251
00:13:22.879 --> 00:13:26.039
<v Speaker 2>winter based on historical patterns and maybe weather data.

252
00:13:26.159 --> 00:13:29.919
<v Speaker 1>Okay, looking ahead, And finally, prescriptive analytics, which is about

253
00:13:29.960 --> 00:13:32.960
<v Speaker 1>deciding what action to take based on those predictions.

254
00:13:33.000 --> 00:13:37.080
<v Speaker 2>Exactly, So based on that Latte prediction, you'd proactively decide, okay,

255
00:13:37.399 --> 00:13:41.120
<v Speaker 2>let's stock up on ingredients and adjust staff schedules. They

256
00:13:41.159 --> 00:13:44.440
<v Speaker 2>really work together for that full circle informed decision making

257
00:13:44.480 --> 00:13:47.159
<v Speaker 2>process makes sense. And for each of these types, there

258
00:13:47.159 --> 00:13:50.799
<v Speaker 2>are specific model types we can use. For understanding fundamental

259
00:13:50.799 --> 00:13:55.399
<v Speaker 2>relationships between variables. We often use correlation, okay, for example,

260
00:13:55.519 --> 00:13:58.320
<v Speaker 2>exploring if more gaming hours tend to correspond to lower

261
00:13:58.360 --> 00:14:01.759
<v Speaker 2>student grades, or if increase least advertising spend is associated

262
00:14:01.799 --> 00:14:05.159
<v Speaker 2>with higher sales revenue. But the crucial insight here, the

263
00:14:05.200 --> 00:14:09.240
<v Speaker 2>one everyone needs to remember, is correlation does not imply causation.

264
00:14:10.120 --> 00:14:11.720
<v Speaker 1>Ah. Yes, the classic say it.

265
00:14:11.759 --> 00:14:15.879
<v Speaker 2>Again, Correlation does not imply causation. Just because two things

266
00:14:15.919 --> 00:14:18.960
<v Speaker 2>move together doesn't automatically mean one causes the other. There

267
00:14:19.000 --> 00:14:21.960
<v Speaker 2>could be a third factor, or it could be coincidence.

268
00:14:21.480 --> 00:14:25.320
<v Speaker 1>Such a critical pitfall to avoid. Okay. Then we have regression,

269
00:14:25.360 --> 00:14:28.200
<v Speaker 1>which you said is fantastic for predicting numerical values.

270
00:14:28.440 --> 00:14:31.200
<v Speaker 2>That's right, like predicting how much a mobile phone's battery

271
00:14:31.200 --> 00:14:33.919
<v Speaker 2>capacity is likely to decrease as the phone gets older,

272
00:14:34.440 --> 00:14:36.480
<v Speaker 2>predicting a specific number.

273
00:14:36.320 --> 00:14:38.919
<v Speaker 1>Gotcha, and for forecasting patterns over time.

274
00:14:39.159 --> 00:14:43.080
<v Speaker 2>For that time series analysis is key. Airlines, for example,

275
00:14:43.279 --> 00:14:46.600
<v Speaker 2>use this expensively to forecast passenger demand. It helps pick

276
00:14:46.679 --> 00:14:50.039
<v Speaker 2>up on trends, seasonality, and other complex patterns and data

277
00:14:50.039 --> 00:14:53.639
<v Speaker 2>that evolve over time. Models like ARIMA are common.

278
00:14:53.360 --> 00:14:56.279
<v Speaker 1>Here Arima okay. And when you need to sort data

279
00:14:56.279 --> 00:15:01.039
<v Speaker 1>into pre defined categories like gues no or customer customer.

280
00:15:00.799 --> 00:15:04.559
<v Speaker 2>You'd use classification. Think of an e commerce platform predicting

281
00:15:04.679 --> 00:15:07.960
<v Speaker 2>which website visitors are most likely to actually make a purchase,

282
00:15:08.600 --> 00:15:11.399
<v Speaker 2>or maybe a decision tree model helping someone decide which

283
00:15:11.399 --> 00:15:14.000
<v Speaker 2>phone to buy based on their budget and preferred brand.

284
00:15:14.440 --> 00:15:17.639
<v Speaker 2>It guides you through a series of questions to a category, right, like.

285
00:15:17.600 --> 00:15:19.320
<v Speaker 1>A float chart. And what if you want to group

286
00:15:19.399 --> 00:15:23.240
<v Speaker 1>similar data points together without knowing the categories beforehand.

287
00:15:23.279 --> 00:15:27.480
<v Speaker 2>Oh, that's clustering. Imagine segmenting your customer base based on

288
00:15:27.559 --> 00:15:32.039
<v Speaker 2>their actual buying habits into distinct groups you didn't predefine.

289
00:15:32.200 --> 00:15:35.919
<v Speaker 2>Methods like Kai means clustering can reveal these hidden customer

290
00:15:35.960 --> 00:15:38.360
<v Speaker 2>personas just from the data itself.

291
00:15:38.559 --> 00:15:43.080
<v Speaker 1>Finding natural groups. Yeah okay. And finally, association.

292
00:15:43.080 --> 00:15:47.559
<v Speaker 2>Association helps you discover relationships between items. It's famously used

293
00:15:47.559 --> 00:15:50.679
<v Speaker 2>in market basket analysis and retail to see which products

294
00:15:50.679 --> 00:15:53.879
<v Speaker 2>are frequently bought together. The classic example is people who

295
00:15:53.879 --> 00:15:57.679
<v Speaker 2>buy diapers often buy beer apparently, or maybe more commonly,

296
00:15:57.720 --> 00:15:58.360
<v Speaker 2>bread and butter.

297
00:15:58.679 --> 00:16:01.639
<v Speaker 1>Right, finding those connections. Okay, So after selecting and building

298
00:16:01.679 --> 00:16:04.759
<v Speaker 1>your model, model evaluation becomes critical. You need to know

299
00:16:04.759 --> 00:16:07.360
<v Speaker 1>if your predictions are actually meaningful and reliable, not just

300
00:16:07.480 --> 00:16:08.200
<v Speaker 1>random flukes.

301
00:16:08.279 --> 00:16:11.879
<v Speaker 2>Right, absolutely crucial. You need to assess its performance using

302
00:16:12.000 --> 00:16:15.559
<v Speaker 2>various metrics and concepts. One common one is the P value.

303
00:16:15.679 --> 00:16:18.519
<v Speaker 1>The P value often misunderstood it is.

304
00:16:18.519 --> 00:16:21.720
<v Speaker 2>It's not just about surprise. It's your model's way of asking,

305
00:16:22.200 --> 00:16:24.600
<v Speaker 2>how likely is it that I observe this result or

306
00:16:24.639 --> 00:16:28.000
<v Speaker 2>something even more extreme if there was no real effect

307
00:16:28.120 --> 00:16:31.519
<v Speaker 2>actually occurring in the world purely by random chance. Okay,

308
00:16:31.879 --> 00:16:34.879
<v Speaker 2>A tiny pea value gives you confidence that your findings

309
00:16:34.879 --> 00:16:38.080
<v Speaker 2>are statistically significant, meaning they're unlikely to be just a

310
00:16:38.120 --> 00:16:40.360
<v Speaker 2>fluke of the data you happen to collect, right, not

311
00:16:40.480 --> 00:16:44.000
<v Speaker 2>just noise exactly. And you also critically compare the model's

312
00:16:44.039 --> 00:16:46.840
<v Speaker 2>performance on train data, the data used to build it,

313
00:16:47.200 --> 00:16:50.679
<v Speaker 2>and test data new data it hasn't seen before. This

314
00:16:50.720 --> 00:16:54.960
<v Speaker 2>helps spot overfitting. Overfitting that's where the model performs brilliantly

315
00:16:55.000 --> 00:16:57.679
<v Speaker 2>on the data it's already seen, but completely falls apart

316
00:16:57.720 --> 00:17:00.720
<v Speaker 2>when it encounters new unseen data because it learned the

317
00:17:00.759 --> 00:17:05.000
<v Speaker 2>training data too specifically, including its noise, or the opposite

318
00:17:05.119 --> 00:17:08.079
<v Speaker 2>underfitting where it's too simple and performs poorly on both.

319
00:17:08.279 --> 00:17:10.880
<v Speaker 1>Finding that balance and you also need to analyze errors

320
00:17:11.200 --> 00:17:13.599
<v Speaker 1>like false positives and false negatives definitely.

321
00:17:13.720 --> 00:17:16.240
<v Speaker 2>For example, predicting someone has the flu when they don't

322
00:17:16.519 --> 00:17:20.200
<v Speaker 2>is a false positive that has very different real world implications,

323
00:17:20.240 --> 00:17:24.000
<v Speaker 2>maybe unnecessary warrior treatment, than a false negative, which is

324
00:17:24.039 --> 00:17:26.519
<v Speaker 2>predicting they don't have the flu when they actually do.

325
00:17:27.200 --> 00:17:31.640
<v Speaker 2>Understanding the specific consequences of your model's errors is paramount

326
00:17:31.680 --> 00:17:33.200
<v Speaker 2>in deciding if it's fit for purpose.

327
00:17:33.440 --> 00:17:37.519
<v Speaker 1>Absolutely okay. This careful evaluation then leads us nicely to

328
00:17:37.599 --> 00:17:41.440
<v Speaker 1>step five visualizations, where you actually tell the story with

329
00:17:41.519 --> 00:17:45.480
<v Speaker 1>your data. Our source says it beautifully, numbers can transform

330
00:17:45.519 --> 00:17:48.319
<v Speaker 1>into stories and insights leap off the page.

331
00:17:48.480 --> 00:17:50.839
<v Speaker 2>I love that framing too, It really captures it. It's

332
00:17:50.839 --> 00:17:53.119
<v Speaker 2>about so much more than just picking a chart type.

333
00:17:53.240 --> 00:17:55.599
<v Speaker 2>It's about crafting a compelling visual.

334
00:17:55.319 --> 00:17:57.599
<v Speaker 1>Narrative, and the key insights here seem to be knowing

335
00:17:57.599 --> 00:17:59.880
<v Speaker 1>your audience, choosing the right chart type to convey your

336
00:18:00.039 --> 00:18:05.759
<v Speaker 1>specific message, clearly simplifying complex data visually and using color wisely,

337
00:18:06.200 --> 00:18:10.440
<v Speaker 1>especially considering accessibility for users with colorblindness, which is often overlooked.

338
00:18:10.720 --> 00:18:14.720
<v Speaker 2>Absolutely good user experience principles applied just as much here.

339
00:18:15.240 --> 00:18:18.079
<v Speaker 2>Think about interactivity through things like tooltips that pop up

340
00:18:18.079 --> 00:18:21.279
<v Speaker 2>with details or filters that allow your audience to explore

341
00:18:21.319 --> 00:18:23.799
<v Speaker 2>the data at their own pace and get answers to

342
00:18:23.839 --> 00:18:25.279
<v Speaker 2>their own specific questions.

343
00:18:25.400 --> 00:18:28.920
<v Speaker 1>Yeah, letting them dig in. We use so many chart types, yeah,

344
00:18:28.960 --> 00:18:32.319
<v Speaker 1>bar charts, histograms, line graphs, scatter plots, box and whisker

345
00:18:32.359 --> 00:18:35.599
<v Speaker 1>plots for showing spread heat maps, even stem and leaf

346
00:18:35.599 --> 00:18:37.200
<v Speaker 1>plots sometimes true.

347
00:18:37.200 --> 00:18:40.000
<v Speaker 2>But often the real power comes from combining charts, like

348
00:18:40.079 --> 00:18:43.680
<v Speaker 2>putting a scatterplot showing individual data points alongside a line

349
00:18:43.720 --> 00:18:46.759
<v Speaker 2>graph showing the overall trend That can tell a much

350
00:18:46.799 --> 00:18:50.240
<v Speaker 2>more complete story, say about sales performance over time, showing

351
00:18:50.279 --> 00:18:54.079
<v Speaker 2>both individual transaction outliers and the overall profit trends together.

352
00:18:54.240 --> 00:18:56.440
<v Speaker 1>Good point, But there's a word of caution here too.

353
00:18:56.680 --> 00:19:00.880
<v Speaker 2>Yes, definitely, visualizations can be subjective and quite a motive.

354
00:19:01.000 --> 00:19:03.640
<v Speaker 2>It's important to avoid making them overly technical for your

355
00:19:03.640 --> 00:19:07.599
<v Speaker 2>audience or using distracting elements like save three D graphs

356
00:19:07.680 --> 00:19:11.319
<v Speaker 2>which rarely add clarity and often just confuse the core message.

357
00:19:11.480 --> 00:19:12.599
<v Speaker 2>Keep it clean and clear.

358
00:19:12.799 --> 00:19:17.400
<v Speaker 1>Good advice. Okay, That brings us to the bigger picture,

359
00:19:17.720 --> 00:19:21.359
<v Speaker 1>exploring the broader implications of data science, especially as it

360
00:19:21.400 --> 00:19:24.319
<v Speaker 1>evolves into AI and touches more parts of our lives.

361
00:19:24.559 --> 00:19:27.640
<v Speaker 2>This is such a critical discussion, and we sometimes encounter

362
00:19:27.720 --> 00:19:31.519
<v Speaker 2>situations where the very application of data can spark significant

363
00:19:31.519 --> 00:19:36.160
<v Speaker 2>public debate, raising ethical questions like what well, consider the

364
00:19:36.240 --> 00:19:39.359
<v Speaker 2>controversy around the A level exam grading during the COVID

365
00:19:39.400 --> 00:19:43.599
<v Speaker 2>pandemic in the UK. Algorithms used historical school performance data

366
00:19:43.640 --> 00:19:46.759
<v Speaker 2>to help assign grades when exams couldn't happen. This led

367
00:19:46.799 --> 00:19:50.279
<v Speaker 2>to widespread public outcry. I remember that many felt it

368
00:19:50.319 --> 00:19:53.960
<v Speaker 2>was deeply unfair to individual high achieving students in historically

369
00:19:54.000 --> 00:19:57.920
<v Speaker 2>lower performing schools. It really highlighted the challenges of algorithmic

370
00:19:57.960 --> 00:20:01.119
<v Speaker 2>fairness and how the public reacts when data driven decisions

371
00:20:01.160 --> 00:20:03.079
<v Speaker 2>don't seem to align with perceived equity.

372
00:20:03.400 --> 00:20:06.400
<v Speaker 1>That example clearly shows the real world impact these models

373
00:20:06.440 --> 00:20:09.359
<v Speaker 1>can have and the importance of public perception and trust.

374
00:20:09.839 --> 00:20:13.319
<v Speaker 1>It's also why questions arise like are people comfortable with

375
00:20:13.359 --> 00:20:16.000
<v Speaker 1>their data being used in this specific way. We saw

376
00:20:16.000 --> 00:20:18.799
<v Speaker 1>Elon Musk raise concerns about his private jet movements being

377
00:20:18.839 --> 00:20:22.440
<v Speaker 1>publicly tracked, citing personal privacy and safety risks for his family.

378
00:20:22.759 --> 00:20:24.119
<v Speaker 1>It's a constant.

379
00:20:23.640 --> 00:20:27.799
<v Speaker 2>Tension, indeed, and to navigate these complexities we have legal frameworks.

380
00:20:28.319 --> 00:20:32.680
<v Speaker 2>In Europe. For instance, the GDPR principles are key lawful, fair,

381
00:20:32.839 --> 00:20:39.599
<v Speaker 2>transparent processing, limited purpose data minimization, accuracy, storage limitation, integrity

382
00:20:39.599 --> 00:20:43.599
<v Speaker 2>and confidentiality and accountability. These also have stricter rules for

383
00:20:43.720 --> 00:20:47.160
<v Speaker 2>special category data like health information or race, which requires

384
00:20:47.160 --> 00:20:51.960
<v Speaker 2>specific explicit consent. Regulatory bodies like the Information Commissioner's Office,

385
00:20:52.000 --> 00:20:55.039
<v Speaker 2>the ICO and the UK enforce these rules. They even

386
00:20:55.079 --> 00:20:58.000
<v Speaker 2>reprimanded a school for using a facial recognition system for

387
00:20:58.079 --> 00:21:01.920
<v Speaker 2>cashless catering, emphasizing the needed for robust legal compliance around

388
00:21:01.960 --> 00:21:03.920
<v Speaker 2>how data is used, especially sensitive data.

389
00:21:04.039 --> 00:21:08.519
<v Speaker 1>It's a complex landscape. Then there's the exciting but also

390
00:21:08.640 --> 00:21:12.799
<v Speaker 1>maybe slightly intimidating, rapidly evolving world of machine learning and

391
00:21:12.880 --> 00:21:17.039
<v Speaker 1>artificial intelligence. It feels important to clarify their relationship because

392
00:21:17.039 --> 00:21:19.319
<v Speaker 1>the terms are often used interchangeably.

393
00:21:18.720 --> 00:21:22.440
<v Speaker 2>Aren't they They are, and they're deeply interconnected but distinct.

394
00:21:22.559 --> 00:21:25.200
<v Speaker 2>You can think of it like this, data science methods

395
00:21:25.200 --> 00:21:28.279
<v Speaker 2>are often used to develop machine learning models, and machine

396
00:21:28.359 --> 00:21:31.079
<v Speaker 2>learning techniques can be applied to solve data science problems

397
00:21:31.240 --> 00:21:32.960
<v Speaker 2>and also to create AI systems.

398
00:21:33.119 --> 00:21:34.519
<v Speaker 1>Okay, so how do we define them?

399
00:21:34.960 --> 00:21:38.920
<v Speaker 2>We can define machine learning mL generally as software that

400
00:21:39.000 --> 00:21:42.160
<v Speaker 2>improves as it performs a task through experience with data,

401
00:21:42.480 --> 00:21:47.519
<v Speaker 2>and AI artificial intelligence as computer systems performing complex human

402
00:21:47.599 --> 00:21:52.240
<v Speaker 2>tasks like reasoning, problem solving, or creation. So AI is

403
00:21:52.279 --> 00:21:55.880
<v Speaker 2>about the system performing human like tasks, while mL is

404
00:21:55.920 --> 00:21:58.319
<v Speaker 2>often the method by which that software learns and improves

405
00:21:58.319 --> 00:21:59.759
<v Speaker 2>its performance on those tasks.

406
00:22:00.000 --> 00:22:03.839
<v Speaker 1>It's a helpful distinction. And within AI, there's narrow AI, right.

407
00:22:03.759 --> 00:22:08.000
<v Speaker 2>Which performs highly specific tasks like identifying potholes in road images.

408
00:22:08.079 --> 00:22:11.000
<v Speaker 2>Current machine learning is very good at creating narrow AI.

409
00:22:11.240 --> 00:22:14.680
<v Speaker 1>Versus general AI, which is the hypothetical AI that could

410
00:22:14.680 --> 00:22:17.920
<v Speaker 1>handle all human intellectual task as well as we can,

411
00:22:18.240 --> 00:22:21.440
<v Speaker 1>which importantly has not yet been achieved. Not yet.

412
00:22:21.519 --> 00:22:25.279
<v Speaker 2>No. And then there's generative AI, which has exploded recently.

413
00:22:25.759 --> 00:22:31.559
<v Speaker 2>This is AI specifically designed to create new content, text, images, music, code.

414
00:22:31.680 --> 00:22:34.720
<v Speaker 2>Foundational models are a type of generative AI. They're pre

415
00:22:34.839 --> 00:22:37.960
<v Speaker 2>trained on absolutely vast amounts of data and can then

416
00:22:38.000 --> 00:22:41.200
<v Speaker 2>be adapted to many different downstream uses, like the large

417
00:22:41.279 --> 00:22:44.680
<v Speaker 2>language models LMS that generate the text you might interact

418
00:22:44.720 --> 00:22:45.440
<v Speaker 2>with online.

419
00:22:45.559 --> 00:22:48.279
<v Speaker 1>But even with all this incredible power, AI comes with

420
00:22:48.359 --> 00:22:51.240
<v Speaker 1>inherent challenges and risks. We hear a lot about them.

421
00:22:51.359 --> 00:22:54.240
<v Speaker 2>We do. There's the issue of bias, where the AI's

422
00:22:54.279 --> 00:22:57.160
<v Speaker 2>outputs are skewed or unfair because of biases present in

423
00:22:57.200 --> 00:22:59.839
<v Speaker 2>the massive data sets it was trained on. There's hallucin

424
00:23:00.319 --> 00:23:03.400
<v Speaker 2>where the AI essentially invents information that sounds plausible but

425
00:23:03.519 --> 00:23:04.079
<v Speaker 2>isn't true.

426
00:23:04.279 --> 00:23:05.759
<v Speaker 1>That's a worrying one it is.

427
00:23:06.039 --> 00:23:09.920
<v Speaker 2>Then there's transparency or the black box problem, the difficulty

428
00:23:09.920 --> 00:23:13.160
<v Speaker 2>in understanding how exactly the AI reached its conclusion, which

429
00:23:13.200 --> 00:23:16.440
<v Speaker 2>is crucial for trust and debugging, and of course privacy

430
00:23:16.440 --> 00:23:19.559
<v Speaker 2>concerns about how personal data is used within these huge,

431
00:23:19.640 --> 00:23:20.680
<v Speaker 2>complex systems.

432
00:23:20.960 --> 00:23:23.319
<v Speaker 1>In different parts of the world are approaching these challenges

433
00:23:23.319 --> 00:23:25.000
<v Speaker 1>differently regulation wise.

434
00:23:24.920 --> 00:23:27.920
<v Speaker 2>Very much so. The UK and USA tend to take

435
00:23:28.000 --> 00:23:33.480
<v Speaker 2>a more pro innovation, perhaps lighter touch approach initially. China

436
00:23:33.559 --> 00:23:38.160
<v Speaker 2>tends to regulate specific AI products and applications directly. The EU,

437
00:23:38.359 --> 00:23:41.799
<v Speaker 2>with its Comprehensive AI Act, is taking a risk based approach,

438
00:23:41.880 --> 00:23:45.759
<v Speaker 2>providing stricter controls for AI systems deemed high risk, aiming

439
00:23:45.799 --> 00:23:48.880
<v Speaker 2>to provide a broad framework for responsible development and deployment

440
00:23:48.880 --> 00:23:52.240
<v Speaker 2>across member states. It's a developing picture globally.

441
00:23:51.920 --> 00:23:54.319
<v Speaker 1>Definitely one to watch. Okay, to really bring all these

442
00:23:54.359 --> 00:23:56.759
<v Speaker 1>concepts of life cycle, of the ethics, the AI connection

443
00:23:56.799 --> 00:23:59.960
<v Speaker 1>to life, let's dive into some compelling case studies data science.

444
00:24:00.640 --> 00:24:04.279
<v Speaker 1>First up Innovation Factory and their Traffic Year sounds intriguing.

445
00:24:04.599 --> 00:24:08.319
<v Speaker 2>It's a fantastic example of putting prescriptive analytics into real

446
00:24:08.359 --> 00:24:13.640
<v Speaker 2>world action. Anwar, the founder combines sound detection, computer vision,

447
00:24:13.720 --> 00:24:17.519
<v Speaker 2>and generative AI to monitor traffic and pollution in Birmingham.

448
00:24:18.319 --> 00:24:21.440
<v Speaker 2>He developed a classification model to identify different types of

449
00:24:21.519 --> 00:24:25.400
<v Speaker 2>vehicles just by their sound signature, which automated the incredibly

450
00:24:25.480 --> 00:24:30.000
<v Speaker 2>laborious process of manual labeling. The prescriptive analytics then kicked

451
00:24:30.039 --> 00:24:33.599
<v Speaker 2>in linking these data science outcomes directly to automated actions

452
00:24:33.640 --> 00:24:34.480
<v Speaker 2>in the real world.

453
00:24:34.599 --> 00:24:36.519
<v Speaker 1>And it went beyond just traffic.

454
00:24:36.279 --> 00:24:39.880
<v Speaker 2>Yes, it extended to railway lines. They use the traffic

455
00:24:39.960 --> 00:24:43.599
<v Speaker 2>Year technology to detect animals like deer or even kangaroos

456
00:24:43.640 --> 00:24:46.440
<v Speaker 2>near the tracks. The system then plays specific light and

457
00:24:46.519 --> 00:24:49.839
<v Speaker 2>sound patterns designed to deter them safely. It even uses

458
00:24:49.920 --> 00:24:52.680
<v Speaker 2>generative AI to try and determine the animal species and

459
00:24:52.720 --> 00:24:56.480
<v Speaker 2>its activity, then triggers the most appropriate response from literally

460
00:24:56.559 --> 00:25:00.440
<v Speaker 2>twenty thousand options. Wow, it's a prime ex example of

461
00:25:00.519 --> 00:25:05.839
<v Speaker 2>quite sophisticated data science leading to immediate automated, intelligent interventions

462
00:25:05.839 --> 00:25:06.720
<v Speaker 2>in the physical world.

463
00:25:06.960 --> 00:25:12.079
<v Speaker 1>That absolutely highlights how models can drive real world actions. Okay,

464
00:25:12.319 --> 00:25:15.119
<v Speaker 1>next up, Smart Container Co. What was their story?

465
00:25:15.240 --> 00:25:17.559
<v Speaker 2>Steve a Smart Container Co. Was trying to prove a

466
00:25:17.599 --> 00:25:22.599
<v Speaker 2>specific hypothesis that ultrasonic readings could accurately measure the carbonation

467
00:25:22.759 --> 00:25:26.680
<v Speaker 2>levels inside sealed containers. He put in a solid year

468
00:25:26.799 --> 00:25:31.119
<v Speaker 2>of really diligent effort, collecting robust data, trying various regression

469
00:25:31.200 --> 00:25:32.680
<v Speaker 2>models to try improve his theory.

470
00:25:32.880 --> 00:25:34.880
<v Speaker 1>Okay, makes sense, and did he prove it?

471
00:25:35.079 --> 00:25:38.640
<v Speaker 2>Surprisingly? No, the data just didn't support it. The hypothesis

472
00:25:38.680 --> 00:25:39.720
<v Speaker 2>remained unproven.

473
00:25:39.880 --> 00:25:41.480
<v Speaker 1>Oh so a failure.

474
00:25:41.640 --> 00:25:44.759
<v Speaker 2>Well, Interestingly, Steve and Smart Container Co. Didn't see it

475
00:25:44.799 --> 00:25:47.920
<v Speaker 2>as a failure at all. They now knew definitively how

476
00:25:47.960 --> 00:25:51.079
<v Speaker 2>not to measure carbonation using that particular method, which was

477
00:25:51.119 --> 00:25:54.599
<v Speaker 2>actually immensely valuable information for them. Ah right, It saved

478
00:25:54.640 --> 00:25:58.160
<v Speaker 2>them from potentially disastrous future investments based on a false

479
00:25:58.200 --> 00:26:03.839
<v Speaker 2>premis insight. Here is the power of rigorous data driven disproving.

480
00:26:04.519 --> 00:26:06.759
<v Speaker 2>Sometimes the most valuable result you can get from a

481
00:26:06.839 --> 00:26:10.559
<v Speaker 2>data science project is knowing definitively what doesn't work. It

482
00:26:10.640 --> 00:26:14.279
<v Speaker 2>requires curiosity and a relentless focus on data quality, even

483
00:26:14.319 --> 00:26:16.000
<v Speaker 2>when the answer isn't the one you hoped for.

484
00:26:16.319 --> 00:26:19.640
<v Speaker 1>That's a really powerful lesson. Understanding what's not a solution

485
00:26:20.160 --> 00:26:23.119
<v Speaker 1>can be just as valuable, maybe even more so sometimes

486
00:26:23.400 --> 00:26:24.359
<v Speaker 1>than finding one.

487
00:26:24.680 --> 00:26:28.759
<v Speaker 2>Yeah, okay, Then there's cognitive business applying data science to

488
00:26:28.799 --> 00:26:32.319
<v Speaker 2>wind farms. That sounds like big data territory.

489
00:26:32.519 --> 00:26:35.759
<v Speaker 1>It certainly was. Tie and his team used machine learning

490
00:26:35.759 --> 00:26:38.960
<v Speaker 1>for predictive maintenance in wind farms. The interesting thing is

491
00:26:39.000 --> 00:26:41.640
<v Speaker 1>that turbines within a single wind farm field are often

492
00:26:41.720 --> 00:26:45.079
<v Speaker 1>quite similar. Okay, so learning from the performance patterns of

493
00:26:45.079 --> 00:26:48.119
<v Speaker 1>one turbine could potentially be applied to predict issues in

494
00:26:48.200 --> 00:26:50.880
<v Speaker 1>many others. They had to account for external factors like

495
00:26:50.920 --> 00:26:54.279
<v Speaker 1>wind direction physical location on the farm, but this similarity

496
00:26:54.319 --> 00:26:57.680
<v Speaker 1>allowed them to scale up massivelyatively. Ultimately, they ended up

497
00:26:57.680 --> 00:27:01.079
<v Speaker 1>building and managing millions of individual predict of models, one

498
00:27:01.119 --> 00:27:04.759
<v Speaker 1>for each key component on each turbine. This immense scale

499
00:27:04.799 --> 00:27:08.599
<v Speaker 1>allowed them to identify previously unknown fault types and completely

500
00:27:08.640 --> 00:27:12.599
<v Speaker 1>automate the model building, training and evaluation process. It really

501
00:27:12.640 --> 00:27:16.000
<v Speaker 1>showcases the sheer power of automation in modern data science

502
00:27:16.039 --> 00:27:19.920
<v Speaker 1>deployments millions of models. That really speaks to the scale

503
00:27:19.960 --> 00:27:25.440
<v Speaker 1>and potential here. Okay, another compelling story, goodw focused on

504
00:27:25.559 --> 00:27:27.640
<v Speaker 1>financial inclusion. What did they do?

505
00:27:28.039 --> 00:27:30.599
<v Speaker 2>Ellie and her team at goodwith took a really interesting

506
00:27:30.640 --> 00:27:34.279
<v Speaker 2>blended approach. They combine in depth qualitative research talking to

507
00:27:34.400 --> 00:27:39.720
<v Speaker 2>young people about money with unsupervised machine learning, specifically clustering

508
00:27:40.079 --> 00:27:42.799
<v Speaker 2>to validate financial personas for young adults, so.

509
00:27:42.920 --> 00:27:45.640
<v Speaker 1>Mixing human insight with algorithms exactly.

510
00:27:46.119 --> 00:27:50.160
<v Speaker 2>They collected quantitative data through questionnaires and aggregated banking histories.

511
00:27:50.559 --> 00:27:53.920
<v Speaker 2>They even use natural language processing n LP that's the

512
00:27:53.960 --> 00:27:56.680
<v Speaker 2>tech that enables computers to understand human language on the

513
00:27:56.680 --> 00:28:00.640
<v Speaker 2>text descriptions of bank transactions to get richer insightsp on

514
00:28:00.680 --> 00:28:03.799
<v Speaker 2>transaction data. Interesting, and the clustering results from the mL

515
00:28:03.960 --> 00:28:07.319
<v Speaker 2>remarkably aligned with the initial personas they developed through the

516
00:28:07.400 --> 00:28:11.680
<v Speaker 2>qualitative interviews. This gave them confidence to build personalized financial

517
00:28:11.759 --> 00:28:15.839
<v Speaker 2>learning pathways and ultimately aimed to enable better fair lending

518
00:28:15.880 --> 00:28:19.640
<v Speaker 2>decisions for often underserved groups. The transparency of their models

519
00:28:19.680 --> 00:28:20.839
<v Speaker 2>was also key for trust.

520
00:28:21.240 --> 00:28:25.680
<v Speaker 1>That blend of deep qualitative insight, powerful quantitative methods in

521
00:28:25.680 --> 00:28:30.000
<v Speaker 1>transparency sounds like a truly impactful approach. Finally, smart TAB

522
00:28:30.160 --> 00:28:33.559
<v Speaker 1>providing financial trading signals, what were the challenges there?

523
00:28:33.720 --> 00:28:36.920
<v Speaker 2>Dev's core challenge at SMARTAB was ensuring the quality and

524
00:28:36.960 --> 00:28:41.039
<v Speaker 2>reliability of data coming from multiple external sources, which often

525
00:28:41.079 --> 00:28:44.640
<v Speaker 2>had varying costs and levels of trustworthiness. He had to

526
00:28:44.680 --> 00:28:49.079
<v Speaker 2>implement rigorous sampling and testing protocols, including looking at things

527
00:28:49.119 --> 00:28:52.319
<v Speaker 2>like the standard deviation and price movements to understand market

528
00:28:52.359 --> 00:28:53.319
<v Speaker 2>volatility from.

529
00:28:53.200 --> 00:28:55.680
<v Speaker 1>Each source, managing input quality.

530
00:28:55.400 --> 00:28:59.960
<v Speaker 2>Precisely and crucially. He also integrated unstructured social context day

531
00:29:00.079 --> 00:29:02.440
<v Speaker 2>to things like official commentary from the Bank of England

532
00:29:02.720 --> 00:29:06.000
<v Speaker 2>using natural language processing to add another layer of understanding

533
00:29:06.000 --> 00:29:10.000
<v Speaker 2>to its quantitative models. This project perfectly illustrates that necessary

534
00:29:10.039 --> 00:29:13.039
<v Speaker 2>blend of deep domain knowledge in finance and pure pattern

535
00:29:13.119 --> 00:29:17.119
<v Speaker 2>detection skill needed to extract real value from complex, messy,

536
00:29:17.240 --> 00:29:18.559
<v Speaker 2>real world data streams.

537
00:29:18.920 --> 00:29:22.000
<v Speaker 1>These case studies really bring the entire data analysis life

538
00:29:22.000 --> 00:29:25.079
<v Speaker 1>cycle to life, don't they From the initial problem framing

539
00:29:25.200 --> 00:29:27.960
<v Speaker 1>right through to the measurable real world impact. And that

540
00:29:28.079 --> 00:29:31.920
<v Speaker 1>impact really hinges on our final step communication, because even

541
00:29:31.920 --> 00:29:36.079
<v Speaker 1>the best models are useless. You can't communicate the findings effectively.

542
00:29:35.599 --> 00:29:40.279
<v Speaker 2>Absolutely useless. The DKW Pyramid Beta Information Knowledge Wisdom helps

543
00:29:40.359 --> 00:29:44.000
<v Speaker 2>us think about this transformation. Raw data becomes information when

544
00:29:44.000 --> 00:29:48.240
<v Speaker 2>it's processed and analyzed through tested hypotheses. It turns into

545
00:29:48.279 --> 00:29:51.400
<v Speaker 2>knowledge when you combine that information with deep domain context

546
00:29:51.480 --> 00:29:55.400
<v Speaker 2>and experience, and ultimately it leads to wisdom that informs decisive,

547
00:29:55.839 --> 00:30:01.359
<v Speaker 2>actionable strategy. Communication is key at each step, and communication.

548
00:30:00.920 --> 00:30:03.279
<v Speaker 1>Itself needs to provide a great user experience. Right is

549
00:30:03.279 --> 00:30:07.640
<v Speaker 1>it usable? Is it useful, desirable, findable, accessible, credible, and

550
00:30:07.759 --> 00:30:10.359
<v Speaker 1>ultimately truly valuable to the person receiving it.

551
00:30:10.839 --> 00:30:14.640
<v Speaker 2>Exactly and knowing your audience's paramount here, you need to

552
00:30:14.680 --> 00:30:18.559
<v Speaker 2>segment them. Are you talking to technical specialists, managers, executives

553
00:30:18.640 --> 00:30:22.279
<v Speaker 2>or the project team itself. You have to tailor your message,

554
00:30:22.319 --> 00:30:25.920
<v Speaker 2>your language, your level of detail accordingly, while still maintaining

555
00:30:25.960 --> 00:30:28.440
<v Speaker 2>a consistent core narrative across all groups.

556
00:30:28.799 --> 00:30:31.599
<v Speaker 1>This is where storytelling with data really shines, isn't it.

557
00:30:32.079 --> 00:30:35.720
<v Speaker 1>Using a narrative helps your audience not only understand the knowledge,

558
00:30:35.839 --> 00:30:38.319
<v Speaker 1>but also connect with it on a deeper maybe more

559
00:30:38.400 --> 00:30:41.440
<v Speaker 1>motional level, making it more memorable and actionable.

560
00:30:41.519 --> 00:30:44.960
<v Speaker 2>Definitely, take a contact center example, Instead of just presenting

561
00:30:45.000 --> 00:30:48.480
<v Speaker 2>stark numbers about call handling times and customer satisfaction scores,

562
00:30:48.799 --> 00:30:51.359
<v Speaker 2>you can tell a story. You can illustrate the inherent

563
00:30:51.440 --> 00:30:55.880
<v Speaker 2>conflict often present between, say, stripped cost cutting targets like

564
00:30:55.920 --> 00:31:00.319
<v Speaker 2>reducing call handling time and maintaining high customer satisfaction by

565
00:31:00.319 --> 00:31:03.559
<v Speaker 2>things like net promoter scores, show how one might negatively

566
00:31:03.599 --> 00:31:04.119
<v Speaker 2>impact the.

567
00:31:04.079 --> 00:31:07.480
<v Speaker 1>Other, make the trade offs clear. And this storytelling is

568
00:31:07.480 --> 00:31:10.480
<v Speaker 1>supported by what the source calls the four pillars of

569
00:31:10.599 --> 00:31:17.279
<v Speaker 1>data storytelling, using symbols, effectively choosing color, thoughtfully crafting clear captions,

570
00:31:17.319 --> 00:31:22.400
<v Speaker 1>and considering the overall editorial layout. It's about designing the communication,

571
00:31:22.640 --> 00:31:23.559
<v Speaker 1>not just the charts.

572
00:31:23.920 --> 00:31:27.400
<v Speaker 2>It really is, and it's also crucial when communicating to

573
00:31:27.480 --> 00:31:31.480
<v Speaker 2>clearly distinguish between evidence backed recommendations that flow directly from

574
00:31:31.480 --> 00:31:35.559
<v Speaker 2>your analysis and hypothesis testing, versus theories or ideas that

575
00:31:35.680 --> 00:31:39.039
<v Speaker 2>might have emerged but still require further testing. For example,

576
00:31:39.200 --> 00:31:43.160
<v Speaker 2>recommending ab testing different contact handling times to truly understand

577
00:31:43.160 --> 00:31:46.920
<v Speaker 2>their causal impact on customer satisfaction, rather than stating it

578
00:31:46.960 --> 00:31:48.240
<v Speaker 2>as a proven fact.

579
00:31:48.039 --> 00:31:50.799
<v Speaker 1>Initially maintaining that intellectual wandysey exactly.

580
00:31:50.920 --> 00:31:54.640
<v Speaker 2>And finally, for insights to have lasting impact, solutions often

581
00:31:54.720 --> 00:31:58.200
<v Speaker 2>need to be operationalized. This means integrating them into ongoing

582
00:31:58.240 --> 00:32:02.200
<v Speaker 2>business processes, often through automation, turning those hard one insights

583
00:32:02.240 --> 00:32:05.079
<v Speaker 2>into continuous, efficient benefit for the organization.

584
00:32:05.559 --> 00:32:09.039
<v Speaker 1>This deep dive has truly unpacked the foundations of data science.

585
00:32:09.640 --> 00:32:12.599
<v Speaker 1>We've seen its transformative power, how it's built on core

586
00:32:12.640 --> 00:32:16.839
<v Speaker 1>foundations and statistics and machine learning, and crucially the importance

587
00:32:16.839 --> 00:32:20.599
<v Speaker 1>of skilled practitioners who can expertly navigate that entire project

588
00:32:20.680 --> 00:32:24.400
<v Speaker 1>life cycle, from the initial problem framing right through to

589
00:32:24.680 --> 00:32:25.839
<v Speaker 1>impactful communication.

590
00:32:26.160 --> 00:32:29.200
<v Speaker 2>And we've touched upon that vital aspect of responsible innovation,

591
00:32:29.759 --> 00:32:32.720
<v Speaker 2>ensuring that as data science continues to evolve and become

592
00:32:33.079 --> 00:32:37.279
<v Speaker 2>frankly more democratized, we build and apply these incredibly powerful

593
00:32:37.279 --> 00:32:41.000
<v Speaker 2>tools with a keen awareness of their broader implications, always

594
00:32:41.039 --> 00:32:44.799
<v Speaker 2>striving for solutions that genuinely benefit society and minimize harm.

595
00:32:45.079 --> 00:32:47.920
<v Speaker 1>So, as technology continues to advance and new data sources

596
00:32:47.960 --> 00:32:52.319
<v Speaker 1>emerge constantly, the possibilities for innovation seem truly endless, and

597
00:32:52.359 --> 00:32:55.039
<v Speaker 1>it's worth remembering not every data science project will give

598
00:32:55.079 --> 00:32:58.440
<v Speaker 1>you some groundbreaking, completely new answer. Sometimes they might just

599
00:32:58.480 --> 00:33:01.759
<v Speaker 1>confirm what you already suspected through experience or intuition, and

600
00:33:01.799 --> 00:33:02.839
<v Speaker 1>that's perfectly okay.

601
00:33:02.960 --> 00:33:06.799
<v Speaker 2>It really is okay knowing something definitively having the data

602
00:33:06.839 --> 00:33:09.480
<v Speaker 2>to back it up, even if it confirms prior intuition

603
00:33:10.079 --> 00:33:13.559
<v Speaker 2>is still incredibly valuable for making confident decisions.

604
00:33:13.559 --> 00:33:15.920
<v Speaker 1>Absolutely, and that leads us with this provocative thought for

605
00:33:15.960 --> 00:33:19.519
<v Speaker 1>you to consider. As data continues to permeate every corner

606
00:33:19.519 --> 00:33:21.839
<v Speaker 1>of our lives and the tools to analyze it become

607
00:33:21.880 --> 00:33:25.160
<v Speaker 1>ever more powerful and accessible, how will you, in whatever

608
00:33:25.240 --> 00:33:29.119
<v Speaker 1>role you have, contribute to shaping this incredibly rewarding field

609
00:33:29.480 --> 00:33:32.039
<v Speaker 1>and building a future where data truly serves humanity
