S2E10 - Statistics & Talk With Alexander Pelletier (PhD Student)
References & Transcript
- Connect with Alexander Pelletier: email@example.com
- Metaphorigins Instagram Page - https://www.instagram.com/metaphorigins/
- Scalene Writing Instagram Page - https://www.instagram.com/scalenewriting/
- Buzzfeed - Article - https://www.buzzfeed.com/tabathaleggett/philosophical-questions-that-get-harder-the-longer-you-th
- Zeno's Paradox of Dichotomy - Wikipedia - https://en.wikipedia.org/wiki/Zeno%27s_paradoxes#Dichotomy_paradox
- 4 out of 5 Dentist Statistic - Article - https://www.targetmarketingmag.com/article/lessons-statistical-interpretations/
- Dr. Arthur Benjamin - https://www.arthurbenjamin.info/
- Dr. Arthur Benjamin - TedTalk - https://www.ted.com/talks/arthur_benjamin_teach_statistics_before_calculus
- Most Good Country - https://www.goodcountry.org/index/results/
- BMJ Journal - Textbook - https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one
- Khan Academy - Online Tutoring - https://www.khanacademy.org/math/statistics-probability
- Mathemagician - TedTalk - https://www.ted.com/talks/arthur_benjamin_a_performance_of_mathemagic?language=en#t-895034)
- Dr. Alan Smith - https://www.ft.com/alan-smith
- Dr. Alan Smith - TedTalk - https://www.ted.com/talks/alan_smith_why_you_should_love_statistics
- How Well Do You Know Your Area? - Survey - http://www.neighbourhood.statistics.gov.uk/HTMLDocs/dvc147/index.html
- Dr. Sebastian Wernicke - Twitter - https://twitter.com/_wernicke?lang=en
- Dr. Sebastian Wernicke - TedTalk - https://www.ted.com/talks/sebastian_wernicke_lies_damned_lies_and_statistics_about_tedtalks
- Dr. Daniel Kahneman - https://scholar.princeton.edu/kahneman/home
- Thinking Fast and Slow - Summary - https://medium.com/@marklooi/summary-of-kahnemans-thinking-fast-and-slow-3d1c2ea0e6a
- Dr. Sanne Blauw - https://www.sanneblauw.com/
- Dr. Sanne Blauw - Youtube - https://www.youtube.com/watch?v=mJ63-bQc9Xg
- Homer Simpson - Youtube - https://www.youtube.com/watch?v=sm7ArKlzHSM
- Simpson's Paradox - Wikipedia - https://en.wikipedia.org/wiki/Simpson%27s_paradox
- Mona Chalabi - https://monachalabi.com/
- Mona Chalabi - TedTalk - https://www.ted.com/talks/mona_chalabi_3_ways_to_spot_a_bad_statistic
- PLOS Medicine - Scientific Article - https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124
- P-Hacking - Youtube - https://www.youtube.com/watch?v=0Rnq1NpHdmw
- Dr. Ronald Coase - https://medium.com/@timothyakinyomi/if-you-torture-the-data-long-enough-it-will-confess-to-anything-492786c30169
- Domain of Science - Youtube - https://www.youtube.com/watch?v=OmJ-4B-mS-Y
- BBC Documentary - Youtube - https://www.youtube.com/watch?v=l6oKriR-RjM
- Target Statistics - Article - https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?pagewanted=1&_r=1&hp
- Dr. Hans Rosling - TedTalk - https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen
- Latin Origin - http://homepage.divms.uiowa.edu/~dzimmer/alphaseminar/Statistics-history.pdf
- CrashCourse - Youtube - https://www.youtube.com/watch?v=sxQaBpKfDRk
- Canadian Census - https://www12.statcan.gc.ca/census-recensement/2021/ref/98-26-0001/2020001/004-eng.cfm
- MiniTab - Article - https://blog.minitab.com/blog/understanding-statistics/17-common-words-with-precise-statistical-meaningsor-more-bewildering-things-statisticians-say
- Humans and Patterns - Article - https://medium.com/@zulierane/the-psychological-reason-you-see-patterns-where-there-are-none-ca9b0dc34e53
- Myside Bias - Scientific Article - http://keithstanovich.com/Site/Research_on_Reasoning_files/TandR07.pdf
- Central Tendancy - Wikipedia - https://en.wikipedia.org/wiki/Central_tendency
- Dr. Philippe Rigolette - Youtube - https://www.youtube.com/watch?v=VPZD_aij8H0
- Laerd Statistics - Article - https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php
- MathandScience - Youtube - https://www.youtube.com/watch?v=VK-rnA3-41c&t
- Monty Hall Problem - Wikipedia - https://en.wikipedia.org/wiki/Monty_Hall_problem
- Three Prisoners Problem - Wikipedia - https://en.wikipedia.org/wiki/Three_Prisoners_problem
- Bertrand's Box Paradox - Wikipedia - https://en.wikipedia.org/wiki/Bertrand%27s_box_paradox
- Spurious Correlations - https://www.tylervigen.com/spurious-correlations
- Mark Twain - Wikipedia - https://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statistics
- Jupyter Notebooks - https://jupyter.org/
- Gaussian Distribution Demo - Google Colaboratory - https://colab.research.google.com/drive/1_72Uv5mPfRjifq0Mh67e4vfKrxWcZYxq?usp=sharing
- Flying High by jantrax | https://soundcloud.com/jantr4x
- Music promoted by Switxwrhttps://www.free-stock-music.com
- Creative Commons Attribution 3.0 Unported License | https://creativecommons.org/licenses/by/3.0/deed.en_US
To my lovely family and friends. Near and far. Old and new. This is Kevin Mercurio on the mic. And welcome to the twentieth episode and Season 2 Finale of the Metaphorigins Podcast.
Remember, to show support if you like this sort of content, please make sure to rate and subscribe to the podcast on Apple or whatever platform you are listening to this on, and follow @metaphorigins on Instagram, where I will be posting most of my updates, as well as on my personal website: kjbmercurio.com/metaphorigins. Most listeners of this podcast know that I hold a draw every 5 episodes for some cool swag. So in today’s episode, I will be giving away a cute, butterfly-printed Metaphorigins mug to one of you great listeners, right now! I have done the draw for the limited-edition Metaphorigins mug, and the winner is… Dylan Singh!!! Yaaaay! Congratulations! I’ll shoot you a message following this episode.
I am extremely grateful to those who have been following this journey, and your support in creating this podcast. I never would have thought it would lead to anything but a writing exercise or learning something new in science and communication. Yet I have been able to connect with family, friends and others in science to discuss things I am very curious about, and hope to make others curious about as well. With that said, I am announcing that I will take a brief break to focus on my academic responsibilities and other creative endeavours like writing fiction, but trust me there will be more metaphorical content, SciComm rants, and butterfly-styled giveaways in the new season coming to you sometime in February 2021. I have big ideas for this new season and where I want this platform to go, so stay tuned, and thank you from the bottom of my heart.
Okay. We have a special episode today with a special goal in mind. Today’s topic is again, something I touched on briefly in my 10th episode about Scientific Literature. A monster of a topic, colossal in its scope, but microscopic in its interest, which I hope to change. And that topic, is statistics.
Now before I lose you to daydreaming about how shitty the pandemic is, Youtube’s recommended video playlist or one of Buzzfeed’s 19 Questions that Get Harder to Answer the More You Think About Them (Number 5 being a version of Zeno’s Infamous Paradox of Dichotomy, umm, what was I saying before? Forget I said all that. Point is, everything I mentioned that could distract you, and everything that you did or are about to do today in which deserves any importance probably requires some form of statistics.
You might be thinking, “Well yeah, of course , anything with numbers is in some way tied to statistics, and most things in life need numbers. Scheduling, counting, assigning, profiling, even cutting a piece of cake in, coincidentally, the easiest way possible, requires statistics, analyses by you or prior to you from the moment in which you undertake the task.” And yeah, that’s right. It creeps up on you on most decisions you make everyday, even on the brand of toothpaste you use because remember, 4 out of 5 dentists recommend every toothpaste on the market, the one other dentist having a biased or high probability of working for the competitor.
Statistics, argued by Professor of Mathematics at Harvey Mudd College Dr. Arthur Benjamin, is a topic that could be a lot of fun. Its scope is present within things we would at least deem fun. He says “It’s the mathematics of games, and gambling, it’s analyzing trends, it’s predicting the future!” Statistics can even tell you which country currently does the most good for humanity (right now its Finland, with Canada just outside the top 10, and the US at #40). I myself was brought to painful, eye-fluttering boredom re-teaching myself the fundamentals of statistics through online resources like the BMJ Journal, Khan Academy, and my old Introduction to Biostatistics course notes (those damn T-tables!). Since 1998, researchers like those at the University of Alabama have discussed major challenges in teaching statistics at the undergraduate level, like motivation, math anxiety, performance extremes and making the learning last [paper]. However, if taught correctly, it can be learned in an interactive manner and actually be useful. A innate problem with statistics is that it’s often thought about with the same non-interest and sometimes mathematical difficulty as calculus, topology, and geodesics. Such a loss of utility! Briefly staying on Dr. Benjamin for a second, check out his videos about being a mathemagician, using pneumatic memory techniques to calculate solutions to complex squares, or large numbers multiplied by themselves.
Here’s another way to think about statistics. Dr. Alan Smith, Data Visualization expert for the Financial Times, was a late convert to the idea of statistics, stating “statistics are about us, if you look at the etymology about the word statistics, its the science of dealing with the data of the state of the community we live in.” Expanding on this, statistics allows us to determine how we perceive life around us as compared to the reality. The consequence of this, often characterized as a reality check, permits us to understand our collective mindset, our biases, our impressions of the way things are, and make improvements. It identifies problems at not just an individual level, but a societal level. In Dr. Smith’s case, he invented a survey, or more like a game, about statistics using UK census data, in order to gain statistics about how well people knew about the statistics of their geographical area. Meta enough for you? More meta is the Data Scientist Dr. Sebastian Wernicke, who, with the power of statistics, created a Ted Talk about analyzing other Ted Talks available on the Ted website to determine what factors are present in good Ted Talks and bad Ted Talks.
Statistics literally describe how we tend to live life. In his book Thinking Fast and Slow by Nobel Laureate Dr. Daniel Kahneman, he describes that despite how advanced our civilization has become, our brains are still wired in a particular, predictable way, with emotional tendencies that are outright illogical. It’s how advertisers devise commercials or postings for their products, or how magicians organize their performances. We know how people are likely to behave or act in particular instances, especially if primed or set up in a particular fashion.
As we realize how much statistics governs the way we live, we also realize how much statistics influences the way we think. News outlets broadcast the latest polls, latest scientific findings, latest collective measurements, through our smart devices so that we can learn more about the world everyday. And yes, although this is beneficial for an informed population, how can individuals know whether the conclusions are valid? Whether they are complete? Dr. Sanne Blauw, a Dutch Journalist for De Correspondent, defines the five most common statistical lies in the news: 1) The Good-Looking Graph, so pleasing to the eyes, 2) The Polluted Poll, unique in its population, 3) The Overconfident Decimal Point, the absurdly accurate measurement, 4) The Spectacular Statistic, mind-blowing and click-baity, 5) The Cocky Correlation, the relationship that needs some work. With this knowledge, it’s important to distinguish between what is true and what is too good to be true, and to recognize if presented data falls under Simpson’s Paradox, in which “People can come up with statistics to prove anything, Kent, forfty percent of all people know that”. I’m kidding that quote is from Homer Simpson in the episode Homer the Vigilante. Simpson’s paradox is described as a phenomenon “in which a trend appears in several different groups of data but disappears or reverses when these groups are combined”. Perhaps there’s some hidden variable in the analyzed data that wasn’t initially taken into account, but once included in the analysis, completely changes statistical findings. More on this later.
Those that are involved in statistics should be questioned. Always questioned. Why should private companies divulge any sort of statistics that would not put them in a good light? Even statistics provided by government authorities should be questioned. Journalist Mona Chalabi reminds us that the word statistics “come from the state, to better measure the population in order to serve it.” She also states 3 questions one should ask themselves when confronted with a government statistic that could be extrapolated to all of statistics: 1) Do you see uncertainty in the data, is there any reason to believe this could be wrong, 2) Do you see yourself in the data, can the data be generalized to fit people like you and 3) How was the data collected, are there other ways that they could have collected the data.
Let’s bring it to the major topic at hand, where does this bring us in the context of science? A 2005 critique published in the highly regarded scientific journal PLOS Medicine by Dr. John Ioannidis titled Why most published research findings are false, steers us to concern. I mentioned in previous episodes that despite being published in credible journals, scientific studies will have biases, and these are in part due to the way in which academics strive to be successful. Success in academia comes from publishing, and publishing typically requires original, statistically significant research findings (more on significance later). The point being that, as Dr. Ioannidis states in his essay, “A major problem is that it is impossible to know with 100% certainty what the truth is in any research question. In this regard, the pure “gold” standard is unattainable.” Many scientists know this and use statistics in a way that’s not malicious, but ensures that they have a chance at publishing their work in some journal, because again, publishing = success. For a simple example, think about a medical-related study that measures different health traits ranging from cholesterol levels, to hours of sleep each night, to weight loss, etc. Now induce some test on a select group of subjects, while another group of subjects, your control, are not given the test. This could be a drug, a specific food, whatever. By measuring many different variables, one might be able to understand that statistically, there is a high possibility that your test will impact at least one of those variables just by chance. This is called p-hacking, and was even highlighted in an episode of Last Week Tonight with John Oliver. Without a general understanding of statistics, even within the scientific community, there can be no agreement on whether findings are relevant to our understanding of the world and the universe as a whole.
Statistics is extremely important, extremely useful but extremely dangerous. Dr. Ronald Coase, a renowned British Economist, once said “Torture the data long enough, and it will confess to anything”. In today’s episode, let’s step back and respect statistical values for what they are, quantifications of phenomena that are waiting to be interpreted by us collectively, with as little bias as possible.
Most of this information was obtained from many articles and videos discussing the origins, applications and future directions of the field of statistics. All sources will be mentioned in the description.
I’ve spent weeks trying to determine a good starting place for this section, until finally I decided to start where numbers began, after watching a video about the origins of mathematics by the Youtube channel Domain of Science. Creator Dr. Dominic Walliman states that all mathematics, the basis behind any sort of statistical analysis, comes from the idea of counting something, “counting is not just a human trait, other animals are able to count as well and evidence for counting goes back to prehistoric times with checkmarks made in bones.” The act of counting, recorded on some piece of calcified material, was the first recording of any sort of statistic. Dr. Walliman mentions that Egypt claims the first equation in 3000 BCE, […] China claims the first use of negative numbers in 200 BCE, India claims the first use of the idea of ZERO in year 628, the Persian empire writes the first book on algebra, the rules of how to work with equations, in year 820, and mathematics booms during the renaissance in the 18th century, along with the sciences. Without going further into pure mathematics, one can only appreciate its beauty, driven by curiosity and the motivation to overcome current frontiers and essentially speak to the universe itself.
Any useful information, transcribed in alphanumerical characters, is present in the data collected. In a 2016 BBC documentary titled the Joy of Data, host Dr. Hannah Fry states that “data is the new currency of our time.” The capturing, storing, sharing and interpreting of data is the very foundation of which statistics is built on. Statistical analysis can help spotlight some very weird phenomena. Present in the documentary, you can figure out when the is the best time to inseminate cows based on tracking the amount of steps they take, how infectious disease like cholera is spread and what population is most vulnerable based on factor commonalities, determine the total amount of how much company employees are to be paid based on arithmetic performed at incredible speeds, even determine how much to score letters used in a game of Scrabble based on the the amount of Shannon information each letter possesses (like how Z, having a score of 10, has 10.5 bits). Weirder still, statistics can allow companies like Target to find out that your daughter is pregnant before you even suspected it.
The statistical analysis of data is absolutely fascinating, if done right. Everyday we are reminded of how utterly terrible global circumstances are and the hopelessness of the immediate future. However, the power of statistics can be used to understand how much the world has positively changed over time, regarding life expectancy and the size of families. Dr. Hans Rosling, a Swedish physician famous for his dynamic presentations about global economics, states in a presentation about that very topic, “We have data in the UN, National Statistical Agencies, the Universities, and other non-governmental organizations, because the data is hidden in the databases, and the public is there, and the internet is there, but we still have not used it effectively, all that information we saw changing in the world does not include publicly funded statistics, there are some webpages but people put prices on them, stupid passwords and boring statistics, and this won’t work.” In addition to teaching people how data is and can be analyzed, we need ways to show people the reality of the world they live in, and present an analysis in a way that can be comprehended by a vast majority.
The word statistics comes from Latin, statisticum collegium, meaning council of state. Like what Dr. Alan Smith and journalist Mona Chalabi has already mentioned, its widespread usage came about as a way for societal leaders and whole governments to understand their governing population and create legislation. Basically, this department of smart civil servants, later becoming an entire field to discover ways for optimizing this task, can be summarized as the science of collecting data (also known as Sampling Methods), organizing & analyzing data (also known as Descriptive Statistics), as well as interpreting data (also known as inferential statistics).
We can go further and talk about each of those specific words on their own. First of all, what data are we collecting? Let’s say we wanted to know whether those who eat apple pie, the best dessert ever created, are the coolest people in the world. Are we collecting data on individuals, or apple pies? If individuals, are we collecting data on their frequency of pie devouring, the time of day when they eat pies, or both? How do we define coolness? Is there a coolness scale in which we can measure an individual by? And if apple pies, are we collecting data on the number of apples in the pies, the brand of apple pie, or neither? How do we define a pie? Is apple pie eaten as an entree still considered a dessert? My point here being, any study, performed by a statistician, medical doctor or any person, should start with a well defined question for statistics to jump on. Adrienne Hill, lecturer on the online platform CrashCourse, puts it nicely, “It’s important to form the right statistical question, for example, the question why do people eat fast food is a difficult question to answer with statistics, but do people who eat fast food often work 80 hours a week is a better question that statistics can get to the bottom of”.
And collecting data can be done in numerous ways. You can collect data by providing your study’s subjects with surveys in which they can answer questions and fill in particular fields. Generally, this is how governments provide statistical portraits of their country, like the census program distributed by Statistics Canada. Since the Liberals took back power in 2015, the long form census was reinstated, with the next one happening in 2021. But it’s obvious that this method of surveying people cannot be the only method of collecting data. In fact, a lot of data is collected by individuals heading a study, by recording observations or measurements. And depending on your study, there are several ways of collecting observations or measurements. How do we know which way is best, and the number of samples needed to generalize findings, but also be performed in the most optimal time? Another great question. With the millions of questions one could ask, there is just no time I could list every technical procedure to collect data, so just remember knowing how particular data is collected is the first step in determining the plausibility of any statistics.
We’ve gone over Sampling Methods, so let’s move to the next bulk. Descriptive Statistics is what is normally regarded as “statistics” by the general public. It’s what is taught in schools across the world, things like charts, graphs, averages, percentages, significance, regression, ideas that some math teacher routinely scribbled on a chalkboard at one point in your life and now long forgotten or at least not part of our daily thoughts. Yet, it’s these ideas that propagate our knowledge as a civilization. If we can’t start from the same basic knowledge of word definitions, how can we understand the data at all? An interesting blog post on mini-tab.com acknowledges the challenges of statistical communication, especially since “some words or phrases that mean one thing in statistical vernacular, […] may signify something very different in a popular context”. Things like an independent variable, something statisticians define as a factor we can control, like the duration of an experiment, but in popular context is the exact opposite, a thing we can’t control or influence, like a flat-earth believer.
Descriptive Statistics, again, is how we organize and analyze collected data. Here, we can visualize specific unique trends or unusual spookiness. Humans are incredibly good at that, seeing patterns. In a 2019 article on Medium, writer Zulie Rane states “Humans are so good at recognizing patterns that if we think two variables are connected, we start seeing a trend even if there isn’t one.” This phenomena known as natural myside bias was coined from a 2007 study published in the journal Thinking and Reasoning by Dr Keith Stanovich and Dr. Richard West. By knowing that this is a thing, what we should not do is dump unprocessed data onto those interested in a study. The data has to be organized in order for any analysis to make sense.
We can organize observations and measurements into several different ways. Tables like stem-and-leaf plots and spreadsheets, graphs like pie charts, bar graphs and scatterplots, even three dimensional models. With these visualizations, humans can very easily see clusters, gaps, peaks and outliers in data that are likely to be true (but still need to be proven statistically). The choice of presentation depends on the specific data collected, described as variables, which can be put into two groups: quantitative variables can be continuous like weights and heights, or discrete like the number of people you have in your family, while categorical variables can be nominal like sex or blood type (in which the order of options does not matter) or ordinal like when an optometrist asks you whether a particular lens on that absurd phoropter device is worse, same or better than the previous lens (in which the order of options does matter).
Let’s go back to the elementary school curriculum and discuss the infamous statistical alliteration: mean, median and mode. I’ll put these into context of numerical data, but these are also used in non-numerical data as well. All three are measures of something statisticians call central tendency, or the typical value of a distribution of data. The mean, or the arithmetic average, is the collection of numbers divided by the count of numbers in that collection. Most sports stats, temperature stats about how screwed we are because of climate change, most times someone says “well, generally speaking…” and then continues to say some number, are likely talking about the mean. The median is a value separating the higher half and lower half of a set of data. But what about if there was an even number of measurements taken, Kevin? Well, listener, after listing the measurements in order from least to greatest, take the two middle values and find the MEAN between the two. Because of this definition, any categorical types of data would not be analyzed in regards to the median. The mode is the value that appears the most often in a data set. Put that all together, in a data set consisting of the numbers 1, 1, 2, and 8, the mean would be 3, the median would be 1.5, and the mode would be 1. Easy right? Now go out and stats the crap out of life!
Unfortunately, it’s not that simple, but I am happy that I get to address a language challenge. In fact, in everyday speech we often talk about each of these distinct ideas as one term, the average. As Professor of Mathematics at MIT Dr. Philippe Rigollet tells his statistics students, “All of statistics is replacing expectations with averages”. The mean or arithmetic average, I mean it’s in the name, but it is generally considered the true average. The problem with it, as mentioned in a post in Laerd Statistics, “it is particularly susceptible to the influence of outliers. These are values that are unusual compared to the rest of the data set by being especially small or large in numerical value”. For example, when considering the average income of a citizen living in your neighbourhood, that new billionaire living across from you with the 5 floor building, fountain encircled driveway would heavily influence the average income and thus not be representative of the TYPICAL income. Medians are often used in this case, which is why a lot of government surveys typically specify things like median resident income, or something like that. But modes could also be used, especially if we’re talking about categorical data, like the average eye colour of your group of friends. Ahhh, which one is better? Remember that what these ideas are trying to measure is central tendency, and it is often possible that all three could be the same statistic. What’s recommended is the following: quantitative data that is not skewed by outliers, use the arithmetic average or mean, quantitative data that is skewed by outliers, use the median, and categorical data, use the mode.
Let’s get even more statistical, I mean I’ve already brought you in this far. Now how do statisticians analyze data even more? A common question a researcher asks is how variable is the data? One could see this by, again, listing all values collected in a data set from least to greatest writing what that range is. In a categorical data set, you can either do this by assigning values to them, like frequencies or number of people that provided that response in the survey. Though we run into the problem with those pesky outliers again, skewing the analysis in a way that isn’t representative of the data set. The BMJ journal describes a more robust approach called the interquartile range, "divide the distribution of the data into four [called quartiles], and find the points below which are 25%, 50% and 75% of the distribution”. In other words, the interquartile range is the value at the first and third quartile, and serves as the range in which most of your observations lie. This is a probability statistic, probability being a topic I will discuss more later with examples one would consider probability to be a part of.
Now, clearly reaching the limits of discussing statistics via an auditory format, but bear with me. Let’s go a bit more abstract to understand two more terms, standard deviation and standard error. These are great statistics that are commonly used to determine the variability of the data. We’ll use an example of 100 students and their grades in a statistics course, with all kinds of grades that we can visualize with a histogram, or a bar chart with grades being listed as ranges on the bottom axis, and the number of students who fall within those ranges measured on the side axis. We can imagine that students could do extremely well, but also do extremely poorly, but generally speaking (wink wink, I have to say it because you can’t see me), students will do okay, let's say a mean GPA of 7/10. We can say the data fits a Normal Distribution, a term that might need a whole episode on its own, so we need not go further into detail than that. We can include the standard deviation of this mean, usually as plus/minus a value after the mean value. Essentially, if you were to add or subtract one standard deviation value to your mean value, you would obtain a range that includes 68% of your observations. Adding or subtracting two standard deviations to your mean value would give a range that includes 95% of your observations, and adding or subtracting 3 standard deviations to your mean value would give a range that includes 99.7% of your observations. Therefore, the lower this standard deviation value, the less variable your data set is. This is different from standard error of the mean, which measures how far a sample mean is from the true mean. Using the previous example, let’s say you only survey a sample group of 10 students, this sample average will have a standard deviation but not truly representative of the 100 students. Well we can divide this sample standard deviation by the square root of the number of people in your sample group, 10, which will give the standard error of the mean. Generally speaking (here we go again), both measure variance but the standard error usually is smaller and used rather than the standard deviation to make data analyses look less variable than they truly are.
Variability is… expected. You could imagine that not everything has the same characteristics. One class of students in one year, will not have the same distribution of grades as the next year’s class of students. If you wanted to find out whether, let’s say, the statistics class of 2020 has a different mean grade compared to the statistics class of 2019, how would you know that the values are actually different, and not due to variability? You can do that, and this is what statisticians call statistical significance. In this example, the mean grade of the two classes can be statistically different from each other, or not statistically different from each other. I won’t go too much more in detail about statistical significance, p-values and R^2 linearity, as I am beginning to think I will need a statistics pt.2 episode, but its important to be familiar with the concept of significance as it will surely pop up in any statistical study that you come across.
We’re now at the stage of interpreting data, or what is known as inferential statistics. Here is where the fun stuff happens and where potentially new knowledge comes to light. Let’s begin with hypothesis testing, as any data-driven decision making requires the formation of some statement, or hypothesis, that we are trying to test. As discussed earlier, statistical questions that form the hypotheses we test occur at the beginning of any study. Statisticians will use the terms null hypothesis, being the statement we are trying to test (for example, that people who eat apple pies daily in Canada make an mean salary of $100,000), and the alternative hypothesis, being literally any other hypothesis (for example, that people who eat apple pies daily in Canada DO NOT make a mean salary of $100,000). From here, any properly devised statistical test will do either of the following two things: 1) accept the null hypothesis or 2) reject the null hypothesis. In other words, as described by Jason Gibson, creator of the tutoring platform MathandScience, “This is similar to what happens in the court of law: [the null hypothesis] is presumed to be innocent until the evidence says otherwise, or reject its innocence and declare it guilty of the crime [of basically existing].” And just like a court of law, you can accept the null hypothesis to be true when in fact it is false, also known as a type I error or false positive, or you can reject the null hypothesis when in fact it is true, also known as a type II error or false negative. These rates of false positives and false negatives can be quantified in statistical tests which can give you a sense of how good the study is. Imagine a medical diagnostic test claiming to detect dangerous levels of SARS-CoV-2 in a patient’s blood sample, has a high false positive rate, meaning that it diagnoses people with COVID-19 when they don’t have it, scaring the pants off them. The opposite is also frightening, if the medical diagnostic test has a high false negative rate, meaning that it misses diagnosing people with COVID-19 who actually have it, and thus not mandating self-quarantine to reduce viral spread.
[Charlie Brown noises] Goodness, if you’ve made it this far, well done. Honestly, kudos to you. Let me reward you with something more fun regarding inferential statistics. I’ve mentioned various aspects of probability theory, giving people that do statistical analyses confidence that how they interpret data has credibility, has merit. Well we can use the power of statistics to make educated guesses about the future. Actually, why statistics is done in the first place is to guess where problems lie or will lie and think about ways to fix them. Statistics can be used to benefit from our choices. To emphasize this idea, I present the Monty Hall Problem, a famous brain teaser based on the American game show, Let’s Make A Deal. You’re the next contestant, and host Monty Hall presents the game: There are 3 doors, behind two of which have a stinky goat, while the remaining door has the car of your dreams. You pick a door, let’s say door #1, and Monty says, “Alright, let’s open one of those doors!”, opening door #2, revealing a stinky goat. Monty than asks, “Let’s make a deal, I’ll give you the chance right now to switch your choice.” The clock ticks down. Do you make the switch, or stick to your guns? Well, statistics says, you have a higher probability of winning if you make the switch, in fact a 67% chance of winning, versus a 33% chance if you stay with your original choice. Why? Well, were you more likely to choose a goat or the car at the beginning of the the game? A goat. The fact that Monty revealed that one of the doors had a goat does not change that fact, or in other words, the game does not wind down to a 50:50 split. Therefore, statistically speaking, you should always switch if found in a similar circumstance. Other fun statistical games are the Three Prisoners Problem and Bertrand’s Box paradox, which will be linked in the description.
The last thing I want to discuss in this episode is correlation and causation. Correlation is a term that basically means that there is some measurable relationship between one variable and another. For example, one could track the amount of cars purchased each year, as well as track average global temperatures each year. A relationship could be that for every increase of 1000 cars purchased, the average global temperature also increases by 1 degree Celsius (which would be a crazy amount by the way, we’ll be barbecued, so this is just a hypothetical scenario). You could say that one variable is positively correlated with the other. Negative correlation would just be the opposite, for every increase in 1000 cars purchased, the global temperature decreases by 1 degree Celsius. But any correlation, any relationship between two variables, does not necessarily mean that one influences the other, or in other words, that manipulating one variable causes a change in the other variable. This is what is defined as causation, or cause and effect, something with cause in it. To determine true causation, one would have to perform further tests that control the manipulation of one variable and observe changes in the other. For the example I’ve mentioned, one with enough power could deincentivize buying cars one year by increasing taxes on car purchases by 1000%, and hopefully observe a decrease in car purchases and see any effect that has on global temperatures. I know, it’s not quite that simple since global temperatures have many other factors that contribute to its increase or decrease each year, but it would be something along those lines. The point is that studies need to be devised in a controlled way that can actually prove correlations are indeed evidence of causation. A funny blog by Tyler Vigen called Spurious Correlations looks at bizarre variables that show correlation but do not imply causation. Things like, the number of people who drowned by falling into a pool each year in the US correlates with films Nicholas Cage appeared in each year, or Divorce rates in the state of Maine correlates wth the per capita consumption of margarine. Obviously Nicholas Cage has no influence on those who have drowned by falling into a pool, or that margarine consumption impacts people’s romantic life, but it is a fun exercise in understanding data analysis. My question is, has anyone ever tested those?
That brings us to the end of this segment. There was so much I either glazed over or completely left out, as I didn’t want to complicate things too much as an episode solely dedicated to science communication, completely negating the purpose. My goal here was to introduce the basics of statistical terms and ideas used frequently in mainstream media and get people talking, get people curious about a topic that is so bland in educational curriculums but so important in everyday life. Mark Twain once said, “There are three kinds of lies: lies, damned lies and statistics.” So remember, if there’s anything I want you to retain, it is the very definition of statistics whenever you come across some statistical value that blows people’s minds. Essentially, how was the data collected, how was the data organized or presented, how was the data analyzed and how was the data interpreted. These four questions will only benefit all of us in a world where anything can be said, but not everything is significant.
For this episode, I would like to continue with what I started in Episode 15. What I want is for this podcast to slowly grow into a platform for academics to come on and give their opinion about communicating science, whether you’re an established researcher or graduate student. Mr importantly, I think trainees are usually hidden from the public eye, yet they do a substantial amount of work to make research available to other scientists and the public via publishing means. With that said, I will be doing my second interview today with someone who I believe is doing great work in their field of computational science and biostatistical analysis.
Alexander is a PhD student at the University of California, Los Angeles in the Computer Science department. He completed his undergraduate degree at UC San Diego in Bioengineering: Bioinformatics, followed by a M.Sc. at the University of Ottawa in Biochemistry, with a bioinformatics specialization. He specializes in developing machine learning models to analyze and process large biomedical datasets. His current research involves writing software to investigate how oxidative stress can causes changes in cardiac proteins over time and can be linked to risk of heart disease. Please welcome the highly intelligent Alexander Pelletier.
Thanks for listening to this special Season 2 Finale of… Metaphorigins. Remember to rate and subscribe for more episodes and to follow the podcast on Instagram for updates on when I will be returning with new content. Again, I will be taking a brief absence to focus on academic responsibilities and other hobbies like fiction writing (follow @ScaleneWriting on instagram!), though there will be more Metaphorigins coming to you sometime in February. But until then, stay skeptical but curious.