The world’s leading publication for data science, AI, and ML professionals.

Statistics Bootcamp 1: Laying the Foundations

Learn the math and methods behind the libraries you use daily as a data scientist

Image by Author
Image by Author

Statistics Bootcamp

To more formally address the need for a Statistics lecture series on Medium, I have started to create a series of Statistics Bootcamps, as seen in the title above. These will build on one another and as such will be numbered accordingly. The motivation for doing so is to democratize the knowledge of statistics in a ground up fashion to address the need for more formal statistics training in the data science community. These will begin simple and expand upwards and outwards, with exercises and worked examples along the way. My personal philosophy when it comes to engineering, coding, and statistics is that if you understand the math and the methods the abstraction now seen using a multitude of libraries falls away and allows you to be a producer, not only a consumer of information. Many facets of these will be a review for many learners/readers, however having a comprehensive understanding and a resource to refer to is important. Happy reading/learning!

Research and development process

Our whole reason behind statistics and data science is that we are conducting research, either in a formal or informal setting. I have outlined below the research and development process which should be followed regardless of whether they are in academia or industry. These I like to call the ‘Elegant 8’ of R and D:

  1. Identify a question of interest (that answering would have some notable impact in your line of work)
  2. Search literature for information. This is especially important, so you do not unknowingly duplicate work, inadvertently plagiarizing, wasting time and resources etc.
  3. Generate a hypothesis (will cover this more later). This is especially important because in ‘data science’ you are constantly performing data exploration and cleaning (often in tandem) and you forget to generate a meaningful hypothesis that adds novel value to current working understanding of phenomena.
  4. Identify variables: independent (proposed cause), dependent (proposed effect)
  5. Collect/acquire/gain access to data
  6. Analyze data – HOW you will analyze data MUST be considered prior to data acquisition and occur in tandem with hypothesis creation so it is formulated in a testable way
  7. Draw conclusions – based on your stats! (what you can say and what you can’t say…)
  8. Make recommendations

Terminology

We shall start with some simple terminology. My apologizes that this first article will be laden with more than the subsequent boot camps as it sets our stage. So what is statistics? Well, statistics is the science of conducting studies to collect, organize, summarize, analyze, and draw conclusions from data. When we collect or acquire data, we achieve this end by either having all the data on the topic (population level) or a subset of our population (sample). Population is ** the collection of all individuals or items, (subjects) being studied. Sample is** a group of subjects selected from a population (should have similar characteristics as the population)

Statistics can be broken down into two flavors: descriptive and inferential.

Image by Author
Image by Author

Descriptive statistics consists of the collection, organization, summarization of data. Examples include: mean, median, percentiles, counts etc.

Example: Average healthcare spending per capita by state governments in 2005–2006 was $2845.

Inferential statistics consists of methods for drawing and measuring the reliability of conclusions about a population, leveraging a sample. Examples include: estimates, hypothesis tests, relationships among variables and predictions.

Example: By the year 2035, 25% of the U.S. population will be 65 years of age or more.

Types of Data

Possible data transformations aside, we have two types of data. These are qualitative and quantitative. Qualitative implies a non-numeric value. There are two types: nominal or ordinal.

Example: Nominal: gender, ethnicity, geographic location Ordinal: number of rooms in a house (each room may be a different size)

Quantitative data constitute interval, ordered, ranked data. Again, there are two main types: discrete and continuous. Discrete indicates that possible values can be counted. Think of unsigned integers here. Continuous/interval data assume infinite number of values between two specific values. Think of floating point numbers here.

Example: Continuous: radius of object in image, height (fractions available) Discrete: number of student in a class

Measurement Levels

When breaking down data types, you have to consider the measurement levels. Nominal data are classified into mutually exclusive categories in which no order or ranking can be imposed.

Example: gender, ethnicity, political party

Ordinal refers to **** ordered or ranked categories, where precise differences between levels do not necessarily exist. E.g. letter grades, Likert scale survey evaluations.

Image by Author
Image by Author

Interval data are those with precise and fixed magnitude differences between consecutive units of measure. However, there is NO meaningful zero, but there is meaningful distance between the units of measurement.

Example: calendar year, figure skating scores

Ratio data have all the characteristics of interval measurement, but with a TRUE zero. Thus, ratios can be created.

Example: height, weight, salary, time

Population vs. Sample

We defined the difference between population and sample above, but now we are going to look at what this means when selecting statistical tests, specifically a statistic or a parameter.

Image by Author
Image by Author

Statistic is a measure obtained by using the data values from a sample.

Example: sample mean, sample standard deviation, median, quartiles, frequency

Parameter by contrast **** is a measure obtained by using ALL the data values from a population, often unknown and of interest in research.

Example: population average, population variance

Sampling

Sampling is the method by which we obtain a subset of the population that are representative of the population of interest and should be unbiased by the researchers

Note: the sampling procedure used will influence the type of inferential statistics you can apply to the data!

Simple Random Sample

A sampling procedure where each possible sample of a given size is equally likely to be the one obtained.

How do we do this? We need to generate random samples, and can do so using a table or function of random numbers. This will number all the subjects/items in the population. We should start at a random location in the table, then go in any direction with variable step size until you obtain a sample.

Systematic Sampling

  1. Number each item/subject in the population without ordering for anything specific
  2. Calculate ‘k’ based on your desired sample size k = population size/sample size

  3. Start at a random number (within the first k)
  4. Pick every kth item until desired sample size

Example: The population has 200 subjects and a sample of 50 is needed. 2000/50 = 40 (40 is the kth item).

Note: This could easily present an issue with sampling, IF for example your population list was ordered alternating male, female, male, female etc. Using k=2 here would not give you a representative sample, as you would only be selecting females.

Example: male female male female male female

Cluster Sampling

Cluster sampling works well if your data is HUGE. Essentially, you divide the population into groups of approximately equal size, then obtain a random sample of the clusters. Steps to carry this out:

  1. Divide population into clusters (groups)

    • the number of clusters is usually large and each cluster size is similar
  2. Take a random number of clusters
  3. Include all of these together to reach sample size

Ex. department in a university, classes in school

Image by Author
Image by Author

Stratified Sampling

When we want to ensure, we are creating a representation of the population along a variety of characteristics.

  1. Divide the population into subpopulations (groups or strata according to some characteristic)
  2. Obtain a simple random sample from each stratum of size proportional to population
  3. Combine the samples to form your sample of the population
Image by Author
Image by Author

Ex. Assume we want a sample size of 20:

*Stratum sample size = desired sample size(stratum size/population) size**

Image by Author
Image by Author

Cluster vs. Stratified Sampling

Similarities

  • goal: get a representative sample of the population
  • divide the population into groups

Differences

  • cluster: take a random sample of groups
  • stratified: take a random sample within groups

When to use which

  • cluster: large number of groups in population, expect large differences between groups. Naturally happens and can have a great deal of variability
  • stratified: small number of groups, expect similarity between groups, characteristic to stratify on relates to research question

Combine Sampling Schemes

Clustering within stratified sampling:

Example: Research Q:What is the effect of tackling on the risk of injury in adolescent (13–17) football players?

Stratified by city: Dallas, Chicago, New York Clustering within each city: adolescent teams within each city Include all players in each sampled team to the final sample

Study Classification

Observational study – the researcher simply observes what is happening or what has happened in the past and tries to draw conclusions based on these observations (cannot manipulate or control). Data Science typically falls in THIS category!

Ex. smoking versus lung cancer Disadvantage: only association/correlation can be inferred. Note: There is work being done into causal modeling that does not typically follow this route, however this is traditionally true.

Experimental Study – researcher manipulates a variable (independent variable, explanatory variable) and tries to determine how it influences the outcome (dependent variable/response).

Ex. Randomized Control Trials (RCTs) Advantage: causal link can be established

Wrap-up

In this bootcamp we have introduced the definition of statistics, the differences between population, sample, and how they pair with measures parameter and statistic accordingly. We’ve covered multiple sampling techniques, noting the pros and cons, and how one might combine them in a research study. Most importantly though, we have set the stage for subsequent bootcamps in this series by outlining our approach to research, with the idea of development following close at hand. Unfortunately, applying statistics ad hoc hoping to get the result you desire is contrived and all too common, and a large part of the motivation for creating this bootcamp. So, remember your ‘Elegant 8‘!

Next bootcamp in the series:

#2 Center, Variation and Position


Additionally, if you like seeing articles like this and want unlimited access to my articles and all those supplied by Medium, consider signing up using my referral link below. Membership is $5(USD)/month; I make a small commission that in turn helps to fuel more content and articles!

Join Medium with my referral link – Adrienne Kline


Related Articles