Statistics Bootcamp
To more formally address the need for a Statistics lecture series on Medium, I have started to create a series of Statistics Bootcamps, as seen in the title above. These will build on one another and as such will be numbered accordingly. The motivation for doing so is to democratize the knowledge of statistics in a ground up fashion to address the need for more formal statistics training in the data science community. These will begin simple and expand upwards and outwards, with exercises and worked examples along the way. My personal philosophy when it comes to engineering, coding, and statistics is that if you understand the math and the methods the abstraction now seen using a multitude of libraries falls away and allows you to be a producer, not only a consumer of information. Many facets of these will be a review for many learners/readers, however having a comprehensive understanding and a resource to refer to is important. Happy reading/learning!
Research and development process
Our whole reason behind statistics and data science is that we are conducting research, either in a formal or informal setting. I have outlined below the research and development process which should be followed regardless of whether they are in academia or industry. These I like to call the ‘Elegant 8’ of R and D:
- Identify a question of interest (that answering would have some notable impact in your line of work)
- Search literature for information. This is especially important, so you do not unknowingly duplicate work, inadvertently plagiarizing, wasting time and resources etc.
- Generate a hypothesis (will cover this more later). This is especially important because in ‘data science’ you are constantly performing data exploration and cleaning (often in tandem) and you forget to generate a meaningful hypothesis that adds novel value to current working understanding of phenomena.
- Identify variables: independent (proposed cause), dependent (proposed effect)
- Collect/acquire/gain access to data
- Analyze data – HOW you will analyze data MUST be considered prior to data acquisition and occur in tandem with hypothesis creation so it is formulated in a testable way
- Draw conclusions – based on your stats! (what you can say and what you can’t say…)
- Make recommendations
Terminology
We shall start with some simple terminology. My apologizes that this first article will be laden with more than the subsequent boot camps as it sets our stage. So what is statistics? Well, statistics is the science of conducting studies to collect, organize, summarize, analyze, and draw conclusions from data. When we collect or acquire data, we achieve this end by either having all the data on the topic (population level) or a subset of our population (sample). Population is ** the collection of all individuals or items, (subjects) being studied. Sample is** a group of subjects selected from a population (should have similar characteristics as the population)
Statistics can be broken down into two flavors: descriptive and inferential.

Descriptive statistics consists of the collection, organization, summarization of data. Examples include: mean, median, percentiles, counts etc.
Example: Average healthcare spending per capita by state governments in 2005–2006 was $2845.
Inferential statistics consists of methods for drawing and measuring the reliability of conclusions about a population, leveraging a sample. Examples include: estimates, hypothesis tests, relationships among variables and predictions.
Example: By the year 2035, 25% of the U.S. population will be 65 years of age or more.
Types of Data
Possible data transformations aside, we have two types of data. These are qualitative and quantitative. Qualitative implies a non-numeric value. There are two types: nominal or ordinal.
Example: Nominal: gender, ethnicity, geographic location Ordinal: number of rooms in a house (each room may be a different size)
Quantitative data constitute interval, ordered, ranked data. Again, there are two main types: discrete and continuous. Discrete indicates that possible values can be counted. Think of unsigned integers here. Continuous/interval data assume infinite number of values between two specific values. Think of floating point numbers here.
Example: Continuous: radius of object in image, height (fractions available) Discrete: number of student in a class
Measurement Levels
When breaking down data types, you have to consider the measurement levels. Nominal data are classified into mutually exclusive categories in which no order or ranking can be imposed.
Example: gender, ethnicity, political party
Ordinal refers to **** ordered or ranked categories, where precise differences between levels do not necessarily exist. E.g. letter grades, Likert scale survey evaluations.

Interval data are those with precise and fixed magnitude differences between consecutive units of measure. However, there is NO meaningful zero, but there is meaningful distance between the units of measurement.
Example: calendar year, figure skating scores
Ratio data have all the characteristics of interval measurement, but with a TRUE zero. Thus, ratios can be created.
Example: height, weight, salary, time
Population vs. Sample
We defined the difference between population and sample above, but now we are going to look at what this means when selecting statistical tests, specifically a statistic or a parameter.

Statistic is a measure obtained by using the data values from a sample.
Example: sample mean, sample standard deviation, median, quartiles, frequency
Parameter by contrast **** is a measure obtained by using ALL the data values from a population, often unknown and of interest in research.
Example: population average, population variance
Sampling
Sampling is the method by which we obtain a subset of the population that are representative of the population of interest and should be unbiased by the researchers
Note: the sampling procedure used will influence the type of inferential statistics you can apply to the data!
Simple Random Sample
A sampling procedure where each possible sample of a given size is equally likely to be the one obtained.
How do we do this? We need to generate random samples, and can do so using a table or function of random numbers. This will number all the subjects/items in the population. We should start at a random location in the table, then go in any direction with variable step size until you obtain a sample.
Systematic Sampling
- Number each item/subject in the population without ordering for anything specific
-
Calculate ‘k’ based on your desired sample size k = population size/sample size
- Start at a random number (within the first k)
- Pick every kth item until desired sample size
Example: The population has 200 subjects and a sample of 50 is needed. 2000/50 = 40 (40 is the kth item).
Note: This could easily present an issue with sampling, IF for example your population list was ordered alternating male, female, male, female etc. Using k=2 here would not give you a representative sample, as you would only be selecting females.
Example: male female male female male female
Cluster Sampling
Cluster sampling works well if your data is HUGE. Essentially, you divide the population into groups of approximately equal size, then obtain a random sample of the clusters. Steps to carry this out:
-
Divide population into clusters (groups)
- the number of clusters is usually large and each cluster size is similar
- Take a random number of clusters
- Include all of these together to reach sample size
Ex. department in a university, classes in school

Stratified Sampling
When we want to ensure, we are creating a representation of the population along a variety of characteristics.
- Divide the population into subpopulations (groups or strata according to some characteristic)
- Obtain a simple random sample from each stratum of size proportional to population
- Combine the samples to form your sample of the population

Ex. Assume we want a sample size of 20:
*Stratum sample size = desired sample size(stratum size/population) size**

Cluster vs. Stratified Sampling
Similarities
- goal: get a representative sample of the population
- divide the population into groups
Differences
- cluster: take a random sample of groups
- stratified: take a random sample within groups
When to use which
- cluster: large number of groups in population, expect large differences between groups. Naturally happens and can have a great deal of variability
- stratified: small number of groups, expect similarity between groups, characteristic to stratify on relates to research question
Combine Sampling Schemes
Clustering within stratified sampling:
Example: Research Q:What is the effect of tackling on the risk of injury in adolescent (13–17) football players?
Stratified by city: Dallas, Chicago, New York Clustering within each city: adolescent teams within each city Include all players in each sampled team to the final sample
Study Classification
Observational study – the researcher simply observes what is happening or what has happened in the past and tries to draw conclusions based on these observations (cannot manipulate or control). Data Science typically falls in THIS category!
Ex. smoking versus lung cancer Disadvantage: only association/correlation can be inferred. Note: There is work being done into causal modeling that does not typically follow this route, however this is traditionally true.
Experimental Study – researcher manipulates a variable (independent variable, explanatory variable) and tries to determine how it influences the outcome (dependent variable/response).
Ex. Randomized Control Trials (RCTs) Advantage: causal link can be established
Wrap-up
In this bootcamp we have introduced the definition of statistics, the differences between population, sample, and how they pair with measures parameter and statistic accordingly. We’ve covered multiple sampling techniques, noting the pros and cons, and how one might combine them in a research study. Most importantly though, we have set the stage for subsequent bootcamps in this series by outlining our approach to research, with the idea of development following close at hand. Unfortunately, applying statistics ad hoc hoping to get the result you desire is contrived and all too common, and a large part of the motivation for creating this bootcamp. So, remember your ‘Elegant 8‘!
Next bootcamp in the series:
#2 Center, Variation and Position
Additionally, if you like seeing articles like this and want unlimited access to my articles and all those supplied by Medium, consider signing up using my referral link below. Membership is $5(USD)/month; I make a small commission that in turn helps to fuel more content and articles!