Department of Psychology and Child Development
It might seem trivial to talk about nature and how science related to nature. However, we will study nature when we study statistics.
As Westfall & Henning (2013) mentioned in their book: “Nature is all aspects of past, present, and future existence. Understanding Nature requires common observation—that is, it encompasses those things that we can agree we are observing” (p.1)
As psychologist we study behavior, thoughts, emotions, beliefs, cognition and contextual aspects of all the above. These elements are also part of nature, we mainly study constructs, I will talk about constructs frequently.
Statistics is the language of science. Statistics concerns the analysis of recorded information or data. Data are commonly observed and subject to common agreement and are therefore more likely to reflect our common reality or Nature.
To study and understand Nature we must construct a model for how Nature works.
A model helps you to understand Nature and also allows you to make predictions about Nature. There is no right or wrong model; they are all wrong! But some are better than others.
We will focus on mathematical models integrated by equations and theorems.
Models cannot reflect exactly how Nature works, but it is close enough to allow us to make predictions and inferences.
Let’s create a model for driving distance.Imagine you drive at 100 km/hour, then we can predict your driving time \(y\) in hours by creating the following model: \(y = x/100\)
You could plug in any value to replace \(x\) and you’ll get a prediction:
Example:
Always remember!
Models produce data, data does not produce models!
You can use data to estimate models, but that does not change the fact that your model comes first, before you ever see any data.
You can use data to estimate models, but that does not change the fact that your model comes first, before you ever see any data.
Always remember!
The order matters, Nature is there before our measurements,the data comes after we establish our way to measure Nature.
Westfall & Henning (2013) are doing an important distinction:
They present DATA with upper case to differentiate data with lower case. Why?
When we talk about Nature we are talking about infinite numbers of observations, from these infinite number of possibilities we extract a portion of DATA, this is similar to the concept of population. For instance, your DATA could be all the college students in California.
data (lowercase) means that we already collected a sample from that DATA. For example, if you get information from only college students from valley, you’ll have one single instance of what is possible in your population.
Your data is the only way we have to say something about the DATA.
Prior collecting data, the DATA is random, unknown.
More about models
We will use the term “probability model” , we will represent this term using: \(p(y)\) which translates into “probability of \(y\)”.
Let’s use the flipping coin example. We know that the probability of flipping a coin and getting heads is 50%, same probability can be observed for tails (50%). Then, we can represent this probability by \(p(heads) = 0.5\) or \(p(tail) = 0.5\).
This is actually a good model! Every time, it produces as many random coin flips as you like. Models produce data means: that YOUR model is a model the explains how your DATA will be produced!
Your model for Nature is that your DATA come from a probability model \(p(y).\)
Our model \(p(y)\) can be used to predict and explain Nature. A prediction is a guess about unknown events in the past, present or future or about events that might not happen at all.
It is more like asking “What-if” .For example, what if I had extra $3000 to pay my credit balance?
Let’s see this example: Free Fall Calculator
In physics you’ll find several examples of deterministic models, especially from the Newtonian perspective.
Normally these models are represented with \(f(.)\) instead of \(p(.)\), for example the driving distance example can be written as \(f(x) = x/100\).
This symbol \(f(x)\) means mathematical function. The function will give us only one solution in the case of a deterministic mode. We plug in values and we get a solution. (\(y = f(x)\)).
Probabilistic models models assume variability , they are not deterministic, probabilistic models produce data that varies, therefore we’ll have distributions.
Probabilistic models are more realistic to explain phenomena in psychology.
The following expresion represents a probabilistic model:
Would a deterministic model explain how people feel after a traumatic event?
Can we plug in values in a deterministic function to predict your attention span while driving?
You must use a probabilistic (stochastic) models to study natural variability.
For instance, in the model showed above, we have two unknown parameters, this model produces data \(Y\). The unknown parameters are represented with greek letters, for instance the letter beta in the example above.
Mantra
Model produces data.
Model has unknown parameters.
Data reduce the uncertainty about the unknown parameters.
In a probabilistic model the variable \(Y\) is produced at random. This statement is represented by \(Y \sim p(y)\).
\(p(y)\) is called a probability density function (pdf).
A pdf assigns a likelihood to your values. I will explain this in the next sessions.
IMPORTANT: A purely probabilistic statistical model states that a variable \(Y\) is produced by a pdf having unknown parameters. In symbolic shorthand, the model is given as \(Y \sim p(y|\theta )\). This greek letter \(\theta\) is called “theta”.
set.seed(1234)
dataProcess <- rnorm(16000000,
mean = 78,
sd = 1)
grades <- rnorm(100,
mean = 78,
sd = 1)
plot(density(dataProcess),
lwd = 2,
col = "red",
main = "DATA generating process",
xlab = "Grade",
ylab = "p(x)")
lines(density(grades),
col = "blue",
lwd = 2)
legend(80, 0.4,
legend=c("Data Process", "My sample"),
col=c("red", "blue"), lty=1)
I just used the word “assume” , probabilistic models have assumptions. In the previous example we assumed:
Outcome, \(y\) | \(p(y)\) |
---|---|
Tails | 1 - \(\pi\) |
Heads | \(\pi\) |
Total | 1.00 |
We need to collect data to reduce the uncertainty about the unknown parameters.
The reduction in uncertainty about model parameters that you achieve when you collect data is called statistical inference
I mentioned before that DATA is RANDOM, this means that there is a distribution of values in the universe for what you are studying.
For instance, there is a random distribution of possible romantic relationships for all of you.
But, we need to make our research idea more operative. Let me try:
Following my example of number of romantic relationships, we could simulate data using our estimand (\(\mu\)).
Let’s assume \(\mu=5\) , we want to simulate multiple data sets, let’s simulate 2000 data sets. This is equivalent to conduct the same study 2000 times.
Each data set contains 13 observations, why? We are assuming we are asking the same question 2000 times to multiple classes similar to our class.
We will use our estimator the sampled mean to find out if our data is close the mean of the process which is \(\mu=5\).
library(ggplot2)
ggplot(data.frame(multipleSamples), aes(x=multipleSamples)) +
geom_histogram(color="black", fill="white")+
geom_vline(aes(xintercept=mean(multipleSamples)),
color="blue",
linetype="dashed",
linewidth=1) +
ggtitle("Histogram of generated means")+
xlab("Sampled means") +
ylab("Frequency")+
annotate(geom= "text", x=4, y=150, label="Mean = 5.002",
color="blue")+
theme_classic()