# Correlation and Regression Models

Part 1

## Today’s aims

To introduce correlation to estimate relationships between two variables.

To introduce the notion of covariance.

To study scatter plots to visualize correlations.

## What is a correlation coefficient ?

A

is a numerical index that reflects the relationship between two variables. The value of this descriptive statistic ranges between -1.00 and +1.00.*correlation coefficient*A correlation between two variables is sometimes referred to as a bivariate (for two variables) correlation

## What is a correlation coefficient ?

- At the beginning we will study the correlation named
*Pearson product-moment*. - There other types of correlation estimation depending on the data generating process of each variable.
*Pearson product-moment*deals withDATA.*continuous*

## Correlation interpretation and other features

Salkind & Shaw (2020):

## Correlation interpretation and other features (cont.)

Salkind & Shaw (2020):

A correlation can range in value from \(-1.00\) to \(+1.00\).

A correlation equal to 0 means there is no relationship between the two variables.

The absolute value of the coefficient reflects the strength of the correlation. So, a correlation of \(-.70\) is stronger than a correlation of \(+.50\). One frequently made mistake regarding correlation coefficients occurs when students assume that a direct or positive correlation is always stronger (i.e., “better”) than an indirect or negative correlation because of the sign and nothing else.

A negative correlation is not a “bad” correlation.

We will use the letter

*r*to represent correlation. For example \(r= .06\).

## Correlation interpretation and other features (cont.)

\(r_{xy}\) is the correlation coefficient.

\(n\) is the sample size.

\(X\) represents variable \(X\).

\(Y\) represents variable \(Y\).

\(\Sigma\) means summation or addition.

## Let’s take a look at positive correlations

## Let’s take a look at negative correlations (cont.)

```
Warning in rmvnorm(200, mean = c(30, 63, 95), sigma = cor2cov(coR, SDS)): sigma
is numerically not positive semidefinite
```

## Correlation matrix

Salkind & Shaw (2020):

- You will find a correlation matrix in publications.
- It is the best way to represent several correlations between different pairs of variables.

- You will notice that a a correlation matrix has 1.00 on the diagonal and two “triangles” with the same information.

## Coefficient of Determination

- There is a useful trick, you could square your \(r\) and get a measure of correlation in terms of percentage of shared variance:

## Coefficient of Determination (cont.)

What is the coefficient of determination in this case?

We just need to estimate \(r^2= -0.22^2 = -0.05\). Attention and depression shared only 5% of the variability (variance).

`Warning: Removed 7 rows containing non-finite values (`stat_smooth()`).`

`Warning: Removed 7 rows containing missing values (`geom_point()`).`

## Scatter plots and direction of correlation

I have shown you several plots, these plots are called

.*scatter plots*These plots are useful to explore visually possible correlations.

When you create this plots, you only need to represent one of the variables in the x-axis and the other variable will be represented in the y-axis.

- Can you guess if the next scatter plot corresponds to a positive correlation?

## Scatter plots and direction of correlation (cont.)

```
ggplot(data = rum_scores, aes(x = rumZ, y = worryZ)) +
geom_point() +
scale_y_continuous(breaks = c(-3, -2, -1, 0, 1, 2, 3),limits = c(-3,3)) +
scale_x_continuous(breaks = c(-3, -2, -1, 0, 1, 2, 3),limits = c(-3,3)) +
#geom_jitter(width = 0.25) +
##geom_smooth(method = lm) +
xlab("Standardized Rumination Score")+
ylab("Standardized Worry Score")+
ggtitle("Is this a positive correlation?") +
theme_classic()
```

`Warning: Removed 23 rows containing missing values (`geom_point()`).`

## Scatter plots and direction of correlation (cont.)

- We can check some values and see what is happening, like case #78, in the next plot:

```
ggplot(data = rum_scores, aes(x = rumZ, y = worryZ)) +
geom_point() +
scale_y_continuous(breaks = c(-3, -2, -1, 0, 1, 2, 3),limits = c(-3,3)) +
scale_x_continuous(breaks = c(-3, -2, -1, 0, 1, 2, 3),limits = c(-3,3)) +
#geom_jitter(width = 0.25) +
##geom_smooth(method = lm) +
xlab("Standardized Rumination Score")+
ylab("Standardized Worry Score")+
ggtitle("Is this a positive correlation?") +
gghighlight(worryZ > 2.5, use_direct_label = FALSE)+
geom_label(aes(label = ID),
hjust = 1, vjust = 1, fill = "purple", colour = "white", alpha= 0.5)+
theme_classic()
```

```
Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
ℹ Please use one dimensional logical vectors instead.
ℹ The deprecated feature was likely used in the gghighlight package.
Please report the issue at
<https://github.com/yutannihilation/gghighlight/issues>.
```

`Warning: Removed 23 rows containing missing values (`geom_point()`).`

`Warning: Removed 2 rows containing missing values (`geom_point()`).`

`Warning: Removed 2 rows containing missing values (`geom_label()`).`

## Scatter plots and direction of correlation (cont.)

- Maybe if we add the line of best fit we will see it better:

```
ggplot(data = rum_scores, aes(x = rumZ, y = worryZ)) +
geom_point() +
geom_smooth(method = lm, color = "red", se = FALSE)+
scale_y_continuous(breaks = c(-3, -2, -1, 0, 1, 2, 3),limits = c(-3,3)) +
scale_x_continuous(breaks = c(-3, -2, -1, 0, 1, 2, 3),limits = c(-3,3)) +
#geom_jitter(width = 0.25) +
##geom_smooth(method = lm) +
xlab("Standardized Rumination Score")+
ylab("Standardized Worry Score")+
ggtitle("Is this a positive correlation?") +
theme_classic()
```

`Warning: Removed 23 rows containing non-finite values (`stat_smooth()`).`

`Warning: Removed 23 rows containing missing values (`geom_point()`).`

## Scatter plots and direction of correlation (cont.)

Can you spot the direction of this correlation?

This data come from a questionnaire that asks to rate how emotional you feel. For instance, it asks: Rate how GREAT you feel where 1 = “not feeling” to 6=“I strongly feel it”.

```
<- read.csv("pos_neg.csv")
pos
ggplot(data = pos, aes(x = great , y = down)) +
geom_point() +
#geom_smooth(method = lm, color = "red", se = FALSE)+
geom_jitter(width = 0.25) +
scale_y_continuous(breaks = c(1,2,3,4),limits = c(0,4)) +
scale_x_continuous(breaks = c(1,2,3,4,5),limits = c(0,6)) +
xlab("Great")+
ylab("Down")+
ggtitle("Is this a positive correlation?") +
theme_classic()
```

```
Warning: Removed 1 rows containing missing values (`geom_point()`).
Removed 1 rows containing missing values (`geom_point()`).
```

## Scatter plots and direction of correlation (cont.)

- Let’s add again the line of best linear fit:

```
ggplot(data = pos, aes(x = great , y = down)) +
geom_point() +
geom_smooth(method = lm, color = "red", se = FALSE)+
geom_jitter(width = 0.25) +
scale_y_continuous(breaks = c(1,2,3,4),limits = c(0,4)) +
scale_x_continuous(breaks = c(1,2,3,4,5),limits = c(0,6)) +
xlab("Great")+
ylab("Down")+
ggtitle("Is this a positive correlation?") +
theme_classic()
```

`Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).`

```
Warning: Removed 1 rows containing missing values (`geom_point()`).
Removed 1 rows containing missing values (`geom_point()`).
```

## Scatter plots and direction of correlation (cont.)

```
ggplot(data = rum_scores, aes(x = ageMonths , y = rumZ)) +
geom_point() +
#geom_smooth(method = lm, color = "red", se = FALSE)+
#geom_jitter(width = 0.25) +
#scale_y_continuous(breaks = c(1,2,3,4),limits = c(0,4)) +
#scale_x_continuous(breaks = c(1,2,3,4,5),limits = c(0,6)) +
xlab("Age in Months")+
ylab("Rumination")+
ggtitle("Is this a positive correlation?") +
theme_classic()
```

`Warning: Removed 7 rows containing missing values (`geom_point()`).`

## Scatter plots and direction of correlation (cont.)

- Let’s add the line of linear fit:

```
ggplot(data = rum_scores, aes(x = ageMonths , y = rumZ)) +
geom_point() +
geom_smooth(method = lm, color = "red", se = FALSE)+
#geom_jitter(width = 0.25) +
#scale_y_continuous(breaks = c(1,2,3,4),limits = c(0,4)) +
#scale_x_continuous(breaks = c(1,2,3,4,5),limits = c(0,6)) +
xlab("Age in Months")+
ylab("Rumination")+
ggtitle("Is this a positive correlation?") +
theme_classic()
```

`Warning: Removed 7 rows containing non-finite values (`stat_smooth()`).`

`Warning: Removed 7 rows containing missing values (`geom_point()`).`

## Important remarks

- When the correlation is high, it means there is a large portion of shared variance between \(x\) and \(y\).
- When the correlation is high all the values will converge towards the line of best linear fit.
- When the correlation is low, the values will be sparse and far from the line of best fit.
- A flat linear line means that there is not correlation between \(x\) or \(y\) or the correlation is remarkably low. This means \(r=0\) or closer to zero.

# Computer estimation time!

- In
`R`

you can estimate Pearson correlations using the function`cor()`

as showed here:

```
### pos is the name of the object representing my data set
cor(pos$down, pos$great)
```

`[1] -0.359944`

In this estimation, I’m calculating the correlation between the emotion DOWN and the emotion GREAT. The Pearson correlation was \(r= -0.36\). Is this a strong correlation?

- We could follow an ugly rule of thumb, but be careful, these are not rules cast in stone (Salkind & Shaw, 2020):

## JAMOVI

## JAMOVI

## References

*Statistics for people who (think they) hate statistics: Using r*. Sage publications.