120 points

Author

Esteban Montenegro-Montenegro

Published

October 1, 2024

Description

In this assignment you’ll have to answer questions based on the presentation Introduction to Probability and Statistics and the lecture Probability distributions and random variables. Also, you may need to check the examples provided by Westfall & Henning (2013) (Chapter 1).

Please submit your answers in a Word document, Libre Office, Google document, or a pdf file rendered via Quarto. Copy the question and then answer the question in the following paragraph, similar to this example.

1) What is a parameter?

Answer: A parameter is an unknown value in a model, but it will be estimated using the data

Second Part: More to do in R!

In this section, the aim is to practice what we have learned by estimating some models in R. If you’d like to use R, you are free to do it.

  1. Imagine you need to evaluate the mean difference in wage. In this analysis you ask to yourself: are there differences in wage by insurance status? Estimate a model that helps you to understand the mean difference in wage when people have insurance versus people without insurance. What model will you pick?

To answer this question open the data set named wageData.csv in R. Then, conduct the appropriate statistical model. Remember that the variable wage represents the amount of dollars earn per hour. The variable health_ins is the group, factor or independent variable.

Warning

You can copy this R code to open the data set:

url <- "https://raw.githubusercontent.com/blackhill86/mm2/refs/heads/main/dataSets/wageData.csv"

wageData <- read.csv(url)

1.1 Was the value of the mean different? Are people with insurance earning more money? (5 point)

1.2 Is the mean difference explained by chance alone? How do you know? (5 points)

1.3 Create a bar plot showing the mean of wage by group (insurance vs. no insurance ). You may follow the example code below. (7 points)

Show the code
library(ggplot2)
library(dplyr)

### Estimates standard error to plot error bars

StandError <- function(x) {
 sd(x)/sqrt(length(x))
}
### Now we can estimate the mean, and SE by group,
### then we can save the information in a data frame.
summaries <- wageData |> 
  group_by(race) |>
  summarise_at("wage", list(mean= mean,
                          SE = StandError))
ggplot(summaries , aes(x=race,y = mean, fill = race)) + 
  geom_bar(position=position_dodge(), stat="identity") +
  geom_errorbar(aes(ymin=mean-SE, ymax=mean+SE), width=.1)+
  xlab("Race")+
  ylab("Estimated mean")+
  ggtitle("Example of a bar plot in R")+
  theme_classic()

  1. In this class, I have extensively mentioned the data set rumination. We will use this data set to complete the following set of tasks.

To complete the following questions, open the data set named ruminationExam.csv in R.

Warning

You can open the dataset ruminationExam.csv in R by copying the following code:

urlRumi <- "https://raw.githubusercontent.com/blackhill86/mm2/refs/heads/main/dataSets/ruminationExam.csv"

dataRumi <- read.csv(urlRumi, na.strings = "99")

2.1. The first step is to compute a composite score for rumination and a composite score for depression. This is a very common step in psychology. You will have to add all the columns corresponding to the rumination scale and divide the total by the number of columns. The same step has to be done for the depression scale.

In the data set ruminationExam.csv the rumination scale has 13 items, the corresponding columns range from CRQS1 to CRQS13.

The depression scale has 26 items, the corresponding columns range from CDI1 to CDI26. You may follow the code below:

Show the code
library(dplyr)

dataRumi <- dataRumi|>
  mutate(depressionScore = rowMeans(pick(starts_with("CDI"))),
         rumScore = rowMeans(pick(starts_with("CR"))))

2.2. Create a scatter plot of depression by rumination. Include the figure in your answer. What do you see in the figure? Can you tell if there is a positive or negative correlation? It is a positive or negative correlation? (10 points)

2.3. Estimate a Pearson correlation between rumination and depression. Report the estimated correlation. Is this correlation explained by chance alone? (5 points)

  1. The previous question gave you a simple example to estimate a single Pearson correlation. But, that is not realistic; many times we need to create several pairs of correlations into a matrix that we call in statistics “correlation matrix”.

In this exercise, you will need to open the data set pos_neg.csv in R. The data set pos_neg has scores that range from 1 to 5, if the person answered 5 in the great variable that means the person felt great, but if the person answered 1.8, that means that the person felt less great. You can see the data pos_neg.csv in the table below:

Warning

You can open the data set pos_neg.csv by copying this code:

urlposNeg <- "https://raw.githubusercontent.com/blackhill86/mm2/refs/heads/main/dataSets/pos_neg.csv"

pos_neg <- read.csv(urlposNeg)

3.1. Estimate a correlation matrix in R between all the emotions: great, cheerful, happy, sad, down, unhappy. Report the correlation matrix. Remember to estimate the \(p\)-value for all the correlations. (5 point)

Note

You may need to install the package Hmisc, then you can use the function rcorr()by running a code similar to this: rcorr(as.matrix(data), type="pearson").

3.2. Interpret your results, for example you can write something like this (10 points):

Tip

The correlation between variable 1 and variable 2 was \(r = .30\) with a \(p\)-value \(< .05\). This means that when variable 1 increases, variable 2 also increases. Whereas the correlation between variable 3 and variable 4 was negative \(r = -0.38\) which means that a high score in varaible 3 is related to a low score in variable 4. Both correlations are not explained by chance alone because the \(p\)-value is lower than 0.05

References

Westfall, P. H., & Henning, K. S. (2013). Understanding advanced statistical methods. CRC Press Boca Raton, FL, USA: