What have you learned so far in R?

After completing the first practice, you should have learned these key concepts:

R is an object-oriented programming language.
R is free and open source, and you are able to install as many packages as you want for free.
You learned the concept of “package”.
You learned about data frames.

In this practice we will work with data frames , we will create new variables, and we will use extensively pipes along with the package tidyverse.

Good News!

In this practice you will be able to run R code straight from your web browser. Isn’t it awesome!

Let’s put a frame on those data points….

The data frame is one type of objects in R, and it is extremely useful to represent information as a data frame. This R object looks very similar to a spreadsheet in Excel, each column represents a variable, while each row represents an observation.

In this practice session we will wrangle the data set named ruminationToClean.csv. You may download the data set by Clicking Here. This is a dataset that was part of a large project that I conducted in Costa Rica in collaboration with a colleague and friend David Campos. Participants were high school students from the Central Valley in Costa Rica. The aim of this study was to explore the relationship between rumination and depression in teenagers. We added metacognitive beliefs to the study, but we will not talk about metacognition in this practice.

When I say wrangle I mean to manipulate the data set and clean the data. In real life, data sets are not perfectly clean. Many times we need to delete observations that are extreme cases (careless answers, mistakes, etc.), or sometimes we need to delete new variables. Additionally, we might need to compute new variables. We will cover all these topics in this practice using R.

The package `tidyverse` is your friend

If you run the following code, you’ll see the variable names in the data set ruminationToClean.csv:

I’m reading in the data file from my personal repository, fancy right?

Copy the code above to open ruminationToClean.csv in your R session.

As you can see, we have a large list of variables. We are using the function colnames() to get a list of column names, which in this case are research variables. Several of these variables could be dropped from the data set, others will need to be renamed.

Dropping variables

In this example we will use the function select() from the package tidyverse. Remember that you’ll need to call the package by running library(tidyverse).

What is happening here? You may have noticed that inside the function selection() I am adding the name of the variable I want to delete from the data set, and then I added a minus sign. When you add a minus sign before the name of the variable, the package will delete the variable that you list with a minus sign. You can do the same with several variables at the same time:

Now the variables, age, school, and grade are not longer in the dataset, if you don’t believe me click on RUN CODE.

Excercise 1 (15 points)

Download the data set ruminationToClean.csv then open the data file using ruminationRaw <- read.csv(file.choose()) or you may copy the code to open the data from the online repository. After opening the data in R, delete from the data set all the variables that start with th stem EATQ_R_.

Change variable names

In some occasions, the research variable was named wrongly or the name does not reflect the content of the variable. In the data set ruminationToClean.csv there are variables with names in Spanish, we can change them:

In this example, you can see that I’m using the function rename() from the package dplyr which is inside the package tidyverse. When you use rename() you’ll need to declare the new name first and then the old variable name that will be replaced.

Excercise 2 (15 points)

In psychological measurement it is common to implement questionnaires with Likert-response options. We typically sum the items corresponding to the latent variable or construct that we are measuring. For example in the ruminationToClean.csv the construct of “Rumination” is measured by the Children’s Response Styles Questionnaire (CRSQ), however the variable names are not self-explained. Your task is to change the name of the variables with stem CRSQ, the new variables should have the stem rumination_itemNumber. For example rename(rumination_1=CRSQ1). Remember to save the changes in a new object.

Compute a new variable

As always in programming languages, there are several ways to create a new research variable in a data frame. But, we will use the tidyverse method by using a function that actually works like a verb. The function mutate() will help use to create new variables.

I mentioned in Exercise 2 that psychology is a field that studies latent variables also known as hypothetical constructs, they have many names, but at the end we study variables that are implicit; we can measure them but we don’t see them. For example, we have studied variables such depression or rumination in this course, these are latent variables because you cannot capture depression or you don’t see it walking, however you are able to measure depression by asking about symptoms and thoughts.

It is a figure where there is a circle, the word "depression" is inside the circle. Straight arrows are pointing to three squares. The first square has the label "I am feeling lonely", the second square has the label "I feel sad", the third square has the label "I lost my appetite". The squares are grouped as "Observed indicators". The circle with the word depression inside is grouped as a "latent variable". — Example of a latent variable

In the figure above you can see a representation of a latent variable. In psychology we often assume that the latent variable is the common factor underlying our questions, -in this example- questions related to depression.

When we analyze data that corresponds with latent variables we have two options:

We can create a Structural Equation Modeling (I won’t cover this topic).
We sum all the items to compute a total score. This total score will be our approximation to the latent variable. (Yes, this is covered in this practice).

I’ll show an example in the next chunk of R code:

In this example we are using the function mutate() to create a new column named depressionTotal, this new depressionTotal column will contain the depression total score for each participant. In this study we included the “Child’s Depression Inventory (CDI)”, that’s why the columns follow the stem “CDI”. You can see the new column in the next table:

Code

ruminationRaw <- read.csv("ruminationToClean.csv", na.string = "99")
ruminationDepresionScore <- ruminationRaw |>
  mutate(
    depressionTotal = rowSums(across(starts_with("CDI")))
    )
ruminationDepresionScore |> select(starts_with("CDI"),  
                                   depressionTotal ) |>
  gt_preview()

	CDI1	CDI2	CDI3	CDI4	CDI5	CDI6	CDI7	CDI8	CDI9	CDI10	CDI11	CDI12	CDI13	CDI14	CDI15	CDI16	CDI17	CDI18	CDI19	CDI20	CDI21	CDI22	CDI23	CDI24	CDI25	CDI26	depressionTotal
1	0	2	0	0	0	0	0	1	0	0	1	0	1	1	1	1	1	1	1	0	0	1	1	1	1	0	15
2	0	0	1	1	0	0	0	1	0	0	1	0	2	0	1	0	2	1	1	0	0	1	2	1	0	0	15
3	0	1	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	4
4	0	1	0	0	0	0	0	0	0	0	0	1	1	0	0	0	1	1	0	0	0	0	0	1	0	0	6
5	0	1	0	1	0	1	0	0	1	1	0	2	1	1	0	1	1	2	0	1	0	0	1	1	1	0	17
6..211
212	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	1	0	0	0	1	0	1	0	0	5

Excercise 3 (15 points)

In the data set ruminationToClean.csv you’ll find seven columns with the stem “DASS”. These columns correspond to the instrument named: Depression, Anxiety, and Stress Scale (DASS). This is an scale that measures three latent variables at the same time, but only the anxiety items were added in this study.

Your task is to change the name of the “DASS” items, you should think a more informative name for these items in the data frame. After that, you will have to compute the total score of anxiety for each participant in the data frame.

Let’s cleanse this data set

It is always satisfactory to clean a data set and then see the final product, but so far we have cleaned these data in small steps. I’ll clean the data set in one step. In this example I will delete unnecessary columns (a.k.a research variables), and I will translate the variables to English when necessary:

Excercise 4 (15 points)

Describe the steps taken in the R code above.

Examine your data

In this second part we will create several plots to examine and explore our data. But first we’ll need to compute total scores again, because after all, the main point is to make conclusions about our latent variables (a.k.a latent factors).

We have now three total scores corresponding to three different latent factors: depression, anxiety, and depression. We are in a good place to explore the data. It is healthy to start looking at histograms because they are now continuous variables:

We can see in the histogram that low values are more frequent. This is expected because all the participants are healthy teenagers without any history of psychopathology. Also, pay attention that we are using the function hist() in R.

We can also generate a density plot to have an idea of the observed shape:

Notice the function plot() which is a generic function in R it is used as a method in the environment. This functions comes with R as a base function. If you need to know how to use it you can run ?plot(), but the help section might not be helpful for new users. Additionally, pay attention on how you can change the labels in your plot.

Excercise 5 (30 points)

Create a density plot for the depression total score. Where is most of the density located?

Note

R Studio has an option to save plots. Just click on the tab named “Plots”, then click where it says “Export”. You may click on “Copy to clipboard”, this will allow you to paste the image on any software. You may also select to export the image as PDF file or image.

Box plots are also useful to understand our data better, you may revisit the box plot’s anatomy in this website: link.

We can see the next example (Click on Run Code):

Excercise 6 (15 points)

Can you tell which group in the box plot has the highest median in rumination? What else can you see in the box plot?

Special topics in `R`

You may remember that I mentioned multiple times how R objects have particular characteristics. This is time to study a few of these characteristics.

We will cover special characteristics of missing values.

Missing data without missing the point….

In scientific research is always problematic when a participant forgets to answer a question, it is more problematic when a participant drops from our study when conducting a longitudinal study. These scenarios are a nightmare for researchers.

I will not demonstrate all the possible missing data treatments, but I will show you how R deals naturally with missing values, and how you can code your missing values in your data set.

You have an `NA` on your face

In R you will see that missing values are flagged with NA. In the rumination data set you will see NA values like the ones on red:

Code

ruminationRaw |> 
  select(starts_with("PSWQC")) |> 
  gt_preview()|>
   data_color(
    columns = c(PSWQC_7, PSWQC_7r, PSWQC_9r),
    method = "numeric",
    palette = "grey",
    domain = c(0, 50),
    na_color = "red"
  )

	PSWQC_1	PSWQC_2	PSWQC_3	PSWQC_4	PSWQC_5	PSWQC_6	PSWQC_7	PSWQC_8	PSWQC_9	PSWQC_10	PSWQC_11	PSWQC_12	PSWQC_13	PSWQC_14	PSWQC_2r	PSWQC_7r	PSWQC_9r
1	2	1	2	3	2	1	1	1	1	2	3	1	0	3	12	2	2
2	1	0	2	1	3	1	0	0	0	1	1	1	0	3	4	12	3
3	0	0	3	3	3	3	0	3	0	3	3	3	3	3	24	3	3
4	1	1	1	1	1	1	NA	0	1	0	1	0	1	0	2	NA	NA
5	2	0	2	2	3	1	2	0	0	1	2	2	1	2	3	1	3
6..211
212	0	1	1	1	0	0	3	0	2	0	3	0	0	0	3	2	0

NA stands for “non-applicable” and it became the flag for missing values. It is already built-in with R. You could change this flag but that is not necessary.

You can add NA to any vector, list, or data frame, see the following example:

Code

vectorNames <- c("Adrian", "Marco", "Sofia")
### Let's add a missing value
vectorNames[3] <- NA 
vectorNames

[1] "Adrian" "Marco"  NA

In this example, I added NA in the third position, but of course you could add NA anywhere in a vector.

In R there are functions that will return a logical value. Logical data contains only TRUE or FALSE, or 0 or 1. We can see an example with our vector with missing values.

Code

## is.na() function will be your friend
is.na(vectorNames)

[1] FALSE FALSE  TRUE

The function is.na() returned a new logical vector flagging TRUE where there is a missing value. In this example is the third element.

Excercise 7 (15 points)

Search on Internet how to write a command where we ask the opposite: “show me TRUE when the value is not missing” .

HINT: You may need the exclamation character.

Add `NA` when you read in data

When you are opening a data set in R you will always need to specify how you flagged or coded your missing information. For example, it is usual to observe the number “-99” when values are missing, it is also possible to use “99” like I did in my rumination study. The main point is to select a number that is not possible to be found in your data set. In the rumination data set, it is impossible to expect a value “99” as a valid score. Because all the scores range from zero to maximum 20. Observe what I did at the beginning of this practice:

Code

dataLink <- "https://raw.githubusercontent.com/blackhill86/mm2/main/dataSets/ruminationToClean.csv"

ruminationRaw <- read.csv(url(dataLink ),
                          na.strings = "99")

In the code above, I’m adding the argument na.string = "99" to the function read.csv(). I did it because I flagged my missing values with “99”.

Important!!

If your external file has empty cells, R will interpret empty cells as missing values NA.

Usual questions when you are learning data coding and stats

Can I replace missing values with zeros?

The straight answers is a well rounded and pretty NO!!!

A zero is not equivalent to missing value. When you collect data you need to think first how you will code your missing values.

Can I replace missing values with the mean?

The short answer is NO! But, I’ll give you a longer answer, I’m a professor after all, it is my weakness!

When you replace a missing value with the mean value of an observed distribution, you are doing mean imputation. Long time ago, when computers where not fast, it was faster to replace missing values using the mean, it was easy, and it didn’t require to have a computer. But, nowadays it is not necessary to follow this practice anymore. However, there are still researchers using this replacement method.

If you assume that the estimated mean is a good representation of your missing value you will inflate your distribution with values close to the observed mean, which is wrong. Let’s see an example.

In this example, I will simulate a regression model. we haven’t covered the theory related to regression but I ebelieve I can explain some details in simple words.

The data-generating process will be the following:

\[Y \sim \alpha + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{3}X_{3} + \epsilon \tag{1}\] You may have noticed that all “X” are uppercase and “Y” is uppercase. This means that any outcome $Y$ will be predicted by $X_{1}$, $X_{2}$, and $X_{2}$. This is the model that generates the data. However, given that this is a simulation we can be in control of the process, we can assume that the “population” model looks like this:

\[Y \sim 1 + (0.2)X_{1} + (0.3)X_{2} + (0.4)X_{2} + \epsilon \tag{2}\]

Let’s create 2000 data sets with this model in each data set there will be 100 observations.

Code

library(MASS)
library(purrr)
library(patchwork)
library(ggplot2)

#### Let's simulate some data
set.seed(1236)

generateCompleteData <- function(n){

library(MASS)
MX <- rep(0, 3)
SX <- matrix(1, 3, 3)
SX[1, 2] <- SX[2, 1] <- 0.5
SX[1, 3] <- SX[3, 1] <- 0.2
SX[2, 3] <- SX[3, 2] <- 0.3
X <- mvrnorm(n, MX, SX)
e <- rnorm(n, 0, sqrt(0.546))
y <- 1 + 0.2 * X[,1] + 0.3 * X[,2] + 0.4 * X[,3] + e

reg <- lm(y ~ X[,1] +  X[,2] + X[,3])

dataReg <- data.frame(inter = reg$coefficients[[1]],
                      x1 = reg$coefficients[2],
                      x2 = reg$coefficients[3],
                      x3 = reg$coefficients[4])
                    

return(dataReg)

}

### Multiple models

allModels <- map_df(1:2000, ~generateCompleteData(100))


interPlot <- ggplot(allModels, aes(x=inter)) + 
  geom_density(fill = "blue",alpha = 0.2 )+
  geom_vline(xintercept = round(mean(allModels$inter),2), color = "red")+
  annotate("text", 
           x = 0.85, 
           y = 5, 
           label = paste0("M=", round(mean(allModels$inter),2)," ", 
                          "SD=", round(sd(allModels$inter),2)),
           color = "red")+
  xlab("2000 intercepts")+
  ylab("Likelihood")+
  theme_classic()
  
x1Plot <- ggplot(allModels, aes(x=x1)) + 
  geom_density(fill = "blue",alpha = 0.2 )+
  geom_vline(xintercept = round(mean(allModels$x1),2), color = "red")+
  annotate("text", 
           x = 0.4, 
           y = 5, 
           label = paste0("M=", round(mean(allModels$x1),2)," ", 
                          "SD=", round(sd(allModels$x1),2)),
           color = "red")+
  xlab("Estimated slopes X1")+
  ylab("Likelihood")+
  theme_classic()


x2Plot <- ggplot(allModels, aes(x=x2)) + 
  geom_density(fill = "blue",alpha = 0.2 )+
  geom_vline(xintercept = round(mean(allModels$x2),2), color = "red")+
  annotate("text", 
           x = 0.5, 
           y = 5, 
           label = paste0("M=", round(mean(allModels$x2),2)," ", 
                          "SD=", round(sd(allModels$x2),2)),
           color = "red")+
  xlab("Estimated slopes X2")+
  ylab("Likelihood")+
  theme_classic()

x3Plot <- ggplot(allModels, aes(x=x3)) + 
  geom_density(fill = "blue",alpha = 0.2 )+
  geom_vline(xintercept = round(mean(allModels$x3),2), color = "red")+
  annotate("text", 
           x = 0.6, 
           y = 5, 
           label = paste0("M=", round(mean(allModels$x3),2)," ", 
                          "SD=", round(sd(allModels$x3),2)),
           color = "red")+
  xlab("Estimated slopes X3")+
  ylab("Likelihood")+
  theme_classic()

interPlot  + x1Plot +
x2Plot + x3Plot

Simulated values with complet information

As expected, the density plots for our parameters showed that after replicating the same study 2000 times the mean value is exactly at the target. The target are the values that we set in our “population” model in equation Equation 2.

In the next step, I will impose missing values on the data generated, we will see what happens when we replace the values with the mean of the observed distribution.

Code

library(simsem)

set.seed(1236)

generateIncompleteData <- function(n){

library(MASS)
library(simsem)  
  
MX <- rep(0, 3)
SX <- matrix(1, 3, 3)
SX[1, 2] <- SX[2, 1] <- 0.5
SX[1, 3] <- SX[3, 1] <- 0.2
SX[2, 3] <- SX[3, 2] <- 0.3
X <- mvrnorm(n, MX, SX)
e <- rnorm(n, 0, sqrt(0.546))
y <- 1 + 0.2 * X[,1] + 0.3 * X[,2] + 0.4 * X[,3] + e


dataGen <- data.frame(outcome = y,
                      x1 = X[,1],
                      x2 = X[,2],
                      x3 = X[,3]) 


datMis <- imposeMissing(dataGen, 
                         pmMAR = .15, cov="x3")

meanOutcome <- mean(datMis$outcome, na.rm = TRUE)
meanX1 <- mean(datMis$x1, na.rm = TRUE)
meanX2 <- mean(datMis$x2, na.rm = TRUE)

datMis$outcome[is.na(datMis$outcome)] <- meanOutcome 
datMis$x1[is.na(datMis$x1)] <- meanX1
datMis$x2[is.na(datMis$x2)] <- meanX2


reg <- lm(outcome ~ x1 + x2 + x3, data = datMis)

dataReg <- data.frame(inter = reg$coefficients[[1]],
                      x1 = reg$coefficients[2],
                      x2 = reg$coefficients[3],
                      x3 = reg$coefficients[4])
                    

return(dataReg)

}

allModelsMissing <- map_df(1:2000, ~generateIncompleteData(100))


interPlotMissing <- ggplot(allModelsMissing, aes(x=inter)) + 
  geom_density(fill = "blue",alpha = 0.2 )+
  geom_vline(xintercept = round(mean(allModelsMissing$inter),2), color = "red")+
  geom_vline(xintercept = 1, color = "blue")+
  annotate("text", 
           x = 0.85, 
           y = 5, 
           label = paste0("M=", round(mean(allModelsMissing$inter),2)," ", 
                          "SD=", round(sd(allModelsMissing$inter),2)),
           color = "red")+
  xlab("2000 intercepts")+
  ylab("Likelihood")+
  theme_classic()
  
x1PlotMissing <- ggplot(allModelsMissing, aes(x=x1)) + 
  geom_density(fill = "blue",alpha = 0.2 )+
  geom_vline(xintercept = round(mean(allModelsMissing$x1),2), color = "red")+
  geom_vline(xintercept = 0.2, color = "blue")+
  annotate("text", 
           x = 0.4, 
           y = 5, 
           label = paste0("M=", round(mean(allModelsMissing$x1),2)," ", 
                          "SD=", round(sd(allModelsMissing$x1),2)),
           color = "red")+
  xlab("Estimated slopes X1")+
  ylab("Likelihood")+
  theme_classic()


x2PlotMissing <- ggplot(allModelsMissing, aes(x=x2)) + 
  geom_density(fill = "blue",alpha = 0.2 )+
  geom_vline(xintercept = round(mean(allModelsMissing$x2),2), color = "red")+
  geom_vline(xintercept = 0.3, color = "blue")+
  annotate("text", 
           x = 0.5, 
           y = 5, 
           label = paste0("M=", round(mean(allModelsMissing$x2),2)," ", 
                          "SD=", round(sd(allModelsMissing$x2),2)),
           color = "red")+
  xlab("Estimated slopes X2")+
  ylab("Likelihood")+
  theme_classic()

x3PlotMissing <- ggplot(allModelsMissing, aes(x=x3)) + 
  geom_density(fill = "blue",alpha = 0.2 )+
  geom_vline(xintercept = round(mean(allModelsMissing$x3),2), color = "red")+
  geom_vline(xintercept = 0.4, color = "blue")+
  annotate("text", 
           x = 0.5, 
           y = 5, 
           label = paste0("M=", round(mean(allModelsMissing$x3),2)," ", 
                          "SD=", round(sd(allModelsMissing$x3),2)),
           color = "red")+
  xlab("Estimated slopes X3")+
  ylab("Likelihood")+
  theme_classic()

interPlotMissing  + x1PlotMissing +
x2PlotMissing + x3PlotMissing

Simulated values after replacing missing data with the mean

This time the estimated mean of all the slopes and intercepts moved a little bit far from the target, the blue line is our target value, and the red line is the estimated mean after replacing missing values with mean. You may think, that this is not a big deviation but in more complicated models with more parameters, this deviation from the target can cause harm to the point that you are not sure about your results. This is one reason why we don’t replace missing values with the mean.

So, what should we do instead of using `mean imputation`?

You may estimate your models using listwise deletion, this method deletes any case that contains a missing value. This is actually the default behavior in many software. We can study an example:

In this example I’m generating a vector with random numbers, and then I’m esimating the mean:

Code

### Let's create a vector with complete values
vectorComplete <- c(123,56,96,89,59)
### We can estimate the mean:
mean(vectorComplete)

[1] 84.6

The mean is M=84.6, but notice what happens when there is a missing value:

Code

### Let's delete some values to investigate what happens

vectorMissing <- c(123,56,NA,89,59)
mean(vectorMissing)

[1] NA

Now the mean() function returns nothing, it returns NA, why? Because you cannot compute anything if there is an empty slot. You need to delete the observation that is missing and then compute the mean again. This is what is called listwise deletion. You can enable listwise deletion by adding the argument na.rm = TRUE to the mean() function:

Code

### Let's delete some values to investigate what happens

vectorMissing <- c(123,56,NA,89,59)
mean(vectorMissing, na.rm = TRUE)

[1] 81.75

Excercise 8 (15 point)

Fix the following code when estimating the correlation:

Code

### mtcars is a data set already built into R
dataCars <- mtcars

### I'm imposing missing value on values located at 6,7,8,9
dataCars$mpg[c(6,7,8,9)] <- NA 

cor(dataCars$mpg, dataCars$hp) ### This function cor() should give us a estimated correlation, what is wrong?

[1] NA

HINT: You may check the help provided for the function cor() by typing ?cor on your R console. Or search on Internet.

Final note (I promise it is the end)

The best strategy to deal with missing values is something call multiple imputation, it is a statistical model that generates multiple data sets with possible values. Although, this is a topic for a second course in statistics.

--- title: "Practice 2" subtitle: "135 points" title-block-banner: true author: "Esteban Montenegro-Montenegro" date: now format: html: highlight-style: a11y-dark code-fold: true code-tools: true resources: - webr-serviceworker.js - webr-worker.js filters: - webr engine: knitr webr: packages: ['ggplot2', 'dplyr'] home-dir: "/home/esteban/MEGA/stanStateDocuments/PSYC3000/practice2" theme: light: sandstone dark: cyborg toc: true toc-depth: 4 toc-title: Sections self-contained: true bibliography: references.bib csl: apa.csl editor: markdown: wrap: 72 code-annotations: hover lightbox: true --- # What have you learned so far in R? After completing the first practice, you should have learned these key concepts: - `R` is an object-oriented programming language. - `R` is free and open source, and you are able to install as many packages as you want for free. - You learned the concept of "package". - You learned about data frames. In this practice we will work with `data frames` , we will create new variables, and we will use extensively `pipes` along with the package `tidyverse`. ::: callout-note ## Good News! In this practice you will be able to run `R` code straight from your web browser. Isn't it awesome! ::: # Let's put a frame on those data points.... The `data frame` is one type of objects in `R`, and it is extremely useful to represent information as a data frame. This `R` object looks very similar to a spreadsheet in Excel, each column represents a variable, while each row represents an observation. In this practice session we will wrangle the data set named `ruminationToClean.csv`. You may download the data set by [Clicking Here.](https://mega.nz/file/eJFU3BKA#ljpDkezv325VEI2Ii6xuXoU8BbmPg-lNqdHH0XQqk5s){target="_blank"} This is a dataset that was part of a large project that I conducted in Costa Rica in collaboration with a colleague and friend David Campos. Participants were high school students from the Central Valley in Costa Rica. The aim of this study was to explore the relationship between rumination and depression in teenagers. We added metacognitive beliefs to the study, but we will not talk about metacognition in this practice. When I say ***wrangle*** I mean to manipulate the data set and clean the data. In real life, data sets are not perfectly clean. Many times we need to delete observations that are extreme cases (careless answers, mistakes, etc.), or sometimes we need to delete new variables. Additionally, we might need to compute new variables. We will cover all these topics in this practice using `R`. ## The package `tidyverse` is your friend If you run the following code, you'll see the variable names in the data set `ruminationToClean.csv`: ::: column-margin *I'm reading in the data file from my personal repository, fancy right?* ::: ```{webr-r} ## Link to the data set: dataLink <- "https://raw.githubusercontent.com/blackhill86/mm2/main/dataSets/ruminationToClean.csv" ruminationRaw <- read.csv(url(dataLink ), na.strings = "99") colnames(ruminationRaw) ``` Copy the code above to open `ruminationToClean.csv` in your `R` session. As you can see, we have a large list of variables. We are using the function `colnames()` to get a list of column names, which in this case are research variables. Several of these variables could be dropped from the data set, others will need to be renamed. ### Dropping variables In this example we will use the function `select()` from the package `tidyverse`. Remember that you'll need to call the package by running `library(tidyverse)`. ```{webr} ### Let's remove "age" from the data set ruminationClean <- ruminationRaw |> select(-age) colnames(ruminationClean) ``` *What is happening here?* You may have noticed that inside the function `selection()` I am adding the name of the variable I want to delete from the data set, and then I added a minus sign. When you add a minus sign before the name of the variable, the package will delete the variable that you list with a minus sign. You can do the same with several variables at the same time: ```{webr} ### Let's remove "age" from the data set ruminationClean <- ruminationRaw |> select(-school, -grade) colnames(ruminationClean) ``` Now the variables, `age`, `school`, and `grade` are not longer in the dataset, if you don't believe me click on `RUN CODE`. ::: callout-tip ## Excercise 1 (15 points) 1) Download the data set [`ruminationToClean.csv`](https://mega.nz/file/eJFU3BKA#ljpDkezv325VEI2Ii6xuXoU8BbmPg-lNqdHH0XQqk5s){target="_blank"} then open the data file using `ruminationRaw <- read.csv(file.choose())` or you may copy the code to open the data from the online repository. After opening the data in `R`, delete from the data set all the variables that start with th stem `EATQ_R_`. ::: ### Change variable names In some occasions, the research variable was named wrongly or the name does not reflect the content of the variable. In the data set `ruminationToClean.csv` there are variables with names in Spanish, we can change them: ```{webr} ## rename() requires the following grammar: ## newName = oldName ruminationEnglishNames <- ruminationRaw |> rename(otherTreatments = OtrosEspeci, takinPills = medicamentos, psychTreatment = tratamientopsicologico, ruminationTotalScore = Rumiacion) colnames(ruminationEnglishNames) ``` In this example, you can see that I'm using the function `rename()` from the package `dplyr` which is inside the package `tidyverse`. When you use `rename()` you'll need to declare the new name first and then the old variable name that will be replaced. ::: callout-tip ## Excercise 2 (15 points) 2) In psychological measurement it is common to implement questionnaires with Likert-response options. We typically sum the items corresponding to the latent variable or construct that we are measuring. For example in the [`ruminationToClean.csv`](https://mega.nz/file/eJFU3BKA#ljpDkezv325VEI2Ii6xuXoU8BbmPg-lNqdHH0XQqk5s){target="_blank"} the construct of "Rumination" is measured by the [Children's Response Styles Questionnaire (CRSQ),](https://www.researchgate.net/profile/Randy-Auerbach-2/publication/6286250_Responses_to_Depression_in_Children_Reconceptualizing_the_Relation_Among_Response_Styles/links/54e547dc0cf29865c336b5c8/Responses-to-Depression-in-Children-Reconceptualizing-the-Relation-Among-Response-Styles.pdf){target="_blank"} however the variable names are not self-explained. Your task is to change the name of the variables with stem `CRSQ`, the new variables should have the stem `rumination_itemNumber`. For example `rename(rumination_1=CRSQ1)`. Remember to save the changes in a new object. ::: ### Compute a new variable As always in programming languages, there are several ways to create a new research variable in a data frame. But, we will use the `tidyverse` method by using a function that actually works like a verb. The function `mutate()` will help use to create new variables. I mentioned in ***Exercise 2*** that psychology is a field that studies *latent variables* also known as *hypothetical constructs*, they have many names, but at the end we study variables that are implicit; we can measure them but we don't see them. For example, we have studied variables such depression or rumination in this course, these are latent variables because you cannot capture depression or you don't see it walking, however you are able to measure depression by asking about symptoms and thoughts. ![Example of a latent variable](latentVariable.png){fig-alt="It is a figure where there is a circle, the word \"depression\" is inside the circle. Straight arrows are pointing to three squares. The first square has the label \"I am feeling lonely\", the second square has the label \"I feel sad\", the third square has the label \"I lost my appetite\". The squares are grouped as \"Observed indicators\". The circle with the word depression inside is grouped as a \"latent variable\"." fig-align="center" width="1000"} In the figure above you can see a representation of a latent variable. In psychology we often assume that the latent variable is the common factor underlying our questions, -in this example- questions related to depression. When we analyze data that corresponds with latent variables we have two options: 1. We can create a Structural Equation Modeling (I won't cover this topic). 2. We sum all the items to compute a total score. This total score will be our approximation to the latent variable. (Yes, this is covered in this practice). ::: column-margin ![](roll.jpg){fig-alt="This is a meme, the character rolls her eyes insinuating the explanation was boring" width="120"} ::: I'll show an example in the next chunk of `R` code: ```{webr} #-------------------------------------------------------------------------# # Calculates a depression total score and then creates a new variable # # in the data frame. # #-------------------------------------------------------------------------# ruminationDepresionScore <- ruminationRaw |> mutate( depressionTotal = rowSums(across(starts_with("CDI"))) ### Starts_with() helps to avoid a lot of typing ) ``` In this example we are using the function `mutate()` to create a new column named `depressionTotal`, this new `depressionTotal` column will contain the depression total score for each participant. In this study we included the ***"Child's Depression Inventory (CDI)"***, that's why the columns follow the stem "CDI". You can see the new column in the next table: ```{r, setup, include=FALSE} path <- "/home/esteban/MEGA/stanStateDocuments/advancedMethods/practice2" knitr::opts_knit$set(root.dir = path ) library(dplyr) library(gt) ``` ```{r, echo=TRUE, include=TRUE, warning=FALSE, message=FALSE} ruminationRaw <- read.csv("ruminationToClean.csv", na.string = "99") ruminationDepresionScore <- ruminationRaw |> mutate( depressionTotal = rowSums(across(starts_with("CDI"))) ) ruminationDepresionScore |> select(starts_with("CDI"), depressionTotal ) |> gt_preview() ``` ::: callout-tip ## Excercise 3 (15 points) 3) In the data set [`ruminationToClean.csv`](https://mega.nz/file/eJFU3BKA#ljpDkezv325VEI2Ii6xuXoU8BbmPg-lNqdHH0XQqk5s){target="_blank"} you'll find seven columns with the stem "`DASS`". These columns correspond to the instrument named: [Depression, Anxiety, and Stress Scale (DASS)](https://maic.qld.gov.au/wp-content/uploads/2016/07/DASS-21.pdf){target="_blank"}. This is an scale that measures three latent variables at the same time, but only the anxiety items were added in this study. Your task is to change the name of the "DASS" items, you should think a more informative name for these items in the data frame. After that, you will have to compute the total score of anxiety for each participant in the data frame. ::: ### Let's cleanse this data set ::: column-margin ![](datadirty.png){fig-alt="This is a meme, the character is a aging adult cleaning a data set like it was a window in a house" width="200"} ::: It is always satisfactory to clean a data set and then see the final product, but so far we have cleaned these data in small steps. I'll clean the data set in one step. In this example I will delete unnecessary columns (a.k.a research variables), and I will translate the variables to English when necessary: ```{webr} ruminationClean <- ruminationRaw |> rename(otherTreatments = OtrosEspeci, takinPills = medicamentos, psychTreatment = tratamientopsicologico, ruminationTotalScore = Rumiacion, activation = ACTIVACION, attention = ATENCION) |> select(-PBRSad, -NBRS2ad, -starts_with("EATQ_R_")) colnames(ruminationClean) ``` ::: callout-tip ## Excercise 4 (15 points) 4) Describe the steps taken in the `R` code above. ::: # Examine your data In this second part we will create several plots to examine and explore our data. But first we'll need to compute total scores again, because after all, the main point is to make conclusions about our latent variables (a.k.a latent factors). ```{webr} ### I'm calculating the total score of depression, rumination, and anxiety. ruminationClean <- ruminationClean |> mutate( depressionTotal = rowSums(across(starts_with("CDI"))), ### starts_with() function is a time saver! anxietyTotal = rowSums(across(starts_with("DASS"))), ruminationTotal = rowSums(across(starts_with("CR")))) ## This part will print only the new variables in the console ruminationClean |> select(depressionTotal, anxietyTotal, ruminationTotal) |> head() ``` We have now three total scores corresponding to three different latent factors: depression, anxiety, and depression. We are in a good place to explore the data. It is healthy to start looking at histograms because they are now continuous variables: ```{webr} hist(ruminationClean$depressionTotal, main = "Histogram of depression's total score", breaks = 12 ) ``` We can see in the histogram that low values are more frequent. This is expected because all the participants are healthy teenagers without any history of psychopathology. Also, pay attention that we are using the function `hist()` in `R`. We can also generate a density plot to have an idea of the observed shape: ```{webr} ### First we need to get an estimate of the observed density: ruminationDensity <- density(ruminationClean$depressionTotal, na.rm = TRUE) ## We need to add "na.rm=" # because there are missing values. plot(ruminationDensity , main="Density plot: Rumination Total Score" , xlab= "Rumination total score", ylab= "Likelihood") ``` Notice the function `plot()` which is a generic function in `R` it is used as a method in the environment. This functions comes with `R` as a base function. If you need to know how to use it you can run `?plot()`, but the help section might not be helpful for new users. Additionally, pay attention on how you can change the labels in your plot. :::: callout-tip ## Excercise 5 (30 points) 5) Create a density plot for the depression total score. Where is most of the density located? ::: callout-note R Studio has an option to save plots. Just click on the tab named "Plots", then click where it says "Export". You may click on ***"Copy to clipboard"***, this will allow you to paste the image on any software. You may also select to export the image as PDF file or image. ![Option to save generated plots](saveImage.png) ::: :::: Box plots are also useful to understand our data better, you may revisit the box plot's anatomy in this website: [link.](https://r-graph-gallery.com/boxplot.html){target="_blank"} We can see the next example (Click on Run Code): ```{webr} boxplot(ruminationTotal ~ sex, data=ruminationClean, main="Box Plot of Rumination total score by Sex", xlab="Sex", ylab="Rumination total score", names = c("women", "men")) ``` ::: callout-tip ## Excercise 6 (15 points) 6) Can you tell which group in the box plot has the highest median in rumination? What else can you see in the box plot? ::: # Special topics in `R` You may remember that I mentioned multiple times how `R` objects have particular characteristics. This is time to study a few of these characteristics. We will cover special characteristics of missing values. ## Missing data without missing the point.... In scientific research is always problematic when a participant forgets to answer a question, it is more problematic when a participant drops from our study when conducting a longitudinal study. These scenarios are a nightmare for researchers. I will not demonstrate all the possible missing data treatments, but I will show you how `R` deals naturally with missing values, and how you can code your missing values in your data set. ## You have an `NA` on your face In `R` you will see that missing values are flagged with `NA`. In the rumination data set you will see `NA` values like the ones on red: ```{r} ruminationRaw |> select(starts_with("PSWQC")) |> gt_preview()|> data_color( columns = c(PSWQC_7, PSWQC_7r, PSWQC_9r), method = "numeric", palette = "grey", domain = c(0, 50), na_color = "red" ) ``` `NA` stands for "non-applicable" and it became the flag for missing values. It is already built-in with `R`. You could change this flag but that is not necessary. You can add `NA` to any vector, list, or data frame, see the following example: ```{r} vectorNames <- c("Adrian", "Marco", "Sofia") ### Let's add a missing value vectorNames[3] <- NA vectorNames ``` In this example, I added `NA` in the third position, but of course you could add `NA` anywhere in a vector. In `R` there are functions that will return a logical value. Logical data contains only `TRUE` or `FALSE`, or `0` or `1`. We can see an example with our vector with missing values. ```{r} ## is.na() function will be your friend is.na(vectorNames) ``` The function `is.na()` returned a new logical vector flagging `TRUE` where there is a missing value. In this example is the third element. ::: callout-tip ## Excercise 7 (15 points) 7) Search on Internet how to write a command where we ask the opposite: "show me `TRUE` when the value ***is not missing***" . *HINT:* You may need the exclamation character. ::: ## Add `NA` when you read in data When you are opening a data set in `R` you will always need to specify how you flagged or coded your missing information. For example, it is usual to observe the number ***"-99"*** when values are missing, it is also possible to use ***"99"*** like I did in my rumination study. The main point is to select a number that is not possible to be found in your data set. In the rumination data set, it is impossible to expect a value ***"99"*** as a valid score. Because all the scores range from zero to maximum 20. Observe what I did at the beginning of this practice: ```{r, eval=FALSE} dataLink <- "https://raw.githubusercontent.com/blackhill86/mm2/main/dataSets/ruminationToClean.csv" ruminationRaw <- read.csv(url(dataLink ), na.strings = "99") ``` In the code above, I'm adding the argument `na.string = "99"` to the function `read.csv()`. I did it because I flagged my missing values with "99". ::: callout-important ## Important!! If your external file has empty cells, `R` will interpret empty cells as missing values `NA`. ::: ## Usual questions when you are learning data coding and stats ### Can I replace missing values with zeros? The straight answers is a well rounded and pretty NO!!! A zero is not equivalent to missing value. When you collect data you need to think first how you will code your missing values. ::: column-margin ![](noMeme.jpg){fig-alt="Bugs Bunny cartoon saying no to the question: Can I replace missing values with the mean?"} ::: ### Can I replace missing values with the mean? The short answer is NO! But, I'll give you a longer answer, I'm a professor after all, it is my weakness! When you replace a missing value with the mean value of an observed distribution, you are doing `mean imputation`. Long time ago, when computers where not fast, it was faster to replace missing values using the mean, it was easy, and it didn't require to have a computer. But, nowadays it is not necessary to follow this practice anymore. However, there are still researchers using this replacement method. If you assume that the estimated mean is a good representation of your missing value you will inflate your distribution with values close to the observed mean, which is wrong. Let's see an example. In this example, I will simulate a regression model. we haven't covered the theory related to regression but I ebelieve I can explain some details in simple words. The data-generating process will be the following: $$Y \sim \alpha + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{3}X_{3} + \epsilon $$ {#eq-1} You may have noticed that all "X" are uppercase and "Y" is uppercase. This means that any outcome $Y$ will be predicted by $X_{1}$, $X_{2}$, and $X_{2}$. This is the model that generates the data. However, given that this is a simulation we can be in control of the process, we can assume that the "population" model looks like this: $$Y \sim 1 + (0.2)X_{1} + (0.3)X_{2} + (0.4)X_{2} + \epsilon $$ {#eq-2} Let's create 2000 data sets with this model in each data set there will be 100 observations. ```{r, fig.height= 7 , fig.width= 9, message=FALSE, warning=FALSE, fig.cap="Simulated values with complet information"} library(MASS) library(purrr) library(patchwork) library(ggplot2) #### Let's simulate some data set.seed(1236) generateCompleteData <- function(n){ library(MASS) MX <- rep(0, 3) SX <- matrix(1, 3, 3) SX[1, 2] <- SX[2, 1] <- 0.5 SX[1, 3] <- SX[3, 1] <- 0.2 SX[2, 3] <- SX[3, 2] <- 0.3 X <- mvrnorm(n, MX, SX) e <- rnorm(n, 0, sqrt(0.546)) y <- 1 + 0.2 * X[,1] + 0.3 * X[,2] + 0.4 * X[,3] + e reg <- lm(y ~ X[,1] + X[,2] + X[,3]) dataReg <- data.frame(inter = reg$coefficients[[1]], x1 = reg$coefficients[2], x2 = reg$coefficients[3], x3 = reg$coefficients[4]) return(dataReg) } ### Multiple models allModels <- map_df(1:2000, ~generateCompleteData(100)) interPlot <- ggplot(allModels, aes(x=inter)) + geom_density(fill = "blue",alpha = 0.2 )+ geom_vline(xintercept = round(mean(allModels$inter),2), color = "red")+ annotate("text", x = 0.85, y = 5, label = paste0("M=", round(mean(allModels$inter),2)," ", "SD=", round(sd(allModels$inter),2)), color = "red")+ xlab("2000 intercepts")+ ylab("Likelihood")+ theme_classic() x1Plot <- ggplot(allModels, aes(x=x1)) + geom_density(fill = "blue",alpha = 0.2 )+ geom_vline(xintercept = round(mean(allModels$x1),2), color = "red")+ annotate("text", x = 0.4, y = 5, label = paste0("M=", round(mean(allModels$x1),2)," ", "SD=", round(sd(allModels$x1),2)), color = "red")+ xlab("Estimated slopes X1")+ ylab("Likelihood")+ theme_classic() x2Plot <- ggplot(allModels, aes(x=x2)) + geom_density(fill = "blue",alpha = 0.2 )+ geom_vline(xintercept = round(mean(allModels$x2),2), color = "red")+ annotate("text", x = 0.5, y = 5, label = paste0("M=", round(mean(allModels$x2),2)," ", "SD=", round(sd(allModels$x2),2)), color = "red")+ xlab("Estimated slopes X2")+ ylab("Likelihood")+ theme_classic() x3Plot <- ggplot(allModels, aes(x=x3)) + geom_density(fill = "blue",alpha = 0.2 )+ geom_vline(xintercept = round(mean(allModels$x3),2), color = "red")+ annotate("text", x = 0.6, y = 5, label = paste0("M=", round(mean(allModels$x3),2)," ", "SD=", round(sd(allModels$x3),2)), color = "red")+ xlab("Estimated slopes X3")+ ylab("Likelihood")+ theme_classic() interPlot + x1Plot + x2Plot + x3Plot ``` As expected, the density plots for our parameters showed that after replicating the same study 2000 times the mean value is exactly at the target. The target are the values that we set in our "population" model in equation @eq-2. In the next step, I will impose missing values on the data generated, we will see what happens when we replace the values with the mean of the observed distribution. ```{r, fig.height= 7 , fig.width= 9, message=FALSE, warning=FALSE, fig.cap="Simulated values after replacing missing data with the mean"} library(simsem) set.seed(1236) generateIncompleteData <- function(n){ library(MASS) library(simsem) MX <- rep(0, 3) SX <- matrix(1, 3, 3) SX[1, 2] <- SX[2, 1] <- 0.5 SX[1, 3] <- SX[3, 1] <- 0.2 SX[2, 3] <- SX[3, 2] <- 0.3 X <- mvrnorm(n, MX, SX) e <- rnorm(n, 0, sqrt(0.546)) y <- 1 + 0.2 * X[,1] + 0.3 * X[,2] + 0.4 * X[,3] + e dataGen <- data.frame(outcome = y, x1 = X[,1], x2 = X[,2], x3 = X[,3]) datMis <- imposeMissing(dataGen, pmMAR = .15, cov="x3") meanOutcome <- mean(datMis$outcome, na.rm = TRUE) meanX1 <- mean(datMis$x1, na.rm = TRUE) meanX2 <- mean(datMis$x2, na.rm = TRUE) datMis$outcome[is.na(datMis$outcome)] <- meanOutcome datMis$x1[is.na(datMis$x1)] <- meanX1 datMis$x2[is.na(datMis$x2)] <- meanX2 reg <- lm(outcome ~ x1 + x2 + x3, data = datMis) dataReg <- data.frame(inter = reg$coefficients[[1]], x1 = reg$coefficients[2], x2 = reg$coefficients[3], x3 = reg$coefficients[4]) return(dataReg) } allModelsMissing <- map_df(1:2000, ~generateIncompleteData(100)) interPlotMissing <- ggplot(allModelsMissing, aes(x=inter)) + geom_density(fill = "blue",alpha = 0.2 )+ geom_vline(xintercept = round(mean(allModelsMissing$inter),2), color = "red")+ geom_vline(xintercept = 1, color = "blue")+ annotate("text", x = 0.85, y = 5, label = paste0("M=", round(mean(allModelsMissing$inter),2)," ", "SD=", round(sd(allModelsMissing$inter),2)), color = "red")+ xlab("2000 intercepts")+ ylab("Likelihood")+ theme_classic() x1PlotMissing <- ggplot(allModelsMissing, aes(x=x1)) + geom_density(fill = "blue",alpha = 0.2 )+ geom_vline(xintercept = round(mean(allModelsMissing$x1),2), color = "red")+ geom_vline(xintercept = 0.2, color = "blue")+ annotate("text", x = 0.4, y = 5, label = paste0("M=", round(mean(allModelsMissing$x1),2)," ", "SD=", round(sd(allModelsMissing$x1),2)), color = "red")+ xlab("Estimated slopes X1")+ ylab("Likelihood")+ theme_classic() x2PlotMissing <- ggplot(allModelsMissing, aes(x=x2)) + geom_density(fill = "blue",alpha = 0.2 )+ geom_vline(xintercept = round(mean(allModelsMissing$x2),2), color = "red")+ geom_vline(xintercept = 0.3, color = "blue")+ annotate("text", x = 0.5, y = 5, label = paste0("M=", round(mean(allModelsMissing$x2),2)," ", "SD=", round(sd(allModelsMissing$x2),2)), color = "red")+ xlab("Estimated slopes X2")+ ylab("Likelihood")+ theme_classic() x3PlotMissing <- ggplot(allModelsMissing, aes(x=x3)) + geom_density(fill = "blue",alpha = 0.2 )+ geom_vline(xintercept = round(mean(allModelsMissing$x3),2), color = "red")+ geom_vline(xintercept = 0.4, color = "blue")+ annotate("text", x = 0.5, y = 5, label = paste0("M=", round(mean(allModelsMissing$x3),2)," ", "SD=", round(sd(allModelsMissing$x3),2)), color = "red")+ xlab("Estimated slopes X3")+ ylab("Likelihood")+ theme_classic() interPlotMissing + x1PlotMissing + x2PlotMissing + x3PlotMissing ``` This time the estimated mean of all the slopes and intercepts moved a little bit far from the target, the blue line is our target value, and the red line is the estimated mean after replacing missing values with mean. You may think, that this is not a big deviation but in more complicated models with more parameters, this deviation from the target can cause harm to the point that you are not sure about your results. This is one reason why we don't replace missing values with the mean. ### So, what should we do instead of using `mean imputation`? You may estimate your models using *listwise deletion*, this method deletes any case that contains a missing value. This is actually the default behavior in many software. We can study an example: In this example I'm generating a vector with random numbers, and then I'm esimating the mean: ```{r} ### Let's create a vector with complete values vectorComplete <- c(123,56,96,89,59) ### We can estimate the mean: mean(vectorComplete) ``` The mean is M=84.6, but notice what happens when there is a missing value: ```{r} ### Let's delete some values to investigate what happens vectorMissing <- c(123,56,NA,89,59) mean(vectorMissing) ``` Now the `mean()` function returns nothing, it returns `NA`, why? Because you cannot compute anything if there is an empty slot. You need to delete the observation that is missing and then compute the mean again. This is what is called *listwise deletion*. You can enable listwise deletion by adding the argument `na.rm = TRUE` to the `mean()` function: ```{r} ### Let's delete some values to investigate what happens vectorMissing <- c(123,56,NA,89,59) mean(vectorMissing, na.rm = TRUE) ``` ::: callout-tip ## Excercise 8 (15 point) 8. Fix the following code when estimating the correlation: ```{r} ### mtcars is a data set already built into R dataCars <- mtcars ### I'm imposing missing value on values located at 6,7,8,9 dataCars$mpg[c(6,7,8,9)] <- NA cor(dataCars$mpg, dataCars$hp) ### This function cor() should give us a estimated correlation, what is wrong? ``` ***HINT:*** You may check the help provided for the function `cor()` by typing `?cor` on your `R` console. Or search on Internet. ::: ## Final note (I promise it is the end) The best strategy to deal with missing values is something call *multiple imputation*, it is a statistical model that generates multiple data sets with possible values. Although, this is a topic for a second course in statistics. ::: column-margin ![](goodfathersMeme.jpg){width="150"} ::: ------------------------------------------------------------------------ ![](finalMeme.jpg){fig-align="center" width="400"}

What have you learned so far in R?

Let’s put a frame on those data points….

The package tidyverse is your friend

Dropping variables

Change variable names

Compute a new variable

Let’s cleanse this data set

Examine your data

Special topics in R

Missing data without missing the point….

You have an NA on your face

Add NA when you read in data

Usual questions when you are learning data coding and stats

Can I replace missing values with zeros?

Can I replace missing values with the mean?

So, what should we do instead of using mean imputation?

Final note (I promise it is the end)

The package `tidyverse` is your friend

Special topics in `R`

You have an `NA` on your face

Add `NA` when you read in data

So, what should we do instead of using `mean imputation`?