Exercise 4: Applied Statistics in R

Author

HeaDS Data Science Lab, University of Copenhagen

This exercise focuses on Gaussian linear models, where data points are assumed to be independent, normally distributed, and have equal variance. Such models are fitted in R using lm(), e.g.

Later sections also introduce logistic regression (for binary outcomes) and linear mixed models (for grouped or hierarchical data).

The purpose of this exercise is to show how to fit and work with statistical models in R. This means that the analyses are not necessarily those that should be used in a real-life analysis of the data.

Start with the Data part.

Data

The dataset to be used concerns risk factors for low infant birth weight and is available in the MASS package. It includes 189 births with the following variables:

Use the commands below to load the MASS package and get the help page for the data. The help page will appear in the lower right window of RStudio. Read about the data; in particular about the variables age, lwt, ftv, smoke, and bwt.
Assign the data to an object named birthwt if it is not already loaded. Then glimpse the dataset and check the first few rows.

Note: birthwt is not a tibble (the modern tidyverse data format), but we can easily convert it:
The variables smoke (mother’s smoking habits) and ftv (number of first-trimester physician visits) are coded as numeric variables.
Convert them to factors so they are treated as categorical variables in your analyses:
Make boxplot of birth weight bwt according the number of visits to physician ftvFac during the first trimester. Split by smoking status.
Make a scatterplot with lwt on the \(x\)-axis and bwt on the \(y\)-axis. Modify the plot such that points are colored according to smoke and size shows ftv.
The ftv is a numerical variable, but we would also like a categorical version of it called visits , with groups corresponding to zero visits, one visit, and two or more visits, respectively.
Report how many how many women had more than one physician visit during the first trimester here.

Regression

The term regression is usually used for models where all explanatory variables are numerical.

If you are not too familiar with linear regression, you can have a look at what the summary() output means here and for understanding regression plot(), have a look here.

Fit a linear regression model with bwt as response and lwt as covariate, and identify the estimates for the intercept and for the slope. Find the 97.5% confidence interval for the slope parameter by using the confint() function.
Performe model validation. Hint: use autoplot. Does the model seem appropriate for the data?
```
#par(mfrow=c(2,2))
#plot(reg1)
```
Fit the multiple linear regression model where you include lwt as well as age and ftv (numerical variables) as covariates. Identify the parameter estimates, and consider what their interpretation is.
Try the following commands, and see if you can figure out what the outcome means. You should replace reg2 with the name of the model object from the previous question.

Fit the multiple linear regression model again, but now only using data from mothers with a weight less than 160 pounds (lwt < 160). Hint: Use the filter function and change the data argument in the lm command.

ANOVA

The term analysis of variance (ANOVA) is used for models where all explanatory variables are categorical (factors). Before proceeding, make sure that smoke is coded as a factor.

Fit a one-way ANOVA (using lm()) where the mean birth weight is allowed to differ between smokers and non-smokers. Find the estimated birth weight for infants from smokers as well as non-smokers. Hint: use emmeans().
```
library(emmeans)    
```
Check, is there a statistically significant difference in mean birth weight between infants of smokers and non-smokers?
Fit the oneway ANOVA where you use visits as the explanatory variable. Perform an overall F-test for differences across all levels. Find the estimated birth weight for each group, and make the pairwise comparisons between the groups.
Hint drop1(, test="F"). What do you conclude about the relationship between the number of visits and infant birth weight?
Now, consider both visits and smoke as explanatory variables.
Since both are categorical, the relevant model is a two-way ANOVA. Fit the model without interaction. Inspect the estimated means and group differences. How do you interpret the estimated means? What do they tell you about birth weight by smoking status and visit frequency?

Fit the twoway ANOVA model with interaction replacing + by a *. Use drop1() to test whether the interaction term is statistically significant. Compare it to model without interaction? What do the p-values suggest about whether the effect of smoking depends on the number of visits?
Use anova() to formally compare the two models. Are the two models statistically different?
What does this imply about including the interaction term?
Report are the 2 models statistically different here.

Logistic regression

The variable low is 1 if birth weight is less than 2500 g, and 0 otherwise.
Visualize the data:

Imagine the actual birth weight bwt was not recorded, but we still want to know whether the birth weight was low.
Because the outcome is binary (low has two values), we use logistic regression to model the probability.

For example, we may consider a model with mothers’s weight, smoking status, and age as predictors, just like in the previous question.

Fit the model, take a look at glm function:
Use drop1() to check if each predictor contributes significantly:

A Note: test = "F" is for Gaussian linear models (lm()), where F-tests are appropriate.
For logistic regression and other GLMs, test = "Chisq" is the correct choice.

Question: Based on the p-values, is there evidence that smoking affects the probability of low birth weight?

We have data for 5 new mothers. Use your fitted logistic model logreg1 to predict the probability of low birthweight and fill the low column.

newData <- tibble(
  low = rep(NA, 5),
  age = c(22, 30, 18, 17, 28),
  lwt = c(140, 155, 130, 97, 165),
  race = c("Black", "Other", "Black", "Other", "White"),
  smoke = c(1, 0, 1, 1, 0),
  ptl = c(1, 0, 0, 0, 0),
  ftv = c(1, 3, 0, 0, 2),
  bwt = c(2800, 4500, 1700, 1500, 3400)
)

Use the predict() function to compute the predicted probabilities for the new cases. Make sure to set type="response" to get probabilities rather than log-odds.
Add the predicted probabilities (red points) and a connecting line to your ggplot. Hint: add geom_point and geom_line with the newData.
Report here - In how many of the new cases is birthweight predicted to be low??