library(MASS)
?birthwt
Exercise 4, part 2: Applied Statistics in R
This part might be easier to follow from the Quarto document. You will use the following exercise4_PART2.qmd file. Please clink on the link, download and open the file in your RStudio.
Except for the last two parts, this exercise is about analysis with so-called Gaussian linear models. This is the class of models where data points are assumed to be independent, Gaussian (i.e. with a normal distribution) and with the same variance. All such models are fitted with the lm
function in R, like so:
lm(y~x, data = my.data)
The term regression is usually used for models where all explanatory variables are numerical, whereas the term analysis of variance (ANOVA) is usually used for models where all explanatory variables are categorical (i.e. factors). However, predictors of both types can be included in the same Gaussian linear model. In the end, we are also concerned with logistic regression models (for binary outcomes e.g. cases vs. controls) and linear mixed models (for data with a block structure).
The purpose of this exercise is to show how to fit and work with statistical models in R. This means that the analyses are not necessarily those that should be used in a real-life analysis of the data.
Start with the Data part. The later parts are almost independent of each other, so you can choose the parts that are most appropriate for you. However, not all statistical concepts are studied in all the parts present.
Data
The dataset to be used is about risk factors associated with low infant birth weight. The data are available in the R-package MASS (part of base R, so it is installed automatically) as a data frame called birthwt. If nothing else is mentioned, then use the infants’s birth weight, bwt
, as outcome (response).
- Use the commands below to load the MASS package and get the help page for the data frame. The help page will appear in the lower right window of RStudio. Read about the data; in particular about the variables
age
,lwt
,ftv
,smoke
, andbwt
.
birthwt
is not a tibble (the modern way to organize data), but we can make it one:
library(tidyverse)
<- as_tibble(birthwt)
birthData birthData
Mother’s smoking habits (
smoke
) is coded as numerical variable. Make it a factor (categorical variable).The
ftv
is a numerical variable, but we would also like a categorical version of it, with groups corresponding to zero visits, one visit, and two or more visits, respectively. Try the following commands, and check the results:
table(birthData$ftv)
<- mutate(birthData, ftvFac = factor(ftv))
birthData <- mutate(birthData, visits = fct_collapse(ftvFac, Never="0", Once="1", other_level="MoreThanOnce"))
birthData summary(birthData)
Make groupwise boxplots of
bwt
byvisits
, i.e., boxplots of birth weight for those women who did not see their doctor, those who saw their doctor once, and those who saw their doctor more than once during the first trimester - all boxplots in one graph. Do the same thing forsmoke
(instead ofvisits
).You may also make groupwise boxplots for each combination of
smoke
andvisits
. You add a combination term with a:
, e.g.smoke:visits
. What is your initial impression about potential associations?Make a scatterplot with
lwt
on the \(x\)-axis andbwt
on the \(y\)-axis. Modify the plot such that points are coloured according to visits.
Modify it further such that the symbol types are different for smokers and non-smokers. This is done by addingpch=smoke
.
Regression
The term regression is usually used for models where all explanatory variables are numerical.
If you are not too familiar with linear regression, you can have a look at what the summary()
output means here and for understanding regression plot()
, have a look here.
Fit a linear regression model with
bwt
as response andlwt
as covariate, and identify the estimates for the intercept and for the slope. Find the 95% confidence interval for the slope parameter by using theconfint()
function.Just by calling
plot()
on your regression model you made you can carry out model validation. Does the model seem appropriate for the data?Fit the multiple linear regression model where you include
lwt
as well asage
andftv
(numerical variables) as covariates. Identify the parameter estimates, and consider what their interpretation is.Try the following commands, and see if you can figure out what the outcome means. You should replace
reg2
with the name of the model object from the previous question.
<- data.frame(lwt=100, age=25, ftv=0)
newData
newDatapredict(reg2, newData)
predict(reg2, newData, interval="prediction")
- Fit the multiple linear regression model again, but now only using data from mothers with a weight less than 160 pounds (
lwt
< 160). Hint: Use thefilter
function and change thedata
argument in thelm
command.
ANOVA
The term analysis of variance (ANOVA) is usually used for models where all explanatory variables are categorical (factors). It is important that you have coded smoke
as a factor, cf. Question 3.
First, install and load the R-package emmeans
.
library(emmeans)
Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'
Fit the oneway ANOVA (with
lm
) where the expected value ofbwt
is allowed to differ between smokers and non-smokers. Find the estimated birth weight for infants from smokers as well as non-smokers. Is there significant difference between smokers and non-smokers when it comes to infant’s birth weight? Hints: Usesummary()
and theemmeans()
function from theemmeans
R-package. N.B make sure you know what needs to go into the function, use ?.Fit the oneway ANOVA where you use
visits
as the explanatory variable. Find the estimated birth weight for each group, and make the pairwise comparisons between the groups. Furthermore, carry out the \(F\)-test for the overall comparison of the three levels.
Hintdrop1(, test="F")
. What is the conclusion?Now, consider both
visits
andsmoke
as explanatory variables. Since they are both categorical variables (factors), the relevant model is a twoway ANOVA. Fit the twoway ANOVA model without interaction, and make sure you understand the estimates:
<- lm(bwt ~ visits + smoke, data=birthData)
twoway1 summary(twoway1)
- Fit the twoway ANOVA model with interaction (use the command below). Then use
anova
and/ordrop1
to test if the interaction between visits and smoking habits is statistically significant.
<- lm(bwt ~ visits * smoke, data=birthData) twoway2
- You should still use
twoway2
in for this question. Useemmeans
to compute the expected birth weight of infants for smokers and non-smokers, respectively, on average over the three levels ofvisits
.
Models with numerical as well as categorical predictors
Predictors of any type can be included in Gaussian linear models, still using lm
.
- Fit a model where
lwt
(numeric) andsmoke
(factor) are included as predictors in an additive way:
<- lm(bwt ~ lwt + smoke, data=birthData)
model1 summary(model1)
What is the interpretation of the estimates?
- Fit a model with interaction between the two predictors, replacing
+
by a*
:
<- lm(bwt ~ lwt * smoke, data=birthData)
model2 summary(model2)
What is the interpretation of the estimates? Is there evidence in the data that the effect of mother’s weight on infant’s weight differs between smokers and non-smokers? Hint: What does the last question have to do with interaction? Use anova
on appropriate models.
- Fit the model with additive effects of mothers’s weight, smoking status, age, and visit status, and carry out model validation. Give an interpretation of the estimate associated to
smoke
. Does smoking affect the weight of infants?
Logistic regression
The variable low
is 1 if birth weight is less than 2500 g, and 0 if birth weight is larger than 2500g, see the plot
ggplot(birthData, aes(x=bwt, y=low)) + geom_point()
Consider for a moment the situation where the actual birth weight (bwt
) was not registered, such that low
was the only information on the child. Hence, the outcome is binary (low
has two values), and the relevant analysis would be a logistic regression where the probablity \(Pr(low=1)\) is described in terms of predictors. For example, we may consider a model with mothers’s weight, smoking status, age, and visit status as predictors, just like in the previous question.
- Fit the model, and consider the estimates:
<- glm(low ~ lwt + smoke + age + visits, data=birthData, family="binomial")
logreg1 summary(logreg1)
Does this model give evidence for an effect of smoking on the weight of infants? Compare the signs of the estimates from model3
(Gaussian model in the solution for Question 18) and logreg1
(logistic regression model). Can you explain ‘what happens’?
Linear mixed models
- Assume for a moment that the 189 births took place at 19 different medical centers with 10 births at each center, except for one center with only nine births. This is not the case, so we have to generate a center variable artificially. You should of course never invent such an artificial structure for a real dataset! Anyway, let’s do it like this:
set.seed(123)
<- sample(rep(1:19, each=10)[1:189])
center
center<- mutate(birthData,center=factor(center)) birthData
The rep
command repeats the number from 1 to 19, 10 times each. We only need the first 189 numbers. The sample
changes the order of the 189 numbers at random. The set.seed
command has the effect that you get the same sample each time you run the commands. The last line includes the new variable in the original dataset as a categorical variable.
- The
center
variable would typically be included in the model as a random effect. Gaussian models with both fixed and random effects are called linear mixed models, and are fitted with thelmer
function from thelme4
package. Run the code below and identify relevant estimates. Remember thatlme4
must be installed before the commands below can be used.
library(lme4)
<- lmer(bwt ~ lwt + smoke + age + visits + (1|center), data=birthData)
lmm1 summary(lmm1)