#par(mfrow=c(2,2))
#plot(reg1)
Exercise 4: Applied Statistics in R
This exercise focuses on Gaussian linear models, where data points are assumed to be independent, normally distributed, and have equal variance. Such models are fitted in R using lm()
, e.g.
Later sections also introduce logistic regression (for binary outcomes) and linear mixed models (for grouped or hierarchical data).
The purpose of this exercise is to show how to fit and work with statistical models in R. This means that the analyses are not necessarily those that should be used in a real-life analysis of the data.
Start with the Data part.
Data
The dataset to be used concerns risk factors for low infant birth weight and is available in the MASS package. It includes 189 births with the following variables:
Use the commands below to load the MASS package and get the help page for the data. The help page will appear in the lower right window of RStudio. Read about the data; in particular about the variables
age
,lwt
,ftv
,smoke
, andbwt
.Assign the data to an object named
birthwt
if it is not already loaded. Then glimpse the dataset and check the first few rows.Note:
birthwt
is not a tibble (the modern tidyverse data format), but we can easily convert it:The variables
smoke
(motherβs smoking habits) andftv
(number of first-trimester physician visits) are coded as numeric variables.
Convert them to factors so they are treated as categorical variables in your analyses:Make boxplot of birth weight
bwt
according the number of visits to physicianftvFac
during the first trimester. Split by smoking status.Make a scatterplot with
lwt
on the \(x\)-axis andbwt
on the \(y\)-axis. Modify the plot such that points are colored according tosmoke
and size showsftv.
The
ftv
is a numerical variable, but we would also like a categorical version of it calledvisits
, with groups corresponding to zero visits, one visit, and two or more visits, respectively.Report how many how many women had more than one physician visit during the first trimester here.
Regression
The term regression is usually used for models where all explanatory variables are numerical.
If you are not too familiar with linear regression, you can have a look at what the summary()
output means here and for understanding regression plot()
, have a look here.
Fit a linear regression model with
bwt
as response andlwt
as covariate, and identify the estimates for the intercept and for the slope. Find the 97.5% confidence interval for the slope parameter by using theconfint()
function.Performe model validation. Hint: use autoplot. Does the model seem appropriate for the data?
Fit the multiple linear regression model where you include
lwt
as well asage
andftv
(numerical variables) as covariates. Identify the parameter estimates, and consider what their interpretation is.Try the following commands, and see if you can figure out what the outcome means. You should replace
reg2
with the name of the model object from the previous question.
Fit the multiple linear regression model again, but now only using data from mothers with a weight less than 160 pounds (
lwt
< 160). Hint: Use thefilter
function and change thedata
argument in thelm
command.
ANOVA
The term analysis of variance (ANOVA) is used for models where all explanatory variables are categorical (factors). Before proceeding, make sure that smoke
is coded as a factor.
Fit a one-way ANOVA (using
lm()
) where the mean birth weight is allowed to differ between smokers and non-smokers. Find the estimated birth weight for infants from smokers as well as non-smokers. Hint: useemmeans()
.library(emmeans)
Check, is there a statistically significant difference in mean birth weight between infants of smokers and non-smokers?
Fit the oneway ANOVA where you use
visits
as the explanatory variable. Perform an overall F-test for differences across all levels. Find the estimated birth weight for each group, and make the pairwise comparisons between the groups.
Hintdrop1(, test="F")
. What do you conclude about the relationship between the number of visits and infant birth weight?Now, consider both
visits
andsmoke
as explanatory variables.
Since both are categorical, the relevant model is a two-way ANOVA. Fit the model without interaction. Inspect the estimated means and group differences. How do you interpret the estimated means? What do they tell you about birth weight by smoking status and visit frequency?
Fit the twoway ANOVA model with interaction replacing
+
by a*
. Usedrop1()
to test whether the interaction term is statistically significant. Compare it to model without interaction? What do the p-values suggest about whether the effect of smoking depends on the number of visits?Use
anova()
to formally compare the two models. Are the two models statistically different?
What does this imply about including the interaction term?Report are the 2 models statistically different here.
Logistic regression
The variable low
is 1 if birth weight is less than 2500 g, and 0 otherwise.
Visualize the data:
Imagine the actual birth weight bwt
was not recorded, but we still want to know whether the birth weight was low.
Because the outcome is binary (low
has two values), we use logistic regression to model the probability.
For example, we may consider a model with mothersβs weight, smoking status, and age as predictors, just like in the previous question.
Fit the model, take a look at glm function:
Use
drop1()
to check if each predictor contributes significantly:A Note:
test = "F"
is for Gaussian linear models (lm()
), where F-tests are appropriate.
For logistic regression and other GLMs,test = "Chisq"
is the correct choice.Question: Based on the p-values, is there evidence that smoking affects the probability of low birth weight?
We have data for 5 new mothers. Use your fitted logistic model
logreg1
to predict the probability of low birthweight and fill thelow
column.<- tibble( newData low = rep(NA, 5), age = c(22, 30, 18, 17, 28), lwt = c(140, 155, 130, 97, 165), race = c("Black", "Other", "Black", "Other", "White"), smoke = c(1, 0, 1, 1, 0), ptl = c(1, 0, 0, 0, 0), ftv = c(1, 3, 0, 0, 2), bwt = c(2800, 4500, 1700, 1500, 3400) )
Use the
predict()
function to compute the predicted probabilities for the new cases. Make sure to settype="response"
to get probabilities rather than log-odds.Add the predicted probabilities (red points) and a connecting line to your
ggplot
. Hint: add geom_point and geom_line with the newData.Report here - In how many of the new cases is birthweight predicted to be low??