library(tidyverse)
# FactoMineR for PCA computation and factoextra for pretty plots
library(FactoMineR)
library(factoextra)
library(ggpubr)
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("DESeq2")
library(DESeq2)
Presentation 3 - EDA - Multivariate Analysis
In this section, we’ll move past basic summary stats and simple bivariate plots to uncover more complex relationships in our data.
We’ll use PCA (Principal Component Analysis) — a popular dimensionality reduction method — to reveal hidden structure and highlight the most informative patterns. in the data.
Our focus is to assess whether the collagen (COL) gene family plays a role in ovarian cancer, and whether cell type proportions (tumor, stromal, or normal) influence collagen expression.
We also want to examine whether patient characteristics, such as tumor stage, correlate with gene expression patterns.
Load packages and the data
Let’s get started by loading the necessary libraries and the cleaned dataset we explored in previous sections:
# Load the pre-saved RData file
load("../data/Ovarian_comb_prep_Col.RData")
# Subset the data we will use and drop NA values
<- df_comb %>%
df_comb select(
where(is.integer) |
where(is.factor) |
where(is.numeric)
%>%
) drop_na(where(is.integer))
Normality and Data Transformation
Before running PCA or other multivariate analyses, it’s important to check if our variables are roughly normally distributed, since many methods assume or work best with normal data.
One simple way to do this is with a QQ (Quantile-Quantile) plots, which visually compares the distribution of our data to a normal distribution.
Checking for Normality: QQ Plot
Let’s check the normality of all integer variables at once by reshaping the data and plotting QQ plots for each.
# Pivot integer columns to long format for easy faceting
<- df_comb %>%
df_long pivot_longer(where(is.integer),
names_to = "variable",
values_to = "value")
# QQ plots faceted by variable
ggplot(df_long, aes(sample = value)) +
geom_qq_line(color = "red") +
geom_qq(color = "#482878FF", alpha = 0.7) +
labs(title = "QQ Plot for Gene Expression",
x = "Theoretical Quantiles", y = "Sample Quantiles") +
theme_minimal() +
facet_wrap(vars(variable), nrow = 2, scales = "free")
If the points fall along the diagonal line, the data is approximately normal.
If the points curve above the line, the data is right-skewed.
If the points fall below the line, the data is left-skewed.
This quick check shows whether a transformation is needed to improve symmetry and model accuracy.
In this case, we can see that Age is approximately normally distributed, while percent tumor cells and percent normal cells are not. This is very common — many variables, especially in biological or economic data, naturally tend to be skewed rather than perfectly normal.
Transformations to Reduce Skewness
Depending on your data and goals, it’s often necessary to scale or transform variables for meaningful analysis. If normality is an issue, transformations like a log or square root can help make distributions more symmetric and stabilize variance.
Let’s apply a transformation to our data now to help reduce skewness and get our variables closer to normal.
<- df_comb %>%
df_raw select(where(is.integer))
<- df_raw %>%
df_scaled select(where(is.integer)) %>%
mutate(
#days_to_death = log2(days_to_death + 1),
days_to_death = sqrt(days_to_death),
days_to_tumor_recurrence = log2(days_to_tumor_recurrence + 1),
percent_stromal_cells = log2(percent_stromal_cells + 1),
#percent_not_cancer_cells = log2(percent_not_cancer_cells + 1),
percent_normal_cells = log2(percent_normal_cells + 1),
percent_tumor_cells = percent_tumor_cells^3)
Histogram Before and After Transformation
Lets take a look!
<- df_long %>%
p1 filter(variable %in% c("days_to_death", "percent_normal_cells", "percent_tumor_cells")) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = "#482878FF", color = "black") +
theme_minimal() +
facet_wrap(vars(variable), nrow = 1, scales = "free") +
labs(title = "Before")
<- df_scaled %>%
df_long pivot_longer(cols = everything(), names_to = "variable", values_to = "value")
<- df_long %>%
p2 filter(variable %in% c("days_to_death", "percent_normal_cells", "percent_tumor_cells")) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = "#2A788EFF", color = "black") +
theme_minimal() +
facet_wrap(vars(variable), nrow = 1, scales = "free") +
labs(title = "After")
ggarrange(p1, p2, nrow = 2, ncol = 1)
Summary
Always check for normality before choosing a transformation.
Right-skewed data can benefit from log or sqrt transformations.
Left-skewed data may benefit from squaring.
QQ plots, histograms are excellent tools to assess transformation effectiveness.
Applying Z-Score Scaling
Z-score standardization is essential when variables have different units or ranges
<- df_comb %>%
p1 select(where(is.integer)) %>%
drop_na() %>%
pivot_longer(everything(), names_to = "Sample", values_to = "Value") %>%
ggplot(aes(x = Sample, y = Value, fill = Sample)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Boxplot: raw") +
theme(legend.position = "none",
axis.text.x = element_text(angle = 90, vjust = 0.5)) +
scale_fill_viridis_d()
<- as.tibble(scale(df_comb %>%
p2 select(where(is.integer)) %>%
drop_na())) %>%
pivot_longer(everything(), names_to = "Sample", values_to = "Value") %>%
ggplot(aes(x = Sample, y = Value, fill = Sample)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Scaled (Z-score)") +
theme(legend.position="none",
axis.text.x = element_text(angle = 90, vjust = 0.5)) +
scale_fill_viridis_d()
ggarrange(p1, p2, nrow = 1)
Principal Component Analysis (PCA)
There are many frameworks for performing PCA in R. Here, we’ll use the FactoMineR
and factoextra
packages to compute and visualize PCA results.
Since PCA is sensitive to scale and variance, it’s important to use the cleaned and transformed data we prepared earlier.
Let’s run PCA on the integer variables:
<- df_comb %>%
group_var select(where(is.factor))
# Compute PCA
<- PCA(df_scaled, graph = FALSE, scale.unit = TRUE) res.pca_transformed
Eigenvalues and Variance Explained
Next, we’ll check how much variance each principal component explains and visualize the results!
fviz_screeplot(res.pca_transformed, addlabels = TRUE, ylim = c(0, 50))
The scree plot shows the percentage of total variance explained by each principal component. This helps decide how many components to keep.
Variable-Level Results
We can extract variable-level results with:
<- get_pca_var(res.pca_transformed) var
This gives us:
var$coord
: Coordinates (i.e., how variables align with PCs)var$cos2
: Quality of representationvar$contrib
: Contribution to each principal component
Visualize variable relationships: this plot shows how strongly each variable contributes to the PCs. Variables closer to the circle’s edge are well represented and more influential.
# Color by cos2 values: quality on the factor map
fviz_pca_var(res.pca_transformed,
col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE) # Avoid text overlapping)
Check which variables contribute the most to PC1 and PC2 individually.
<- fviz_cos2(res.pca_transformed, choice = "var", axes = 1, top=10)
p1 <- fviz_cos2(res.pca_transformed, choice = "var", axes = 2, top=10)
p2 ggarrange(p1, p2, nrow = 1, ncol = 2)
Individual-Level Results
Beyond looking at variables, PCA also provides insights at the individual level (i.e., each sample or patient). You can access their coordinates, cos2 values, and contributions.
# Extract individual results
<- get_pca_ind(res.pca_transformed) ind
To see which individuals contribute most to the main axes, plot their contributions. (same as for variables):
# Quality of representation (cos²) for individuals
<- fviz_cos2(res.pca_transformed, choice = "ind", axes = 1, top = 20)
p1 <- fviz_cos2(res.pca_transformed, choice = "ind", axes = 2, top = 20)
p2 #p3 <- fviz_cos2(res.pca_transformed, choice = "ind", axes = 1:4, top = 20)
ggarrange(p1, p2, nrow = 1, ncol = 2)
You can also color individuals by important integer variables to explore patterns.
For example, color points by age, stromal cells, normal cells, or tumor cells:
<- fviz_pca_ind(res.pca_transformed,
t1 col.ind = df_scaled$age_at_initial_path_diagn,
geom = c("point"), title = "Age") + scale_color_viridis_c()
<- fviz_pca_ind(res.pca_transformed,
t2 col.ind = df_scaled$percent_stromal_cells,
geom = c("point"), title = "Stromal Cells") + scale_color_viridis_c()
<- fviz_pca_ind(res.pca_transformed,
t3 col.ind = df_scaled$percent_normal_cells,
geom = c("point"), title = "Normal Cells") + scale_color_viridis_c()
<- fviz_pca_ind(res.pca_transformed,
t4 col.ind = df_scaled$percent_tumor_cells,
geom = c("point"), title = "Tumor Cells") + scale_color_viridis_c()
ggarrange(t1, t2, t3, t4, ncol = 2, nrow = 2)
Note that, individuals that are similar are grouped together on the plot.
Coloring by Groups (Factors)
Want to color by groups? Just add a column to your metadata and pass it in:
<- fviz_pca_ind(
p1
res.pca_transformed,col.ind = group_var$batch,
geom = "point",
legend.title = "Batch") +
scale_color_viridis_d() +
theme(legend.position = "bottom")
<- fviz_pca_ind(
p2
res.pca_transformed,col.ind = group_var$stage,
geom = "point",
legend.title = "Stage",
pointsize = 2) +
scale_color_viridis_d() +
theme(legend.position = "bottom")
ggarrange(p1, p2, ncol = 2, nrow = 1)
That’s a complete walkthrough of PCA: computing it, exploring results, and visualizing them effectively. PCA is a critical tool for working with high-dimensional data.
Variance Stabilization
Now that we’ve explored data tranformations and PCAs, it’s time to address another important aspect: variance stabilization.
Log or square-root transformations can help reduce variance when larger values have higher variability.
If that’s not enough, try more robust methods like:
rlog()
(regularized log transformation)varianceStabilizingTransformation()
(VST)
These advanced tools, available in the DESeq2 package, are specifically designed for stabilizing variance in count data.
Example: Apply VST, rlog, and log2 Transformations
Lets examine this using our gene expression data.
Since our expression values are already log2-transformed, we’ll first reverse them back to raw counts. Then, we can test and compare various transformations side-by-side to see how they affect variance and distribution.
# 1. Extract COL genes and convert to counts
<- df_comb %>%
df_matrix select(starts_with('COL')) %>%
mutate(across(everything(), ~ 2^.)) %>%
mutate(across(everything(), as.integer)) %>%
as.matrix()
# 3. Create variance stabilized versions
<- log2(df_matrix)
log2_df <- rlog(df_matrix, blind = TRUE)
rlog_df <- varianceStabilizingTransformation(df_matrix, blind = TRUE) vst_df
Comparing Mean-Variance Relationships
Now we visualize how well each transformation stabilizes variance:
# 4. Function for mean-variance plot
<- function(mat, title) {
plot_mean_var <- tibble(
df_mv mean = rowMeans(mat),
variance = apply(mat, 1, var),
rank = rank(rowMeans(mat))
)
ggplot(df_mv, aes(x = rank, y = variance)) +
geom_point(alpha = 0.5, color = "#482878FF") +
theme_minimal() +
labs(title = title, x = "Mean Rank", y = "Variance")
}
<- plot_mean_var(df_matrix, "Mean-Variance: Raw Counts")
mv1 <- plot_mean_var(log2_df, "Mean-Variance: log2(counts + 1)")
mv2 <- plot_mean_var(rlog_df, "Mean-Variance: rlog")
mv3 <- plot_mean_var(vst_df, "Mean-Variance: VST")
mv4
# 5. Arrange and display plots
ggarrange(mv1, mv2, mv3, mv4, ncol = 2, nrow = 2)
We can see that log2 helps, but rlog and VST do a better job of flattening the variance. However, rlog may overly compress high-expression values. For that reason, we will proceed using the VST-transformed data.
Expression Distribution Comparison
Let’s inspect how the transformations affect overall data spread:
# Function for boxplot
<- function(mat, title) {
plot_box <- as.data.frame(mat) %>%
df_long pivot_longer(everything(), names_to = "Sample", values_to = "Expression")
ggplot(df_long, aes(x = Sample, y = Expression, fill = Sample)) +
geom_boxplot() +
theme_minimal() +
labs(title = title) +
theme(legend.position="none",
axis.text.x = element_text(angle = 90, vjust = 0.5)) +
scale_fill_viridis_d()
}
# Generate plots
<- plot_box(log2_df, "Boxplot: log2")
p1 <- plot_box(rlog_df, "Boxplot: rlog")
p2 <- plot_box(vst_df, "Boxplot: VST")
p3
# Arrange and display plots
ggarrange(p1, p2, p3, nrow = 3)
We will go with the VST!
Caution: Transformations Are Not a Cure-All
Not every issue is solvable with a transformation
Consider non-linear models or weighted least squares when appropriate
Interpret results on the transformed scale, but report findings on the original scale when possible
Best Practice: Try multiple transformations and assess both statistical fit and interpretability.
PCA on VST-Stabilized Data
Now that our gene expression data has undergone variance stabilization using the VST transformation, we’re ready to run PCA and explore structure, key contributors, and biological relevance.
# Run PCA
<- PCA(vst_df, graph = FALSE, scale.unit = TRUE)
res.pca_genes_vst fviz_screeplot(res.pca_genes_vst, addlabels = TRUE, ylim = c(0, 50))
Inspect Variable Contributions
fviz_pca_var(res.pca_genes_vst, col.var = "coord",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)
<- fviz_cos2(res.pca_genes_vst, choice = "var", axes = 1, top=10) # cos2 = quality
p1 <- fviz_cos2(res.pca_genes_vst, choice = "var", axes = 2, top=10) # cos2 = quality
p2 ggarrange(p1, p2, nrow = 1, ncol = 2)
Gene Expression Patterns on PCA Projection
We can project individual samples onto PCA space and color them by expression of key genes to understand their influence and explore potential biological relevance.
<- fviz_pca_ind(res.pca_genes_vst, col.ind = log2_df[,"COL5A1"], geom = "point",
p1 title = "COL5A1") + scale_color_viridis_c()
<- fviz_pca_ind(res.pca_genes_vst, col.ind = log2_df[,"COL1A1"], geom = "point",
p2 title = "COL1A1") + scale_color_viridis_c()
<- fviz_pca_ind(res.pca_genes_vst, col.ind = log2_df[,"COL9A3"], geom = "point",
p3 title = "COL9A3") + scale_color_viridis_c()
<- fviz_pca_ind(res.pca_genes_vst, col.ind = log2_df[,"COL4A6"], geom = "point",
p4 title = "COL4A6") + scale_color_viridis_c()
ggarrange(p1, p2, p3, p4, ncol = 2, nrow = 2)
Observation: COL5A1 is a major driver of variation along PC1, while COL9A3 appears more influential along PC2.
Results: Investigating Collagen Genes and Tumor Characteristics
We set out to explore whether the COL (collagen) gene family plays an important role in ovarian cancer, and whether cell type proportions — like tumor or stromal content — influence collagen gene expression.
To explore this, we have used Principal Component Analysis (PCA) to identify the most variable and informative COL genes.
# Get variable loadings (contributions to PCs)
<- res.pca_genes_vst$var$contrib
loadings <- as.data.frame(loadings)
loadings_df $gene <- rownames(loadings_df)
loadings_df
# Top genes from PC1 and PC2
<- loadings_df %>% arrange(desc(abs(Dim.1))) %>% slice_head(n = 2)
top_pc1 <- loadings_df %>% arrange(desc(abs(Dim.2))) %>% slice_head(n = 2)
top_pc2
<- unique(c(top_pc1$gene, top_pc2$gene))
top_genes top_genes
[1] "COL5A1" "COL1A1" "COL9A3" "COL4A6"
These top genes are the ones most strongly contributing to variation along the principal components — suggesting biological relevance.
Gene Expression by Cancer Stage
We next visualized the expression of these genes across cancer stages to see if their expression correlates with disease progression.
<- df_comb %>%
df_long select(any_of(top_genes), stage) %>%
pivot_longer(cols = starts_with('COL'),
names_to = "gene",
values_to = "value")
ggplot(df_long, aes(x = gene, y = value, fill = stage)) +
geom_boxplot() +
scale_fill_viridis_d() +
theme_minimal() +
facet_wrap(vars(gene), nrow = 2, scales = "free") +
labs(title = "Gene Expression by Tumor Stage")
From these boxplots:
Genes highly loading on PC1 seem to increase with tumor stage, suggesting a role in tumor progression.
Genes from PC2 show a more stable or mixed pattern — possibly reflecting different biological regulation.
Gene Expression vs. Tumor Composition
We also examined how collagen gene expression loaded in PC1 relates to tumor cell percentages and stromal content — key features of the tumor microenvironment.
<- loadings_df %>%
sel_genes arrange(desc(abs(Dim.1))) %>%
slice_head(n = 2)
<- df_comb %>%
df_long select(any_of(sel_genes$gene), percent_tumor_cells, percent_stromal_cells) %>%
pivot_longer(cols = starts_with("COL"),
names_to = "gene",
values_to = "value")
ggplot(df_long, aes(x = percent_tumor_cells^3, y = value, color = percent_stromal_cells)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c() +
theme_minimal() +
stat_smooth(method = "loess", col = "red") +
facet_wrap(vars(gene), ncol = 2, scales = "free") +
stat_cor(method = "pearson", p.accuracy = 0.001, r.accuracy = 0.001) +
labs(title = "Expression vs. Tumor Cell Composition") +
theme(legend.position = "bottom")
`geom_smooth()` using formula = 'y ~ x'
We observe a negative correlation between PC1 gene expression and tumor cell percentage, suggesting that:
Collagen expression may be higher in samples with lower tumor purity, possibly reflecting increased stromal or non-tumor activity.
This highlights the importance of adjusting for cell composition when analyzing expression data — otherwise, we risk attributing biological meaning to technical or compositional bias.
Summary of Key Results
PCA helped us prioritize COL genes most associated with biological variation.
PC1 genes showed higher expression in late-stage tumors, suggesting a role in progression.
These genes also correlate negatively with tumor cell content, implying stromal influence.
Further statistical analysis and modeling would be necessary to confirm these theories properly.