Readers of experimental evolution studies by Richard Lenski, Jeffrey Barrick and others, or of cancer evolution reviews, will have seen stacked area plots that combine information about frequency dynamics and phylogeny. The example below is from a recent preprint by Rohan Maddamsetti, Lenski and Barrick. The horizontal axis is time (measured in generations), and each coloured shape represents a genotype, with height corresponding to genotype frequency (or relative abundance) at the corresponding time point. Descendant genotypes are shown emerging from inside their parents.
Such diagrams are sometimes termed Muller plots in honour of Hermann Joseph Muller (of ratchet fame), who used them in 1932 to illustrate an evolutionary advantage of sex. I wanted to draw Muller plots of some evolutionary simulations. Unable to find generic software, I spent a few days coding in R and tweeted an example result:
Encouraged by the response, I expanded my code into the fully documented R package ggmuller (thanks to Sean Leonard for suggesting the name). Ggmuller is my attempt to set an easy-to-use standard for representing evolutionary dynamics. You can install the package from github with a few lines of R code:
install.packages("devtools") # enables installation from github library(devtools) install_github("robjohnnoble/ggmuller") library(ggmuller)
Some of the code that follows depends on the latest version of ggmuller; if you installed the package before 20th August then please reinstall it for full functionality.
How does it work?
The core function is
get_Muller_df, which constructs a specially ordered data frame from which we can draw a stacked area plot.
get_Muller_df takes two data frames as input: first an adjacency matrix that defines a phylogeny (with columns named “Identity” and “Parent”), and second the populations of each genotype over time (with columns named “Identity”, “Generation” and “Population”). The genotype names in the “Identity” and “Parent” columns can be any words or numbers. You can use a phylo object instead of an adjacency matrix, provided the genotype names in the population data are integers that correspond to the node numbers in the phylo object.
get_Muller_df records a path through the phylogenetic tree. Each genotype appears exactly twice in the path: once before and once after all of the genotype’s descendants. The function then associates each instance of each genotype with half of the genotype’s frequency over time.
To draw plots, ggmuller exploits Hadley Wickham’s ggplot package (which accounts for the first part of its name). Because ggplot’s stacked area plot respects row order, and because the two instances of each genotypes are coloured identically, the resulting plot shows each genotype emerging from the centre of its parent (following Jeffrey Barrick’s example, multiple descendants are stacked from top to bottom in chronological order of appearance). Whereas I could have made descendants emerge at different heights – as in the Lenski lab figure above – it seems simpler and less ambiguous to have all descendants emerge exactly in the middle of their parents. I propose this as a standard to facilitate comparing plots of this type.
If your adjacency matrix and your population data frame are properly formatted, you can use ggmuller to visualize your results with just two lines of R code:
Muller_df <- get_Muller_df(example_edges, example_pop_df) Muller_plot(Muller_df)
example_pop_df with the names of the data frames that contain your adjacency matrix and population data, respectively.
That’s all you need for basic usage. The rest of this post explains why these plots are such a powerful way of visualizing evolutionary dynamics, and introduces some additional features and options.
A couple of minimal examples
To get a better idea of why you might want to use ggmuller, let’s recreate Figure 1 of a recent commentary by Mark Robertson-Tessi & Sandy Anderson. The figure (drawn by Chandler Gatenbee) comprises two diagrams contrasting the “traditional” selective sweep model of cancer progression and the“Big Bang” neutral evolution model proposed by Andrea Sottoriva and colleagues.
Most of the code that follows is to create an appropriate data set. First we make an adjacency matrix that defines the phylogeny. In this minimal example, the gentoype names are integers, and each genotype has exactly one descendant (1 begets 2, which begets 3, which begets 4):
edges1 <- data.frame(Parent = 1:3, Identity = 2:4)
Next we construct a data frame of genotype populations over time. In the neutral evolution case, all genotypes (except the background genotype) have the same growth rate:
# a function for generating exponential growth curves: pop_seq <- function(gens, lambda, start_gen) c(rep(0, start_gen), exp(lambda * gens[0:(length(gens) - start_gen)])) gens <- 0:150 # generations lambda <- 0.1 # baseline exponential growth rate pop1 <- data.frame(Generation = rep(gens, 4), Identity = rep(1:4, each = length(gens)), Population = c(1E2 * pop_seq(gens, lambda, 0), pop_seq(gens, 2*lambda, 0), pop_seq(gens, 2*lambda, 3), pop_seq(gens, 2*lambda, 8)))
get_Muller_df to combine the phylogeny and population dynamics into a data frame we can plot from:
Muller_df1 <- get_Muller_df(edges1, pop1)
We create the plot using
Muller_plot (which is a wrapper for ggplot’s stacked area plot) using a custom colour palette (optional):
my_palette <- c("white", "skyblue2", "red", "green3") plot1 <- Muller_plot(Muller_df1, palette = my_palette)
We proceed similarly for the second panel, in which genotypes have different growth rates (for example, due to the accumulation of advantageous mutations):
edges2 <- data.frame(Parent = 1:3, Identity = 2:4) pop2 <- data.frame(Generation = rep(gens, 4), Identity = rep(1:4, each = length(gens)), Population = c(1E2 * pop_seq(gens, lambda, 0), pop_seq(gens, 2*lambda, 0), pop_seq(gens, 4*lambda, 40), pop_seq(gens, 8*lambda, 80))) Muller_df2 <- get_Muller_df(edges2, pop2) plot2 <- Muller_plot(Muller_df2, palette = my_palette)
Finally, we draw the plots together using the gridExtra package:
library(gridExtra) grid.arrange(plot1, plot2)
And here we see the power of the Muller vizualisation: even though these systems have identical phylogenies, we can see at a glance that they have very different evolutionary dynamics.
A more sophisticated phylogeny
Now let’s look at a more interesting, branched phylogeny:
edges3 <- data.frame(Parent = paste0("clone_", LETTERS[c(rep(1:3, each = 2), 2, 5)]), Identity = paste0("clone_", LETTERS[2:9]))
The population data are generated using diverse fitness values:
fitnesses <- c(1, 2, 2.2, 2.5, 3, 3.2, 3.5, 3.5, 3.8) pop3 <- data.frame(Generation = rep(gens, 9), Identity = paste0("clone_", LETTERS[rep(1:9, each = length(gens))]), Population = c(1E2 * pop_seq(gens, fitnesses*lambda, 0), pop_seq(gens, fitnesses*lambda, 0), pop_seq(gens, fitnesses*lambda, 10), pop_seq(gens, fitnesses*lambda, 20), pop_seq(gens, fitnesses*lambda, 30), pop_seq(gens, fitnesses*lambda, 40), pop_seq(gens, fitnesses*lambda, 50), pop_seq(gens, fitnesses*lambda, 50), pop_seq(gens, fitnesses*lambda, 60)), Fitness = rep(fitnesses, each = length(gens)))
Of course there are more efficient ways of coding this in R, but perhaps the long-winded version is clearer for readers unfamiliar with lapply and its siblings. Note the inclusion of an optional “Fitness” column containing the fitness values. We combine the data using
get_Muller_df as before:
Muller_df3 <- get_Muller_df(edges3, pop3)
Finally, we create two versions of the plot, first coloured by genotype identity (the default) and then (using the RColorBrewer package) coloured by fitness, making use of that optional column:
plot3 <- Muller_plot(Muller_df3, add_legend = TRUE) library(RColorBrewer) num_cols <- length(unique(Muller_df3$Fitness)) + 1 Muller_df3$Fitness <- as.factor(Muller_df3$Fitness) plot3a <- Muller_plot(Muller_df3, colour_by = "Fitness", palette = rev(colorRampPalette(brewer.pal(9, "YlOrRd"))(num_cols)), add_legend = TRUE) grid.arrange(plot3, plot3a)
Since simulations can result in very large sets of genotypes, many of which never become abundant,
get_Muller_df has a
threshold option to exclude rarities. By default, the data frame includes only rows with nonzero population; this can be overridden by setting
add_zeroes = TRUE, in which case all genotypes are represented at every generation (which makes it easier to add new columns, for example).
Converting between adjacency matrix and phylo representations
We can visualize the tree from our branched phylogeny example using Emmanuel Paradis’ ape package, but first we need to use the ggmuller function
adj_matrix_to_tree to convert our adjacency matrix (with arbitrary node names) to a phylo object that ape can understand. Phylo is the standard way of representing trees in R, and it requires nodes to be numbered in a particular order.
tree <- adj_matrix_to_tree(edges3)
After optionally labeling the nodes, we use ape’s
plot.phylo function to draw the tree:
library(ape) tree$tip.label <- 1:length(tree$tip.label) # optional tree$node.label <- (length(tree$tip.label) + 1):10 # optional plot(tree, show.node.label = TRUE, show.tip.label = TRUE, tip.color = "red")
If you already have a tree in phylo format, you can use that instead of an adjacency matrix as input to
get_Muller_df, as long as the genotype names in the population data are integers that correspond to the node numbers in the phylo object.
Showing changes in population size
Although ggmuller deals in frequencies, you can hack it to show changes in the total population size by adding a dummy common ancestor to your phylogeny, with population equal to the maximum total population size minus the total population size at each time point, and then putting a new scale on the y-axis. By default, the dummy common ancestor (which corresponds to unfilled space) will be shown in black, whereas actual genotypes will have paler colours. I may automate this process in future versions.
I hope ggmuller will prove useful to researchers in experimental evolution, cancer research, population genetics, and beyond. I plan to submit the package to CRAN and will continue to add functionality. Meanwhile, if you use ggmuller in a publication, please cite the github version including doi:10.5281/zenodo.240589, and please notify me via email or twitter so I can keep track of uptake. Bug reports and feature requests would be most welcome.
[Edited on 5th September 2016 to fix a typo in the code defining “edges3”.]