Literature studies for the scientist

Literature scholars are far better equipped to talk intelligently about science than scientists are to discuss the study of literature. So lamented a friend of mine on Facebook recently, based on his experience as a professor of English. It is sad, he went on, that the value of his field is as unappreciated among STEM colleagues as in society at large.

Galvanised by an ensuing stream of arguments and rebuttals, I decided to test the claim by polling fellow scientists on twitter: do you think the study of literature matters in general and to your work? Of the thirteen who replied, five said the study of literature is not useful for science, and two of these said it is generally not useful. It seems my friend has a point.

The poll result is all the more striking given the ambiguity of my question. As one Twitter user put it, did I mean contemplating the works of Shakespeare, Márquez, and Cervantes, or the scientific papers of one’s own discipline? Uncertainty should have prompted more positive responses. But here I will argue that it shouldn’t matter. Blinkered as I am by an exclusively STEM-focussed post-16 education, I can still see several ways in which wisdom gleaned from the study of literature in general — as practiced in university departments of English and their equivalents — might enrich my understanding of the world and help me become a better scientist.

Let’s start with the easiest case. As in any academic field, the literature of science is a messy ecosystem of arguments, counterarguments, modifications, dead ends, and attempts at synthesis. More is written on each topic than any of us could hope to read. We make sense of this textual jungle by being selective; by learning how to discern the strength of evidence; by spotting flawed logic or crucial omissions; and by forming opinions about particular theories, research programmes and researchers. Not all good work is popular, and not all that’s popular is good. Fashion, politics and celebrity matter. And culture matters. For example, in the information age it’s commonplace to make potentially misleading links between DNA and computer code, gene pathways and electrical circuits, or evolution and machine learning. Therefore smart interpretations must account for context. All of which is bread and butter for a literature specialist.

Then there’s how we go about writing the stuff. The standard form for a scientific report describes an orderly progression from question to hypothesis, test, and conclusion. This is of course an artificial narrative imposed on a jumble of events, ideas, and observations. Indeed, many of the finest scientific papers are structured like genre fiction. We shape our science stories according to the idiosyncratic conventions of generalist or specialist journals, conference posters, job talks, and seminar slideshows. Clear communication is notoriously difficult, yet English majors know how to do it better than most.

What about the core of the scientific enterprise: how we attempt to understand reality? Just as painters of the same subject might variously aim to convey light, form, psychology, or narrative, so scientists will draw different features from the same set of observations. Or pursuing the same question, each will design a different set of experiments. Our ways of seeing are informed by training, personality, and taste in problems. Writing on theoretical biology in particular is often akin to philosophy. The Price equation and inclusive fitness theory offer either deep insights or worthless tautologies, depending on who you ask. Humanity scholars can help us recognise and understand unavoidable subjectivity.

My particular way of understanding nature is through mathematical models. A useful model describes an imaginary, internally-consistent system that behaves at least a little like some aspect of reality. Modellers prize simplicity. So do playwrights. If you put a gun in your model then it had better go off. Amalgamate your bit players into composite characters. And consider carefully what fundamental feature — the mathematical MacGuffin — you use to drive the action. I find much the same qualities to admire in an elegant mathematical model and a taut movie plot.

Whereas I’ve focussed here on my area of biology, the arguments extend to all of science. For sure, if your sole aim is to measure the mass of an electron then you needn’t worry so much about epistemology. But the mass of an electron, a star, or an elephant is only interesting inasmuch as it provides a parameter of a predictive theory. And theories — even those as successful as general relativity — are always fair game for debate.

I imagine that a literature scholar would find the above arguments woefully simplistic and unoriginal. And that’s exactly my point. My job as a scientist requires me to interpret more or less subjective literature; to weave narratives; and to identify meaningful patterns while accounting for my biasses and those of others; yet scientific training devotes scant time to any of these difficult skills. Rather than relying on checklists and templates, or trying to reinvent the wheel through trial and error, wouldn’t we do well to learn from those whose knowhow is honed by years of specialist study, founded on generations of scholarship addressing precisely this set of problems? Just as athletes gain from cross-training, we can strengthen our critical faculties by exploring alternative intellectual frameworks. At the very least, we might prepare ourselves to ask more informed questions when we next encounter a scholar who doesn’t work in STEM.

For a more nuanced take on the parallels and contrasts between science and literature, I suggest an interview with my PhD advisor Sunetra Gupta, discussing her dual roles of theoretical biologist and novelist.

Advertisements

Mathematical biology versus computational biology (where am I?)

Over years of practice, the seasoned professor develops a sophisticated understanding of how their research programme fits into the broader scientific endeavour. Earning a PhD takes a long time largely because graduate students have no such map. Even when directed towards promising new pastures, the apprentice is bound to spend much time rediscovering well-trodden ground or getting bogged down in unproductive swamps that more experienced explorers know how to avoid. Only a lucky few wanderers happen upon hidden treasure.

Understanding one’s place in academia involves knowing how its subfields are demarcated. A problem here is that the discipline definitions used by journals or in textbooks don’t necessarily correspond to research communities that go by the same names. This discrepancy struck me recently when I attended ISMB/ECCB, which combines the European Conference on Computational Biology with the flagship meeting of the International Society for Computational Biology (ISCB).

The official ISCB journal PLOS Computational Biology publishes works that “further our understanding of living systems at all scales – from molecules and cells, to patient populations and ecosystems – through the application of computational methods.” [1] This broad scope is consistent with NIH’s definition of computational biology as “The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.” [2]

ISMB/ECCB, on the other hand, has a narrower ambit in two respects. First, the conference focusses on data-driven, statistical methods. Second, almost all participants specialise in interpreting molecular data. Indeed, whereas the “CB” of ECCB stands for computational biology, the “MB” of ISMB is molecular biology. ISCB’s original mission statement placed emphasis “on the role of computing and informatics in advancing molecular biology.” [3] Although those words have since been removed from the mission statement, one needs only to peruse the society’s Communities of Special Interest (COSI) profiles to see that the emphasis remains in practice. Of more than 500 talks at the meeting, only eight mentioned evolution (or a derived word) in their title, of which four were in a special “non-COSI” session.

I don’t at all mean to complain about this particular conference. I attended many excellent ISMB/ECCB talks spanning diverse methods and applications. But it strikes me as worthwhile to examine why communities have formed within particular boundaries, and what we might gain from eroding the divisions. So I drew a diagram:

Screen Shot 2017-08-08 at 18.29.46

The aim of this plot is to help clarify (for junior researchers like me) how scientists applying mathematical and computational methods to biological problems have organised themselves into communities. Of course I’m biassed, and I expect that some aspects of the diagram are demonstrably wrong. I welcome suggestions for improvement and would be happy to post a revised version. But I think the above picture is a useful starting point for discussion.

I’m particularly interested in the white spaces. I don’t doubt there are people developing computational workflows to analyse non-molecular data sets at the tissue, organ, organism, and population scales, but their research community seems to be less prominent than those pertaining to smaller or larger scales. I suppose this is partly because we have better mechanistic understanding at intermediate scales, where systems can be more readily observed and manipulated. Likewise, I’m well aware that mathematical modelling is applied at every biological scale, but (at least based on conference programmes) the mathematical and theoretical biology communities seem to have stronger ties to evolutionary biology, developmental biology, and the study of infectious disease than to molecular biology.

The picture may be changing. I’m fortunate to belong to a Computational Biology Group that uses molecular data to inform agent-based models of tumour evolution, and that uses and advances methods from pure mathematics to strengthen our theoretical grasp of molecular processes. James Sharpe gave a fantastic ISMB/ECCB keynote talk about investigating vertebrate limb development at levels ranging from gene regulatory networks to the physical interactions between cells and tissues. Sharpe conveyed a vision of systems biology not as a subset of computational biology (as narrowly defined) but as a holistic approach to unravelling life’s complexity.

As for myself, I feel most comfortable near the middle of the diagram, though spreading tendrils in each direction to span as many scales and methods as are needed to address the question at hand. So I reckon I’ll keep on attending ISMB/ECCB as well as SMB/ESMTB (mathematical and theoretical biology) and ESEB (evolutionary biology) conferences, and I’ll try to play a part not just in drawing but in redrawing the map.

References

  1. PLOS Computational Journal Information, retrieved 8th August 2017.
  2. Huerta, Michael, et al. (2000) NIH working definition of bioinformatics and computational biology. US National Institute of Health.
  3. History of ISCB, retrieved 8th August 2017.

Towards a unified theory of cancer risk

Martin Nowak and Bartlomiej Waclaw conclude their recent commentary [1] on the “bad luck and cancer” debate with a look to the future:

“The earlier analysis by Tomasetti and Vogelstein has already stimulated much discussion… It will take many years to answer in detail the interesting and exciting questions that have been raised.”

I agree. When a couple of journalists [2, 3] contacted me for comments on the latest follow-up paper from Christian Tomasetti, Bert Vogelstein and Lu Li, I emphasized what can be gained from rekindling the decades-old debate about the contribution of extrinsic (or environmental, or preventable) factors to cancer risk. In particular, the diverse scientific critiques of Tomasetti and Vogelstein’s analysis suggest important avenues for further inquiry.

My own take is summarized in the figure below. This diagram (inspired by Tinnbergen) reframes the question in terms of proximate mechanisms and ultimate causes. It also provides a way of categorizing cancer etiology research.

Causes of cancer

Tomasetti and Vogelstein’s 2015 paper [4] demonstrated that the lifetime number of stem cell divisions is correlated with cancer risk across human tissues (part A in the figure). Colleagues and I have argued [5, 6] that, although characterizing this association is important, it cannot be used to infer what proportion of cancer risk is due to intrinsic versus extrinsic factors. This is because cancer initiation depends not only on mutated cells, but also on the fitness landscape that governs their fate, which is determined by a microenvironment that differs between tissues (figure part B).

Moreover, the supply of mutated cells and the microenvironment are both shaped by an interaction of nature and nurture (figure part C). In a recently published paper [7], Michael Hochberg and I draw attention to the relationship between cancer incidence and environmental changes that alter organism body size and/or life span, disrupt processes within the organism, or affect the germline (figure part D). We posit that “most modern-day cancer in animals – and humans in particular – are due to environments deviating from central tendencies of distributions that have prevailed during cancer resistance evolution”. We support this claim in our paper with a literature survey of cancer across the tree of life, and with an estimate of cancer incidence in ancient humans based on mathematical modelling [7].

To understand why cancer persists at a certain baseline level even in stable environments, we must further examine the role of organismal evolution (figure part E). If cancer lowers organismal fitness then we might expect selection for traits that reduce risk. But continual improvement in cancer prevention is expected to come at a cost, and the net effect on fitness will depend on life history. For example, more stringent control of cell proliferation might reduce cancer risk and so lower the mortality rate at older ages, while also increasing deaths in juveniles and young adults due to impaired wound healing. We can predict outcomes of such trade-offs by calculating selection gradients, which is what I’ve been doing in a research project that I presented at an especially stimulating MBE conference in the UK last week.

The quest to understand cancer risk must then encompass not only cell biology, but also ecology and evolution at both tissue and organismal levels. One of my goals is to make connections between these currently disparate lines of research in pursuit of a more unified theory.

References

  1. Nowak, M. A., & Waclaw, B. (2017). Genes, environment, and “bad luck”. Science, 355(6331), 1266–1267.
  2. Ledford, H. (2017) DNA typos to blame for most cancer mutationsNature News.
  3. Chivers, T. (2017) Here’s Why The “Cancer Is Caused By Bad Luck” Study Isn’t All It Seems. Buzzfeed.
  4. Tomasetti, C., & Vogelstein, B. (2015). Variation in cancer risk among tissues can be explained by the number of stem cell divisions. Science, 346(6217), 78–81.
  5. Noble, R., Kaltz, O., & Hochberg, M. E. (2015). Peto’s paradox and human cancers. Philosophical Transactions of the Royal Society B: Biological Sciences, 370(1673), 20150104–20150104.
  6. Noble, R., Kaltz, O., Nunney, L., & Hochberg, M. E. (2016). Overestimating the Role of Environment in Cancers. Cancer Prevention Research, 9(10), 773–776.
  7. Hochberg, M. E., & Noble, R. J. (2017). A framework for how environment contributes to cancer risk. Ecology Letters20(2), 117–134.

The Box-Einstein surface of mathematical models

As a mathematical modeller in evolutionary biology, my seminar bingo card has four prime boxes. Watching a talk about evolution, I count down the minutes to the first appearance of Dobzhansky’s “nothing in biology” quote (or some variant thereof) or a picture of Darwin’s “I think” sketch. For mathematical modelling, it’ll be either Albert Einstein or George Box:

“All models are wrong but some are useful” – George Box

“Everything should be made as simple as possible, but not simpler” – probably not Albert Einstein

Of course, such quotes are popular for good reason, and I’m not criticising those who use them to good effect, but all the same it can be fun to try to find a new way of presenting familiar material. That’s why in spring 2015 I came up with and tweeted a visual summary of the latter two aphorisms, which I named the Box-Einstein surface of mathematical models:

plot

The grey region in the plot ensures that all possible models have some degree of “wrongness”, but the contours in the remaining region tell us that some models are useful all the same. To find the most useful description of a particular phenomenon, we must reduce complexity without overly increasing wrongness.

A key thing to understand about this diagram is that although the boundary of the grey region is invariant, the surface is changeable. If our empirical knowledge of the system becomes richer, or if we change the scope of our enquiry, the most useful model may be more or less complex than before.

Einstein’s quote can be seen as simply paraphrasing Occam’s razor, but I think it has additional meaning with regard to what Artem Kaznatcheev calls heuristic and abstract mathematical models, such as are generally used in biology. In statistics, a simple model has few degrees of freedom, which is desirable to reduce overfitting. However, statisticians should also beware what JP Simmons and colleagues termed “researcher degrees of freedom”:

“In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?

“It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “statistical significance,” and to then report only what “worked.” The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.”

Likewise, when a researcher makes a mathematical model of a dynamical system – be it a set of differential equations or a stochastic agent-based model – he or she makes numerous decisions, usually with more or less full knowledge of the empirical data against which the model will be judged.

But there’s an important difference between the process of collecting data and that of creating a mathematical model. Ideally, the experimentalist can minimise researcher degrees of freedom by following a suitable experimental design and running controls that enable him or her to test a hypothesis against a null according to a predetermined statistical model. For most mathematical models there is no such template, and a process of trial and improvement is unavoidable, forgivable, and even desirable (inasmuch as it strengthens understanding of why the model works). The role of mathematical modeller is somewhere between experimentalist and pure mathematician. By making our models as simple as possible, we shift ourselves further toward the latter role, and our experimentation becomes less about exploiting our freedom and more about honing our argument.

For further reading, check out Artem Kaznatcheev’s insightful post about what “wrong” might mean, and why Box’s quote doesn’t necessarily apply to all types of model.

Visualizing evolutionary dynamics with ggmuller

A how-to guide for making beautiful plots of your data.

Readers of experimental evolution studies by Richard Lenski, Jeffrey Barrick and others, or of cancer evolution reviews, will have seen stacked area plots that combine information about frequency dynamics and phylogeny. The example below is from a recent preprint by Rohan Maddamsetti, Lenski and Barrick. The horizontal axis is time (measured in generations), and each coloured shape represents a genotype, with height corresponding to genotype frequency (or relative abundance) at the corresponding time point. Descendant genotypes are shown emerging from inside their parents.

Lenski

Such diagrams are sometimes termed Muller plots in honour of Hermann Joseph Muller (of ratchet fame), who used them in 1932 to illustrate an evolutionary advantage of sex. I wanted to draw Muller plots of some evolutionary simulations. Unable to find generic software, I spent a few days coding in R and tweeted an example result:

Tweet

Encouraged by the response, I expanded my code into the fully documented R package ggmuller (thanks to Sean Leonard for suggesting the name). Ggmuller is my attempt to set an easy-to-use standard for representing evolutionary dynamics. You can install the package from github with a few lines of R code:

install.packages("devtools") # enables installation from github
library(devtools)
install_github("robjohnnoble/ggmuller")
library(ggmuller)

Some of the code that follows depends on the latest version of ggmuller; if you installed the package before 20th August then please reinstall it for full functionality.

How does it work?

The core function is get_Muller_df, which constructs a specially ordered data frame from which we can draw a stacked area plot. get_Muller_df takes two data frames as input: first an adjacency matrix that defines a phylogeny (with columns named “Identity” and “Parent”), and second the populations of each genotype over time (with columns named “Identity”, “Generation” and “Population”). The genotype names in the “Identity” and “Parent” columns can be any words or numbers. You can use a phylo object instead of an adjacency matrix, provided the genotype names in the population data are integers that correspond to the node numbers in the phylo object.

get_Muller_df records a path through the phylogenetic tree. Each genotype appears exactly twice in the path: once before and once after all of the genotype’s descendants. The function then associates each instance of each genotype with half of the genotype’s frequency over time.

To draw plots, ggmuller exploits Hadley Wickham’s ggplot package (which accounts for the first part of its name). Because ggplot’s stacked area plot respects row order, and because the two instances of each genotypes are coloured identically, the resulting plot shows each genotype emerging from the centre of its parent (following Jeffrey Barrick’s example, multiple descendants are stacked from top to bottom in chronological order of appearance). Whereas I could have made descendants emerge at different heights – as in the Lenski lab figure above – it seems simpler and less ambiguous to have all descendants emerge exactly in the middle of their parents. I propose this as a standard to facilitate comparing plots of this type.

Basic usage

If your adjacency matrix and your population data frame are properly formatted, you can use ggmuller to visualize your results with just two lines of R code:

Muller_df <- get_Muller_df(example_edges, example_pop_df)
Muller_plot(Muller_df)

Simply replace example_edges and example_pop_df with the names of the data frames that contain your adjacency matrix and population data, respectively.

That’s all you need for basic usage. The rest of this post explains why these plots are such a powerful way of visualizing evolutionary dynamics, and introduces some additional features and options.

A couple of minimal examples

To get a better idea of why you might want to use ggmuller, let’s recreate Figure 1 of a recent commentary by Mark Robertson-Tessi & Sandy Anderson. The figure (drawn by Chandler Gatenbee) comprises two diagrams contrasting the “traditional” selective sweep model of cancer progression and the“Big Bang” neutral evolution model proposed by Andrea Sottoriva and colleagues.

Most of the code that follows is to create an appropriate data set. First we make an adjacency matrix that defines the phylogeny. In this minimal example, the gentoype names are integers, and each genotype has exactly one descendant (1 begets 2, which begets 3, which begets 4):

edges1 <- data.frame(Parent = 1:3, Identity = 2:4)

Next we construct a data frame of genotype populations over time. In the neutral evolution case, all genotypes (except the background genotype) have the same growth rate:

# a function for generating exponential growth curves:
pop_seq <- function(gens, lambda, start_gen) c(rep(0, start_gen),
 exp(lambda * gens[0:(length(gens) - start_gen)]))
gens <- 0:150 # generations
lambda <- 0.1 # baseline exponential growth rate
pop1 <- data.frame(Generation = rep(gens, 4),
 Identity = rep(1:4, each = length(gens)), 
 Population = c(1E2 * pop_seq(gens, lambda, 0), 
 pop_seq(gens, 2*lambda, 0), 
 pop_seq(gens, 2*lambda, 3), 
 pop_seq(gens, 2*lambda, 8)))

We use get_Muller_df to combine the phylogeny and population dynamics into a data frame we can plot from:

Muller_df1 <- get_Muller_df(edges1, pop1)

We create the plot using Muller_plot (which is a wrapper for ggplot’s stacked area plot) using a custom colour palette (optional):

my_palette <- c("white", "skyblue2", "red", "green3")
plot1 <- Muller_plot(Muller_df1, palette = my_palette)

We proceed similarly for the second panel, in which genotypes have different growth rates (for example, due to the accumulation of advantageous mutations):

edges2 <- data.frame(Parent = 1:3, 
 Identity = 2:4)
pop2 <- data.frame(Generation = rep(gens, 4),
 Identity = rep(1:4, each = length(gens)),
 Population = c(1E2 * pop_seq(gens, lambda, 0), 
 pop_seq(gens, 2*lambda, 0), 
 pop_seq(gens, 4*lambda, 40), 
 pop_seq(gens, 8*lambda, 80)))
Muller_df2 <- get_Muller_df(edges2, pop2)
plot2 <- Muller_plot(Muller_df2, palette = my_palette)

Finally, we draw the plots together using the gridExtra package:

library(gridExtra)
grid.arrange(plot1, plot2)

Simple

And here we see the power of the Muller vizualisation: even though these systems have identical phylogenies, we can see at a glance that they have very different evolutionary dynamics.

A more sophisticated phylogeny

Now let’s look at a more interesting, branched phylogeny:

edges3 <- data.frame(Parent = paste0("clone_", 
 LETTERS[c(rep(1:3, each = 2), 2, 5)]), 
 Identity = paste0("clone_", LETTERS[2:9]))

The population data are generated using diverse fitness values:

fitnesses <- c(1, 2, 2.2, 2.5, 3, 3.2, 3.5, 3.5, 3.8)
pop3 <- data.frame(Generation = rep(gens, 9),
 Identity = paste0("clone_", LETTERS[rep(1:9, each = length(gens))]),
 Population = c(1E2 * pop_seq(gens, fitnesses[1]*lambda, 0), 
 pop_seq(gens, fitnesses[2]*lambda, 0), 
 pop_seq(gens, fitnesses[3]*lambda, 10), 
 pop_seq(gens, fitnesses[4]*lambda, 20),
 pop_seq(gens, fitnesses[5]*lambda, 30),
 pop_seq(gens, fitnesses[6]*lambda, 40),
 pop_seq(gens, fitnesses[7]*lambda, 50),
 pop_seq(gens, fitnesses[8]*lambda, 50),
 pop_seq(gens, fitnesses[9]*lambda, 60)),
 Fitness = rep(fitnesses, each = length(gens)))

Of course there are more efficient ways of coding this in R, but perhaps the long-winded version is clearer for readers unfamiliar with lapply and its siblings. Note the inclusion of an optional “Fitness” column containing the fitness values. We combine the data using get_Muller_df as before:

Muller_df3 <- get_Muller_df(edges3, pop3)

Finally, we create two versions of the plot, first coloured by genotype identity (the default) and then (using the RColorBrewer package) coloured by fitness, making use of that optional column:

plot3 <- Muller_plot(Muller_df3, add_legend = TRUE)
library(RColorBrewer)
num_cols <- length(unique(Muller_df3$Fitness)) + 1
Muller_df3$Fitness <- as.factor(Muller_df3$Fitness)
plot3a <- Muller_plot(Muller_df3, colour_by = "Fitness", 
 palette = rev(colorRampPalette(brewer.pal(9, "YlOrRd"))(num_cols)), 
 add_legend = TRUE)
grid.arrange(plot3, plot3a)

Branching

 Optional extras

Since simulations can result in very large sets of genotypes, many of which never become abundant, get_Muller_df has a threshold option to exclude rarities. By default, the data frame includes only rows with nonzero population; this can be overridden by setting add_zeroes = TRUE, in which case all genotypes are represented at every generation (which makes it easier to add new columns, for example).

Converting between adjacency matrix and phylo representations

We can visualize the tree from our branched phylogeny example using Emmanuel Paradis’ ape package, but first we need to use the ggmuller function adj_matrix_to_tree to convert our adjacency matrix (with arbitrary node names) to a phylo object that ape can understand. Phylo is the standard way of representing trees in R, and it requires nodes to be numbered in a particular order.

tree <- adj_matrix_to_tree(edges3)

After optionally labeling the nodes, we use ape’s plot.phylo function to draw the tree:

library(ape)
tree$tip.label <- 1:length(tree$tip.label) # optional
tree$node.label <- (length(tree$tip.label) + 1):10 # optional
plot(tree, show.node.label = TRUE, show.tip.label = TRUE, tip.color = "red")

Tree

If you already have a tree in phylo format, you can use that instead of an adjacency matrix as input to get_Muller_df, as long as the genotype names in the population data are integers that correspond to the node numbers in the phylo object.

Showing changes in population size

Although ggmuller deals in frequencies, you can hack it to show changes in the total population size by adding a dummy common ancestor to your phylogeny, with population equal to the maximum total population size minus the total population size at each time point, and then putting a new scale on the y-axis. By default, the dummy common ancestor (which corresponds to unfilled space) will be shown in black, whereas actual genotypes will have paler colours. I may automate this process in future versions.

[Update on 1st September 2017: ggmuller is now on CRAN with a new function for plotting population sizes instead of frequencies.]

Next steps

I hope ggmuller will prove useful to researchers in experimental evolution, cancer research, population genetics, and beyond. I plan to submit the package to CRAN and will continue to add functionality. Meanwhile, if you use ggmuller in a publication, please cite the github version including doi:10.5281/zenodo.591304, and please notify me via email or twitter so I can keep track of uptake. Bug reports and feature requests would be most welcome.

[Edited on 5th September 2016 to fix a typo in the code defining “edges3”.]