Whole-genome viral phylogeny estimation without sequence alignment

Prerequisites: 1. Phylogenetics; 2. Sequence analysis

Level: Advanced.
Objectives:
To build phylogenies of viruses using complete genome data without multiple sequence alignment.

Introduction

When it comes to inferring phylogenies using sequence data, multiple sequence alignment (MSA) is de rigueur – yet, despite its usefulness, it can be challenging to produce an MSA when we aim to build phylogenies using whole genome information. The difficulties are articulated in Zielezinski et al. (2017).

Let us consider viral genomes. They typically mutate very rapidly, generating a highly mosaic genome that defies modeling using standard nucleotide substitution models. It can be tricky to identify homologous sequences in a set of viral genomes. Even in the best situation when the homologies are precisely known, the order of the genes in the genome is unlikely to be the same, thus making it necessary to make separate multiple sequence alignments of homologous sequences, followed by concatenation – very tedious work indeed!

A Machine Learning Perspective To Phylogeny Estimation

It turns out that looking at the phylogeny estimation problem from the perspective of machine learning can be quite refreshing. To begin with, we have the raw data – the whole genome, which is just a sequence consisting of four characters: A, G, C, and T. What sort of features can be engineered? Think about this for a second.

It might not be hard to come up with the following approach: consider a sequence of length k (a ‘k-mer’). There are 4^k possible k-mers, which we take can use as our features. The value assigned to a k-mer feature is just the number of times that k-mer appears in the genome. This quickly enables the generation of a data matrix with n rows (the genomes of interest) and 4^k columns.

After controlling for genome size, i.e., normalizing each row by its row total, each sequence can be thought of as being represented by a probability distribution of the 4^k k-mers, a "feature-frequency profile" (Sims et al., 2009). The pairwise distance between two genomes can then be estimated using the Jensen-Shannon divergence measure. Subsequently, the resulting distance matrix can be used as input in a clustering algorithm, such as the neighbor-joining algorithm, thus summarising the relatedness of the viral genomes as a phylogenetic tree. All this without aligning any sequences!

Sounds reasonable – but how well does it work in practice?

The R Code

So here’s how we get everything done in R. First, we load the necessary R packages, and the data:

#setting up libraries
#Biostrings Bioconductor package
#source("http://bioconductor.org/biocLite.R")
#biocLite("Biostrings")

library(seqinr)
library(Biostrings)
library(ape)
library(textmineR)

#In the folder containing only the 12 viral sequences
nfiles <- length(dir())
seqdat <- vector("list", nfiles)

for(i in 1:nfiles){
seqdat[[i]] <- read.fasta(file=dir()[i])
}

label <- sapply(1:nfiles, function(k) unlist(strsplit(dir()[k], "[.]"))[1])
names(seqdat) <- label

seqdat_join <- lapply(seqdat, function(k) paste(toupper(unlist(k)),collapse="") )

The next step is to generate tabulate the frequency of all possible 5-mers in each of the viral genomes.

#enumerate all 5-mers
dna <- c("A","C","G","T")
kmer5 <- expand.grid(dna, dna, dna, dna, dna)
kmer5 <- apply(kmer5, 1, function(k) paste(k, collapse=""))

#function for counting all possible kmers (k=5) given a single dna string
kmercount <- function(data){
  sapply(1:length(kmer5), function(k)
  length(unlist(gregexpr2(kmer5[k], data)))
  )}

#vector of counts for all possible kmers (k=5) for all viral sequences
kmer_features <- lapply(seqdat_join, function(k) kmercount(k))


#Collect k-mer counts into a data frame
M <- do.call(rbind, kmer_features)

We then take care of some labelling stuff.

#taxonomic labels
taxonomy <- data.frame(rownames(M), 
c("Filoviridae", "Flaviviridae", "Picornaviridae", "Flaviviridae",
"Picornaviridae", "Flaviviridae", "Filoviridae", "Flaviviridae",
"Flaviviridae", "Picornaviridae", "Picornaviridae",
"Picornaviridae"))
colnames(taxonomy) <- c("Virus","Family")

#Simplify virus species names
virusnames <- sapply(1:nrow(taxonomy), function(k){
chop <- unlist(strsplit(as.character(taxonomy$Virus)[k],"_")) 
chop[length(chop)]}
)

rownames(M) <- virusnames

tipcolor <- c("red","blue","darkviolet")[unclass(taxonomy$Family)]

We compute the appropriate distance between all pairs of 5-mer probability vectors for the viral genomes using the Jensen-Shannon divergence, and then use the resulting distance matrix as input for the bioNJ algorithm for tree construction.

#The correct input for CalcJSDivergence is the (unnormalised) count vector
JSdist <- CalcJSDivergence(M)

plot.phylo(bionj(JSdist), type="unrooted", cex=0.8, tip.color=tipcolor,
rotate.tree=95)

A principal component analysis (PCA) plot is also useful for corroborating the topology of the inferred tree. Here, it seems that the plot of the first and the fourth principal component gives the best ordination, whereby viruses from the same family are clustered together.

#Alternative visualisation
#Q-mode PCA for visualisation of ordination of virus species in low dimensional kmer space
M_norm <- t(apply(M, 1, function(k) k/sum(k)))
pca <- prcomp(M_norm)
summary(pca)

pairs(pca$x[,1:5], pch=16, cex=2,
col=c("red","blue","olivedrab")[unclass(taxonomy$Family)])

#PC1 vs PC4 seems to give best separation for the three viral families

plot(pca$x[,1], pca$x[,4], cex=2, xlim=c(-0.022,0.022), ylim=c(-0.01, 0.01),
pch=16, col=c("red","blue","darkviolet")[unclass(taxonomy$Family)],
xlab="PC1(38%)", ylab="PC5(6%)")

text(pca$x[,1], pca$x[,4]+0.001, virusnames,  cex=0.7, srt=35)

legend(0.01,-0.007,pch=16, pt.cex=2, 
col=c("red","blue","darkviolet"), c("Filoviridae","Flaviviridae","Picornaviridae"))

Download the entire R script bioinfoRmatics-series1.R.R and the virus genomes: virus_genomes.zip.

The Results

To test, I used 12 complete viral genomes from the NCBI database, 5 from the Flaviviridae family (which includes viruses in the tropics like the Zika and Dengue), 5 from the Picornaviridae family (which counts the flu and hand-foot-mouth disease viruses as members), and 2 deadly viruses: Ebola and Marburg, from the Filoviridae family. All members from the same family are correctly clustered in the inferred phylogeny (Fig. 1) and supported by results of principal component analysis (Fig. 2).

A Computationally Efficient Approach

The entire computation was complete in about 10 seconds, using an old laptop running on just 2GB of RAM! Try the R code.

Virus phylogeny — *Figure 1.*Inferred phylogeny using the BioNJ algorithm with a distance matrix computed from Jensen-Shannon divergence. Colour legend: red = *Filoviridae*; blue = *Flaviviridae*; purple = *Picornaviridae*.

clustering of the viral genomes according to family — *Figure 2.*PCA plot of PC5 against PC1, shows clear clustering of the viral genomes according to the families.

You may also be interested in

Pair-wise sequence alignment

Pair-wise sequence alignment methods

Construction of substitution matrices

DNA scoring matrices

Multiple sequence alignment (MSA) tools and resources

Single nucleotide polymorphism (SNP) tools and resources

Back to Tutorials Main Page

Please, send comments and suggestions. Suggest for example a topic you would like to see or any improvements in the content. We will not spam you with emails!

References

1. Sims, G.E., Jun, S.R., Wu, G.A., Kim, S.H. (2009). "Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions." PNAS, 106, 2677-82.

2. Zielezinski, A., Vinga, S., Almeida, J., Karlowski, W.M. (2017). "Alignment-free sequence comparison: benefits, applications, and tools." Genome Biology, 18: 186.