Whole-genome viral phylogeny estimation without sequence alignment
Objectives:
To build phylogenies of viruses using complete genome data without multiple sequence alignment.
Introduction
When it comes to inferring phylogenies using sequence data, multiple sequence alignment (MSA) is de rigueur – yet, despite its usefulness, it can be challenging to produce an MSA when we aim to build phylogenies using whole genome information. The difficulties are articulated in Zielezinski et al. (2017).
Let us consider viral genomes. They typically mutate very rapidly, generating a highly mosaic genome that defies modeling using standard nucleotide substitution models. It can be tricky to identify homologous sequences in a set of viral genomes. Even in the best situation when the homologies are precisely known, the order of the genes in the genome is unlikely to be the same, thus making it necessary to make separate multiple sequence alignments of homologous sequences, followed by concatenation – very tedious work indeed!
A Machine Learning Perspective To Phylogeny Estimation
It turns out that looking at the phylogeny estimation problem from the perspective of machine learning can be quite refreshing. To begin with, we have the raw data – the whole genome, which is just a sequence consisting of four characters: A, G, C, and T. What sort of features can be engineered? Think about this for a second.
It might not be hard to come up with the following approach: consider a sequence of length k (a ‘k-mer’). There are 4k possible k-mers, which we take can use as our features. The value assigned to a k-mer feature is just the number of times that k-mer appears in the genome. This quickly enables the generation of a data matrix with n rows (the genomes of interest) and 4k columns.
After controlling for genome size, i.e., normalizing each row by its row total, each sequence can be thought of as being represented by a probability distribution of the 4k k-mers, a "feature-frequency profile" (Sims et al., 2009). The pairwise distance between two genomes can then be estimated using the Jensen-Shannon divergence measure. Subsequently, the resulting distance matrix can be used as input in a clustering algorithm, such as the neighbor-joining algorithm, thus summarising the relatedness of the viral genomes as a phylogenetic tree. All this without aligning any sequences!
Sounds reasonable – but how well does it work in practice?
The R Code
So here’s how we get everything done in R. First, we load the necessary R packages, and the data:
#setting up libraries
#Biostrings Bioconductor package
#source("http://bioconductor.org/biocLite.R")
#biocLite("Biostrings")
library(seqinr)
library(Biostrings)
library(ape)
library(textmineR)
#In the folder containing only the 12 viral sequences
nfiles <- length(dir())
seqdat <- vector("list", nfiles)
for(i in 1:nfiles){
seqdat[[i]] <- read.fasta(file=dir()[i])
}
label <- sapply(1:nfiles, function(k) unlist(strsplit(dir()[k], "[.]"))[1])
names(seqdat) <- label
seqdat_join <- lapply(seqdat, function(k) paste(toupper(unlist(k)),collapse="") )
The next step is to generate tabulate the frequency of all possible 5-mers in each of the viral genomes.
#enumerate all 5-mers
dna <- c("A","C","G","T")
kmer5 <- expand.grid(dna, dna, dna, dna, dna)
kmer5 <- apply(kmer5, 1, function(k) paste(k, collapse=""))
#function for counting all possible kmers (k=5) given a single dna string
kmercount <- function(data){
sapply(1:length(kmer5), function(k)
length(unlist(gregexpr2(kmer5[k], data)))
)}
#vector of counts for all possible kmers (k=5) for all viral sequences
kmer_features <- lapply(seqdat_join, function(k) kmercount(k))
#Collect k-mer counts into a data frame
M <- do.call(rbind, kmer_features)
We then take care of some labelling stuff.
#taxonomic labels
taxonomy <- data.frame(rownames(M),
c("Filoviridae", "Flaviviridae", "Picornaviridae", "Flaviviridae",
"Picornaviridae", "Flaviviridae", "Filoviridae", "Flaviviridae",
"Flaviviridae", "Picornaviridae", "Picornaviridae",
"Picornaviridae"))
colnames(taxonomy) <- c("Virus","Family")
#Simplify virus species names
virusnames <- sapply(1:nrow(taxonomy), function(k){
chop <- unlist(strsplit(as.character(taxonomy$Virus)[k],"_"))
chop[length(chop)]}
)
rownames(M) <- virusnames
tipcolor <- c("red","blue","darkviolet")[unclass(taxonomy$Family)]
We compute the appropriate distance between all pairs of 5-mer probability vectors for the viral genomes using the Jensen-Shannon divergence, and then use the resulting distance matrix as input for the bioNJ algorithm for tree construction.
#The correct input for CalcJSDivergence is the (unnormalised) count vector
JSdist <- CalcJSDivergence(M)
plot.phylo(bionj(JSdist), type="unrooted", cex=0.8, tip.color=tipcolor,
rotate.tree=95)
A principal component analysis (PCA) plot is also useful for corroborating the topology of the inferred tree. Here, it seems that the plot of the first and the fourth principal component gives the best ordination, whereby viruses from the same family are clustered together.
#Alternative visualisation
#Q-mode PCA for visualisation of ordination of virus species in low dimensional kmer space
M_norm <- t(apply(M, 1, function(k) k/sum(k)))
pca <- prcomp(M_norm)
summary(pca)
pairs(pca$x[,1:5], pch=16, cex=2,
col=c("red","blue","olivedrab")[unclass(taxonomy$Family)])
#PC1 vs PC4 seems to give best separation for the three viral families
plot(pca$x[,1], pca$x[,4], cex=2, xlim=c(-0.022,0.022), ylim=c(-0.01, 0.01),
pch=16, col=c("red","blue","darkviolet")[unclass(taxonomy$Family)],
xlab="PC1(38%)", ylab="PC5(6%)")
text(pca$x[,1], pca$x[,4]+0.001, virusnames, cex=0.7, srt=35)
legend(0.01,-0.007,pch=16, pt.cex=2,
col=c("red","blue","darkviolet"), c("Filoviridae","Flaviviridae","Picornaviridae"))
Download the entire R script bioinfoRmatics-series1.R.R and the virus genomes: virus_genomes.zip.
The Results
To test, I used 12 complete viral genomes from the NCBI database, 5 from the Flaviviridae family (which includes viruses in the tropics like the Zika and Dengue), 5 from the Picornaviridae family (which counts the flu and hand-foot-mouth disease viruses as members), and 2 deadly viruses: Ebola and Marburg, from the Filoviridae family. All members from the same family are correctly clustered in the inferred phylogeny (Fig. 1) and supported by results of principal component analysis (Fig. 2).
A Computationally Efficient Approach
The entire computation was complete in about 10 seconds, using an old laptop running on just 2GB of RAM! Try the R code.
You may also be interested in
Pair-wise sequence alignment methods
Construction of substitution matrices
Multiple sequence alignment (MSA) tools and resources
Single nucleotide polymorphism (SNP) tools and resources
Back to Tutorials Main Page
You may also like:
References
1. Sims, G.E., Jun, S.R., Wu, G.A., Kim, S.H. (2009). "Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions." PNAS, 106, 2677-82.
2. Zielezinski, A., Vinga, S., Almeida, J., Karlowski, W.M. (2017). "Alignment-free sequence comparison: benefits, applications, and tools." Genome Biology, 18: 186.