EMBOSS matcher and supermatcher - incongruent results?

EMBOSS matcher and supermatcher - incongruent results?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am trying to align a sequence to the mouse genome. I know a priori that part of my sequence should align to chromosome 9, but not all of it.

I gathered that EMBOSS'matcherandsupermatcherwould be adequate tools to do this locally on my machine - with supermatcher being (a lot!) faster and matcher being (allegedly) more accurate. Strangely enough, the two functions give me very similar fits (quality-wise) which are however at non-identical but very close positions.

  • How can I explain this?
  • What are the odds that an ~800-bp sequence has multiple equally good fits right next to each other (how can I test if I hit a repeat-heavy area?)?
  • Why domatchandsupermatchnot give me both fits then?

My current alignments:

#======================================= # # Aligned_sequences: 2 # 1: # 2: CM001002.2 # Matrix: EDNAFULL # Gap_penalty: 16.0 # Extend_penalty: 4.0 # # Length: 357 # Identity: 322/357 (90.2%) # Similarity: 322/357 (90.2%) # Gaps: 2/357 ( 0.6%) # Score: 1458.0 # # #======================================= 1 AAAAACGTGAAAAATGAGAAATGCACACTGTAGGACCTGAAATATGGCAA 50 ||||.|… |.||||||||||||.||||||.||||||.|||||.|||||.| CM001002.2 35305253 AAAATCACGGAAAATGAGAAATACACACTTTAGGACGTGAAAAATGGCGA 35305302 51 GGAAAACTGAAAAAGGTGGAAAATTTAGAAATGTCCACTATAGGACGTGG 100 ||||||||||||||||||||||||||||||||||||.||.||||||.||| CM001002.2 35305303 GGAAAACTGAAAAAGGTGGAAAATTTAGAAATGTCCTCTGTAGGACATGG 35305352 101 AATATGGCAAGAAAAATGAAAATCATTGAAAATGAGAAACATACAGTTGA 150 |||||||||||||||.||||||||||.|||||||||||||||.||.|||| CM001002.2 35305353 AATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAACATCCACTTGA 35305402 151 CGACTTGAAAAATGATGAAATCACTGAAAAACGTGAAAAATGAGAAATGC 200 .||||||||||||||.|||||||.|.|||||||||||||||||||||||| CM001002.2 35305403 TGACTTGAAAAATGACGAAATCATTAAAAAACGTGAAAAATGAGAAATGC 35305452 201 ACCCTGTAAGACCTGGAATATGTCGAGAAAACTGAAAATCACGGAAAATG 250 .|.|||.|.|||||||||||||… |||||||||||||||||||||||||| CM001002.2 35305453 CCACTGAAGGACCTGGAATATGGGGAGAAAACTGAAAATCACGGAAAATG 35305502 251 AGAAATACACACTTTAGGACGTGAAATATGGCGAGGAAAACTGAAAAAGG 300 |||||||||||||||||||||||||||||||||||||||||||||||||| CM001002.2 35305503 AGAAATACACACTTTAGGACGTGAAATATGGCGAGGAAAACTGAAAAAGG 35305552 301 TGGAAAATTTAGAAATGTCCACTGTAGGACATGGAATAT--GGCAAGAAA 348 |||||.||||||||||||||||||||||||.|||||||| |.|.||||| CM001002.2 35305553 TGGAATATTTAGAAATGTCCACTGTAGGACGTGGAATATAAGTCCAGAAA 35305602 349 ACTGAAA 355 .||.|.| CM001002.2 35305603 CCTAAGA 35305609


#======================================= # # Aligned_sequences: 2 # 1: # 2: CM001002.2 # Matrix: EDNAFULL # Gap_penalty: 16 # Extend_penalty: 4 # # Length: 417 # Identity: 377/417 (90.4%) # Similarity: 377/417 (90.4%) # Gaps: 2/417 ( 0.5%) # Score: 1713 # # #======================================= 180 190 200 210 220 TGAAAAACGTGAAAAATGAGAAATGCACCCTGTAAGACCTGGAATATGTC :: : :: : :: : :: ::::: ::: : :::::::::: : CM0010 TGTCACACACTATAATTTTGAGGTGCACACTGAAGGACCTGGAATTATGC 35305200 35305210 35305220 35305230 35305240 230 240 250 260 270 GAGAAAACTGAAAATCACGGAAAATGAGAAATACACACTTTAGGACGTGA :::::::::::::::::::::::::::::::::::::::::::::::::: CM0010 GAGAAAACTGAAAATCACGGAAAATGAGAAATACACACTTTAGGACGTGA 35305250 35305260 35305270 35305280 35305290 280 290 300 310 320 AATATGGCGAGGAAAACTGAAAAAGGTGGAAAATTTAGAAATGTCCACTG :: ::::::::::::::::::::::::::::::::::::::::::: ::: CM0010 AAAATGGCGAGGAAAACTGAAAAAGGTGGAAAATTTAGAAATGTCCTCTG 35305300 35305310 35305320 35305330 35305340 330 340 350 360 370 TAGGACATGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAAC :::::::::::::::::::::::::::::::::::::::::::::::::: CM0010 TAGGACATGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAAC 35305350 35305360 35305370 35305380 35305390 380 390 400 410 420 ATCCACTTGACGACGTGAAAAATGACGAAATCACTGAAAAACGTGAAAAA :::::::::: ::: :::::::::::::::::: : :::::::::::::: CM0010 ATCCACTTGATGACTTGAAAAATGACGAAATCATTAAAAAACGTGAAAAA 35305400 35305410 35305420 35305430 35305440 430 440 450 460 470 TGAGAAATGCACACTGTAGGACCTGGAATATGTCGAGAAAACTGAAAATC :::::::::: ::::: ::::::::::::::: :::::::::::::::: CM0010 TGAGAAATGCCCACTGAAGGACCTGGAATATGGGGAGAAAACTGAAAATC 35305450 35305460 35305470 35305480 35305490 480 490 500 510 520 ACGGAAAATGAGAAATACACACTTTAGGACGTGAAATATGGCGAGGAAAA :::::::::::::::::::::::::::::::::::::::::::::::::: CM0010 ACGGAAAATGAGAAATACACACTTTAGGACGTGAAATATGGCGAGGAAAA 35305500 35305510 35305520 35305530 35305540 530 540 550 560 570 CTGAAAAAGTTGGAAAATTTAGAAATGTCCATTGTAGGACATGGAATAT- ::::::::: ::::: ::::::::::::::: :::::::: :::::::: CM0010 CTGAAAAAGGTGGAATATTTAGAAATGTCCACTGTAGGACGTGGAATATA 35305550 35305560 35305570 35305580 35305590 580 -GGCAAGAAAACTGAAA : : ::::: :: : : CM0010 AGTCCAGAAACCTAAGA 35305600

EMBOSS matcher and supermatcher - incongruent results? - Biology

Understanding the in vivo dynamics of protein localization and their physical interactions is important for many problems in biology. To enable systematic protein function interrogation in a multicellular context, we built a genome-scale transgenic platform for in vivo expression of fluorescent- and affinity-tagged proteins in Caenorhabditis elegans under endogenous cis regulatory control. The platform combines computer-assisted transgene design, massively parallel DNA engineering, and next-generation sequencing to generate a resource of 14,637 genomic DNA transgenes, which covers 73% of the proteome. The multipurpose tag used allows any protein of interest to be localized in vivo or affinity purified using standard tag-based assays. We illustrate the utility of the resource by systematic chromatin immunopurification and automated 4D imaging, which produced detailed DNA binding and cell/tissue distribution maps for key transcription factor proteins.

Graphical Abstract


► A genome-wide resource for in vivo expression of tagged proteins was engineered ► The tagged gene alleles provide native protein expression and localization patterns ► Tag-based ChIP provides genome-wide DNA binding site maps for key transcription factors ► Live 4D tracing reveals rapid transcription factor protein localization dynamics

Present address: Department of Genetics, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA

See also

Program name Description
cathparse Generate DCF file from raw CATH files
domainnr Remove redundant domains from a DCF file
domainrep Reorder DCF file to identify representative structures
domainseqs Add sequence records to a DCF file
domainsse Add secondary structure records to a DCF file
helixturnhelix Identify nucleic acid-binding motifs in protein sequences
libgen Generate discriminating elements from alignments
matcher Waterman-Eggert local alignment of two sequences
matgen3d Generate a 3D-1D scoring matrix from CCF files
oalistat Statistics for multiple alignment files
pepcoil Predict coiled coil regions in protein sequences
rocon Generate a hits file from comparing two DHF files
rocplot Perform ROC analysis on hits files
scopparse Generate DCF file from raw SCOP files
seqalign Extend alignments (DAF file) with sequences (DHF file)
seqfraggle Remove fragment sequences from DHF files
seqmatchall All-against-all word comparison of a sequence set
seqsort Remove ambiguous classified sequences from DHF files
seqwords Generate DHF files from keyword search of UniProt
ssematch Search a DCF file for secondary structure matches
supermatcher Calculate approximate local pair-wise alignments of larger sequences
water Smith-Waterman local alignment of sequences
wordfinder Match large sequences against one or more other sequences
wordmatch Find regions of identity (exact matches) of two sequences


The following message may appear in the log file.

Replaced ' ' in STAMP alignment with 'X' (STAMP can insert non-sensical whitespaces into its alignments, e.g. instead of a residue character where that residue was missing electron density in the PDB file. DOMAINALIGN replaces each whitespace within a STAMP alignment with an "X").


Jon Ison ([email protected])
The European Bioinformatics Institute Wellcome Trust Genome Campus Cambridge CB10 1SD UK


Please cite the authors and EMBOSS.

Rice P, Longden I and Bleasby A (2000) "EMBOSS - The European Molecular Biology Open Software Suite" Trends in Genetics, 15:276-278.

14.1 Other useful references

Russell, R. B. & Barton, G. J. (1992), Multiple Sequence Alignment from Tertiary Structure Comparison: Assignment of Global and Residue Confidence Levels, PROTEINS: Struct. Funct. Genet., 14, 309-323.
C. Notredame, D. Higgins, J. Heringa. T-Coffee: A novel method for multiple sequence alignments. Journal of Molecular Biology, 302, 205-217, (2000)


People can come to falsely remember performing actions that they have not actually performed. Common accounts of such false action memories have invoked source confusion from the overlap of sensory features but largely ignored the role of motor processes. We addressed this lacuna with a paradigm in which participants first perform (vs. do not perform) actions and then observe another person performing some of the non-performed actions. In this paradigm, observation of videos showing another’s actions can later induce false self-attributions of these actions, the observation-inflation effect. Contrary to a sensory-feature account but consistent with a motor-simulation account, we found the effect even with perceptually impoverished action videos in which the majority of sensory features is absent, but motion cues are preserved (Experiment 1). We then created conditions during action observation that should (vs. should not) impede motor simulation. As predicted we found that the effect of observation was reduced when participants executed movements that were incongruent (vs. congruent) with the observed actions (Experiment 2). We discuss the processes that can produce associations of self with observed others’ actions and later affect observers’ action memory.


In this study, we constructed a high-quality chromosome-level genome assembly for wintersweet by combining the long-read sequences from PacBio with highly accurate short reads from Illumina sequencing and using Hi-C data for super-scaffolding. The assembly of wintersweet adds to the growing body of genome information for the Calycantaceae family. As the relatively domesticated species in the Calycantaceae family [48], wintersweet has a range of specific biological features such as early blooming in deep winter, strong cold resistance, and fragrant flowers [4, 10, 49]. As a representative of the Magnoliids, it also maintains a key evolutionary position on the tree of life. The availability of the wintersweet genome sequence makes it possible to consider deep angiosperm phylogenic questions, determine genome evolution signatures, and to reveal the genetic basis of interesting traits. This assembly also facilitates in-depth fundamental comparative genomic analysis to elucidate biology and gain resolution of genome evolution between wintersweet and other species within the Calycantaceae family.

Resolution of the relationship among Magnoliids, monocots, and eudicots has not been conclusively determined, despite numerous attempts. In four independent studies, four genomes representing three orders (Magnoliales, Piperales, and Laurales) within magnoliales have been published [13,14,15,16], and each study attempted to resolve the phylogenetic position of magnoliidaes. Three species including Piper nigrum (representative of Piperales clade), L. tulipifera (representative of Magnoliales clade), and P. americana (representative of laurales clade) were placed as sister to the monocots and eudicots, while C. kanehirae (representative of Magnoliales clade) was found to be a sister clade to the eudicots. Many factors could be responsible for these topological differences such as taxon sample size [50], possible incomplete lineage sorting (ILS) [17], and the number of retrieved orthologs [51]. For example, adequate taxon sampling, especially those smaller sister lineages such as Chloranthales in angiosperm clades [52], was vital to obtain a resolved phylogeny. To account for incomplete lineage sorting, we used two complementary tools to extract the single-copy genes and two methods (coalescent and concatenation-based analysis) to reconstruct the phylogeny. In addition, we also improved taxon sampling, selected key lineages (representative of chloranthales clade) as well as additional lineages in the monocots and eudicots, and included five magnoliids to cover key representative clades. Finally, all the analyses recovered the magnoliids together with eudicots as sister to the monocots. This result is congruent with a recent study of 59 low-copy nuclear genes from 26 mesangiosperm transcriptomes [51] and 410 single-copy genes nuclear gene families extracted from genomic and transcriptomic data from 1153 species [53], but disagrees with the plastid trees which supported a topology of magnoliales as the sister to monocots and eudicots. In comparison with nuclear genes, the plastid genes are uniparentally inherited and may recover different deep-level relationships resulting from ancient lineage sorting and hybridization, which might potentially introduce biases and errors to phylogenetic reconstruction [51]. To date, the genome data were still absent in the key clades of angiosperms, such as Chloranthales. Even though we have suggested a robust phylogeny using “genome-scale” data, sequencing of the complete angiosperm lineages will facilitate future investigations of the phylogenetic relationships of flowering plants.

Wintersweet is one of the very few flowering plant lineages that bloom in winter, which make it an ideal perennial plant for flowering-time study. Application of database about the flowering-time gene networks in Arabidopsis thaliana serves to identify the homologs of flowering-time genes in wintersweet. Comparative transcriptome analyses provide an array of resources for further flowering-time-related gene identification. Mapping quantitative trait loci (QTL) onto linkage maps, with segregating genetic populations, is a powerful strategy to dissecting complex agronomical characters [54]. The availability of high-quality genome and diverse germplasm of wintersweet with different flowering time makes it possible to use this genetic approach to detect flowering-time quantitative trait loci in the future. The petaloid sepal is another striking distinction of wintersweet. This flower structure also exists in some basal eudicots (such as Ranunculus and Aquilegia), some monocots (such as Liliumus and Tulipa), and basal angiosperm lineages, which was supposed to be displayed by the ancestral angiosperm flower [55]. The broad expression pattern of B-function genes was shared by these species, which may represent the ancestral condition for angiosperms. The genetic network for seasonal temperature-mediated control of bud break has been elucidated in the vegetable bud of hybrid aspen [39]. In this genetic network, the FT and SVL are the homologs of FT and SVP in Arabidopsis respectively, both of which act as a flowering regulator [56]. Similar to the vegetable buds, the floral buds are also subjected to the dormancy and bud break. The homologs of the key components in wintersweet displayed a similar expression pattern during the transition from endodormancy to bud break stage, leading us to the hypothesis that wintersweet may utilize the common signaling components in both flowering and bud break process.

The evolution, adaptation, and domestication of wintersweet resulted in specific qualities and quantities of floral volatiles, primarily consisting of monoterpenes and benzenoids [7]. The diversification of terpenes is mainly determined by the TPS family genes, among which the TPS-b subfamily is well known for monoterpenes synthesis [44]. The extensive expansion of TPS-b subfamily genes in the wintersweet genome may be one explanation for diverse monoterpene accumulation. The production of terpenes is regulated to a large extent by the transcription level of TPS genes [43]. The results of the present expression analyses revealed a dynamic expression of the TPS genes, which may be another explanation for the monoterpene diversification. Using the genomic data, we found remarkable duplications of the metabolic genes in both terpene and benzoid/phenylpropanoid biosynthesis pathways, especially in the TPS and BEAT genes which are responsible for the major components (linalool and benzyl acetate) production. Tandem duplication is the major contributor to the expansions of TPS and BEAT genes and most of these duplicated genes are tandemly organized in clusters. In the Drosophila melanogaster genome, the Adh gene is tandemly duplicated and shows a 2.6-fold greater expression than the single-copy gene. The overactivity caused by the tandem arrangement was proposed to be a general property of tandem gene duplicates [57]. The greater output of the tandem arrangement in the TPS and BEAT genes may increase transcript abundance of the tandem duplicates and thereby led to the mass production of major components. Based on our data, we speculate that the remarkable duplication, tandem clustering of gene, and gene expression dynamics may contribute to the abundant characteristic aroma formation in wintersweet.


Data sets

Twelve sequence data sets were used to evaluate AF methods across five research areas (Table 1).

Protein homology

The reference data sets of protein family members sharing a high (≥ 40%) and low (< 40%) sequence identity were constructed based on two sections of the SCOPe database v. 2.07 [68], namely, ASTRAL95 and ASTRAL40 v. 2.07 [86], respectively. The SCOPe database provides a structural classification of proteins at four levels: classes (proteins with similar secondary structure composition, but different sequences and overall tertiary structures), folds (protein domains of similar topology and structure without detectable sequence similarity), superfamilies (proteins with similar structures and weak sequence similarity), and families (proteins with readily detectable sequence similarity). According to previous studies [5, 8], the ASTRAL data sets were subsequently trimmed to exclude sequences with unknown amino acids and families with fewer than 5 proteins and included only the four major classes (i.e., α, β, α/β, and α + β). To minimize the requirements for AF method submission related to performing all-versus-all sequence comparisons and uploading the output to the AFproject server, we further reduced the data sets by randomly selecting only two protein members in each family. As ASTRAL95 also contains protein family members sharing a sequence identity lower than 40%, the Needleman–Wunsch alignment was performed (using needle software in the EMBOSS package [87]) to select proteins with a sequence identity ≥ 40% to acquire a reference data set of proteins with high sequence identity.

Gene trees

Reference trees and corresponding protein sequences of eleven gene families were downloaded from SwissTree release 2017.0 [58, 88]: Popeye domain-containing protein family (49 genes), NOX “ancestral-type” subfamily NADPH oxidases (54 genes), V-type ATPase beta subunit (49 genes), serine incorporator family (115 genes), SUMF family (29 genes), ribosomal protein S10/S20 (60 genes), Bambi family (42 genes), Asterix family (39 genes), cited family (34 genes), Glycosyl hydrolase 14 family (159 genes), and Ant transformer protein (21 genes).

Gene regulatory elements

The data set of CRMs known to regulate expression in the same tissue and/or developmental stage in fly or human was obtained from Kantorovitz et al. [6]. The data set was specifically selected to test the capacity of AF measures to identify functional relationships among regulatory sequences (e.g., enhancers or promoters). The data set contains 185 CRM sequences taken from D. melanogaster—blastoderm-stage embryo (n = 82), eye (n = 17), peripheral nervous system (n = 23), and tracheal system (n = 9)—and Homo sapiens—HBB complex (n = 17), liver (n = 9), and muscle (n = 28).

Genome-based phylogeny

The sequences of 25 whole mitochondrial genomes of fish species from the suborder Labroidei and the species tree were taken from Fischer et al. [59]. The set of 29 E. coli genome sequences was originally compiled by Yin and Jin [23] and has been used in the past by other groups to evaluate AF programs [24, 25, 89]. Finally, the set of 14 plant genomes is from Hatje et al. [90]. This set was also used in the past to evaluate AF methods. To simulate unassembled reads from these data sets, we used the program ART [91].

Horizontal gene transfer

The 27 E. coli and Shigella genomes, and the 8 Yersinia genomes, were taken from Bernard et al. [62]. We used EvolSimulator [92] to simulate HGT in microbial genomes, adopting an approach similar to that described in Bernard et al. [62]. The HGT events were simulated to occur at random, i.e., anywhere along a genomic sequence and between any pair of genomes in a set. Each set of genomes was simulated under a birth-and-death model at speciation rate = extinction rate = 0.5. The number of genomes in each set was allowed to vary from 25 to 35, with each containing 2000–3000 genes 240–1500 nucleotides long. HGT receptivity was set at a minimum of 0.2, mean of 0.5, and maximum of 0.8, with a mutation rate m = 0.4–0.6 and a number of generations i = 5000. The varying extent of HGT was simulated using the mean number of HGT events attempted per iteration l = 0, 250, 500, 750, and 1000, and divergence factor d = 2000 (transferred genes that are of high sequence divergence, i.e., > 2000 iterations apart, will not be successful). All other parameters in this simulation followed Beiko et al. [92].

Alignment-free tools

AAF [38] reconstructs a phylogeny directly from unassembled next-generation sequencing reads. Specifically, AAF calculates the Jaccard distance between sets of k-mers of two samples of short sequence reads. This distance between samples or species is based on the estimate of the rate parameter from a Poisson process for a mutation occurring at a single nucleotide. The phylogeny is constructed using weighted least squares with weights proportional to the expected variance of the estimated distances. AAF provides features for correcting tip branches and bootstrapping of the obtained phylogenetic trees, directly addressing the problems of sequencing error and incomplete coverage.

AFKS [34] is a package for calculating 33 k-mer-based dissimilarity/distance measures between nucleotide or protein sequences. AFKS categorizes the measures into nine families: Minkowski (e.g., Euclidean), Mismatch (e.g., Jaccard), Intersection (e.g., Kulczynski), D2 (e.g., D2s), Squared Chord (e.g., Hellinger), Inner Product (e.g., normalized vectors), Markov (e.g., SimMM), Divergence (e.g., KL Conditional), and Others (e.g., length difference). The tool determines the optimal k-mer size for given input sequences and calculates dissimilarity/distance measures between k-mer counts that include pseudocounts (adding 1 to each k-mer count). The obtained distance is standardized to between 0 and 1.

alfpy [5] provides 38 AF dissimilarity measures with which to calculate distances among given nucleotide or protein sequences. The tool includes 25 k-mer-based measures (e.g., Euclidean, Minkowski, Jaccard, and Hamming), eight information-theoretic measures (e.g., Lempel–Ziv complexity and normalized compression distance), three graph-based measures, and two hybrid measures (e.g., Kullback–Leibler divergence and W-metric). alfpy is also available as a web application and Python package. In this study, the results based on 14 dissimilarity measures are evaluated.

ALFRED-G [45] uses an efficient algorithm to calculate the length of maximal k-mismatch common substrings between two sequences. Specifically, to measure the degree of dissimilarity between two nucleic acid or protein sequences, the program calculates the length of maximal word pairs—one word from each of the sequences—with up to k mismatches.

andi [24] estimates phylogenetic distances between genomes of closely related species by identifying pairs of maximal unique word matches a certain distance from each other and on the same diagonal in the comparison matrix of two sequences. Such word matches can be efficiently found using enhanced suffix arrays. The tool then uses these gap-free alignments to estimate the number of substitutions per position.

CAFE [36] is a package for efficient calculation of 28 AF dissimilarity measures, including 10 conventional measures based on k-mer counts, such as Chebyshev, Euclidean, Manhattan, uncentered correlation distance, and Jensen–Shannon divergence. It also offers 15 measures based on the presence/absence of k-mers, such as Jaccard and Hamming distances. Most importantly, it provides a fast calculation of background-adjusted dissimilarity measures including CVTree, d2star, and d2shepp. CAFE allows for both assembled genome sequences and unassembled next-generation sequencing shotgun reads as inputs. However, it does not deal with amino acid sequences. In this study, the results based on CVTree, d2star, and d2shepp are evaluated.

co-phylog [23] estimates evolutionary distances among assembled or unassembled genomic sequences of closely related microbial organisms. The tool finds short, gap-free alignments of a fixed length and consisting of matching nucleotide pairs only, except for the middle position in each alignment, where mismatches are allowed. Phylogenetic distances are estimated from the fraction of such alignments for which the middle position is a mismatch.

EP-sim [53] computes an AF distance between nucleotide or amino acid sequences based on entropic profiles [93, 94]. The entropic profile is a function of the genomic location that captures the importance of that region with respect to the whole genome. For each position, it computes a score based on the Shannon entropies of the word distribution and variable-length word counts. EP-sim estimates a phylogenetic distance, similar to D2, by summing the entropic profile scores over all positions, or similar to ( _2^ ) , with the sum of normalized entropic profile scores.

FFP [35, 39] estimates the distances among nucleotide or amino acid sequences. The tool calculates the count of each k-mer and then divides the count by the total count of all k-mers to normalize the counts into frequencies of a given sequence. This process leads to the conversion of each sequence into its feature frequency profile (FFP). The pairwise distance between two sequences is then calculated by the Jensen–Shannon divergence between their respective FFPs.

FSWM [26] estimates the phylogenetic distance between two DNA sequences. The program first defines a fixed binary pattern P of length l representing “match positions” and “don’t care positions.” Then, it identifies all “Spaced-word Matches” (SpaM) w.r.t. P, i.e., gap-free local alignments of the input sequences of length l, with matching nucleotides at the “match positions” of P and possible mismatches at the “do not care” positions. To estimate the distance between two DNA sequences, SpaMs with low overall similarity are discarded, and the remaining SpaMs are used to estimate the distance between the sequences, based on the mismatch ratio at the “do not care” positions. There is a version of FSWM that can compare sets of unassembled sequencing reads to each other called Read-SpaM [48].

jD2Stat [37] utilizes a series of D2 statistics [17, 18] to extract k-mers from a set of biological sequences and generate pairwise distances for each possible pair as a matrix. For each sequence set, we generated distance matrices (at the defined k Additional file 1: Table S1), each using ( _2^S ) (D2S exact k-mer counts normalized based on the probability of occurrence of specific k-mers), ( _2^ ) (d2St similar to ( _2^S ) but normalized based on means and variance), and ( _2^n ) (d2n extension of D2 that expands each word w recovered in the sequences to its neighborhood n, i.e., all possible k-mers with n number of wildcard residues, relative to w).

kmacs [20] compares two DNA or protein sequences by searching for the longest common substrings with up to k mismatches. More precisely, for each position i in one sequence, the program identifies the longest pair of substrings with up to k mismatches, starting at i in the first sequence and somewhere in the second sequence. The average length of these substring pairs is then used to define the distance between the sequences.

kr [46] estimates the evolutionary distance between genomes by calculating the number of substitutions per site. The estimator for the rate of substitutions between two unaligned sequences depends on a mathematical model of DNA sequence evolution and average shortest unique substring (shustring) length.

kSNP3 [52] identifies single nucleotide polymorphisms (SNPs) in a set of genome sequences without the need for genome alignment or a reference genome. The tool defines a SNP locus as the k-mers surrounding a central SNP allele. kSNP3 can analyze complete genomes, draft genomes at the assembly stage, genomes at the raw reads stage, or any combination of these stages. Based on the identified SNPs, kSNP3.0 estimates phylogenetic trees by parsimony, neighbor-joining, and maximum-likelihood methods and reports a consensus tree with the number of SNPs unique to each node.

kWIP [44] estimates genetic dissimilarity between samples directly from next-generation sequencing data without the need for a reference genome. The tool uses the weighted inner product (WIP) metric, which aims at reducing the effect of technical and biological noise and elevating the relevant genetic signal by weighting k-mer counts by their informational entropy across the analysis set. This procedure downweights k-mers that are typically uninformative (highly abundant or present in very few samples).

LZW-Kernel [40] classifies protein sequences and identifies remote protein homology via a convolutional kernel function. LZW-Kernel exploits code blocks detected by the universal Lempel–Ziv–Welch (LZW) text compressors and then builds a kernel function out of them. LZW-Kernel provides a similarity score between sequences from 0 to 1, which can be directly used with support vector machines (SVMs) in classification problems. LZW-Kernel can also estimate the distance between protein sequences using normalized compression distances (LZW-NCD).

mash [11] estimates the evolutionary distance between nucleotide or amino acid sequences. The tool uses the MinHash algorithm to reduce the input sequences to small “sketches,” which allow fast distance estimations with low storage and memory requirements. To create a “sketch,” each k-mer in a sequence is hashed, which creates a pseudorandom identifier (hash). By sorting these hashes, a small subset from the top of the sorted list can represent the entire sequence (min-hashes). Two sketches are compared to provide an estimate of the Jaccard index (i.e., the fraction of shared hashes) and the Mash distance, which estimates the rate of sequence mutation under an evolutionary model.

Multi-SpaM [25], similar to FSWM, starts with a binary pattern P of length l representing “match positions” and “don’t care positions.” It then searches for four-way Spaced-word Matches (SpaMs) w.r.t. P, i.e., local gap-free alignments of length l involving four sequences each and with identical nucleotides at the “match positions” and possible mismatches at the “do not care positions.” Up to 1,000,000 such multiple SpaMs with a score above some threshold are randomly sampled, and a quartet tree is calculated for each of them with RAxML [95]. The program Quartet Max-Cut [96] is used to calculate a final tree of all input sequences from the obtained quartet trees.

phylonium [49] estimates phylogenetic distances among closely related genomes. The tool selects one reference from a given set of sequences and finds matching sequence segments of all other sequences against this reference. These long and unique matching segments (anchors) are calculated using an enhanced suffix array. Two equidistant anchors constitute homologous region, in which SNPs are counted. With the analysis of SNPs, phylonium estimates the evolutionary distances between the sequences.

RTD-Phylogeny [51] computes phylogenetic distances among nucleotide or protein sequences based on the time required for the reappearance of k-mers. The time refers to the number of residues in successive appearance of particular k-mers. Thus, the occurrence of each k-mer in a sequence is calculated in the form of a return time distribution (RTD), which is then summarized using the mean (μ) and standard deviation (σ). As a result, each sequence is represented in the form of a numeric vector of size 2·4 k containing the μ and σ of 4 k RTDs. The pairwise distance between sequences is calculated using Euclidean distance.

Skmer [50] estimates phylogenetic distances between samples of raw sequencing reads. Skmer runs mash [11] internally to compute the k-mer profile of genome skims and their intersection and estimates the genomic distances by correcting for the effect of low coverage and sequencing error. The tool can estimate distances between samples with high accuracy from low-coverage and mixed-coverage genome skims with no prior knowledge of the coverage or the sequencing error.

Slope-SpaM [97] estimates the phylogenetic distance between two DNA sequences by calculating the number Nk of k-mer matches for a range of values of k. The distance between the sequences can then be accurately estimated from the slope of a certain function that depends on Nk. Instead of exact word matches, the program can also use SpaMs w.r.t. a predefined binary pattern of “match positions” and “don’t care positions.”

spaced [41,42,43] is similar to previous methods that compare the k-mer composition of DNA or protein sequences. However, the program uses the so-called spaced words instead of k-mers. For a given binary pattern P of length l representing “match positions” and “don’t care positions,” a spaced word w.r.t. P is a word of length l with nucleotide or amino acid symbols at the “match positions” and “wildcard characters” at the “do not care positions.” The advantage of using spaced words instead of exact k-mers is that the obtained results are statistically more stable. This idea has been previously proposed for database searching [98, 99]. The original version of Spaced [41] used the Euclidean or Jensen–Shannon [100] distance to compare the spaced-word composition of genomic sequences. By default, the program now uses a distance measure introduced by Morgenstern et al. [43] that estimates the number of substitutions per sequence position.

Underlying Approach [47] estimates phylogenetic distances between whole genomes using matching statistics of common words between two sequences. The matching statistics are derived from a small set of independent subwords with variable lengths (termed irredundant common subwords). The dissimilarity between sequences is calculated based on the length of the longest common subwords, such that each region of genomes contributes only once, thus avoiding counting shared subwords multiple times (i.e., subwords occurring in genomic regions covered by other more significant subwords are discarded).


Evaluation of structural and evolutionary relationships among proteins

To test the capacity of AF distance measures to recognize SCOPe relationships (i.e., family, superfamily, fold, and class), we used a benchmarking protocol from previous studies [5, 8]. Accordingly, the benchmarking procedure takes the distances between all sequence pairs present in the data set file. The distances between all protein pairs are subsequently sorted from minimum to maximum (i.e., from the maximum to minimum similarity). The comparative test procedure is based on a binary classification of each protein pair, where 1 corresponds to the two proteins sharing the same group in the SCOPe database and 0 corresponds to other outcomes. The group can be defined at one of the four different levels of the database (family, superfamily, fold, and class), exploring the hierarchical organization of the proteins in that structure. Therefore, each protein pair is associated with four binary classifications, one for each level. At each SCOPe level, ROC curves and AUC values computed in scikit-learn [101] are obtained to give a unique number of the relative accuracy of each metric and level according to the SCOP classification scheme. The overall assessment of method accuracy is an average of AUC values across all four SCOPe levels.

Evaluation of functionally related regulatory sequences

To test how well AF methods can capture the similarity between sequences with similar functional roles, we used the original benchmarking protocol introduced by Kantorovitz et al. [6]. Briefly, a set of CRMs known to regulate expression in the same tissue and/or developmental stage is taken as the “positive” set. An equally sized set of randomly chosen noncoding sequences with lengths matching the CRMs is taken as the “negative” set. Each pair of sequences in the positive set is compared, as is each pair in the negative set. The test evaluates if functionally related CRM sequence pairs (from the positive half) are better scored by a given AF tool (i.e., have lower distance/dissimilarity values) than unrelated pairs of sequences (from the negative half). This procedure is done by sorting all pairs, whether they are from the positive set or the negative set, in one combined list and then counting how many of the pairs in the top half of this list are from the positive set. The overall assessment of method accuracy is the weighted average of the positive pairs across all seven subsets.

Evaluation of phylogenetic inference

The accuracy of AF methods for data sets from three categories—gene tree inference, genome-based phylogeny, and horizontal gene transfer—was evaluated by a comparison of topology between the method’s tree and the reference tree. The pairwise sequence distances obtained by the AF method were used as input for the neighbor-joining algorithm (fneighbor in the EMBOSS package [87], version: EMBOSS: PHYLIPNEW:3.69.650) to generate the corresponding method tree. To assess the degree of topological (dis) agreement between the inferred and reference trees, we calculated the normalized Robinson–Foulds (nRF) distance [63] using the function in the ETE3 [102] toolkit for phylogenetic trees with the option unrooted = True. The Robinson–Foulds (RF) distance is a measure for the dissimilarity between two tree topologies with the same number of leaves and the same labels (species) at the leaves, i.e., it measures the dissimilarity of branching patterns and ignores branch lengths. More specifically, the RF distance between two trees is defined as the number of certain edit operations that are necessary to transform the first topology into the second topology (or vice versa). Equivalently, one can define the RF distance between two topologies by considering bipartitions of the leaves (species) of the trees, obtained by removing edges from the trees. The RF distance is then the number of bipartitions that can be obtained only from one tree but not from the respective other tree. The nRF measure normalizes the RF distance such that the maximal possible nRF distance for the given number of leaves is set to 1. Thus, the nRF distance has values between 0 and 1 with 0 for identical tree topologies and 1 for maximally dissimilar topologies, where no bipartition in the reference is recovered. Given certain shortcomings of nRF distance such as rapid saturation (i.e., relatively minor differences between trees can result in the maximum distance value) [103] and imprecise values (i.e., the number of unique values that the metric can take is two fewer than the number of taxa) [104], we supplemented the AFproject service with additional measure for topological disagreement, normalized Quartet Distance (nQD) [105], which is the fraction of subsets of four leaves that are not related by the same topology in both trees.

Performance summary criteria

Figure 2 shows the color-coded performance of the evaluated AF methods across 12 reference data sets.

Performance score

For our benchmarking data sets, we use different measures to assess the performance of each method for a given data set, for example, nRF or AUC. To make our benchmarking results from different data sets comparable, we converted these measures to a performance score with values between 0 and 100. For the protein sequence classification data sets, this score is defined as AUC × 100 for data sets from gene trees, genome-based phylogeny, and horizontal gene transfer categories, we define the performance score as (1 − nRF) × 100. For the regulatory element data set, the performance score is already a number between 0 and 100, namely, the weighted average performance across seven data subsets.

Moreover, we define an overall performance score (Additional file 1: Table S14) that assesses each method across the data sets and that also takes values between 0 and 100. For a given method, we calculate revised scores for each data set, on which the method was tested as (Smin_score)/(max_scoremin_score) × 100, where S is the performance score obtained by the method and min_score and max_score are the minimum and maximum scores obtained with all methods for a given data set, respectively. This way, the best-performing method in a given data set receives a score of 100, and the worst performer receives a score of 0. The overall performance is an average of the revised scores across the data sets on which the given method was tested.

I.5 Additional comments for non-unix users

Bioperl has mainly been developed and tested under various unix environments, including Linux and MacOS X. In addition, this tutorial has been written largely from a Unix perspective.

Mac users may find Steve Cannon's installation notes and suggestions for Bioperl on OS X at

cann0010/Bioperl_OSX_install.html helpful. Also Todd Richmond has written of his experiences with BioPerl on MacOS 9 (

The bioperl core has also been tested and should work under most versions of Microsoft Windows. For many windows users the perl and bioperl distributions from Active State, at has been quite helpful. Other windows users have had success running bioperl under Cygwin ( See the package's INSTALL.WIN file for more details.

Many bioperl features require the use of CPAN modules, compiled extensions or external programs. These features probably will not work under some or all of these other operating systems. If a script attempts to access these features from a non-unix OS, bioperl is designed to simply report that the desired capability is not available. However, since the testing of bioperl in these environments has been limited, the script may well crash in a less graceful manner.

Material and Methods

Sequence data analysis was implemented in BioPython (Cock et al. 2009) using iPython (Pérez and Granger 2007) and BioPerl (Stajich et al. 2002) phylogenetic computation was implemented using DendroPy 3.10.0 (Sukumaran and Holder 2010) in Python 2.7.2 ( Scripts are available from the authors upon request. Other tools were used as described later.

Sequence Data Sources and Genome Annotation

Gene Family Allocation

Haloarchaeal genomes are known to harbor inteins (Perler 2002). Before assigning ORFs to families according to sequence homology, intein sequences were identified and removed from protein coding sequences because they are not present in all homologous ORFs (Gogarten et al. 2002). Each known intein sequence from InBase (Perler 2002) was used as a seed to build position-specific scoring matrices with Position-Specific Initiated Basic Local Alignment Search Tool (BLAST) 2.2.23+ (Camacho et al. 2009) against InBase and the haloarchaeal protein sequences with an acceptance threshold e value of 0.0001. Each matrix was used to query the haloarchaeal protein sequences, and alignments with an e value < 1e� were searched at each end with regular expressions designed to match the N-terminus ([ACS][AGFIHMLQSVY]) and C-terminus ([GFHKNS][QSN][CGSTVY]) intein splicing sites from known InBase Bacteria- and Archaea-derived sequences. Multiple alignments of protein sequences with shared KAAS inferred KEGG orthology numbers that included putative intein containing sequences were performed using Muscle 3.8.31 (Edgar 2004) with the default settings to confirm presence of inteins. Inferred intein sequences were removed and are listed in supplementary table S2, Supplementary Material online.

To establish superfamily clusters of ORFs, each protein sequence was used as a BLASTP (Camacho et al. 2009) query against all proteins, and groups were formed based on e values 1e𢄤. After single-linkage clustering, the MCL algorithm (Enright et al. 2002) was applied with I = 1.2 to each group using the lesser of hit-query bidirectional BLAST bitscores, normalized to self-hit bitscores, as edge weights but with hit-query length mismatches 30% set to zero to lessen the influence of less than full-length alignments on ORF clusters formation. The MCL algorithm was repeated on clusters > 210 with increasing I values: 1.8, 2.4, 3.0, 3.6, and 4.2 as some very large superfamilies remained after applying smaller I values. The resulting superfamilies of sequences were aligned with Muscle 3.8.31, and any remaining distantly homologous sequences were removed from each with scan_orphanerrs from the RASCAL package (Thompson et al. 2003). The superfamilies were realigned, phylogenies inferred with FastTree version 2.1.2 SSE3 (Price et al. 2010), and gene families inferred using the BranchClust algorithm (Poptsova and Gogarten 2007) with many = 11. BranchClust was started at each terminal edge (see Poptsova and Gogarten 2007 for algorithm details) and the run resulting in the most families and greatest inter family edge length (as a tie breaker) was selected.

Phylogenetic Reconstruction of Widely Distributed Gene Families

All ORF family amino acid sequences were aligned using AQUA (Müller et al. 2010) with default settings (Muscle 3.8.31, MAFFT v6.861b (Katoh et al. 2002), RASCAL 1.34 (Thompson et al. 2003), and norMD 1.2 (Thompson et al. 2001), except for -maxiters 32 in Muscle). Nucleotide sequences were aligned to these using Tranalign from the EMBOSS package version 6.3.1 (Rice et al. 2000). Most haloarchaeal genomes have a higher proportion guanine and cytosine bases that cause an increase in erroneous identification of start and stop codons for most gene calling algorithms (Aivaliotis et al. 2007). N-terminal extensions were removed to mitigate phylogenetic reconstruction artifacts caused by inclusion of nonprotein coding sequences. Homology information from ORF family multiple alignments was used to identify putatively erroneous N-terminal extensions defined as regions of ORFs starting in the multiple alignment earlier than the majority of other members that include 1 or more methionine or valine and had predicted isoelectric point (pI) Ϧ (the predicted pI of most Haloarchaeal ORFs is υ) predictions were made using computePI() from SeqinR library 3.0-5 (Charif and Lobry 2007) for the R statistical computing environment 2.13.2 (Ihaka and Gentleman 1996). C-terminal extensions were rare enough to not warrant similar screening.

Phylogenies were inferred for ORF families with one representative from at least 15 of the 21 genomes from amino acid and nucleotide alignments. Families with more than one ORF from any one genome were excluded from the analysis to minimize ambiguity of histories caused by potential paralogy. For each alignment, substitution model selection for ML reconstruction were made for amino acid alignments with ProtTest (Abascal et al. 2005) using the Akaike Information Criterion (AIC) criterion and for nucleotide alignments with ModelTest (Posada and Crandall 1998) implemented in HyPhy (Kosakovsky Pond et al. 2005). Guide trees were constructed using PhyML 3.0 (Guindon and Gascuel 2003) using the best of NNI and SPR search operations, estimating a proportion of invariant sites and a gamma distribution of among site rate variation with four rate categories by ML using LG substitution matrix (Le and Gascuel 2008) for amino acids and the Hasegawa–Kishino–Yano substitution model (Hasegawa et al. 1985) for nucleotide data. Phylogenies with 100 nonparametric bootstrap replicates were inferred as for the guide trees except where the selected models differed.

Quartet Decomposition

Topologies of all quartets of homologous ORF sequences (each representing a genome) embedded in each set of 100 nonparametric bootstrap replicate phylogenies were extracted from distance matrices of the phylogenies according to the four-point condition of Buneman (1974). This numerical approach proved to be more computationally efficient than inferring embedded quartet topologies by directly manipulating phylogenies represented as data objects. For each embedded quartet in each phylogeny in each set of bootstrap replicates (per gene family), the frequency of each of the three topologies was counted providing a bootstrap score (BSS) of resolution out of (and adding up to) 100. In simulations performed by Zhaxybayeva et al. (2006) to investigate error rates of false-positive and -negative HGT inference by embedded quartet decomposition, they found that omitting embedded quartets with �% BSS in less than 30% of the genomes in which that quartet exists (i.e., poorly resolved in most cases) provided a negligibly low rate of false positives. They also found that excluding those quartets increased the number of false-negative inferences (missed HGTs). The relatively smaller rate of false-positive than false-negative inferences provided a conservative estimate of transfers. The excluded quartets were probably vulnerable to stochastic noise, that is, occasionally well supported but potentially false-positive topologies due to chance in a finite data set. This definition of a “well resolved” quartet as having a bootstrap score of �% is used in the present analysis.

The greatest of the three scores per quartet was taken from the amino acid phylogenies unless it was 㲀% BSS and that of the nucleotide quartet was �% BSS in which case the latter was taken as the score for that quartet. This approach mitigated loss of information if only considering amino acid sequences when the corresponding nucleotide data provided better resolution as expected for closely related genes. The score for each topology of a quartet across all families in which it is found was summed, and the topology with the highest score was designated the plurality topology for that quartet of genomes (Zhaxybayeva et al. 2006). Embedded quartets may have been affected by long-branch attraction (Felsenstein 1978) when two adjacent long edges in the full phylogeny share a node with the quartet internal edge. Embedded quartets with these characteristics were omitted from the analysis to mitigate false-positive inferences of HGT due to long-branch attraction artifacts (LBAA). Potentially affected quartets were defined as having the shorter of two external adjacent edges on one side of the quartet's central edge more than five times the length of the central edge. Simulations have demonstrated ML estimation accounting for among-site variation to be unaffected by LBAA within these relative long versus short edge length differences (Zhaxybayeva and Gogarten, unpublished). However, the phylogeny inference that provided the embedded quartets was only subject to long-branch attraction with respect to edge lengths in the full phylogeny not each embedded quartet. Therefore, the lengths used for the external adjacent edges were the inner most with respect to nodes in the full phylogeny. If the outer edge of an embedded quartet formed a terminal edge in the full phylogeny, the whole length of the quartet outer edge was considered.

Phylogenies from Genome Sequences

Concatenated Ribosomal Protein Sequences

We inferred a well-resolved, rooted phylogeny for comparison with each ORF family using a concatenation of ribosomal protein coding genes from the 21 haloarchaeal genomes rooted with three outgroup taxa. Steps were taken to avoid model violations due to nonstationarity caused by compositional heterogeneity and systematic errors caused by long-branch attraction (Felsenstein 1978) most likely to affect the edge leading to the outgroup. To decrease the length of the edge to the in group, we selected outgroup taxa from two divergent groups: Nanohaloarchaea and Methanomicrobia. Alignments of each homologous ribosomal protein from the in and out groups were screened for compositional homogeneity using the test of Foster (2004) implemented in PhyloBayes 3.3b using posterior predictive resampling (Lartillot and Philippe 2004). We omitted sequences with a Z score > 2 in an alignment, that is, those with larger deviations in composition, from a concatenation of 59 ribosomal proteins. Sequences from two mesophilic euryarchaea: Methanosarcina acetivorans C2A and Methanococcus aeolicus. Nankai-3 were also screened in this way. The latter was selected because it had fewer proteins contributing to compositional heterogeneity. An ML phylogenetic reconstruction was performed with RAxML 7.3.0 starting from 20 randomized parsimony trees with a gamma distribution of among site substitution rates using per partition substitution models selected using ProtTest with the AIC criterion (Abascal et al. 2005). Bipartition support was assessed by frequency in 100 nonparametric bootstrap replicates.

Genome Gene Family Composition

For each genome, the presence of a gene family was treated as a character. An MP phylogeny was inferred using the September 2011 version of TNT (Goloboff et al. 2008) with the traditional search, tree bisection reconnection method, 20 search levels, 20 replicated Wagner trees, up to 100 steps for Bremer support (Bremer 1988), and 100 nonparametric bootstrap replicates calculated by frequency differences. To allow an ML phylogenetic reconstruction using PhyML version 20110919 (Guindon and Gascuel 2003), presence was encoded as a cysteine base and absence as an adenosine base with the F84 model of nucleotide substitutions (allows unequal base frequencies and independent rates of transitions and transversions) inferring a proportion of invariable sites and a free distribution of rate categories across a mixture model by ML.

Embedded Quartet Supertree

Plurality-embedded quartet topologies of the strict core gene families were encoded in a matrix according to the method of Baum (1992) and Ragan (1992) used in an MP phylogeny search (MRP) using the September 2011 version of TNT (Goloboff et al. 2008) with the same settings as for gene family composition analysis.

Genome Rearrangements

The strand, order, and chromosome of the core gene families in the subset of genome sequences that were previously fully assembled (Haloferax volcanii, Haloarcula marismortui, Halobacterium, Halogeometricum, Halomicrobium, Haloquadratum DSM 16854 and 16790, Halorhabdus, Halorubrum, Haloterrigena, Natrialba, Natronobacterium, Halalkalicoccus, and Halopiger) were used for neighbor-joining phylogenetic reconstruction (Saitou and Nei 1987) from multichromosomal gene rearrangement distances inferred under the 𠇍ouble-cut-and-join” model implemented in TIBA: Tree Inference with Bootstrap Analysis (Lin et al. 2011, last accessed February 12, 2012).

Inference of HGTs

Screening for Transfers from beyond the Sampled Haloarchaea

It was important not to confuse HGT from unsampled donors with ancient HGTs among ancestors of sampled genomes, else interpretation of HGT donor–recipient partners would suffer inaccuracies. If a homolog is horizontally transferred into the sampled haloarchaea from either an unsampled haloarchaeal lineage sister to the sampled group or a nonhaloarchaeal lineage, the recipient would become a cousin clan (sensu Wilkinson et al. 2007, the unrooted analogue of monophyletic group or clade appropriate for phylogenies in which the root is unknown) in the gene tree to the lineage that is deepest in the rooted reference phylogeny. This would be indistinguishable from an HGT from the deepest sampled lineage by analysis of topological incongruities alone. HGT from a donor outside of the sampled group would, in most cases, deliver a homolog with lower sequence similarity than any sampled donor and would resemble an out group often used for rooting phylogenies, that is, an unexpectedly long edge. The following procedure considering branch lengths was used to identify gene families in which incongruities may be due to HGT from unsampled donors from outside of the sampled group, as opposed to HGT among haloarchaea. Gene family phylogenies with unexpectedly long edges were partitioned into sets of homologs either side of those unexpectedly long edges. Unexpectedly long edges were those that were 㹵% longer than the mean edge length for that phylogeny. This arbitrary length threshold was used to provide a list of potentially problematic gene families which were then screened by BLAST analysis. If a set of homologs had lower BLAST expect scores to non-Haloarchaea than to the other sets from that gene family, an HGT from outside of the haloarchaea was concluded and that set of homologs was excluded from the following analyses to avoid false inference of HGT by phylogenetic incongruity.

Identifying Ancestral HGT Recipient𠄽onor Pairs within the Sampled Haloarchaea

Statistically supported incongruities between a gene family phylogeny and that of vertical descent can be interpreted as an HGT between a pair of ancestral lineages assuming the descendant of the donor lineage is sampled (see previous section). The difference in topologies caused by a single HGT will result in a different number of conflicting embedded quartets depending on how many nontrivial splits in the reference topology were traversed. For example, two HGTs crossing a small numbers of splits can cause fewer conflicting embedded quartets than one HGT crossing a large number of splits. The following algorithm infers recipient𠄽onor pairs by analysis of conflicting embedded quartets corresponding to topological incongruities. Embedded quartets taken from bootstrap replicates, which provide better resolution than bipartition supports in full gene phylogenies, were compared with those of the concatenated ribosomal protein phylogeny taken to be a proxy for that of vertical descent. The embedded quartets differing between the ribosomal protein phylogeny and gene family with adequate resolution (㺀% BSS) were divided into groups that described the same incongruities (a phylogeny may be affected by more than one HGT). Each group was reduced to a single quartet in which each tip represented regions of the full topologies that were congruent (sometimes referred to as 𠇋ranch and bind”). This was achieved by combining all two-member quartet topology defined sets if they had shared membership (“single-linkage clustering”). This yielded several sets containing homologs or groups of homologs corresponding to congruent regions of the two topologies. Two of these groups represent exchange partners and are cousin clans (sensu Wilkinson et al. 2007) in the gene family phylogeny but are not sister clades in the genome lineage phylogeny.

HGT exchange partners that appear adjacent in the gene family phylogeny can be recovered by discarding those sets that are sisters in the genome lineage phylogeny. Where several homologs are recovered, an ancestral HGT affecting more than one sampled descendant has been inferred. Repeating this process using a genome reference phylogeny on which previously inferred transfers are applied by subtree pruning and regrafting operations, nested and overlapping transfers in a single gene phylogeny can be recovered. Rearrangements involving sister clades with two members or four member comb phylogenies were inferred by a set of simple conditions for each scenario. When HGT pairs cannot be recovered but conflicting embedded quartets remain, only nonspecific evidence of HGT in that gene family can be concluded due to insufficient resolution in the data. The recipient in the HGT pair can be inferred by assessing which is in a different phylogenetic context in the gene family phylogeny.

Characterization of HGTs

Transfer of Multiple Homologs

For HGT donor–recipient lineage pairs inferred from conflicting embedded quartets for a specific homolog, the hypothesis that its neighboring ORFs were also transferred in the same event was tested. First, the homology of the next ORF in the 5′ direction along the chromosomes of the donor, recipient, and nonrecipient was tested (i.e., did it belong to the same gene family?) allowing up to four inserted or deleted ORFs in each strand. If homologous and in a single copy per genome, widely distributed gene family for which embedded quartets were obtained indicating the same donor–recipient lineage HGT, it was included in the same multi-ORF HGT event. This process was continued along both strand directions until a homolog was not transferred or not identified between the pair.

Additionally, for donor–recipient lineage pairs separated by distance D along the edges of the ribosome phylogeny, where the recipient was within D × 0.85 to other genomes unaffected by HGT for that gene family (nonrecipients), a multiple ORF transfer was inferred if the ML estimate of substitutions per site distance (inferred using the WAG substitution model [Whelan and Goldman 2001] with five rate categories in a gamma distribution as implemented in RAxML 7.3.0 [Stamatakis 2006] from a multiple sequence alignment of all homologs in the sampled genomes) was smaller to the donor than to the nonrecipient, that is, if the ratio of pairwise distances for that homolog was in conflict with that of the concatenated ribosomal protein phylogeny (see fig. 2 for an example). Many donor–recipient pairs had several sampled descendants in which case the analysis with the shortest multi-ORF transfer was retained to provide a conservative estimate of HGT unit size. Chromosome gene maps to aid in this analysis were plotted using the R package genoPlotR (Guy et al. 2010).

A diagram indicating a horizontal transfer of a protein coding ORF inferred by embedded quartet decomposition and tree reconciliation with an adjacent ORF inferred to have been horizontally transferred in the same event. The three horizontal lines represent regions of chromosomes from Halobacterium salinarum R1 (top, putative donor of transferred genetic material), Haloarcula californiae ATCC 33799 (middle, putative recipient), and Halorhabdus utahensis DSM 12940 (bottom, a reference genome). Units are megabases (Mb). Horizontal arrows represent 3′𠄵′ strand direction and range of protein coding regions. Shared colors indicate most recent homology except for gray, which indicates no local homology. The vertical red arrow indicates which homologs were inferred by embedded quartet decomposition and tree reconciliation to have been transferred between the ancestor of Halobacterium salinarum R1 and Halorhabdus utahensis DSM 12940 and the direction. The reference genome was selected for being more closely related to the putative recipient than donor according to the ribosomal protein phylogeny, plotted to the left side, and to have not been inferred to have been affected by HGT for the gene analyzed with embedded quartets. ML estimates of evolutionary distances measured in substitutions per site are indicated between homologous protein coding regions with the shorter distance indicated by a color.

Mode of Chromosomal Integration

The transferred homolog or homologs were inferred as HR if they were located in a chromosomal region with orthology to the region containing the ancestral versions in the reference genome (described in the previous section). The use of a reference genome allowed confirmation that an ORF underwent HR within an orthologous region with common ancestry between the donor and recipient by excluding the possibility of transfer of that whole region or genomic island (a xenologous region) causing syntenic conservation. If the transferred ORF or ORFs were found in a region other than that identified in the putative donor and close relative, nonhomologous insertion (NHI) followed by loss of the pre-existing version from the orthologous region was inferred. Chromosomal rearrangements during evolution means the probability of identifying homologous regions decreases with evolutionary distance, and for many HGTs, the recipient did not have close relatives with orthology for the gene. Whether these requirements were met for each HGT therefore depended on the phylogenetic placement of the donor and recipient among the available genomes.

If homologous regions were not identified, the mode of integration could not be inferred. If homologous regions were identified and the region of HGT ORF(s) intersects a window of eight ORFs around the center of the region in the recipient, HR was inferred, else NHI (followed by loss of the original homolog for the single copy families analyzed here) was inferred as the mode of chromosomal integration.

Initially, the reference genome chromosomes were scanned with a moving window of eight ORFs. If a single region in the nonrecipient contained two of the same homologs found within four ORFs in or around the HGT unit in the recipient chromosome, those regions were considered homologous. The fewest gene families per genome was 2,212 in the Halobacterium salinarum, whereas the average 3,077 the probability of finding any two of four homologs in a window of eight in a genome of 2,212 homologs is (4 × 8 × [1/2,212]) 2 = 0.0002 providing a false-positive rate of 0.02% for transfers to Halobacterium but for the majority of inferences 0.01% on average.

Modeling Exchange Partner Sequence Similarity versus Frequency of HGTs

The frequency of HGT was calculated as the quantity of HGT events during the time a pair of HGT partner lineages coexisted. Time of coexistence was estimated as the length of overlapping edges in a maximum clade credibility phylogeny (e.g., the region labeled “t” in fig. 1 B) from a Bayesian posterior distribution of phylogenies using the ribosomal protein sequences described earlier under an uncorrelated log-normal relaxed molecular clock (Drummond et al. 2006). The data were partitioned into large and small ribosomal subunit associated sets of sequences, the tree prior set to a Yule model, and the substitution model to WAG (Whelan and Goldman 2001) with five categories in a gamma distribution of among site rate variation. Four Markov chain Monte Carlo sampling chains of 20,000,000 and one of 14,000,000 generations with a discarded burnin of 800,000 generations using BEAST v1.6.1 (Drummond and Rambaut 2007) and BEAGLE v1.0 (Ayres et al. 2011) with an MSI (City of Industry, CA) N560GTX-TI TWIN FROZR II 2G GPU were calculated. The smallest effective sample size was 170 as calculated by Tracer v1.4.1 (Rambaut and Drummond 2007) as five separate trace files or after serializing with LogCombiner (part of the BEAST package) indicating both an adequate burnin and convergence.

(A) ML phylogenetic reconstruction from 59 concatenated ribosomal protein sequences from 21 haloarchaea with edge lengths scaled to substitutions per site. Two sets of nanohaloarchaeal and one mesophilic methanogen from Methanomicrobia were used as an outgroup. Protein homologs inferred as causing compositional heterogeneity were excluded, and the deepest bipartitions were collapsed due to inconsistency among nonparametric bootstrapped replicates and evidence of LBAA. (B) Bayesian sampled phylogeny inferred from the same data set with edge lengths scaled to a relaxed molecular clock. As an example, the edges marked d1𠄴 in (A) and the regions labeled "t" in (B) indicate the genetic distance between and the duration of coexistence respectively of the ancestral lineages of Halalkalicoccus and of Haloarcula and Halomicrobium used in HGT frequency versus genetic distance modeling. All pairwise, coexisting, nonsister edges were included.

The sequence similarity was taken to be the substitutions per site across the RAxML inferred ribosomal protein phylogeny described earlier. Although rates of evolution will vary between gene families, relative rates among lineages within gene families may be similar to those of the ribosomal proteins. Specifically, between the points on the donor–recipient edges mid-way along the overlapping region in the relaxed molecular clock tree (e.g., the region labeled “t” in fig. 1 B) scaled to the equivalent point in the substitutions per site tree (e.g., the terminal ends of the regions labeled �” and �” in fig. 1 A) spanning the edges lengths since the donor–recipient last common ancestor (e.g., the regions labeled �” to �” in fig. 1 A). The distances between partners may be underestimated when the phylogenetic resolution within a clan of putative transfer partner homologs (either recipient or donor descendents) was insufficient to infer the precise edge of horizontal transfer: the next deepest edge of resolution 80 would have been returned by the algorithm used to infer HGT by phylogenetic incongruity. The resolution in the gene phylogenies within the inferred HGT partner groups was tested by checking for embedded quartets that supported each of the next edges within the regions of the gene family phylogeny associated with either exchange pair until supported. The mean distance into the ribosome phylogeny along unresolved edges was added to the distance between exchange partners to account for this uncertainty. A linear model was fitted with the lm() function after a log transformation of the HGT frequency data using the log() function of the base package of R 2.14.2 (Ihaka and Gentleman 1996).

Inferring the Relative Contributions of “in-lineage” and “out-lineage” Sequence Substitutions in Relaxed Core Genes

The total “in lineage” substitutions for ORFs in single copy relaxed core families were calculated as the distance from each tip to the root of the ML ribosomal protein phylogeny multiplied by the quantity of such ORFs in the genome sampled for that lineage (units: − 1 ).

The total “out-lineage” substitutions were calculated by predicting the HGT frequency for each edge between a tip and the root with each of all other coexisting lineages, according to the relaxed molecular clock phylogeny, using the corresponding distances in the substitutions per site phylogeny as the distance for the fitted linear model. For each edge pair, the HGT frequency (units: HGT.time − 1 ) was multiplied by the mean number of ORFs per HGT (units: ORF.time − 1 ) and then by the average of half of the edge lengths in each lineage since the last common ancestor to (assuming equal transfers in each direction: otherwise the edge length in the donor lineage would be used) give horizontally acquired substitutions (units: ORF.substitutions.time − 1 .site − 1 ), finally multiplying by the length of overlapping edges (units: − 1 ).


We generated pseudo-random sequences to determine how far typical alignment scoring schemes spuriously overextend alignments into neighboring unrelated sequences. Random protein sequences reflected the standard Robinson–Robinson ( 16) amino acid frequencies random DNA sequences, the human genome average frequency of 60% AT. To mimic extension from a true alignment, a variant of the Needleman–Wunsch algorithm optimized the score over all alignments starting (possibly with gaps) at the beginning of the two sequences but ending anywhere. For a given pair of random sequences, after finding a constrained alignment with the maximal score, we recorded its flank length, which is the number of residues aligned in the first random sequence. We estimated flank length distributions, both by ‘crude Monte Carlo sampling’ (the name for brute-force simulation in statistics), which generates letters independently from the appropriate background frequencies, and by a well-accepted, more efficient but complicated procedure called ‘importance sampling’ (see Methods section for more details) ( 12).

Figure 3 plots the flank length distributions for several scoring schemes Table 2 lists the expected lengths and probabilities of length = 0. Although the distributions vary widely, the crude Monte Carlo and importance sampling estimates agree closely. Among protein scoring schemes, BLOSUM50 with GOP = 10 and GEP = 2 has an expected flank length of 23 and probability 0.1 of a flank length exceeding 65. Thus, sizeable overextensions are likely with this scoring scheme. The other protein scoring schemes in Figure 3 are much more restrained: for instance, BLOSUM62 with GOP = 11 and GEP = 1 has an expected flank length of 5.5 and probability 0.1 of a flank length exceeding 17. However, there is always a small probability of getting large flanks: BLOSUM62 with GOP = 11 and GEP = 1 has probability 0.01 of a flank length exceeding 69. Since it is common to perform hundreds or even millions of alignments, these probabilities are not negligible. The flank lengths for NCBI BLAST can be roughly halved by increasing the gap extension penalty to 2.

Probability distributions for the length of overalignment into random sequences. The solid lines show distributions obtained from alignment of 10 000 random sequence pairs (using the variant of the Needleman–Wunsch algorithm mentioned in the Results section). The dashed lines show distributions predicted by importance sampling. The top row refers to protein sequences with Robinson–Robinson frequencies, and the bottom row refers to DNA with 60% AT. The abbreviations are GOP (gap opening cost), GEP (gap extension cost), and + X/−Y (match score/mismatch score).

Probability distributions for the length of overalignment into random sequences. The solid lines show distributions obtained from alignment of 10 000 random sequence pairs (using the variant of the Needleman–Wunsch algorithm mentioned in the Results section). The dashed lines show distributions predicted by importance sampling. The top row refers to protein sequences with Robinson–Robinson frequencies, and the bottom row refers to DNA with 60% AT. The abbreviations are GOP (gap opening cost), GEP (gap extension cost), and + X/−Y (match score/mismatch score).

The flank length distributions for popular DNA scoring schemes vary even more widely. The +5/−4 scheme with GOP = 0 and GEP = 10 is severely prone to overextension, with an expected flank length of 41 and probability 0.1 of a flank length exceeding 141. Surprisingly here, the gap extension penalty is twice the match score, perhaps highlighting the importance of a large gap opening penalty in restraining overextension. Surprisingly also, even with the same gap penalties, and despite apparent similarity, the HoxD55 matrix is much more prone to overextension than HoxD70. Because the HoxD55 scheme with GOP = 400 and GEP = 30 has an expected flank length of 24 and probability 0.1 of a flank length exceeding 94, overextensions like the one in Figure 1 are probable. On the other hand, the default schemes for NCBI BLAST are extremely restrained: the +2/−3 scheme with GOP = 5, GEP = 2 has only probability 0.01 of a flank length exceeding 8, and the +1/−3 scheme is, of course, even more conservative.

Because local alignments of random sequences should not extend to include most of the sequence length, practical scoring systems are constrained to have reasonably strong mismatch and gap penalties. Despite extensive simulation, we were unable to verify that the default scoring schemes in two EMBOSS programs, Water and Supermatcher (but not Matcher) satisfied this constraint for sequences with 60% AT ( 15). [In technical terms, practical scoring systems must be in the ‘local regime’ ( 13), which depends also on the letter frequencies in random sequences. In other words, a scoring system might be in the local regime for GC-rich DNA, but not for AT-rich DNA. Although a few approximate analytical studies are extant ( 17, 18), simulations are generally required to show that a scoring system is in the local regime. We could not verify that the Water and Supermatcher scoring systems were in the local regime.]

Mismatch and/or gap penalties restrain overextension, but there is of course a tradeoff: if penalties are too high, alignments fail to include weakly similar subsequences. Because the tradeoff depends on the nature of weak biological similarities, we studied it in real biological sequences, by examining alignments of mtDNA to recent human NUMTs. Because NUMTs are unrelated DNA insertions with well-defined edges, they serve our purposes particularly well. As described in the Methods section, we identified 31 recent NUMTs. The 31 NUMTs, with 1000 bp of flanking sequence on either side (Supplementary dataset 2), were then aligned to mtDNA from mouse, fugu and hagfish (a borderline vertebrate), representing three levels of divergence.

Figure 4 shows the length distribution of overalignments, where the alignment extends past the edge of the NUMT, and underalignments, where the alignment ends before the edge of the NUMT, for six scoring schemes. Although the default scheme of NCBI BLAST (+2/−3 with GOP = 5, GEP = 2) is indeed resistant to overalignment, it pays for this with a strong tendency for underalignment. On the other hand, the most aggressive scoring schemes (+5/−4 with GOP = 0, GEP = 10 and HoxD55 with GOP = 400, GEP = 30) exhibit the least underalignment, but excessive overalignment. The default scheme of BLASTZ (HoxD70 with GOP = 400, GEP = 30) offers a good balance between under- and overalignment, especially for the level of divergence between human and fugu mtDNA. (To avoid misunderstanding, note that on average, human and fugu mtDNA are much less divergent than human and fugu nuclear DNA.) In general, conservative scoring schemes provide a better balance for closely related sequences, and aggressive schemes for divergent sequences. If one desires a simple match/mismatch scoring scheme, then +1/−1 with GOP = 2, GEP = 1 offers a reasonable balance for a wide range of problems, being somewhat more conservative than the BLASTZ default.

Tradeoff between over- and under-alignment. These graphs refer to Smith–Waterman alignments of mouse, fugu and hagfish mtDNA to 31 human NUMTs with 1000 bp of flanking sequence on either side. The 62 endpoints of the NUMTs are known to within ±5 bp. The solid lines show the distribution of overalignments, and the dashed lines show the distribution of underalignments. We discarded alignments not overlapping the NUMT at all: the horizontal dotted lines indicate the number of endpoints remaining for consideration.

Tradeoff between over- and under-alignment. These graphs refer to Smith–Waterman alignments of mouse, fugu and hagfish mtDNA to 31 human NUMTs with 1000 bp of flanking sequence on either side. The 62 endpoints of the NUMTs are known to within ±5 bp. The solid lines show the distribution of overalignments, and the dashed lines show the distribution of underalignments. We discarded alignments not overlapping the NUMT at all: the horizontal dotted lines indicate the number of endpoints remaining for consideration.

A judicious choice of scoring scheme can make large overextensions infrequent, but it does not prevent them completely. Thus, we need to identify overextensions when they occur. Figure 2 suggests that long overextensions have relatively low scores. Thus, given the score distribution for alignments extending from true alignments into random sequences, a P-value (the probability of a chance flank with equal or greater score) could help identify spurious alignment flanks.

Given a true alignment, a spurious alignment flank is approximately the alignment of two random sequences starting from the final aligned letter pair in the true alignment. [To test robustness of our results by varying the nature of the true alignment, we simulated long sequence pairs under the hybrid alignment model of related sequences ( 19), and then concatenated random unrelated sequences to the aligned sequences. Results remained essentially unchanged (data not shown).] Under the approximation, the contribution to the alignment score from the flank is equivalent to a quantity known as the ‘global maximum score’ ( 20). The global maximum score y has a P-value P ≈ ce λy , where c is a fixed constant and λ is the so-called ‘Gumbel scale parameter for local alignment’. Analytical formulas for c and λ are known only for gapless alignment ( 21), but importance sampling techniques can estimate c and λ very efficiently for gapped alignment (see Methods section). Crude Monte Carlo sampling confirmed the accuracy of P-values from importance sampling ( Figure 5).

Probability distributions for the scores of overalignments into random sequences. The solid lines show score distributions from alignment of 10 000 random sequence pairs (using the variant of the Needleman–Wunsch algorithm mentioned in the Results section). The dashed lines show distributions predicted by the formula P ≈ ce λy . Table 2 contains the values of the overalignment parameters c and λ. The dotted lines are the distributions of the maximum left score, as described in the Results section.

Probability distributions for the scores of overalignments into random sequences. The solid lines show score distributions from alignment of 10 000 random sequence pairs (using the variant of the Needleman–Wunsch algorithm mentioned in the Results section). The dashed lines show distributions predicted by the formula P ≈ ce λy . Table 2 contains the values of the overalignment parameters c and λ. The dotted lines are the distributions of the maximum left score, as described in the Results section.

Table 2 gives values of λ and c for sixteen scoring schemes Supplementary dataset 1 gives values for many other scoring schemes. Figure 2 illustrates how the formula P ≈ ce λy converts a given flank score into an overalignment P-value. In the bottom row of Figure 2, the cumulative score reaches a minimum value of 515, at the end of the large gap in the lower sequence. Because P ≈ ce λy ≈ 0.038 (c = 0.802, λ = 0.00592 and y = 515), and because the UCSC fugu–human data include many thousands of individual alignments, we expect many spurious extensions with P-values of this magnitude.

Then, how can we use the overalignment P-value to strengthen inferences from alignments? Figure 2 plots the flank P-value against the alignment position with a solid line. After exclusion of the largest flank with P-value P, 1−P becomes a lower bound for the (theoretical) probability that on that flank, the remaining alignment does not involve two random sequences. (The inference might seem feeble, but it is the only inference possible from any alignment P-value).

In bioinformatics, P-values usually flag biological similarities, so this statement might seem counterintuitive. The overalignment P-value, however, aims to exclude biologically spurious flanks, to increase the dependability of the remaining alignment. Several intervals on a flank alignment might have the same score (and thus, the same overalignment P-value), however. Which interval should we exclude?

To introduce some relevant subtleties, consider the boundary position between the true and flank alignments in Figure 2. Consider now the left end-position of the maximal local alignment. Let the ‘left scores’ be the successive cumulative global alignment scores within the flank, starting from the left end-position and moving rightward (as shown by the dotted line in Figure 2). Now, reverse direction and consider the ‘right scores’ (not shown in Figure 2), which are successive cumulative global alignment scores starting from the boundary position and moving leftward. Because the left end-position is the end of the maximal local alignment, it achieves the maximum right score, which we denote here by M. Fortunately, the P-value for the maximum right score M is known from other work ( 20).

Because the alignment score for any interval remains the same under sequence reversal, the left score at the boundary position is also M. Because we know the left end-position of an optimal local alignment but not the boundary position, to exclude a boundary position with a left score M = y, we must exclude every position with left score y. In Figure 2, e.g. we must exclude the rightmost position with left score y, indicated by the downward double arrow. As an intuitive justification, consider every alignment position with left score y. All intervening alignment intervals have a score of 0, which does nothing for our confidence that they represent parts of a biologically interesting alignment.

One should bear in mind that statistical significance does not always reflect biological significance, however. Various rules of thumb can estimate biological significance from BLAST E-values, e.g. PSI-BLAST iterations retain sequences with a statistical E-value of 0.005. Figure 2 suggests that for overalignment P-values, statistical and biological significance are similar, but further practical experience is required to confirm this point.

To increase confidence in an alignment, an investigator could trim the alignment flanks with the overalignment P-value, but trimming also involves a tradeoff: overalignment becomes less frequent but underalignment becomes more frequent. The P-value threshold used for trimming flanks should therefore reflect the subjective penalties assigned to over- and under-alignment. Figure 6 shows the same mtDNA-NUMT alignments as Figure 4, but after removing flanks with P > 0.01. As expected, overalignment decreases but underalignment increases. In particular, underalignments of length around 10 bp are frequent, because true alignments are likely to extend for a few bases into nearby sequences. Since the overalignment P-values for short extensions are near 1.0, no solid judgment is possible about a few residues at the end of any alignment.

Tradeoff between over- and under-alignment after trimming flanks with P > 0.01. These graphs refer to the same alignments as in Figure 4. This time, however, the alignments were shortened at either end by removing flanks with P > 0.01. In a few cases, the trimming removed the entire alignment: we discarded these cases from consideration.

Tradeoff between over- and under-alignment after trimming flanks with P > 0.01. These graphs refer to the same alignments as in Figure 4. This time, however, the alignments were shortened at either end by removing flanks with P > 0.01. In a few cases, the trimming removed the entire alignment: we discarded these cases from consideration.

Based on these results, we do not recommend routine trimming of alignment flanks, particularly because well-balanced scoring schemes rarely produce large overextensions. Rather, programs should include the P-value of flanks, so investigators can know how often a random flank produces the indicated alignment. In the case of low-quality alignments of transcription factor binding sites, for example, investigators can then regard any flanks with large P-values with appropriate suspicion.


B.B. thanks D. Tserendulam for help, wisdom and guidance. E.W. thanks St John’s College, Cambridge for facilitating scientific discussion. We thank S. Rankin and the staff of the University of Cambridge High Performance Computing service and the National High-throughput Sequencing Centre (Copenhagen). This work was supported by: The Danish National Research Foundation, The Danish National Advanced Technology Foundation (The Genome Denmark platform, grant 019-2011-2), The Villum Kann Rasmussen Foundation, KU2016, European Union FP7 programme ANTIGONE (grant agreement No. 278976), European Union Horizon 2020 research and innovation programmes, COMPARE (grant agreement No. 643476), VIROGENESIS (grant agreement No. 634650) and the Lundbeck Foundation. The National Reference Center for Hepatitis B and D Viruses is supported by the German Ministry of Health via the Robert Koch Institute (Berlin). B.B. was supported by Taylor Family-Asia Foundation Endowed Chair in Ecology and Conservation Biology. A.D.M.E.O. was supported by N-RENNT of the Ministry of Science and Culture of Lower Saxony, Germany.

Reviewer information

Nature thanks P. Simmonds, B. Shapiro, C. Pepperell and the other anonymous reviewer(s) for their contribution to the peer review of this work.


We report an analysis of the TE content of both the ovary and testis germline transcriptomes for the fire ant, which is the first for a hymenopteran insect. A previous study profiled only ovary gene expression in honeybees ( Niu et al. 2014) and did not examine TEs. Additionally, our study is one of a few insect germline transcriptomes outside of Drosophila and mosquitoes ( Akbari et al. 2013 Yang and Xi 2017). We also report the discovery of a rare case of a currently active TE after a recent HTT (<3 My) in insects. This adds to the few cases of HTT documented for Hymenoptera ( Dotto et al. 2015, 2018). Our study shows that profiling germline expression may be a potential approach for identifying active TEs.

Our analysis revealed that ∼50% of TE-containing transcripts in both the female and male germlines of the fire ant contained sequence from members of the IS630-Tc1-Mariner superfamily ( fig. 1A). Although previous studies suggested that mariners were typically inactivated in eukaryote genomes ( Feschotte and Pritham 2007 Muñoz-López and García-Pérez 2010 Yang and Xi 2017), our results are consistent with the fact that all six known cases of active mariners in animals are from invertebrates ( Muñoz-López and García-Pérez 2010). Our findings also corroborate the previous observation that the mariner family is widespread in insects ( Robertson 1993 Peccoud et al. 2017).

Active TEs are a genomic burden, and consequently, organisms have evolved defense mechanisms against TEs ( Levin and Moran 2011 Yang and Xi 2017). Consistent with control by host defenses, >84% of the TE containing transcripts in our study were expressed at low levels ( fig. 1B). Self-regulation could also be occurring ( Kidwell and Lisch 2001 Bire et al. 2016). Nevertheless, 17 autonomous TEs may have escaped, or are not yet subject to, host defenses as they are highly expressed in the germline.

Of these, we found Mariner-2_DF particularly interesting because it may still be active in S. invicta, and possibly in a recent phase of expansion. Six lines of evidence strongly support this possibility. First, it has high germline expression and is the only one expressed in all three germline samples based on comparison to the BUSCO genes and it has the highest germline expression in all three samples using Repbase as the reference ( supplementary table S5 , Supplementary Material online). Second, of the 17 highly expressed TEs examined, it is the only one with nonreference copies. Third, it has multiple unique insertion polymorphisms in seven fire ant families ( fig. 3B, supplementary figs. S4–S10, Supplementary Material online). We found at least two insertions per family, which is likely an underestimate because our analysis only surveyed the ∼67% of the scaffolds joined into pseudochromosomes. Fourth, it can undergo somatic excision ( supplementary fig. S11 , Supplementary Material online). Fifth, it is the TE with the most copies (n = 857 all others n ≤ 306 sequences ≥60% of full length). This copy number is similar to other mariner lineages in Drosophila (e.g., ∼460 copies of Dromar6 in D. erecta) that are likely in a recent phase of expansion ( Wallau et al. 2014). Finally, it has the lowest intercopy genetic diversity, including many identical copies, in the fire ant genome ( fig. 2, table 1). The low genetic diversity among the Mariner-2_DF copies suggests that it may be the youngest active mariner in fire ant genome. This also indicates that the fire ant has not yet evolved a strong defense against Mariner-2_DF.

Although we were successful in discovering one active TE, our analysis may have underestimated the number of active TEs in fire ants for several reasons. For example, we selected for highly expressed TEs in our analysis, thus we would miss active but moderately or lowly expressed TEs. Related, we only profiled one time point for the ovaries (virgin adults) and testes (third and fourth instar), so TEs expressed at other developmental times or during periods of stress (e.g., Naito et al. 2009) would also be missed. Likewise, we did not examine testes from the Sb genotype. Additionally, although we used an improved fire ant genome, there are still assembly gaps, precisely where TEs are typically overrepresented. TE polymorphism (an indication of activity) in the gaps would be undetected. Similarly, fire ant centromeres occupy a third of the genome ( Huang et al. 2018), and any polymorphic insertions there would be difficult to detect.

In addition to contemporary Mariner-2_DF activity in the fire ant, this transposon may have been horizontally transferred into several other species recently (<5.1 My). With the caveat that the analyzed genome assembly qualities were variable, thereby possibly introducing false negatives in Mariner-2_DF presence and sequence completeness, our investigation of its taxonomic distribution revealed a patchy distribution, being found in eight species among 52 diverse insects. For three of the eight species, only remnants of the Mariner-2_DF transposon sequence were detected, indicating host inactivation of the transposon and possibly suggesting an older horizontal transfer date. For the remaining five species, there was high sequence identity among the species and fewer synonymous substitutions in Mariner-2_DF than in nuclear genes in pairwise comparisons, suggesting at least five independent relatively recent horizontal transfer events ( fig. 4). Intact full-length Mariner-2_DF sequences were only detected in S. invicta and D. grimshawi (the youngest, ∼0.18–0.23 My), suggesting that Mariner-2_DF may potentially be active in only these two species. Our results match previous studies reporting HTT for Mariner-2_DF in D. ficusphila (Dromar8Mfic), D. grimshawi (Dromar8) ( Wallau et al. 2014, 2016) and R. prolixus (Rpmar57) ( Filée et al. 2015).

HTT is a well-documented phenomenon among insects. A recent study found that some insects have large proportions of the genome from HTT (24.69% in the stable fly, Stomoxys calcitrans), but in fire ants this value is only 0.75% ( Peccoud et al. 2017). In general, previous research proposed that closely interacting species are more likely to exchange TEs ( Soucy et al. 2015). HTT seems unlikely to have occurred directly among the eight species examined in our study because they have no documented direct ecological interactions. Nevertheless, the current native geographic ranges for R. prolixus and the two ants may overlap ( table 2) and historical geographic ranges may have overlapped for the other species, possibly permitting HTT. More likely, HTT occurred indirectly through one or a series of common vectors between recipient species. These could include viruses, such as baculoviruses or the flock house virus, which are known to carry TEs ( Loreto et al. 2008 Routh et al. 2012 Gilbert et al. 2014), and intimately associated parasites, Wolbachia, or other TEs ( Houck et al. 1991 Loreto et al. 2008 Schaack et al. 2010 Venner et al. 2017). We did check a phoretic mite of fire ants, Histiostoma blomquisti ( Sokolov et al. 2003 Wirth and Moser 2010), which is commonly attached between or under the abdominal tergites of queens. However, we can exclude this mite as the vector because genome sequencing revealed no Mariner-2_DF copies ( Lee and Wang 2016 and unpublished genome).

The direction of HTT, either direct or indirect, among the eight species examined is not clear from our study. Nevertheless, one possibility is that the three species (D. yakuba, D. erecta, and R. prolixus) containing only highly fragmented, and presumably fairly old, copies of Mariner-2_DF, could have been the source for the HTT events into the other five species. Related, and compatible with the first possibility, is that the two ants, which have estimated Mariner-2_DF colonization dates of >2.6 Mya, could have been the source for the three species with more recent insertion dates (D. ficusphila, D. grimshawi, and M. rotunda all <0.57 Mya). Future studies incorporating additional genomes are needed to resolve this issue.

Periods of active transposition may disproportionately shape the host’s genome, leading to increased host genome diversity. Associations between bursts of TE activity and species radiations has been proposed in apes, rodents, and bats ( Warren et al. 2015). Given the evolutionary recent proliferation of Mariner-2_DF and the high likelihood that it is currently active, highly expressed, and highly polymorphic, we suggest that, of all the TEs, Mariner-2_DF has been disproportionately affecting the fire ant genome. An intriguing question would be: Has this transposon generated beneficial mutations in the fire ant genome that have contributed to its adaptation to the invasive ranges? This topic will be the subject of future experiments and analyses.

Watch the video: Couture Creations Embossing Mat (June 2022).


  1. Wakefield

    What necessary words ... Great, a brilliant idea

  2. Chesley

    The font is hard to read on your blog

  3. Josh

    I think you admit the mistake. I propose to examine.

  4. Gojar

    Agreed, the remarkable phrase

  5. Mazuzil

    Well produced?

Write a message