Authors: Joel ZB Low, Tsung Fei Khang, Martti T Tammi
Publication date: 2017/12
Journal: BMC bioinformatics
Volume: 18
Issue: 16
Pages: 575
Publisher: BioMed Central

Abstract
Background

In current statistical methods for calling differentially expressed genes in RNA-Seq experiments, the assumption is that an adjusted observed gene count represents an unknown true gene count. This adjustment usually consists of a normalization step to account for heterogeneous sample library sizes, and then the resulting normalized gene counts are used as input for parametric or non-parametric differential gene expression tests. A distribution of true gene counts, each with a different probability, can result in the same observed gene count. Importantly, sequencing coverage information is currently not explicitly incorporated into any of the statistical models used for RNA-Seq analysis.

Results

We developed a fast Bayesian method which uses the sequencing coverage information determined from the concentration of an RNA sample to estimate the posterior distribution of a true gene count. Our method has better or comparable performance compared to NOISeq and GFOLD, according to the results from simulations and experiments with real unreplicated data. We incorporated a previously unused sequencing coverage parameter into a procedure for differential gene expression analysis with RNA-Seq data.

Conclusions

Our results suggest that our method can be used to overcome analytical bottlenecks in experiments with limited number of replicates and low sequencing coverage. The method is implemented in CORNAS (Coverage-dependent RNA-Seq), and is available at https://github.com/joel-lzb/CORNAS.

Download PDF

Related articles

Authors: Qi Bin Kwong, Chee Keng Teh, Ai Ling Ong, Fook Tim Chew, Sean Mayes, Harikrishna Kulaveerasingam, Martti Tammi, Suat Hui Yeoh, David Ross Appleton, Jennifer Ann Harikrishna

Abstract
Background

Genomic selection (GS) uses genome-wide markers as an attempt to accelerate genetic gain in breeding programs of both animals and plants. This approach is particularly useful for perennial crops such as oil palm, which have long breeding cycles, and for which the optimal method for GS is still under debate. In this study, we evaluated the effect of different marker systems and modeling methods for implementing GS in an introgressed dura family derived from a Deli dura x Nigerian dura (Deli x Nigerian) with 112 individuals. This family is an important breeding source for developing new mother palms for superior oil yield and bunch characters. The traits of interest selected for this study were fruit-to-bunch (F/B), shell-to-fruit (S/F), kernel-to-fruit (K/F), mesocarp-to-fruit (M/F), oil per palm (O/P) and oil-to-dry mesocarp (O/DM). The marker systems evaluated were simple sequence repeats (SSRs) and single nucleotide polymorphisms (SNPs). RR-BLUP, Bayesian A, B, Cπ, LASSO, Ridge Regression and two machine learning methods (SVM and Random Forest) were used to evaluate GS accuracy of the traits.

Results

The kinship coefficient between individuals in this family ranged from 0.35 to 0.62. S/F and O/DM had the highest genomic heritability, whereas F/B and O/P had the lowest. The accuracies using 135 SSRs were low, with accuracies of the traits around 0.20. The average accuracy of machine learning methods was 0.24, as compared to 0.20 achieved by other methods. The trait with the highest mean accuracy was F/B (0.28), while the lowest were both M/F and O/P (0.18). By using whole genomic SNPs, the accuracies for all traits, especially for O/DM (0.43), S/F (0.39) and M/F (0.30) were improved. The average accuracy of machine learning methods was 0.32, compared to 0.31 achieved by other methods.

Conclusion

Due to high genomic resolution, the use of whole-genome SNPs improved the efficiency of GS dramatically for oil palm and is recommended for dura breeding programs. Machine learning slightly outperformed other methods, but required parameters optimization for GS implementation.

Keywords

Genomic prediction Complex traitsMachine learningPredictive modeling Marker-assisted selection SSR SNP Perennial crop

Download PDF

Related articles

Description

The adage 'garbage in, garbage out'serves as an important reminder to users of NGS technologies to be careful about the quality of sequence data that are used for analyses. Although NGS is a powerful technology that allows us to acquire important biological information of a species such as its genome, its accuracy depends on the raw sequenced data. Similar to any high

Sample

Related articles

Description:

Overlap Layout Consensus (OLC) assembler: An assembler that identifies all pairs of reads that overlap sufficiently well and then organizes this information into a graph containing a node for every read and an edge between any pair of reads that overlap each other. Contigs are generated as a consensus by inferences from information of all edges in the possible path.

Sample

Related articles

Description:

Rapid technological developments have led to increasingly efficient sequencing approaches. Next Generation Sequencing (NGS) is increasingly common and has become cost-effective, generating an explosion of sequenced data that need to be analyzed. The skills required to apply computational analysis to target research on a wide range of applications that include identifying causes of cancer, vaccine design, new antibiotics, drug development, personalized medicine and higher crop yields in agriculture are highly sought after. This invaluable book provides step-by-step guides to complex topics that make it easy for readers to perform essential analyses from raw sequenced data to answering important biological questions. It is an excellent hands-on material for teachers who conduct courses in bioinformatics and as a reference material for professionals. The chapters are written to be standalone recipes making it suitable for readers who wish to self-learn selected topics. Readers will gain skills necessary to work on sequenced data from NGS platforms and hence making themselves more attractive to employers who need skilled bioinformaticians to handle the deluge of data.

Sample

Availability:

  1. Amazon
  2. Book Depository
  3. eBook
  4. bokus

Abstract

Genomic selection (GS) uses genome-wide markers to select individuals with the desired overall combination of breeding traits. A total of 1,218 individuals from a commercial population of Ulu Remis x AVROS (UR x AVROS) were genotyped using the OP200K array. The traits of interest included: shell-to-fruit ratio (S/F, %), mesocarp-to-fruit ratio (M/F, %), kernel-to-fruit ratio (K/F, %), fruit per bunch (F/B, %), oil per bunch (O/B, %) and oil per palm (O/P, kg/palm/year). Genomic heritabilities of these traits were estimated to be in the range of 0.40 to 0.80. GS methods assessed were RR-BLUP, Bayes A (BA), Cπ (BC), Lasso (BL) and Ridge Regression (BRR). All methods resulted in almost equal prediction accuracy. The accuracy achieved ranged from 0.40 to 0.70, correlating with the heritability of traits. By selecting the most important markers, RR-BLUP B has the potential to outperform other methods. The marker density for certain traits can be further reduced based on the linkage disequilibrium (LD). Together with in silico breeding, GS is now being used in oil palm breeding programs to hasten parental palm selection.

Download PDF

Related articles

Abstract

High-density single nucleotide polymorphism (SNP) genotyping arrays are powerful tools that can measure the level of genetic polymorphism within a population. To develop a whole-genome SNP array for oil palms, SNP discovery was performed using deep resequencing of eight libraries derived from 132 Elaeis guineensis and Elaeis oleifera palms belonging to 59 origins, resulting in the discovery of >3 million putative SNPs. After SNP filtering, the Illumina OP200K custom array was built with 170 860 successful probes. Phenetic clustering analysis revealed that the array could distinguish between palms of different origins in a way consistent with pedigree records. Genome-wide linkage disequilibrium declined more slowly for the commercial populations (ranging from 120 kb at r2 = 0.43 to 146 kb at r2 = 0.50) when compared with the semi-wild populations (19.5 kb at r2 = 0.22). Genetic fixation mapping comparing the semi-wild and commercial population identified 321 selective sweeps. A genome-wide association study (GWAS) detected a significant peak on chromosome 2 associated with the polygenic component of the shell thickness trait (based on the trait shell-to-fruit; S/F %) in tenera palms. Testing of a genomic selection model on the same trait resulted in good prediction accuracy (r = 0.65) with 42% of the S/F % variation explained. The first high-density SNP genotyping array for oil palm has been developed and shown to be robust for use in genetic studies and with potential for developing early trait prediction to shorten the oil palm breeding cycle.

https://doi.org/10.1016/j.molp.2016.04.010

Related articles

Abstract

Oil palm is a monoecious plant and the sex ratio of the female to male inflorescences on each palm is important for breeding and commercial production.We hypothesise that the sex differences of oil palm inflorescences are due, at least in part, to variations in gene expression so this study aimed to establish the sexual differences by using whole genome expression analysis. We sequenced the transcriptomes of oil palm male and female inflorescences at the earliest stage at which the male and female tissues can be reliably distinguished. From the transcriptome data, we identified 97 potential sex‐specific transcripts. Among the validated transcripts, oil palm orthologs of acid phosphatase and DEFICIENS showed male‐specific expression patterns whereas orthologs of bZIP transcription factor, late embryogenesis abundant protein and TASSELSEED1 showed female‐specific expression patterns. Transcripts for orthologs of acid phosphatase and late embryogenesis abundant protein were also strongly inflorescence‐specific. Furthermore, we assembled a broad and dense consensus transcriptome from male and female inflorescences, shoot apical meristem, mesocarp, leaf and root of oil palm, which provides a valuable reference for identification of unique and common transcripts between these tissues. We suggest that the combined expression of inflorescence‐ and sex‐specific transcripts may account for sexual differences of oil palm male and female inflorescences.

Wiley Online Library

Related articles

Abstract

MicroRNAs (miRNAs) are a distinct class of small non-coding RNAs,~ 22 nt long, found in a wide variety of organisms. They play important regulatory roles by silencing gene activities at the post-transcriptional level. In this work, we developed a computational workflow to identify conserved miRNA genes in the 10,536 unique Penaeus monodon expressed sequence tags (ESTs). After removing all simple repeats and coding regions in the ESTs, the workflow uses both the conservation of miRNA sequences and several filters obtained from pre-miRNA secondary structure properties to identify conserved miRNAs. Finally, we discovered six potential conserved miRNA genes such as mir-4152, mir-466k, miR-32*, lin- 4, mir-1346 and mir-4310.

Download PDF

Related articles

Abstract

Physiological responses to stress are controlled by expression of a large number of genes, many of which are regulated by microRNAs. Since most banana cultivars are salt-sensitive, improved understanding of genetic regulation of salt induced stress responses in banana can support future crop management and improvement in the face of increasing soil salinity related to irrigation and climate change. In this study we focused on determining miRNA and their targets that respond to NaCl exposure and used transcriptome sequencing of RNA and small RNA from control and NaCl-treated banana roots to assemble a cultivar-specific reference transcriptome and identify orthologous and Musa-specific miRNA responding to salinity. We observed that, banana roots responded to salinity stress with changes in expression for a large number of genes (9.5% of 31,390 expressed unigenes) and reduction in levels of many miRNA, including several novel miRNA and banana-specific miRNA-target pairs. Banana roots expressed a unique set of orthologous and Musa-specific miRNAs of which 59 respond to salt stress in a dose-dependent manner. Gene expression patterns of miRNA compared with those of their predicted mRNA targets indicated that a majority of the differentially expressed miRNAs were down-regulated in response to increased salinity, allowing increased expression of targets involved in diverse biological processes including stress signaling, stress defence, transport, cellular homeostasis, metabolism and other stress-related functions. This study may contribute to the understanding of gene regulation and abiotic stress response of roots and the high-throughput sequencing data sets generated may serve as important resources related to salt tolerance traits for functional genomic studies and genetic improvement in banana.

Download PDF

Related articles

Abstract

The Fortilin (also known as TCTP) in Penaeus monodon (PmFortilin) and Fortilin Binding Protein 1 (FBP1) have recently been shown to interact and to offer protection against the widespread White Spot Syndrome Virus infection. However, the mechanism is yet unknown. We investigated this interaction in detail by a number of in silico and in vitro analyses, including prediction of a binding site between PmFortilin/FBP1 and docking simulations. The basis of the modeling analyses was well-conserved PmFortilin orthologs, containing a Ca2+-binding domain at residues 76–110 representing a section of the helical domain, the translationally controlled tumor protein signature 1 and 2 (TCTP_1, TCTP_2) at residues 45–55 and 123–145, respectively. We found the pairs Cys59 and Cys76 formed a disulfide bond in the C-terminus of FBP1, which is a common structural feature in many exported proteins and the “x–G–K–K” pattern of the amidation site at the end of the C-terminus. This coincided with our previous work, where we found the “x–P–P–x” patterns of an antiviral peptide also to be located in the C-terminus of FBP1. The combined bioinformatics and in vitro results indicate that FBP1 is a transmembrane protein and FBP1 interact with N-terminal region of PmFortilin.

Download PDF

Related articles

Abstract

Background

Polymorphisms affecting Toll-like receptor (TLR) structure appear to be rare, as would be expected due to their essential coordinator role in innate immunity. Here, we assess variation in TLR4 expression, rather than structure, as a mechanism to diversify innate immune responses.

Methodology/Principal Findings

We sequenced the TLR4 promoter (4,3 kb) in Swedish blood donors. Since TLR4 plays a vital role in susceptibility to urinary tract infection (UTI), promoter sequences were obtained from children with mild or severe disease. We performed a case-control study of pediatric patients with asymptomatic bacteriuria (ABU) or those prone to recurrent acute pyelonephritis (APN). Promoter activity of the single SNPs or multiple allelic changes corresponding to the genotype patterns (GPs) was tested. We then conducted a replication study in an independent cohort of adult patients with a history of childhood APN. Last, in vivo effects of the different GPs were examined after therapeutic intravesical inoculation of 19 patients with Escherichia coli 83972. We identified in total eight TLR4 promoter sequence variants in the Swedish control population, forming 19 haplotypes and 29 genotype patterns, some with effects on promoter activity. Compared to symptomatic patients and healthy controls, ABU patients had fewer genotype patterns, and their promoter sequence variants reduced TLR4 expression in response to infection. The ABU associated GPs also reduced innate immune responses in patients who were subjected to therapeutic urinary E. coli tract inoculation.

Conclusions

The results suggest that genetic variation in the TLR4 promoter may be an essential, largely overlooked mechanism to influence TLR4 expression and UTI susceptibility.

Download PDF

Related articles

Abstract

Background

DNA copy number variation (CNV) has been recognized as an important source of genetic variation. Array comparative genomic hybridization (aCGH) is commonly used for CNV detection, but the microarray platform has a number of inherent limitations.

Results

We sequenced the TLR4 promoter (4,3 kb) in Swedish blood donors. Since TLR4 plays a vital role in susceptibility to urinary Here, we describe a method to detect copy number variation using WGS sequencing, CNV-seq. The method is based on a robust statistical model that describes the complete analysis procedure and allows the computation of essential confidence values for detection of CNV. Our results show that the number of reads, not the length of the reads is the key factor determining the resolution of detection. This favors the next-generation sequencing methods that rapidly produce large amount of short reads.

Conclusion

Simulation of various sequencing methods with coverage between 0.1× to 8× show overall specificity between 91.7 – 99.9%, and sensitivity between 72.2 – 96.5%. We also show the results for assessment of CNV between two individual human genomes.

Keywords

Copy Number VariationTest GenomeCopy Number Variation RegionCopy Number RatioTrace Archive

Download PDF

Related articles

Abstract

Low target discovery rate has been linked to inadequate consideration of multiple factors that collectively contribute to druggability. These factors include sequence, structural, physicochemical, and systems profiles. Methods individually exploring each of these profiles for target identification have been developed, but they have not been collectively used. We evaluated the collective capability of these methods in identifying promising targets from 1019 research targets based on the multiple profiles of up to 348 successful targets. The collective method combining at least three profiles identified 50, 25, 10, and 4% of the 30, 84, 41, and 864 phase III, II, I, and nonclinical trial targets as promising, including eight to nine targets of positive phase III results. This method dropped 89% of the 19 discontinued clinical trial targets and 97% of the 65 targets failed in high-throughput screening or knockout studies. Collective consideration of multiple profiles demonstrated promising potential in identifying innovative targets.

Download PDF

Related articles

Abstract

Allergy is a major health problem in industrialized countries. The number of transgenic food crops is growing rapidly creating the need for allergenicity assessment before they are introduced into human food chain. While existing bioinformatic methods have achieved good accuracies for highly conserved sequences, the discrimination of allergens and non-allergens from allergen-like non-allergen sequences remains difficult. We describe AllerHunter, a web-based computational system for the assessment of potential allergenicity and allergic cross-reactivity in proteins. It combines an iterative pairwise sequence similarity encoding scheme with SVM as the discriminating engine. The pairwise vectorization framework allows the system to model essential features in allergens that are involved in cross-reactivity, but not limited to distinct sets of physicochemical properties. The system was rigorously trained and tested using 1,356 known allergen and 13,449 putative non-allergen sequences. Extensive testing was performed for validation of the prediction models. The system is effective for distinguishing allergens and non-allergens from allergen-like non-allergen sequences. Testing results showed that AllerHunter, with a sensitivity of 83.4% and specificity of 96.4% (accuracy = 95.3%, area under the receiver operating characteristic curve AROC = 0.928±0.004 and Matthew's correlation coefficient MCC = 0.738), performs significantly better than a number of existing methods using an independent dataset of 1443 protein sequences.

Download PDF

Related articles

Abstract

A variety of specialist databases have been developed to facilitate the study of allergens. However, these databases either contain different subsets of allergen data or are deficient in tools for assessing potential allergenicity of proteins. Here, we describe Allergen Atlas, a comprehensive repository of experimentally validated allergen sequences collected from in-house laboratory, online data submission, literature reports and all existing general-purpose and specialist databases. Each entry was manually verified, classified and hyperlinked to major databases including Swiss-Prot, Protein Data Bank (PDB), Gene Ontology (GO), Pfam and PubMed. The database is integrated with analysis tools that include: (i) keyword search, (ii) BLAST, (iii) position-specific iterative BLAST (PSI-BLAST), (iv) FAO/WHO criteria search, (v) graphical representation of allergen information network and (vi) online data submission. The latest version contains information of 1593 allergen sequences (496 IUIS allergens, 978 experimentally verified allergens and 119 new sequences), 56 IgE epitope sequences, 679 links to PDB structures and 155 links to Pfam domains.

Download PDF

Related articles

Abstract

Background

Bioinformatics tools are commonly used for assessing potential protein allergenicity. While these methods have achieved good accuracies for highly conserved sequences, they are less effective when the overall similarity is low. In this study, we assessed the feasibility of using position-specific scoring matrices as a basis for predicting potential allergenicity in proteins.

Results

Two simple methods for predicting potential allergenicity in proteins, based on general and group-specific allergen profiles, are presented. Testing results indicate that the performances of both methods are comparable to the best results of other methods. The group-specific profile approach, with a sensitivity of 84.04% and specificity of 96.52%, gives similar results as those obtained using the general profile approach (sensitivity = 82.45%, specificity = 96.92%).

Conclusion

We show that position-specific scoring matrices are highly promising for constructing computational models suitable for allergenicity assessment. These data suggest it may be possible to apply a targeted approach for allergenicity assessment based on the profiles of allergens of interest.

Keywords

Support Vector Machine Matthews Correlation Coefficient Dipeptide Composition Include Support Vector Machine Allergen Profile

Download PDF

Related articles

Abstract

A number of therapeutic targets have been explored for developing anticancer drugs. Continuous efforts have been directed at the discovery of new targets as well as the improvement of therapeutic efficacy of agents directed at explored targets. There are 84 and 488 targets of marketed and investigational drugs for the treatment of cancer or cancer related illness. Analysis of these targets, particularly those of drugs in clinical trials and US patents, provides useful information and perspectives about the trends, strategies and progresses in targeting key cancer-related processes and in overcoming the difficulties in developing efficacious drugs against these targets. The efficacy of anticancer drugs directed at these targets is frequently compromised by counteractive molecular interactions and network crosstalk, negative and adverse secondary effects of drugs, and undesired ADMET profiles. Multi-component therapies directed at multiple targets and improved drug targeting methods are being explored for alleviating these efficacy-reducing processes. Investigation of the modes of actions of these combinations and targeting methods offers clues to aid the development of more effective anticancer therapies.

Ingenta

Related articles

Abstract

Allergy is a prevalent health problem in developed countries. With advances in genomic and proteomic technologies, there is a rapid increase in allergy-related data, including allergen sequences, allergic cross-reactivity, molecular structures, clinical measurements, and atmospheric concentrations. The more and more complex allergy data is fueling the need for advanced ways in information management and analysis. Computational methods and resources are increasingly the driving force in allergy research. For example, allergen-specific databases are important data sources for allergen characterization. T-cell and B-cell epitope prediction tools focus on identifying immunogenic regions on allergenic proteins. Allergenicity and cross-reactivity prediction tools are increasingly being applied to assess the potential allergenicity of proteins. This review provides an introduction to the growing literature in this area, with particular emphasis on recent developments in bioinformatics relevant to the study of allergens.

Download PDF

Related articles

Abstract

The constant increase in atopic allergy and other hypersensitivity reactions has intensified the need for successful therapeutic approaches. Existing bioinformatic tools for predicting allergenic potential are primarily based on sequence similarity searches along the entire protein sequence and do not address the dual issues of conformational and overlapping B-cell epitope recognition sites. In this study, we report AllerPred, a computational system that is capable of capturing multiple overlapping continuous and discontinuous B-cell epitope binding patterns in allergenic proteins using SVM as its prediction engine. A novel representation of local protein sequence descriptors enables the system to model multiple overlapping continuous and discontinuous B-cell epitope binding patterns within a protein sequence. The model was rigorously trained and tested using 669 IUIS allergens and 1237 non-allergens. Testing results showed that the area under the receiver operating curve (AROC) of SVM models is 0.81 with 76 percent sensitivity at specificity of 76 percent . This approach consistently outperforms existing allergenicity prediction systems using a standardized testing dataset of experimentally validated allergens and non-allergen sequences.

Download PDF

Related articles

Abstract

Background

Repeats are present in all genomes, and often have important functions. However, in large genome sequencing projects, many repetitive regions remain uncharacterized. The genome of the protozoan parasite Trypanosoma cruzi consists of more than 50% repeats. These repeats include surface molecule genes, and several other gene families. In the T. cruzi genome sequencing project, it was clear that not all copies of repetitive genes were present in the assembly, due to collapse of nearly identical repeats. However, at the time of publication of the T. cruzi genome, it was not clear to what extent this had occurred.

Results

We have developed a pipeline to estimate the genomic repeat content, where WGS reads are aligned to the genomic sequence and the gene copy number is estimated using the average WGS coverage. This method was applied to the genome of T. cruzi and copy numbers of all protein coding sequences and pseudogenes were estimated. The 22 640 results were stored in a database available online. 18% of all protein coding sequences and pseudogenes were estimated to exist in 14 or more copies in the T. cruzi CL Brener genome. The average coverage of the annotated protein coding sequences and pseudogenes indicate a total gene copy number, including allelic gene variants, of over 40 000.

Conclusion

Our results indicate that the number of protein coding sequences and pseudogenes in the T. cruzi genome may be twice the previous estimate. We have constructed a database of the T. cruzi gene repeat data that is available as a resource to the community. The main purpose of the database is to enable biologists interested in repeated, unfinished regions to closely examine and resolve these regions themselves using all available WGS data, instead of having to rely on annotated consensus sequences that often are erroneous and possibly misleading. Five repetitive genes were studied in more detail, in order to illustrate how the database can be used to analyze and extract information about gene repeats with different characteristics in Trypanosoma cruzi.

Keywords

Annotate GeneAverage CoverageProtein Code Sequence Trypanosoma Cruzi Gene Repeat

https://doi.org/10.1186/1471-2164-8-391

Related articles

Abstract

Modern alignment methods designed to work rapidly and efficiently with large datasets often do so at the cost of method sensitivity. To overcome this, we have developed a novel alignment program, GRAT, built to accurately align short, highly similar DNA sequences. The program runs rapidly and requires no more memory and CPU power than a desktop computer. In addition, specificity is ensured by statistically separating the true alignments from spurious matches using phred quality values. An efficient separation is especially important when searching large datasets and whenever there are repeats present in the dataset. Results are superior in comparison to widely used existing software, and analysis of two large genomic datasets show the usefulness and scalability of the algorithm.

ScienceDirect

Related articles

Abstract

Summary: Assessment of potential allergenicity and patterns of cross-reactivity is necessary whenever novel proteins are introduced into human food chain. Current bioinformatic methods in allergology focus mainly on the prediction of allergenic proteins, with no information on cross-reactivity patterns among known allergens. In this study, we present AllerTool, a web server with essential tools for the assessment of predicted as well as published cross-reactivity patterns of allergens. The analysis tools include graphical representation of allergen cross-reactivity information; a local sequence comparison tool that displays information of known cross-reactive allergens; a sequence similarity search tool for assessment of cross-reactivity in accordance to FAO/WHO Codex alimentarius guidelines; and a method based on support vector machine (SVM). A 10-fold cross-validation results showed that the area under the receiver operating curve (AROC) of SVM models is 0.90 with 86.00% sensitivity (SE) at specificity (SP) of 86.00%. Availability: AllerTool is freely available at Contact:zhzhang@i2r.a-star.edu.sg

Download PDF

Related articles

Abstract

In 1998, the Asia Pacific Bioinformatics Network (APBioNet), Asia's oldest bioinformatics organisation was set up to champion the advancement of bioinformatics in the Asia Pacific. By 2002, APBioNet was able to gain sufficient critical mass to initiate the first International Conference on Bioinformatics (InCoB) bringing together scientists working in the field of bioinformatics in the region. This year, the InCoB2006 Conference was organized as the 5th annual conference of the Asia-Pacific Bioinformatics Network, on Dec. 18–20, 2006 in New Delhi, India, following a series of successful events in Bangkok (Thailand), Penang (Malaysia), Auckland (New Zealand) and Busan (South Korea). This Introduction provides a brief overview of the peer-reviewed manuscripts accepted for publication in this Supplement. It exemplifies a typical snapshot of the growing research excellence in bioinformatics of the region as we embark on a trajectory of establishing a solid bioinformatics research culture in the Asia Pacific that is able to contribute fully to the global bioinformatics community.

Download PDF

Related articles

Abstract

Background

Many genome projects are left unfinished due to complex, repeated regions. Finishing is the most time consuming step in sequencing and current finishing tools are not designed with particular attention to the repeat problem.

Results

We have developed DNPTrapper, a WGS sequence finishing tool, specifically designed to address the problems posed by the presence of repeated regions in the target sequence. The program detects and visualizes single base differences between nearly identical repeat copies, and offers the overview and flexibility needed to rapidly resolve complex regions within a working session. The use of a database allows large amounts of data to be stored and handled, and allows viewing of mammalian size genomes. The program is available under an Open Source license.

Conclusion

With DNPTrapper, it is possible to separate repeated regions that previously were considered impossible to resolve, and finishing tasks that previously took days or weeks can be resolved within hours or even minutes.

Keywords

Repeat RegionWGS SequencingMate PairWhole Genome WGSRepeat Copy

Download PDF

Related articles

Abstract

Background

The accurate prediction of a comprehensive set of messenger RNAs (targets) regulated by animal microRNAs (miRNAs) remains an open problem. In particular, the prediction of targets that do not possess evolutionarily conserved complementarity to their miRNA regulators is not adequately addressed by current tools.

Results

We have developed MicroTar, an animal miRNA target prediction tool based on miRNA-target complementarity and thermodynamic data. The algorithm uses predicted free energies of unbound mRNA and putative mRNA-miRNA heterodimers, implicitly addressing the accessibility of the mRNA 3' untranslated region. MicroTar does not rely on evolutionary conservation to discern functional targets, and is able to predict both conserved and non-conserved targets. MicroTar source code and predictions are accessible at http://tiger.dbs.nus.edu.sg/microtar/, where both serial and parallel versions of the program can be downloaded under an open-source licence.

Conclusion

MicroTar achieves better sensitivity than previously reported predictions when tested on three distinct datasets of experimentally-verified miRNA-target interactions in C. elegans, Drosophila, and mouse.

Keywords

Free EnergymiRNA TargetmiRNA Target PredictionmRNA Secondary StructureSeed Match

Download PDF

Related articles

Abstract

The identification of new virus species is a key issue for the study of infectious disease but is technically very difficult. We developed a system for large-scale molecular virus screening of clinical samples based on host DNA depletion, random PCR amplification, large-scale sequencing, and bioinformatics. The technology was applied to pooled human respiratory tract samples. The first experiments detected seven human virus species without the use of any specific reagent. Among the detected viruses were one coronavirus and one parvovirus, both of which were at that time uncharacterized. The parvovirus, provisionally named human bocavirus, was in a retrospective clinical study detected in 17 additional patients and associated with lower respiratory tract infections in children. The molecular virus screening procedure provides a general culture-independent solution to the problem of detecting unknown virus species in single or pooled samples. We suggest that a systematic exploration of the viruses that infect humans, “the human virome,” can be initiated.

Download PDF

Related articles

Abstract

Whole-genome sequencing of the protozoan pathogen Trypanosoma cruzi revealed that the diploid genome contains a predicted 22,570 proteins encoded by genes, of which 12,570 represent allelic pairs. Over 50% of the genome consists of repeated sequences, such as retrotransposons and genes for large families of surface molecules, which include trans-sialidases, mucins, gp63s, and a large novel family (>1300 copies) of mucin-associated surface protein (MASP) genes. Analyses of the T. cruzi, T. brucei, and Leishmania major (Tritryp) genomes imply differences from other eukaryotes in DNA repair and initiation of replication and reflect their unusual mitochondrial DNA. Although the Tritryp lack several classes of signaling molecules, their kinomes contain a large and diverse set of protein kinases and phosphatases; their size and diversity imply previously unknown interactions and regulatory processes, which may be targets for intervention.

Download PDF

Related articles

Description

Page 1. CHAPTER 16 BIOLOGICAL DATABASES AND WEB SERVICES: METRICS FOR QUALITY Tin Wee Tan, Khar Heng Choo, Joo Chuan Tong Department of Biochemistry, National University of Singapore {bchtantw, bchckh, bchtjc}@nus.edu.sg Martti T Tammi Department of Biological Sciences and Department of Biochemistry, National University of Singapore martti Qnus. edu. sg Vladimir B Bajic Institute for Infocomm Research, Singapore bajicv@i2r. a-star. edu. sg

1. Introduction Biological databases (BDs) and web services have been proliferating during the past decade. In 1997, it was estimated that there were more than 400 web accessible BDs1'2. By 2004, the combined figure for Internet accessible BDs and related web services has grown to more than 1000, and we expect it to double every two years. These Internet-accessible BDs and bioinformatic tools are continuously introduced …

World Scientific

Related articles

Abstract

We describe a genetic variation map for the chicken genome containing 2.8 million single-nucleotide polymorphisms (SNPs). This map is based on a comparison of the sequences of three domestic chicken breeds (a broiler, a layer and a Chinese silkie) with that of their wild ancestor, red jungle fowl. Subsequent experiments indicate that at least 90% of the variant sites are true SNPs, and at least 70% are common SNPs that segregate in many domestic breeds. Mean nucleotide diversity is about five SNPs per kilobase for almost every possible comparison between red jungle fowl and domestic lines, between two different domestic lines, and within domestic lines—in contrast to the notion that domestic animals are highly inbred relative to their wild ancestors. In fact, most of the SNPs originated before domestication, and there is little evidence of selective sweeps for adaptive alleles on length scales greater than 100 kilobases.

https://doi.org/10.1038/nature03156

Related articles

Abstract

Although microsatellites with functional effects have been described, generally, these repeats are considered as “junk” DNA in the same way as other repetitive sequences. Our aim was to investigate if certain microsatellites can have a functional role as cis-regulatory elements. A database was created of all short tandem repeats, from 2 to 10 bases, located in the first 10-kb 5′ of the transcription start sites of all annotated genes of the human genome. Of 114 microsatellites selected based on their size and location in the promoter, 51 were found to be polymorphic. Using electrophoretic mobility shift assay (EMSA), we studied five repetitive motifs and three displayed specific protein binding which were found in 12 of the polymorphic microsatellites. An interesting microsatellite is the CTC/GAG repeat which, as double-stranded (DS) DNA, bound specificity protein 1 (SP1) with high affinity, formed triplexes in vitro and displayed differences in SP1 binding and triplex formation capacity for repeats with distinct numbers of repeat units. Interestingly, the polypyrimidine strand of the repeat (CTC) bound other proteins such as polypyrimidine tract-binding protein 1 (PTBP1) as single-stranded (SS) DNA, and a model with two alternative DNA conformations is proposed for these repeats. Distinct protein binding to DS DNA was also observed for different numbers of AAACA and AAAAT repeats. Our results suggest that certain microsatellites may act as cis-regulatory elements, controlling gene expression through transcription factor binding and/or secondary DNA structure formation. Due to their high polymorphism and abundance, they might represent an important source of quantitative genetic variation.

Abbreviations

EMSAelectrophoretic mobility shift assayDSdouble strandSSsingle strandSP1specificity protein 1PTBP1polypyrimidine tract-binding protein 1THtyrosine hydroxylase geneZNF 191zing finger protein 191HBP1HMG-box transcription factor 1PIG3p53-induced gene 3DMPKdystrophia myotonica–protein kinase geneDHPLCdenaturing high-performance liquid chromatographyAP2αactivator protein 2 alphaTFOTriplex-forming oligonucleotideS.D.standard deviationGOGene Ontology ConsortiumPAX7paired box gene 7hnRNP Kheterogeneous ribonucleoprotein KADRA1Badrenergic alpha-1B-receptorBMP7bone morphogenetic protein 7CDC14Acell division cycle 14 homolog ACAPN1calpain 1 large subunitFOXF1forkhead box F1EMX2empty spiracles homolog 2ZNF409zing finger protein 409

Keywords

Triplex Polypyrimidine Repeats Promoter

ScienceDirect

Related articles

Abstract

Summary: Finishing, i.e. gap closure and editing, is the most time-consuming part of genome sequencing. Repeated sequences together with sequencing errors complicate the assembly and often result in misassemblies that are difficult to correct. Repeat Discrepancy Tagger (ReDiT) is a tool designed to aid in the finishing step. This software processes assembly results produced by any fragment assembly program that outputs ace files. The input sequences are analyzed to determine possible differences between repeated sequences. The output is written as tags in an ace file that can be viewed by, e.g. the Consed sequence editor.

Download PDF

Related articles

Abstract

Sequencing errors in combination with repeated regions cause major problems in WGS sequencing, mainly due to the failure of assembly programs to distinguish single base differences between repeat copies from erroneous base calls. In this paper, a new strategy designed to correct errors in WGS sequence data using defined nucleotide positions, DNPs, is presented. The method distinguishes single base differences from sequencing errors by analyzing multiple alignments consisting of a read and all its overlaps with other reads. The construction of multiple alignments is performed using a novel pattern matching algorithm, which takes advantage of the symmetry between indices that can be computed for similar words of the same length. This allows for rapid construction of multiple alignments, with no previous pair‐wise matching of sequence reads required. Results from a C++ implementation of this method show that up to 99% of sequencing errors can be corrected, while up to 87% of the single base differences remain and up to 80% of the corrected reads contain at most one error. The results also show that the method outperforms the error correction method used in the EULER assembler. The prototype software, MisEd, is freely available from the authors for academic use.

Download PDF

Related articles

Abstract

Author: Martti T. Tammi

In the whole genome WGS strategy the genome is fragmented randomly without a prior mapping step. The computer assembly of sequence reads is more demanding than in the HS strategy, due to the lacking positional information. Therefore, the use of varied clone lengths and the sequencing of both ends is essential to build scaffolds (Edwards & Caskey, 1990; Chen et al., 1993; Smith et al., 1994; Kupfer et al., 1995; Roach et al., 1995; Nurminsky & Hartl, 1996; Roach et al., 1995), in particular when sequencing large, complex genomes. This strategy has been used to sequence e.g. Haemophilus influenzae Rd. (Fleischmann et al., 1995), the 580,070 bp genome of Mycoplasma genitalium, the smallest known genome of any free-living organism and the Drosophila melanogaster genome (Adams et al., 2000; Myers et al., 2000). However, a clone-by-clone based strategy has been used to finish the Drosophila genome. The WGS approach to sequence the human genome was proposed (Weber & Myers, 1997). This started a debate of the feasibility of the WGS strategy to sequence the entire human genome (Green, 1997; Eichler, 1998; The Sanger Centre, 1998; Waterston et al., 2002). The method was applied to the whole human genome by the private company Celera Genomics.

HTML full text

Related articles

Abstract

The software commonly used for assembly of WGS sequence data has several limitations. One such limitation becomes obvious when repetitive sequences are encountered. WGS assembly is a difficult task, even for non-repetitive regions, but the use of quality assessments of the data and efficient matching algorithms have made it possible to assemble most sequences efficiently. In the case of highly repetitive sequences, however, these algorithms fail to distinguish between sequencing errors and single base differences in regions containing nearly identical repeats. None of the currently available fragment assembly programs are able to correctly assemble highly similar repetitive data, and we, therefore, present a novel WGS assembly program, Tandem Repeat Assembly Program (trap). The main feature of this program is the ability to separate long repetitive regions from each other by distinguishing single base substitutions as well as insertions/deletions from sequencing errors. This is accomplished by using a novel multiple-alignment based analysis method. Since repeats are a common complication in most sequencing projects, this software should be of use for the whole sequencing community.

ScienceDirect

Related articles

Abstract

An increasingly important problem in genome sequencing is the failure of the commonly used WGS assembly programs to correctly assemble repetitive sequences. The assembly of non-repetitive regions or regions containing repeats considerably shorter than the average read length is in practice easy to solve, while longer repeats have been a difficult problem. We here present a statistical method to separate arbitrarily long, almost identical repeats, which makes it possible to correctly assemble complex repetitive sequence regions. The differencesbetween repeat units may be as low as 1% and the sequencing error may be up to ten times higher. The method is based on the realization that a comparison of only a part of all overlapping sequences at a time in a data set does not generate enough information for a conclusive analysis. Our method uses optimal multi-alignments consisting of all the overlaps of each read. This makes it possible to determine defined nucleotide positions, DNPs, which constitute the differences between the repeat units. Differences between repeats are distinguished from sequencing errors using statistical methods, where the probabilities of obtaining certain combinations of candidate DNPs are calculated using the information from the multi-alignments. The use of DNPs and combinations of DNPs will allow for optimal and rapid assemblies of repeated regions. This method can solve repeats that differ in only two positions in a read length, which is the theoretical limit for repeat separation. We predict that this method will be highly useful in WGS sequencing in the future.

Download PDF

Related articles

Abstract

During the last ten years, a genomics revolution has changed the ways biological research is carried out. The draft sequence of the human genome is available, as well as the sequence of 84 other completed genomes. High-throughput genomics technologies such as genome sequencing with associated bioinformatics tools have been instrumental in this process. The draft genome sequences were determined using the WGS sequencing strategy, where long DNA molecules are randomly sheared into small pieces from which sequences are determined. These are assembled by computer programs in order to reconstruct the original genome sequence. Ubiquitous repeated sequences together with errors in the sequencing process complicate the assembly of WGS fragments. In most genome projects gaps are caused by this complication.

This thesis presents methods and algorithms to separate repeated sequences in WGS projects. The Tandem Repeat Assembly Program (TRAP) builds multiple alignments of reads, which are then analyzed in order to discriminate sequencing errors from real differences between highly similar repeats. The method is based on the fact that sequencing errors are randomly distributed, as opposed to the systematic distribution of mutations in repeat copies. The TRAP assembler was shown to be able to correctly assemble 2000 bp repeat copies that are repeated in tandem eight times. The degree of difference between repeat copies was 1.0%, and the maximum sequencing error 11%.

A refined method based on single base differences between repeat copies has been developed to further improve repeat separation. Results show that in the same sequence, 87% of all the single base differences present in the repeats can be detected, with an error of only 1.6%.

In addition, a novel pattern-matching algorithm was developed. This algorithm takes advantage of the inherent symmetry between indices that can be computed for similar words of the same length and was implemented in the error correction software, MisEd. The results show that up to 99.3% of the sequencing errors can be corrected, while up to 87% of the single base differences remain.

All methods described have thus been shown to be functional, and it is clear that these programs will facilitate genome sequencing and assembly.

eBook

Related articles

Abstract

Mutations in the mitochondrial tRNA leu (UUR) gene have been associated with diabetes mellitus and deafness. We screened for the presence of mtDNA mutations in the tRNAleu (UUR) gene and adjacent ND1 sequences in 12 diabetes mellitus pedigrees with a possible maternal inheritance of the disease. One patient carried a G to A substitution at nt 3243 (tRNAleu (UUR) gene) in heteroplasmic state. In a second pedigree a patient had an A to G substitution at nt 3397 in the ND1 gene. All maternal relatives of the proband had the 3397 substitution in homoplasmic state. This substitution was not present in 246 nonsymptomatic Caucasian controls. The 3397 substitution changes a highly conserved methionine to a valine at aa 31 and has previously been found in Alzheimer's (AD) and Parkinson's (PD) disease patients. Substitutions in the mitochondrial ND1 gene at aa 30 and 31 have associated with a number of different diseases (e.g. AD/PD, MELAS, cardiomyopathy and diabetes mellitus, LHON, Wolfram‐syndrome and maternal inherited diabetes) suggesting that changes at these two codons may be associated with very diverse pathogenic processes. In a further attempt to search for mtDNA mutations outside the tRNAleu gene associated with diabetes, the whole mtDNA genome sequence was determined for two patients with maternally inherited diabetes and deafness. Except for substitutions previously reported as polymorphisms, none of the two patients showed any non‐synonymous substitutions either in homoplasmic or heteroplasmic state. These results imply that the maternal inherited diabetes and deafness in these patients must result from alterations of nuclear genes and/or environmental factors.

Download PDF

Related articles

Abstract

In 1994, at the start of the Parasite Genome Initiatives under the auspices of WHO/TDR, only a few hundred sequences of the parasite Trypanosoma cruzi were known, and genomics was an art for large and specialized centers. At the time of writing, 22 796 T. cruzi sequences have been deposited in the public databases, amounting to about 12% of the (diploid) genome, and many collaborating laboratories, previously not involved in genome projects, have acquired new technology and trained people. The training of people was, from the beginning, one of the secondary objectives of the parasite genome projects1. The complete genome sequence of the T. cruzi parasite is now expected within about four years. Indeed, the National Institutes of Health (NIH; Bethesda, MD, USA) has now provided funding to three Genome Centers [The Institute for Genome Research (TIGR); Seattle Biomedical Research Institute (SBRI) and the University of Uppsala Department of Genetics and Pathology, Sweden] to complete the estimated 43.5 Mb of the haploid genome. Many intriguing details of the genome are emerging and are available at various web sites

Trends in Parasitology

Related articles

Abstract

We have performed a survey of the active genes in the important human pathogen Trypanosoma cruzi by analyzing 5013 expressed sequence tags (ESTs) generated from a normalized epimastigote cDNA library. Clustering of all sequences resulted in 771 clusters, comprising 54% of the ESTs. In total, the ESTs corresponded to 3054 transcripts that might represent one-fourth of the total gene repertoire in T. cruzi. About 33% of the T. cruzitranscripts showed similarity to sequences in the public databases, and a large number of hitherto undiscovered genes predicted to be involved in transcription, cell cycle control, cell division, signal transduction, secretion, and metabolism were identified. More than 140 full-length gene sequences were derived from the ESTs. Comparisons with all open reading frames in yeast and in Caenorhabditis elegansshowed that only 12% of the T. cruzi transcripts were shared among diverse eukaryotic organisms. Comparison with other kinetoplastid sequences identified 237 orthologous genes that are shared between these evolutionarily divergent organisms. The generated data are a useful resource for further studies of the biology of the parasite and for development of new means to combat Chagas' disease.

[The sequence data described in this paper have been submitted to the dbEST database under nos. TENU0001–TENU5214 and the following:AA736292-AA736301, AA738502-AA738535,AA756982-AA756992, AA835598-AA835613, AA866501-AA866550,AA87464-AA874780, AA875669-AA875730, AA875809-AA875824,AA879318-AA897341, AA879376-AA879401, AA882494-AA882518,AA883036-AA883051, AI005678-AI005729, AI007342-AI007441,AI021797-AI021884, AI026370-AI026615, AI037797-AI037846,AI043247-AI043343, AI043427-AI043502, AI046026-AI046290,AI050095-AI050219, AI053146-AI053397, AI057644-AI057957,AI065169-AI065425, AI066117-AI066391, AI069556-AI069908,AI073286-AI073332, AI075466-AI075620, AI077051-AI077281,AI078888-AI079000, AI080790-AI080916, AI083097-AI083245,AI110290-AI110405, AI110412-AI110512, AW324789-AW325325,AW329885-AW330435, and AW621062-AW621094.

Download PDF

Related articles

Abstract

We have initiated large-scale sequencing of the third smallest chromosome of the CL Brener strain of Trypanosoma cruzi and we report here the complete sequence of a contig consisting of three cosmids. This contig covers 93.4 kb and has been found to contain 20–30 novel genes and several repeat elements, including a novel chromosome 3-specific 400-bp repeat sequence. The intergenic sequences were found to be rich in di- and trinucleotide repeats of varying lengths and also contained several known T. cruzi repeat elements. The sequence contains 29 open reading frames (ORFs) longer than 700 bp, the longest being 5157 bp, and a large number of shorter ORFs. Of the long ORFs, seven show homology to known genes in parasites and other organisms, whereas four ORFs were confirmed by sequencing of cDNA clones. Two shorter ORFs were confirmed by a database homology and a cDNA clone, respectively, and one RNA gene was identified. The identified genes include two copies of the gene for alanine-aminotransferase as well as genes for glucose-6-phosphate isomerase, protein kinases and phosphatases, and an ATP synthase subunit. An interesting feature of the sequence was that the genes appear to be organized in two long clusters containing multiple genes on the same strand. The two clusters are transcribed in opposite directions and they are separated by an ∼20-kb long, relatively GC-rich sequence, that contains two large repetitive elements as well as a pseudogene for cruzipain and a gene for U2snRNA. It is likely that this strand switch region contains one or more regulatory and promoter regions. The reported sequence provides the first insight into the genome organization of T. cruzi and shows the potential of this approach for rapid identification of novel genes.

[The sequence data described in this paper have been submitted to the GenBank data library under accession nos. AF052831–AF052833.]

Download PDF

Related articles

Abstract

The WGS sequencing method is the most economical and easily automated sequencing technique and has thus become the method of choice. Computer software is used to put the pieces together very much like a jigsaw puzzle, but they can usually only produce a good estimate of a correct assembly, and extensive manual finishing is needed to generate the final sequence.

Wiley Online Library

Related articles