Cardiology
Dermatology
Emergency Medicine
Endocrinology
Gastroenterology
Genetics
Geriatrics/Aging
Hematology/Oncology
Immunology/Allergy
Infectious Disease
Nephrology
Neurology/Neurosurgery
Obstetrics/Gynecology
Pediatrics
Psychiatry
Public Health, Policy, and Training
Pulmonary
Rheumatology
Surgery

Bioinformatics Bioinformatics

 

Health.cat news Bioinformatics - current issue
Bioinformatics - RSS feed of current issue

Predicting giant transmembrane {beta}-barrel architecture

31/12/69 -

Motivation: The β-barrel is a ubiquitous fold that is deployed to accomplish a wide variety of biological functions including membrane-embedded pores. Key influences of β-barrel lumen diameter include the number of β-strands (n) and the degree of shear (S), the latter value measuring the extent to which the β-sheet is tilted within the β-barrel. Notably, it has previously been reported that the shear value for small antiparallel β-barrels (n≤24) typically ranges between n and 2n. Conversely, it has been suggested that the β-strands in giant antiparallel β-barrels, such as those formed by pore forming cholesterol-dependent cytolysins (CDC), are parallel relative to the axis of the β-barrel, i.e. S=0. The S=0 arrangement, however, has never been observed in crystal structures of small β-barrels. Therefore, the structural basis for how CDCs form a β-barrel and span a membrane remains to be understood.

Results: Through comparison of molecular models with experimental data, we are able to identify how giant CDC β-barrels utilize a ‘near parallel’ arrangement of β-strands where S=n/2. Furthermore, we show how side-chain packing within the β-barrel lumen is an important limiting factor with respect to the possible shear values for small β-barrels (n≤24  β-strands). In contrast, our models reveal no such limitation restricts the shear value of giant β-barrels (n>24 β-strands). Giant β-barrels can thus access a different architecture compared with smaller β-barrels.

Contact: michelle.dunstone@monash.edu

Supplementary information: Supplementary data are available at Bioinformatics online.



1001 Proteomes: a functional proteomics portal for the analysis of Arabidopsis thaliana accessions

31/12/69 -

Motivation: The sequencing of over a thousand natural strains of the model plant Arabidopsis thaliana is producing unparalleled information at the genetic level for plant researchers. To enable the rapid exploitation of these data for functional proteomics studies, we have created a resource for the visualization of protein information and proteomic datasets for sequenced natural strains of A. thaliana.

Results: The 1001 Proteomes portal can be used to visualize amino acid substitutions or non-synonymous single-nucleotide polymorphisms in individual proteins of A. thaliana based on the reference genome Col-0. We have used the available processed sequence information to analyze the conservation of known residues subject to protein phosphorylation among these natural strains. The substitution of amino acids in A. thaliana natural strains is heavily constrained and is likely a result of the conservation of functional attributes within proteins. At a practical level, we demonstrate that this information can be used to clarify ambiguously defined phosphorylation sites from phosphoproteomic studies. Protein sets of available natural variants are available for download to enable proteomic studies on these accessions. Together this information can be used to uncover the possible roles of specific amino acids in determining the structure and function of proteins in the model plant A. thaliana. An online portal to enable the community to exploit these data can be accessed at http://1001proteomes.masc-proteomics.org/

Contact: jlheazlewood@lbl.gov

Supplementary information: Supplementary data are available at Bioinformatics online.



CONTRA: copy number analysis for targeted resequencing

31/12/69 -

Motivation: In light of the increasing adoption of targeted resequencing (TR) as a cost-effective strategy to identify disease-causing variants, a robust method for copy number variation (CNV) analysis is needed to maximize the value of this promising technology.

Results: We present a method for CNV detection for TR data, including whole-exome capture data. Our method calls copy number gains and losses for each target region based on normalized depth of coverage. Our key strategies include the use of base-level log-ratios to remove GC-content bias, correction for an imbalanced library size effect on log-ratios, and the estimation of log-ratio variations via binning and interpolation. Our methods are made available via CONTRA (COpy Number Targeted Resequencing Analysis), a software package that takes standard alignment formats (BAM/SAM) and outputs in variant call format (VCF4.0), for easy integration with other next-generation sequencing analysis packages. We assessed our methods using samples from seven different target enrichment assays, and evaluated our results using simulated data and real germline data with known CNV genotypes.

Availability and implementation: Source code and sample data are freely available under GNU license (GPLv3) at http://contra-cnv.sourceforge.net/

Contact: Jason.Li@petermac.org

Supplementary information: Supplementary data are available at Bioinformatics online.



Probabilistic suffix array: efficient modeling and prediction of protein families

31/12/69 -

Motivation: Markov models are very popular for analyzing complex sequences such as protein sequences, whose sources are unknown, or whose underlying statistical characteristics are not well understood. A major problem is the computational complexity involved with using Markov models, especially the exponential growth of their size with the order of the model. The probabilistic suffix tree (PST) and its improved variant sparse probabilistic suffix tree (SPST) have been proposed to address some of the key problems with Markov models. The use of the suffix tree, however, implies that the space requirement for the PST/SPST could still be high.

Results: We present the probabilistic suffix array (PSA), a data structure for representing information in variable length Markov chains. The PSA essentially encodes information in a Markov model by providing a time and space-efficient alternative to the PST/SPST. Given a sequence of length N, construction and learning in the PSA is done in O(N) time and space, independent of the Markov order. Prediction using the PSA is performed in O(mlog $$\frac{\hbox{ N }}{\left|\Sigma \right|}$$) time, where m is the pattern length, and is the symbol alphabet. In terms of modeling and prediction accuracy, using protein families from Pfam 25.0, SPST and PSA produced similar results (SPST 89.82%, PSA 89.56%), but slightly lower than HMMER3 (92.55%). A modified algorithm for PSA prediction improved the performance to 91.7%, or just 0.79% from HMMER3 results. The average (maximum) practical construction space for the protein families tested was 21.58±6.32N (41.11N) bytes using the PSA, 27.55±13.16N (63.01N) bytes using SPST and 47±24.95N (140.3N) bytes for HMMER3. The PSA was 255 times faster to construct than the SPST, and 11 times faster than HMMER3.

Availability: http://www.csee.wvu.edu/~adjeroh/projects/PSA

Contact: don@csee.wvu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.



Fulcrum: condensing redundant reads from high-throughput sequencing studies

31/12/69 -

Motivation: Ultra-high-throughput sequencing produces duplicate and near-duplicate reads, which can consume computational resources in downstream applications. A tool that collapses such reads should reduce storage and assembly complications and costs.

Results: We developed Fulcrum to collapse identical and near-identical Illumina and 454 reads (such as those from PCR clones) into single error-corrected sequences; it can process paired-end as well as single-end reads. Fulcrum is customizable and can be deployed on a single machine, a local network or a commercially available MapReduce cluster, and it has been optimized to maximize ease-of-use, cross-platform compatibility and future scalability. Sequence datasets have been collapsed by up to 71%, and the reduced number and improved quality of the resulting sequences allow assemblers to produce longer contigs while using less memory.

Availability and implementation: Source code and a tutorial are available at http://pringlelab.stanford.edu/protocols.html under a BSD-like license. Fulcrum was written and tested in Python 2.6, and the single-machine and local-network modes depend on a modified version of the Parallel Python library (provided).

Contact: erik.m.lehnert@gmail.com

Supplementary information: Supplementary information is available at Bioinformatics online.



A subspace method for the detection of transcription factor binding sites

31/12/69 -

Motivation: The identification of the sites at which transcription factors (TFs) bind to Deoxyribonucleic acid (DNA) is an important problem in molecular biology. Many computational methods have been developed for motif finding, most of them based on position-specific scoring matrices (PSSMs) which assume the independence of positions within a binding site. However, some experimental and computational studies demonstrate that interdependences within the positions exist.

Results: In this article, we introduce a novel motif finding method which constructs a subspace based on the covariance of numerical DNA sequences. When a candidate sequence is projected into the modeled subspace, a threshold in the Q-residuals confidence allows us to predict whether this sequence is a binding site. Using the TRANSFAC and JASPAR databases, we compared our Q-residuals detector with existing PSSM methods. In most of the studied TF binding sites, the Q-residuals detector performs significantly better and faster than MATCH and MAST. As compared with Motifscan, a method which takes into account interdependences, the performance of the Q-residuals detector is better when the number of available sequences is small.

Availability: http://r-forge.r-project.org/projects/meet

Contact: epairo@ibecbarcelona.eu; alexandre.perera@upc.edu

Supplementary information: Supplementary data (1, 2, 3 and 4) are available at Bioinformatics online.



PhyLAT: a phylogenetic local alignment tool

31/12/69 -

Motivation: The expansion of DNA sequencing capacity has enabled the sequencing of whole genomes from a number of related species. These genomes can be combined in a multiple alignment that provides useful information about the evolutionary history at each genomic locus. One area in which evolutionary information can productively be exploited is in aligning a new sequence to a database of existing, aligned genomes. However, existing high-throughput alignment tools are not designed to work effectively with multiple genome alignments.

Results: We introduce PhyLAT, the phylogenetic local alignment tool, to compute local alignments of a query sequence against a fixed multiple-genome alignment of closely related species. PhyLAT uses a known phylogenetic tree on the species in the multiple alignment to improve the quality of its computed alignments while also estimating the placement of the query on this tree. It combines a probabilistic approach to alignment with seeding and expansion heuristics to accelerate discovery of significant alignments. We provide evidence, using alignments of human chromosome 22 against a five-species alignment from the UCSC Genome Browser database, that PhyLAT's alignments are more accurate than those of other commonly used programs, including BLAST, POY, MAFFT, MUSCLE and CLUSTAL. PhyLAT also identifies more alignments in coding DNA than does pairwise alignment alone. Finally, our tool determines the evolutionary relationship of query sequences to the database more accurately than do POY, RAxML, EPA or pplacer.

Availability: www.cse.wustl.edu/~htsun/phylat

Contact: sunhongtao@wustl.edu

Supplementary information: Supplementary data are available at Bioinformatics online.



Fast protein binding site comparisons using visual words representation

31/12/69 -

Motivation: Finding geometrically similar protein binding sites is crucial for understanding protein functions and can provide valuable information for protein–protein docking and drug discovery. As the number of known protein–protein interaction structures has dramatically increased, a high-throughput and accurate protein binding site comparison method is essential. Traditional alignment-based methods can provide accurate correspondence between the binding sites but are computationally expensive.

Results: In this article, we present a novel method for the comparisons of protein binding sites using a ‘visual words’ representation (PBSword). We first extract geometric features of binding site surfaces and build a vocabulary of visual words by clustering a large set of feature descriptors. We then describe a binding site surface with a high-dimensional vector that encodes the frequency of visual words, enhanced by the spatial relationships among them. Finally, we measure the similarity of binding sites by utilizing metric space operations, which provide speedy comparisons between protein binding sites. Our experimental results show that PBSword achieves a comparable classification accuracy to an alignment-based method and improves accuracy of a feature-based method by 36% on a non-redundant dataset. PBSword also exhibits a significant efficiency improvement over an alignment-based method.

Availability: PBSword is available at http://proteindbs.rnet.missouri.edu/pbsword/pbsword.html

Contact: shyuc@missouri.edu

Supplementary information: Supplementary data are available at Bioinformatics online.



Matrix eQTL: ultra fast eQTL analysis via large matrix operations

31/12/69 -

Motivation: Expression quantitative trait loci (eQTL) analysis links variations in gene expression levels to genotypes. For modern datasets, eQTL analysis is a computationally intensive task as it involves testing for association of billions of transcript-SNP (single-nucleotide polymorphism) pair. The heavy computational burden makes eQTL analysis less popular and sometimes forces analysts to restrict their attention to just a small subset of transcript-SNP pairs. As more transcripts and SNPs get interrogated over a growing number of samples, the demand for faster tools for eQTL analysis grows stronger.

Results: We have developed a new software for computationally efficient eQTL analysis called Matrix eQTL. In tests on large datasets, it was 2–3 orders of magnitude faster than existing popular tools for QTL/eQTL analysis, while finding the same eQTLs. The fast performance is achieved by special preprocessing and expressing the most computationally intensive part of the algorithm in terms of large matrix operations. Matrix eQTL supports additive linear and ANOVA models with covariates, including models with correlated and heteroskedastic errors. The issue of multiple testing is addressed by calculating false discovery rate; this can be done separately for cis- and trans-eQTLs.

Availability: Matlab and R implementations are available for free at http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL

Contact: shabalin@email.unc.edu



Fast and accurate inference of local ancestry in Latino populations

31/12/69 -

Motivation: It is becoming increasingly evident that the analysis of genotype data from recently admixed populations is providing important insights into medical genetics and population history. Such analyses have been used to identify novel disease loci, to understand recombination rate variation and to detect recent selection events. The utility of such studies crucially depends on accurate and unbiased estimation of the ancestry at every genomic locus in recently admixed populations. Although various methods have been proposed and shown to be extremely accurate in two-way admixtures (e.g. African Americans), only a few approaches have been proposed and thoroughly benchmarked on multi-way admixtures (e.g. Latino populations of the Americas).

Results: To address these challenges we introduce here methods for local ancestry inference which leverage the structure of linkage disequilibrium in the ancestral population (LAMP-LD), and incorporate the constraint of Mendelian segregation when inferring local ancestry in nuclear family trios (LAMP-HAP). Our algorithms uniquely combine hidden Markov models (HMMs) of haplotype diversity within a novel window-based framework to achieve superior accuracy as compared with published methods. Further, unlike previous methods, the structure of our HMM does not depend on the number of reference haplotypes but on a fixed constant, and it is thereby capable of utilizing large datasets while remaining highly efficient and robust to over-fitting. Through simulations and analysis of real data from 489 nuclear trio families from the mainland US, Puerto Rico and Mexico, we demonstrate that our methods achieve superior accuracy compared with published methods for local ancestry inference in Latinos.

Availability: http://lamp.icsi.berkeley.edu/lamp/lampld/

Contact: bpasaniu@hsph.harvard.edu

Supplementary information: Supplementary data are available at Bioinformatics online.



Copyright © 2004/2011 health.cat

Annuaire de Sante | S'inscrire au MedSante  | Dictionnaire medicale  Forumed des médecins