indico    category     | view:  | focus on:  | details:  |  manage export to personal scheduler printable view (w/o menus and icons) create PDF  | 
user login 

 

conference "BITS2007 Meeting"
  from Thursday 26 April 2007 (12:00)
to Saturday 28 April 2007 (16:20)
at Napoli, Italy
support: bits2007_support@ceinge.unina.it
  Description: The 2007 annual meeting of the Italian Bioinformatics Society (BITS) will be held in Naples.

Contributions from the international scientific community related to bioinformatics and computational biology are welcome. Suggested topics are:

  • Structural and functional analysis of genomes
  • Structural biology and drug design
  • Gene expression and systems biology
  • Novel methodologies, algorithms and tools
  • Large scale analysis of experimental data
  • Molecular evolution and biodiversity
  • Biomedical informatics

NOTE: The final program is available in the "Timetable" section.

Keynote speakers:

The official language of the Meeting is English

For further information please contact bits2007@ceinge.unina.it
 

Thursday 26 April 2007 toptop

13:15->14:00    Welcome

14:00->15:00    "Giuliano Preparata" Lecture by Roderic Guigo
Description: Investigation of the signals involved in gene specification in genomic sequences.  
14:00  Introduction (10') Cecilia Saccone (itb-cnr) , Emilia Campochiaro
14:10  Investigation of the signals involved in gene specification in genomic sequences (50') Roderic Guigo (Universitat Pompeu Fabra, Barcelona)

15:00->16:20    Session 1: Structural and functional analysis of genomes
Description: Large scale analysis of genomes and their annotation, gene expression analysis, proteomics. 
15:00  Prediction of human targets for viral encoded microRNAs by thermodynamics and empirical constraints (20') (files Slides ppt  ) Alfredo Ferro (Dipartimento di Matematica e Informatica, Universita` di Catania; Dipartimento di Scienze Biomediche, Universita` di Catania)

Motivation MicroRNAs (miRNAs) are small RNA molecules that modulate gene expression through degradation of specific mRNAs and/or repression of their translation. miRNAs are involved in both physiological and pathological processes, such as apoptosis and cancer. Their presence has been demonstrated in several organisms as well as in viruses. Virus encoded miRNAs can act as regulators of viral gene expression, but they may also interfere with host gene expression. In particular, viral miRNAs may control cell proliferation by targeting cell-cycle and apoptosis regulators in host cells: accordingly, they could be involved in the pathogenesis of cancer. Computational prediction of miRNA/target pairs is a fundamental step in these studies. Methods Here we describe the use of miRiam, a novel program based on both thermodynamics features and empirical constraints, to specifically predict viral miRNAs/human targets interactions. More precisely, miRiam exploits target mRNA secondary structure accessibility and interaction rules, inferred from validated miRNA/mRNA pairs. A set of genes involved in apoptosis and cell-cycle regulation was identified as target for our studies, a choice supported by the knowledge that DNA tumor viruses interfere with these processes in humans. miRNAs were selected from two cancer related viruses, Epstein-Barr Virus (EBV) and Kaposi-Sarcoma Associated Herpes Virus (KSHV). Results The results of our prediction show that several of the mRNAs we analyzed (such as BID, BAX, CASP3, CASP10, TP53) possess good binding scores to these miRNAs. This suggests that besides the protein based host regulation mechanism, a post transcriptional level interference may exist. Future work will be aimed at providing the experimental validation of these computational predictions.

15:20  Identification of candidate regulatory sequences in mammalian 3'-UTR regions by statistical analysis of oligonucleotide distributions (20') (files Slides  ) Davide Cora` (Department of Theoretical Physics, University of Torino, Torino)

Motivation 3'-UTR regions contain binding sites for several regulatory elements which play an important role in the post-transcriptional control of gene expression, regulating mRNA stability, localization and translation efficiency. In particular, a mechanism of post-transcriptional regulation whose importance was realized in the last few years is mediated by a class of small RNAs called micro-RNAs. With this abstract, we present a computational methodology aimed at the identification of putative regulatory elements in human 3'-UTR regions, with a focus for miRNA binding sites. Methods We propose and develop two complementary approaches to the statistical analysis of oligonucleotide frequencies in mammalian 3'-UTR regions aimed at the identification of candidate binding sites for regulatory elements. The first method is based on the identification of sets of genes characterized by evolutionarily conserved overrepresentation of an oligonucleotide, without requiring any alignment procedure. The second method is based on the identification of oligonucleotides characterized by statistically significant strand asymmetry in their distribution in 3'-UTR regions. More precisely, we analyzed repeat-masked 3' UTR sequences of human and mouse genes using two different pipelines, both based only on the statistical properties of oligonucleotide frequencies: - Conserved overrepresentation. We constructed, separately for human and mouse, sets of genes sharing overrepresented oligonucleotides. We then selected those oligos whose sets of genes in human and mouse contained a statistically significant fraction of orthologous genes. Oligos selected in this way are thus characterized by an evolutionary conserved overrepresentation in the 3'UTR sequences of selected sets of genes and can thus be considered as good candidate binding sites. - Strand asymmetry. We identified those oligos whose frequency distribution shows a statistically significant strand asymmetry, that is a difference in frequency between the oligo and its reverse complement. Oligos which are binding sites of regulatory elements acting on single-stranded 3'-UTR sequences are expected to show such an asymmetry since, contrary to what generally happens for transcription factor binding sites, there is no functional equivalence between an oligonucleotide and its reverse complement. Results Concentrating on oligonucleotides of length 7, taken together, the methods identify a total of 610 7-mers as candidate binding sites. Comparing the results with databases of known 3'-UTR regulators, we demonstrate that both methods are able to identify many previously known binding sites located in 3' UTR regions, in particular miRNA-binding sites. Moreover, we obtain a subset of 59 7-mers showing strand asymmetry both in human and mouse and also identified by conserved overrepresentation. We consider this last set of oligos as our best candidate binding sequences. Detailed analysis on this subset of 59 oligos allowed us to identify 52 of these as known cis-acting elements. The remaining 7 are strong candidates to represent new cis-acting elements in mammalian 3' UTR and are promising candidates for experimental verification.

15:40  Evaluation of the potential for secondary structure formation in repeated bacterial sequences: identification of 58 families sharing common motifs. (20') (files Slides ppt  ) Luca Cozzuto (CEINGE Biotecnologie Avanzate, Napoli, Italy; S.E.M.M. - European School of Molecular Medicine - Naples site, Italy)

Motivation Detection of functional elements in genomic sequences by automated methods represents a major research goal in post-genomic analyses. Computational approaches to the analysis of secondary structure may be used to find biological entities such as regulatory hairpins, non-coding RNAs or protein binding domains otherwise not easily detectable. Systematic analysis of high stability stem-loop structures (SLSs), within a representative set of 40 bacterial genomes, showed that their number is larger than expected on the basis of sequence length and composition, and that some SLSs have strong sequence similarity [Petrillo M et al. 2006; 7:170]. Although less frequent than in eukaryotes, families of repeated short sequences have been described in several bacterial genomes and some of them, such as Pu-Bime and ERIC in E. coli, are known to contain transcribed SLSs, possibly active at the RNA level. For this reason, an automatic pipeline was developed to detect, among the available collection, SLS families defined by sequence similarity and sharing a common secondary structure. Methods Starting from the full SLS population previously identified, for each bacterial species, SLSs predicted to fold with a free energy lower than v5 Kcal/mol were selected and filtered to eliminate those falling within either mature stable RNA species (tRNAs, rRNAs) or known ISs. These sequences were grouped by using a BLAST-MCL procedure described by Enright A.J. et al in a very stringent way, in order to obtain homogenous clusters. Resulting clusters were then re-grouped into candidate families by using the same clustering procedure, but in a less stringent way. For each family overlapping elements were fused into SLS-containing regions (SCRs) and a combined cyclic procedure based on a Hidden Markov Model (HMM) genome search was performed in order to define to all family members and their boundaries. Finally, manual refinement was performed to combine equivalent models, which had escaped previous identification. Two different approaches were attempted to evaluate the aptitude of sequences from the detected families to fold into a secondary structure: the probability of non-random folding was tested by using RANDFOLD on SLSs contained within each family; the ability to form conserved secondary structures, was measured by using RNAz. The presence of aligned SLSs in agreement with the structures predicted by RNAz was also evaluated. Results The procedure led to the identification of 92 families: most of them (66) are defined by a model shorter than 200 bps, while 24 vary between 200 bps and 1 Kb; only 2 are larger. Analysis of available literature showed that 25 correspond to more or less extensively described families and that 67 appear to be novel. For 58 families, a common secondary structure was detected by RNAz: in 46 of them the predicted structure is corroborated by the stacking of the originally found SLSs. In these families, SLSs also tend to be positive to the RANDFOLD test for p<=0.005, a good indicator of structured sequences. The procedure mostly identified simple SLS-based families, as expected given the starting SLS populations, but also identified complex structures, such as Efa-1 in E. faecalis and Pae-1 in P. aeruginosa. Most structured families (31 of 58) are preferentially located within intergenic regions. The identified families also include 34 ones where no strong evidence was found for the formation of common secondary structures. It is interesting to note that most of them (20) are located within coding regions: this preferential localization may reflect the lower tendency of such regions to assume a structured form.

16:00  An Integrated Computational Workbench for Structural and Functional Analysis in the Solanacea Genomics Network (20') Luisa Chiusano Maria (Department of Soil, Plant, Environmental and Animal Production Sciences, University Federico II of Naples, Portici (NA), Italy)

Motivation In the frame of the International Tomato Genome Sequencing Project (http://www.sgn.cornell.edu/), the Bioinformatics Committee coordinates the management and the analysis of the genomics data generated within the Solanaceae Genomics Network. The aim is to offer, in a distributed platform, integration of data sources from large scale analysis in order to support the research of the Solanaceae family which includes many species of relevant agricultural interest such as tomato and potato. As participants to the Bioinformatics of the Network, we set up a workbench to support i) the annotation of the Solanum lycopersicum (tomato) genome, ii) the characterization of the tomato genome and of its functional properties, iii) the comparative analysis of sequence collections from other Solanaceae species. The actual organization of the workbench and its novel features are presented. Methods The workbench is built and updated to improve the current available methodologies, to expand the data according to their increase, and to enhance integration of different data sources. To date the main resources of the workbench are the genome and the transcriptome data. Transcriptome analysis. Within the Eu-SOL (EU VI Frame Programme), we are committed to maintain EST sequence collections from Solanaceae species retrieved from dbEST. ESTs are processed to remove redundant sequences and then are released to the community through our ftp site. The ParPEST pipeline (D`Agostino et al., 2005) is performed at each increase of 5000 sequences, to cluster/assemble ESTs and to generate tentative consensus sequences (TCs) per putative transcripts. The resulting data are organized into relational databases (D`Agostino et al., 2007) and are released as a complete set of annotated transcript indices. The computational annotation of the expressed sequence data is based on BLASTX analysis versus the UniProtKB/Swiss-Prot database and on the use of controlled vocabularies such as the Gene Ontologies and the Enzyme Commission numbers. We implemented the `on the fly` mapping of the transcripts onto the KEGG metabolic pathways (Kaneisha et al., 2004) to support expression pattern analysis. We included in the annotation procedure the analysis of non-coding RNAs and correspondences between the EST dataset and the probe-sets from both the tomato reference arrays TOM1, a cDNA microarray (http://ted.bti.cornell.edu/), and the actual Affymetrix chip. Genome analysis. The preliminary annotation of the BAC sequences released by the SOL Genomics Network was based on the spliced-alignments of Solanaceae ESTs and on the tentative consensus sequences (TCs) as generated by our group and as retrieved from the SGN (Mueller et al., 2005) and the TIGR repositories (Lee et al., 2005). Arabidopsis thaliana coding sequences and the TIGR collection of Solanaceae repeats are mapped onto the genomic sequences too. Gene models are defined by the software GeneModelEST (D`Agostino et al., 2007) which evaluates the quality and reliability of the tentative consensus sequences from tomato and other Solanaceae species. The annotated BACs can be accessed through the Generic Genome Browser (GBrowse) at http://biosrv.cab.unina.it/GBrowse/. Results The computational workbench here presented was built to provide an Italian resource for the genomics of the Solanaceae and for the International Tomato Genome Sequencing Project. The workbench can be accessed via web based interfaces. The workbench is useful to support the experimental annotation of the genomic data; indeed, EST data are a quick route for discovering new genes and for confirming coding regions in genomic sequences. Moreover, it supports the analysis of coding and non-coding gene families. However, the most relevant aim of the workbench design is to provide methodologies to allow genome scale structural and functional analyses. The idea is to allow investigations on specific expression patterns as they can be derived from a holistic view of the transcriptome data integrated with arrays results contributed by the community. Our effort aims to provide novel approaches for investigations on the organization and functionalities of genomes to meet the bioinformatic challenges of comprehensive analyses within the Sol Genomics Network based on a systems biology view.

16:20 
Coffee break in poster area

16:40->19:00    Session 2: Novel methodologies, algorithms and tools
Description: Development of novel algorithms and tools for the analysis of biological data, critical analysis and assessment of existing computational methods, databases, distributed applications, tools and grid infrastructures. 
16:40  A statistical test based on the Bernstein inequality to discover multi-level structures in bio-molecular data. (20') (files Slides ppt  ) Giorgio Valentini (DSI - Dipartimento di Scienze dell` Informazione dell` Universita` degli Studi di Milano.)

Motivation The unsupervised exploration and identification of structures (i.e. clusterings) underlying complex bio-molecular data is a central issue in several branches of bioinformatics, ranging from transcriptomics to proteomics and functional genomics. Several methods based on the concept of stability have been proposed to estimate the `optimal` number of clusters in clustered bio-molecular data. A major problem with stability-based methods is the detection of multi-level structures underlying the data (e.g. hierarchical subclasses of diseases, or hierarchical functional classes of genes), and the assessment of their statistical significance. Recently, we proposed a chi-squared-based statistical test of hypothesis to assess the significance of the `optimal` number of clusters and to discover multiple structures simultaneously present in bio-molecular data: however, some assumptions about the distribution of the data are needed to estimate the reliability of the obtained clusterings. Methods In this contribution we propose a new distribution-free approach that does not assume any `a priori` distribution of the similarity measures. In particular we consider the problem of the assessment of the reliability of a clustering procedure using a stability-based approach: multiple clusterings are obtained by introducing perturbations into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. Different procedures may be applied to randomly perturb the data, ranging from bootstrapping techniques, to noise injection into the data or random projections into lower dimensional subspaces. To assess the statistical significance and to discover multi-level structures underlying bio-molecular data, we propose a method based on Bernstein inequality, that makes no assumption about the distribution of the data, thus assuring a reliable application of the method to a large range of bioinformatics problems. Results Figure 1 (see http://homes.dsi.unimi.it/~valenti/papers/BITS07/Bertoni_Valentini_BITS_2007.html) summarizes the results obtained with the classical Golub`s leukemia data set, by applying two variants of the Bernstein-based and the chi-squared-based statistical tests. K-clusterings above the dashed horizontal line are significant, that is they are significantly more reliable than those below the horizontal line (at 0.001 significance level): all the tests correctly detect 2 and 3-clustering as the most reliable, according to the biological meaning of the data (two classes: ALL and AML, with AML splittable in two other B-cell and T-cell subclusters). The Bernstein test, due to its more general assumptions (no particular distributions and no independence are assumed for the random variables that represent the similarity between clusterings), is more sensitive with respect to the chi-squared-based test to multiple structures simultaneously present in the data. Nevertheless it is less selective, that is subject to more false positives. Assuming independence between the random variables, we obtain a more selective Bernstein inequality-based test with intermediate characteristics between the chi-squared and the Bernstein test that assumes independence between variables (red curve in Figure 1). Results with other DNA microarray data sets show the effectiveness of the proposed test based on the Bernstein inequality in the assessment of the reliability of multiple hierarchical structures discovered in bio-molecular data.

17:00  A tool for interactive analysis of bioinformatics data (20') (files Slides ppt  ) Roberto Tagliaferri (DMI, University of Salerno, I-84084, via Ponte don Melillo, Fisciano (SA))

Motivation We propose a scientific data exploration software environment that permits to obtain both data clustering and visualization. The approach is based on a pipeline, from input to cluster analysis and assessment with each stage supported by visualization and interaction tools. The proposed approach enables the user to answer the following questions: Is a dataset really clusterizable? does it contain localized uniform groups and are these groups well separable? How many cluster should we produce? How much is the clusterization reliable? If we reassign some point to a different cluster, how much does the total reliability change? In which clusters do the points of this subset lie? How can we use a priori partial information to validate clusterization? Methods 1. Parameter estimation The first step of the pipeline implemented by our tools consists of a procedure for testing the stability of clusters for any value on a prescribed range of parameters of the algorithm and is based on the Model Explorer approach. 2. Clustering and visualization The same procedure can be used to obtain a first raw clusterization of the dataset. In our tool K-Means, EM, SOM, PPS algorithms are provided, which can be used both directly as clustering methods and in the parameter estimation process. The clusterization can be inspected by a module offering the visualization of clusters convex hulls as they appear in projected 2-3D spaces through PCA and MDS. The user can select each cluster and each point in a cluster checking all its features. 3. Cluster reliability The reliability of each cluster in a clusterization is obtained by first producing several perturbed versions of the dataset, and then a clusterization over each of these new datasets and their similarity matrices. The Fuzzy Similarity Matrix over these clusterizations ia build summing over the similarity matrices. 4. Fuzzy Membership analysis Starting from the fuzzy similarity matrix, we can compute the value of each cluster membership function for each pattern and quantify how much single points belong to a fixed group of other points over different trials of clustering on the perturbed versions of the dataset. The fuzzy membership is represented through a 2-3D visualization of the points of the dataset. The convex hulls of the clusters composed by the points for which the membership values are greater than a given threshold are shown. The user is allowed to change the threshold and save the corresponding clusters. Basing on the values of every membership function for each outlier, the system suggests to manually reassign a point to a different cluster. The user can accept such suggestions or explore alternative clusterizations. 5. Interactive Agglomerative Clustering Since rarely the number of clusters is known in advance, we adopt an interactive agglomerative technique on the clusters obtained in the previous phase. The user can choose different partitioning of the data on the base of a threshold value. The result is a dendrogram in which each leaf is associated with a cluster from the previous phase. This step could employ the Negentropy distance or any hierarchical clustering. 6. Visualization of prior knowledge A priori information can be used to: - Validate a clustering result. - Infer new knowledge. The validation of a clustering can be obtained comparing the prior knowledge and the knowledge obtained from the clustering itself, obtaining a confidence degree. The prior knowledge can also be used to produce new knowledge inferred by the presence or absence of objects of a certain class in a certain cluster. Different hypotheses can be made depending on the relations between the prior knowledge and the features used to cluster the objects. Prior knowledge can be visually shown and permits to visualize distance and cluster information together with the prior knowledge. Results Comparing the results of our cluster analysis with those of hierarchical clustering by Whitfield et al., it appears that the genes classified into the third NEC cluster are distributed into three different Whitfield clusters. Two of these clusters were reported as associated with DNA replication function, whereas the third were composed by various, non-classified genes. It seems that the NEC cluster is more enriched in genes belonging the same functional class (DNA replication = 9.7%; Fisher exact test p = 3.3*10^-11) than two of the three different Whitfield clusters (Whitfield 1, DNA replication = 9.5%; Fisher exact test p=4.4*10^-9 - Whitfield 2, DNA replication = 5.3 %, p = 7.5*10^-3). Concerning the third Whitfield cluster (composed by non-classified genes) the hierarchical clustering appears to misclassify this set of 77 genes. The analysis of their function showed that they are highly associated with the DNA replication function (DNA replication = 11.6%, Fisher exact test p = 8.4*10^-6). In conclusion, at least in several cases, the proposed method of clustering appears to be more useful.

17:20  Generating multidimensional embeddings based on fuzzy memberships (20') (files Slides pdf ppt  ) Stefano Rovetta (Department of Computer and Information Sciences, University of Genova)

Motivation Exploratory analysis of genomic data sets using unsupervised clustering techniques is often affected by problems due to the small cardinality and high dimensionality of the data set. These problems may be eased by performing clustering in an embedding space. This brings about the problem of selecting an appropriate transformation to perform the required multidimensional embedding, which should be able to keep the necessary information while reducing data dimensionality. If the cardinality of the data set is small compared to the input space dimensionality, then the matrix of mutual distances or other pairwise pattern evaluation methods such as kernels may be used to represent data sets in a more compact way. Following this approach, the data matrix is replaced by a pairwise dissimilarity matrix D. Methods We have proposed an embedding technique based on the concept of fuzzy membership, which is related but different from dissimilarity-based representations. A set of vectors are selected from the data set. These are termed probes and are used as reference points for the rest of the data set. Probes are interpreted as fuzzy points; for each of the remaining points in the data set, the fuzzy membership to a probe can be evaluated. Therefore, for each point an ordered set of membership values is defined, one for each probe, and this ordered set can be used as a new feature vector to represent the point itself, embededd in a space induced by the probes. We will call this representation space the Membership Embedding Space (MES). We may observe that a point in the embedding space will be represented by a vector containing only few non-null components (depending on the support of the membership function), in correspondence of the closest probes in the original feature space. In our experiments, the memberships of fuzzy sets centered on the probes were modeled as Gaussians normalized over all probes. Here we propose a generative technique based on Simulated Annealing to select sets of probes of small cardinality. An appropriate generalized energy is defined to represent clustering quality and clustering complexity for the probes. Results When applied to clustering, the approach has been demonstrated to lead to significant improvements with respect to the application of clustering algorithms in the original space and in the distance embedding space. We present results based on standard data commonly available on line, to make them readily comparable with other approaches. These results indicate that the method supports high quality clustering solutions using compact sets of probes.

17:40  CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment (20') (files Slides pdf  ) Svetlin Manavski (Elaide srl, Padova, Italy)

Motivation Searching similarities in protein sequence databases has become a routine procedure in molecular biology. The algorithm of Smith-Waterman has been available for more than 25 years; it is based on dynamic programming and performs an exhaustive search, being able to identify the exact best alignment of each database sequence against a query sequence. Since protein and DNA databases have become considerably large, heuristic approaches, such as those implemented in FASTA and BLAST, tend to be preferred for searching databases. Thus, the loss of accuracy is balanced by a faster execution time. The main motivation of our work is the development of new software/ hardware solutions to implement the Smith-Waterman algorithm on personal computer at a low cost and high performance. Methods Here, we describe a new approach to bio-sequence database scanning using Nvidia CUDA compatible graphics card (GeForce 8800) as an efficient hardware accelerator of the Smith-Waterman algorithm. Unlike previous approaches based on OpenGL, this is the first implementation directly running on the GPU hardware without any conversion of the problem to the graphical domain. But CUDA environment imposes specific development and memory models which require state of the art strategy in designing the implementation. When programmed through CUDA, the GPU is viewed as a highly multithreaded co-processor. Its threads are grouped in warps, which compose blocks. Finally a grid of blocks must be managed by the software developer. Furthermore the memory model of the CUDA device is a composition of six different kinds of memory. So the proper design of the memory usage by the application can dramatically impact the performance of the solution. We explored different software architectures to determine which will be the optimal in terms of the above restrictions. This way the implementation of the algorithm extracts the maximum performance of a single GeForce 8800 GTX board and allows to obtain a near to linear speed-up as additional boards are added. It makes possible low-cost GPUs to be applied as accelerators having performance over the GCUPS range, never reached up to now on commodity hardware platforms. We also discuss the potential approaches to improve the performance of heuristic methods like BLAST using CUDA compatible graphics hardware as efficient accelerators. Results Comparative tests have been done considering: our GPU implementation using a single GPU, a CPU Smith-Waterman implementation, FASTA and BLAST. The hardware used is composed by an Nvidia GTX 8800 and a Pentium IV 3.0GHz. The tests consist in alignment of seven different sequences (lengths of 63, 127, 255, 361, 511, 1023, 2047 amynoacids) with the Swiss-prot database (December 2006 - 250296 sequences). Our solution achieves a speed-up of more than 30 times over a straight forward CPU implementation. Furthermore it is about 3 times faster than previous GPU approaches in the most useful range of the sequence lengths. FASTA is from 9 to 14 times slower than our implementation while we are able to achieve performances comparable with BLAST although this algorithm, like FASTA, benefits of a faster, but less accurate, heuristic approach. Finally our approach allows for a linear speed-up as additional GPUs are added.

18:00  A Framework for the Prediction of Protein Complexes (20') (files Slides pdf  ) Mario Cannataro (Laboratory of Bioinformatics, University Magna Graecia of Catanzaro, Catanzaro, Italy) , Guzzi Pietro Hiram (Laboratory of Bioinformatics, University Magna Graecia of Catanzaro, Catanzaro, Italy)

Motivation Cellular processes are composed by a set of elementary events mediated by proteins. Identifying and analyzing all the possible physical interactions among proteins do appear as an important step in studying cellular biology. Protein interactions are modeled as undirected graphs where vertices represent proteins and edges denote interactions among them. Resulting networks of interactions are thus studied by investigating graph properties, such as connectivity or degree distribution, or by individuating particular regions that encode relevant biological properties. Searching for small and highly interconnected regions is used for the prediction of protein complexes [1], i.e. a set of proteins assembled together playing a biological function. Currently, there exist different approaches for the prediction of complexes, that are based on a particular graph clustering schema [1, 2, 3]. The in silico prediction of possible complexes could drive the planning of in vitro investigations avoiding a set of redundant or negative experiments. However the scenario above described presents two main drawbacks: (i) currently available predictors may be only used by experts, (ii) results of different algorithms can not easily be integrated. In this work we propose a meta-predictor based on a novel algorithm that integrates results produced by different predictors to improve the prediction of complexes, and offers a graphical user interface to simplify the use. Methods Let us consider a set of pairwise interaction data publicly available on different datasets, e.g. MINT (http://cbm.bio.uniroma2.it/mint/) or MIPS (http://mips.gsf.de/genre/proj/mpact), merged in a comprehensive graph, and a set of n predictors that generate n different predictions. Each of these predictions is composed, obviously, by a set of different sub graphs, each one representing a possible protein complex. The proposed algorithm builds a metaclustering in three steps: (i) in the first step it considers the graphs commonly found by the different predictors, and then it finds both (ii) sub/supergraphs (ii), and (iii) overlapping graphs. The metaclustering algorithm is embedded into a more general framework organized in different modules: (i) a metaclustering module, the core of the architecture, that implements the algorithm, (ii) a set of clustering modules that wrap the existing clustering tools, (iii) a visualizer module that visualizes networks and results, and a (iv) data manager module that manages different formats of data and stores a reference dataset providing the evaluation of results. Results Currently we designed the overall architecture of the framework and we implemented the metaclustering and the data manager modules that provide the core functionalities of the framework. In a first experiment, we considered a dataset of yeast described in [4] and available at http://rsat.scmbb.ulb.ac.be/~sylvain/clustering_evaluation/. It contains 1095 nodes and 14658 vertices and it has been obtained by considering first the real interactions stored in the MIPS database and then by adding and removing randomly several edges to such data, simulating noisy data [4]. We run the MCODE [1], RNSC [2] and MCL [3] clustering algorithms giving such dataset as input. Then the obtained cluster data were used by the proposed algorithm to build different metaclusters, by considering different parameters. For each experiment we measured the Sensitivity, the Positive Predictive Value, and the Accuracy of metaclustering, as defined in [4]. The first one measures the fraction of proteins of a complex found in a common cluster, the second one represents the ability of a cluster to predicts a complex, while the last one is a geometric average. Results confirm that the metaclustering outperforms the other algorithms with respect to sensitivity and accuracy, by respectively 0,76 and 0,67 against 0,47 and 0,57 of the better algorithm, while they show a slight decrease of PPV. In summary, the proposed meta-predictor enhances sensitivity and accuracy of the complex prediction with respect to the used predictors. More information are available at http://poseidon.bioingegneria.unicz.it/~guzzi/projects.html. References [1] Gary Bader et al, An automated method for finding molecular complexes in large protein interaction networks, BMC Bioinformatics 4 (2003), no. 1, 2. [2] King A.D., Graph clustering with restricted neighbourhood search, Master thesis, University of Toronto, Toronto, Ontario, 2004. [3] Van Dongen S et al, An efficient algorithm for large-scale detection of protein families, Nucleic Acids Research 30 (2002), no. 7, 1575_1584. [4] Brohee et al, Evaluation of clustering algorithms for protein-protein interaction networks,BMC Bioinformatics 2006, no 7:488.

18:20  Critical Assessment of Side Chain Prediction (CASCP): an in-house evaluation on single-point mutants of lysozyme. (20') Anna Marabotti (Laboratory of Bioinformatics, Institute of Food Science, CNR, Avellino)

Motivation The simulation of protein structure by computational methods has been exceptionally improved during the last 15 years, as proved by the results from CASP (Critical Assessment of techniques for protein Structure Prediction) competition. However, the prediction of the correct conformation of side chains is still a major problem. Several programs are available and are widely used to place the side chains on the backbone of a protein structure and to optimize their conformations; however, serious questions can be made on the correctness of their predictions. The scope of the present work is to assess the ability of three widely used and freely available programs for side chain conformation prediction (NCN, SCAP and SCWRL) to simulate the introduction of a single point mutation in a structure and the effects of this mutation on the structure itself. Methods We chose as a benchmark the phage T4 lysozyme, for which hundreds of single point mutant structures are present in the PDB database. Starting from the structures of the wild type and of the "pseudo-wild-type" lysozyme (in which two cysteine residues are mutated to an alanine and a threonine residue), we introduced different single point mutations (for pseudo-wild-type, in addition to the two mutant residues already present), and we optimized the side chain conformations of the mutants with each of the three programs. The resulting structures were compared with the corresponding structures available in PDB using the traditional parameters adopted for this kind of evaluation, i.e. by analyzing the overall and average RMSD, with or without the inclusion of Cbeta atom in calculations, and the differences in chi1 and chi1+2 angles on the side chains. Results The results for the predictions on this benchmark appear to be worse than those calculated by the Authors on their benchmarks and published in the papers related to each method. The program NCN appears to be the most accurate for this benchmark both in terms of angle accuracy and RMSD, but in general it predicts no more than 70% of chi1 and chi1+2 dihedrals correctly (the prediction was considered correct for a dihedral angle when the deviation between the predicted and the experimental value was less than 20). The overall RMSD is 1.90 A when Cbeta atom is not included, 1.68 A when it is included in calculations. The programs SCAP and SCWRL perform generally worse than NCN both in terms of dihedral angle predictions and RMSD. SCWRL performs slightly better than SCAP in terms of dihedral angle predictions, the opposite is true for RMSD. We also decomposed our results in terms of side chain exposure to solvent, polarity, size of mutations and influence on the nearest residues to the mutations. We found that generally side chains conformations are better predicted in core residues (with less than 10% solvent accessible area of the side-chain) than in the exposed ones, and the accuracy of prediction is similar for residues near the mutation or far from the site of mutation. The accuracy of prediction is higher when the size of mutant side chain is conserved, and a large to small mutation is better simulated than the opposite. Instead, the polarity of mutation affects the prediction of side chain conformation to a lesser extent. This study confirms the need for better predictions of side chains conformations; however it also confirms that programs which are parameterized by taking into account not only a large rotamer library but also potential energy functions perform better, although they need a higher amount of computational time.

18:40  REEF and LAP: a computational framework for the identification of chromosomal regions associated to functional features enrichment and differential expression. (20') (files Slides ppt  ) Alessandro Coppe (Department of Biology, University of Padova, Padova)

Motivation Systems biology elevates the study from the single entity level (e.g., genes, proteins) to higher hierarchies, such as entire genomic regions, groups of co-expressed genes, functional modules, and networks of interactions. Since the scientific attention focuses more on critical levels of biological organization and their emerging properties rather than on the single components of the system, the availability of high-throughput gene expression data, coupled with bioinformatics tools for their analysis, represents a scientific breakthrough in the quest for understanding biological mechanisms. The massive accumulation of high-quality structural and functional annotations of genomes imposes the development of computational frameworks able, not only to analyze gene expression profiles per se, but to merge any genomic information. The integration of different types of genomic data (gene sequences, transcriptional levels, functional characteristics) is a fundamental step in the identification of networks of molecular interactions, which will allow turning genomic research into biological hypotheses. We developed two computational procedures integrating functional annotations and genome structural information with transcriptional data: REEF (REgionally Enriched Features in genomes; Coppe et al., 2006) aims at detecting density variations of specific features along the genome sequence while LAP (Locally Adaptive Statistical Procedure; Callegaro et al., 2006) is a methodology for the identification of differentially expressed chromosomal regions. Methods REEF is a procedure to detect density variations of specific features, such as a class or group of genes homogeneous for expression and/or functional characteristics, along the genome sequence. For example it can be used to identify genomic regions with significant enrichments of genes which are co-expressed, differentially expressed, or related to particular molecular functions. The algorithm adopts a sliding window approach with the hypergeometric distribution to calculate the statistical significance of local enrichments. False Discovery Rate circumvents the problem of multiple testing when calculating the genome-wide statistical significance. Results of analyses are graphically presented at genome, chromosome and cluster level. A graphical tree structure enables the user to select and view a chromosome or a specific enriched region. REEF exploits UCSC Genome Browser Custom Annotation Tracks facility in order to visualize results as custom tracks together with standard tracks from UCSC Genome Browser. LAP is a method for the identification of differentially expressed chromosomal regions, which incorporates transcriptional data and structural information locally smoothing the expression statistic, along the chromosomal coordinate. The smoothing procedure is approached as a non-parametric regression problem using various methods (local variable bandwidth kernel estimator, spline functions or wavelets). A permutation scheme is used to identify differentially expressed regions, under the assumption that each gene has a unique neighborhood and that the corresponding smoothed statistic is not comparable with any statistic smoothed in other regions of the genome. Specifically, the statistic values are randomly assigned to chromosomal locations through B permutations and then, for each permutation, smoothed over the chromosomal coordinate. The significance of differentially expressed regions (p-value) is computed as the probability that the random null statistic exceeds the observed statistic over B permutations. Results REEF is a multiplatform program written in the Python, providing a graphical user interface allowing the interactive display of results. LAP is an R function performing statistical analyses, visualization of results on graphical representations of the genome, and export of the identified regions to genome browsers. The performances of the two algorithms have been accessed and compared first using a simulation approach. Synthetic data mimicking specific modifications or distortions of real gene expression signals have been used to evaluate specificity, sensitivity, and positive predictive values (ROC curves) of the two approaches. Then, REEF and LAP have been applied to the analysis of an integrated dataset of gene expression during myelopoietic differentiation. Results of this study allowed deepen the knowledge on the role of chromatin remodeling and epigenetic control mechanisms on transcriptional regulation, shedding light on the impact of silencing/induction of specific genes on differential expression, in respect to the contribution of epigenetic mechanisms. References - Callegaro A, Basso D, Bicciato S. A locally adaptive statistical procedure (LAP) to identify differentially expressed chromosomal regions. Bioinformatics. 2006; 22: 2658-66. - Coppe A, Danieli GA, Bortoluzzi S. REEF: searching REgionally Enriched Features in genomes. BMC Bioinformatics. 2006; 7:453.


Friday 27 April 2007 toptop

09:00->10:20    Session 3: Molecular evolution and novel algorithms
Description: Methodologies and applications for the study of the evolution of genes and organisms, including novel methods and algorithms for evolutionary studies. 
09:00  Human NumtS: features and bioinformatics approaches for their location and quantification (20') (files Slides ppt  ) Marcella Attimonelli (Department of Biochemistry and Molecular Biology, University of Bari, Bari)

Motivation Eukaryotic Nuclear genomes present, to a greater or lesser extent, fragments of their mitochondrial genome counterpart derived from the "random" insertion of damaged mtDNA fragments. Fragments of mtDNA escape from mitochondria due to the presence of mutagenic agents or other forms of cellular stresses. The fragments reach the nucleus and during the repair of chromosomal breaks insert themselves into the nuclear DNA [1]. Close examination of the "Nuclear mt Sequences" (NumtS) loci reveals a lack of common features at integration sites. NumtS were discovered in 1967 by du Buy and Riley in the mouse liver nuclear genome. The presence of mtDNA in nuclear genome was later (1983) confirmed in several other organisms. In 1994 Lopez et al call these fragments NumtS (new mights). NumtS are not equally abundant in all species and are redundant and polymorphic in terms of number of copies For population genetists and clinicians studying mt diseases, it is important to have a complete overview of NumtS quantity and location. The reason for this is that they are a potential source of contamination when PCR is used to study mtDNA without prior purification of mtDNA and being located in the nucleus, evolve much more slowly than their functional counterparts thus, they can be used as "outgroups" in phylogenetic studies. Searching PubMed for NumtS or Mitochondrial pseudogenes, hundreds of papers are retrieved. Many of these papers report compilation of Human NumtS [2] mostly obtained by applying in silico approaches, while only a minority of them are derived from a wet-lab approach [3]. A comparison of the published compilations clearly shows significant discrepancies among data due to both an unwise application of Bioinformatics methods and to a not yet correctly assembled nuclear genome. Thus the data are still incomplete and imperfect. How to optimize quantification and localization of NumtS? By applying more bioinformatics approaches and verify the results through sequencing approaches. Here we report a consensus compilation of Human NumtS obtained by applying different bioinformatics approaches. Methods The in silico approach is based on the application of database similarity searching methods by comparing the rCRS sequence [4] with the complete Human Genome. Blast is the program used to this purpose, but there are many implementations of Blast and each implementation may produce different results depending on the parameters chosen and on the subject sequences. Thus, we have applied Blastn in different conditions, then we have applied MegaBlast and BLAT. By applying Blastn we have submitted different runs changing subject sequences (database), Limits by Entrez and Limits on Hits Number. The threshold E-value has always been fixed at 0.001. Among the runs we have selected results obtained by searching human chromosome database keeping high the Limits on Hits Number in order to be sure to avoid false negatives. We have applied Megablast on all the assemblies of the Human Genome at NCBI. Finally we have applied Blat at the UCSC site. The Human database at UCSC contains the Human golden sequences, and 4 different builds (hg15, hg16, hg17, hg18) can be searched, thus it is possible to eliminate false positives attributable to assembling artefacts. Results The comparison of the results obtained and the comparison of these results with published ones has allowed us to produce the RefHumanNumtS compilation reporting 188 Human NumtS. At present, the compilation is available through an Excel spreadsheet but soon it will be implemented in a database available through the HmtDB genomic resource. The following information is available for each NumtTS: the identifier, Chromosome and strand location, length, mt and nuclear coordinates, score of the alignment, Blat blocks number and similarity, orthologous NumtS in Chimpanzee, the sequenced NumtS and their polymorphic copy number, the genomic features of the NumtS locus, the isochore where the NumtS is located. The mapping with the published compilation is also annotated. Finally, for each gene and regulatory region of the human mt genome, the number of times it is duplicated along the human nuclear genome is reported. Future developments: checking for SNPs in the NumtS, analysing NumtS integration sites, developing NumtS database, applying our protocol to other species, checking for orthologous NumtS, sequencing the compiled NumtS. The latter task, however, would require a great effort and substantial funding. References [1] Bensasson D.et al., J Mol Evol , 2003, 57:343v354 [2] Ricchetti M et al.,, PLoS Biol, 2004, 2(9): e273. [3] Parr RL et al., BMC Genomics, 2006, 21:7-185. [4]Andrews RM et al. Nat.Genet., 1999, 23:147

09:20  Automatic identification of Transposable Elements (20') (files Slides pdf  ) Cristian Del Fabbro (Department of Mathematics and Computer Science, University of Udine, Udine; Istituto di Genomica Applicata, Udine)

Motivation Our knowledge of the structure and components of genomes is rapidly progressing in pace with their sequencing. The emerging data show that a significant portion of genomes is composed of transposable elements (TEs) The proper annotation of repeated sequences and transposable elements (SINEs, LINEs, DNA-retrotransposons, Helitrons, etc. see [1]) in an assembled genome, is both important and computationally hard. Typically researchers in this area use a combination of visual tools - most often Dotter[2] and variants of it - and alignment tools - typically members of the Blast[3] suite - against know proteins databases in order to combine information on repeated sequences and coding. [1] `A unified classification system for plant transposable elements` Thomas Wicker, Francois Sabot, Jeffrey Bennetzen, Boulos Chalhoub, Andrew Flavell, Philippe Leroy, Michele Morgante, Olivier Panaud, Etienne Paux, Phillip SanMiguel, Alan Schulman - in preparation [2] `A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis` Erik L.L. Sonnhammer and Richard Durbin, Gene 167:GC1-10 (1995) [3] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) `Basic local alignment search tool.` J. Mol. Biol. 215:403-410 Methods Results Our goal is to develop a tool for automatic annotation of repeated sequences, integrated within a system capable to search and classify coding regions characterizing specific TEs. We have developed a prototype taking as input information on repeated sequences in gff format (provided as output by most practical tools for this kind of search like, for example, RepeatMasker[3]). The information we need is only the relative position in a genomic target sequence (e.g. a coordinate in a contig) and the name, of an aligned sequence from a data set S of generic repeated sequences (this information can also be computed de novo by software tools like, for example, ReAS[4]). Using this information we build an simplified representation of our target string (e.g. our contig) written in an extended alphabet A. One letter of A will encode a sequence in S. Consequently, a word in A encodes a sequence of overlapping repeating fragments. Two A-words at distance d are separated by d occurrences of a letter N not in A. Our contig becomes a sequence of words in A*, with words separated by a stretches of Ns. Both visual and computational analysis of the simplified representations of our contigs (i.e. the encoded versions) can be carried out using Dotter. In order to do this in a practical manner we can first re-encode A into an isomorph version of it in which every character corresponds uniquely to a word in the amminoacids alphabet: ARNDCQEGHILKMFPSTWYVBZX. A letter in A becomes a unique sequence of amminoacids and, using an ad hoc cost matrix, we use Dotter for search, visualization, and analysis of a patterns of repeated requences within contigs. The next step is to analyze the output of Dotter using information about structured sequences corresponding proteins commonly occurring in TEs. This step is performed using a characterization of patterns typical of families of transposable elements by structured motifs and SmartFinder[5], a tool for structured motif search expressly developed for TE search. [3] Smit AFA, Hubley R. & Green P. - `RepeatMasker` at http://www.repeatmasker.org [4] ReAS: Recovery of Ancestral Sequences for Transposable Elements from the Unassembled Reads of a Whole Genome Shotgun` Ruiqiang L., Jia Y., Songgang L., Jing W., Yujun H., Chen Y., Jian W., Huanming Y., Jun Y., Gane K. W., and Jun W. [5] `Structured Motifs Search` M. Morgante, A. Policriti, N. Vitacolonna, A. Zuccolo - Journal of Computational Biology, Vol. 12, no. 8, October 2005

09:40  Use of Soft Topographic Maps for Clustering Bacteria on the basis of their 16S rRNA gene sequence (20') Alfonso Urso (Consiglio Nazionale delle Ricerche - Istituto di CAlcolo e Reti ad alte prestazioni (ICAR/CNR) v Palermo, Italy)

Motivation Microbial identification is crucial for the study of infectious diseases. The classical method to attribute a specific name to a bacterial isolate to be identified relays on the comparison of morphologic and phenotypic characters to those described for type or typical strains. In the last years a new standard for identifying bacteria using genotypic information began to be developed. In this new approach phylogenetic relationships of bacteria could be determined by comparing a stable part of the genetic code, and the part of the DNA most commonly used for taxonomic purposes for bacteria is the 16S rRNA `housekeeping` gene. The goal of this work is to show that clustering of bacteria can be obtained using genotypic features and a topographic representation of the clusters allows to better understand the relationships among them. Methods In order to show the effectiveness of the proposed method, the bacteria belonging to Phylum BXII, Proteobacteria; Class III, Gammaproteobacteria, according to the current taxonomy of `Bergey`s Manual of Systematic Bacteriology`, were considered for the analysis. This class includes 14 orders, each of them containing one or more family further subdivided in genera. A total of 147 16S rRNA gene sequences belonging to type strains representative of every single genus within the class was downloaded from the GenBank database. In order to cluster and visualize the bacteria dataset we used an algorithm that can obtain a topographic mapping of a set of objects, starting from proximity data (i.e. mutual distance or dissimilarity measures among the objects). This algorithm builds the map using a set of units (neurons) organized in a rectangular lattice that defines their neighbourhood relationships. This algorithm was developed as an extension of the Self Organizing Map (SOM), a widely used neural network for visualization purposes. The bacteria in the dataset are labelled according to their order and they are characterized by mutual dissimilarities. The dissimilarity matrix is the input for the topographic mapping algorithm, based on a cost function minimized using the deterministic annealing technique. Results We carried out experiments with maps of different dimensions, in particular 8x8, 9x9, 10x10 neurons, trained with the dissimilarity matrix described above, and we noticed that clustering obtained in 10x10 maps is the most accurate. The results are shown in figure: most of the bacteria are clustered according to their order in the present taxonomy. However, some interesting situations are conserved in all the experiments regardless to the map dimensions. Bacteria belonging to orders 5, 6, 9 and 10 are split into two areas of adjacent clusters indicating that they should be divided into different orders. One or two atypical strains, standing alone with respect to the rest of their respective order, were found within orders 1, 3, 5, 13 and 14. These bacteria could deserve a new order attribution or otherwise the sequence was wrongly registered in the GenBank. Bacteria belonging to order 7 and to one of the two clusters of order 5 are very close to each other and they may form a unique order. In general, the proposed method was able to highlight situations in which bacteria could form new orders or they might be incorrectly registered in the GenBank. The topographic map we obtained showed a clustering that generally reflects the present bacterial taxonomy, but highlighted some peculiar cases where this innovative instrument could provide a tool to both update the current taxonomy based on genotypic features and correct the GenBank sequence submission system.

10:00  Analysis of Cancer Genes Duplicability (20') Davide Rambaldi (Bionformatics Group, IFOM-IEO Campus, Milano)

Motivation This research investigates the relationship between duplicability of cancer genes and their tumorigenic potential. A major question in molecular evolution concerns the role of gene duplication in the organization of genomes: to which extent are duplicated genes retained into the genome? How does this retention correlate with the gene phenotype and/or with the genome features? Genes coding for complex subunits and for highly connected proteins duplicate at a lower rate when compared to random sets, and their duplicates are generally not retained. Similarly, an inverse correlation has been observed between alternative splicing and gene duplicability. All these observations seem to suggest that a correlation does exist between the tendency of a gene to duplicate and its function and regulation. We want verify if there is a relation between the cancer genes duplicability and their tumorigenic potential. Methods In order to define which is the duplicability of genes related to tumor initiation and development, we rely on the Cancer Gene Census (below reported as `reference set`). This is a dataset of genes collected at the Sanger Center that are implicated in oncogenesis and show two independent experimental evidence of mutation in primary tumours. To detect the duplications specifically associated to cancer genes, we aligned the protein sequences corresponding to the reference set over the human genome using an in-house developed bioinformatics pipeline. In respect to a protein-protein comparison, this pipeline has the advantage of extracting not only the putative protein-coding paralogs, but also pseudogenes that are not transcribed. For each cancer protein in the reference set, a duplicate is defined as a genomic match other than the best hit over a given threshold. This threshold was chosen as the best compromise between sensitivity and specificity. We applied the same pipeline for the identification of duplicates also to a random datasets of human genes, and to a collection of genes with the same functional distribution. Results According to our pipeline, the cancer genes duplicate at a rate significantly lower than random sets of human genes. We are now trying to combine this lower duplicability signal with other characteristics to predict the oncogenic potential of human genes that have not been yet related to cancer.

10:20 
Coffee break in poster area

10:50->11:40    Keynote Lecture by Christos Ouzounis
Description: Quantifying vertical and horizontal gene flow across species and time. 
10:50  Quantifying vertical and horizontal gene flow across species and time (50') Christos Ouzounis (Center for Research and Technology Hellas, Greece)

Using large-scale comparative genomics with appropriate algorithms and databases, described briefly, we present a phylogeny of microbial species that takes into account both vertical and horizontal gene transfer (HGT) events. So far, all phylogenetic reconstructions have presented microbial relationships as trees rather than networks. This first attempt to reconstruct such an evolutionary network is termed as the "net of life". We use available reconstruction methods to infer vertical inheritance, and use an ancestral state inference algorithm to map HGT events on the tree. We demonstrate that vertical inheritance constitutes the bulk of gene transfer on the tree of life. We term the bulk of horizontal gene flow between species as "vines", and show that multiple, yet tiny, vines interconnect throughout the tree. These results strongly suggest that the HGT network is a scale-free graph, a finding with important implications for genome evolution. Finally, we propose that genes might propagate extremely rapidly across microbial species through the HGT network, using certain organisms as hubs. Open problems and possible future directions for research will also be discussed. Further reading: Ouzounis CA (2005) Ancestral state reconstructions for genomes. Curr. Opin. Genet. Devel. 15, 595-600. Kunin V, Goldovsky L, Darzentas N, Ouzounis CA (2005) The net of life: Reconstructing the microbial phylogenetic network. Genome Research 15, 954-959. Kunin V, Ouzounis CA (2003) The balance of driving forces during genome evolution in prokaryotes. Genome Research 13, 1589-1594.


11:40->12:40    Session 5: Large scale analysis of experimental data
Description: Methods and applications for the analysis of numeric data, images, or other experimental data, from high-throughput approaches in functional genomics, proteomics, cellular biosignalling. 
11:40  BATS: A user friendly Software for analyzing time series microarray experiments. (20') (files Slides  ) Claudia Angelini (Istituto per le Applicazioni del Calcolo, CNR)

Motivation Expression level of genes in a given cell can be influenced by a pharmacological or medical treatment. The response to a given stimulus is usually different for different genes and may depend on the time. Microarray experiments allow one to simultaneously monitor the level of expression of thousands of genes. Suitable statistical methods are required to automatically detect those genes that can be associated with the biological condition under investigation. In what follows we consider microarray experiments involving comparisons between two biological conditions like control and treatment made in the course of time. Special statistical algorithms are necessary for efficient analysis of this type of experiments. BATS is a user-friendly software for the Bayesian Analysis of Time Series microarray experiments based on the novel statistical approach proposed in Angelini et al. (2006). BATS can carry out analysis with both simulated and real experimental data, also it handles data from different platforms. Methods BATS implements a truly functional fully Bayesian method which allows the user to automatically identify and rank differentially expressed genes and estimate their profiles on the basis of time series microarray data. The arrays are taken at n different not necessarily uniformly spaced time points on the interval [0,T] with possible replicates at some or all time points. For each gene, we assume that evolution in time of its expression level is governed by a regular function, true gene expression profile, which is observed with some additive noise. According to Angelini et al. (2006) each gene expression profile is modeled as an expansion over some orthonormal basis, with unknown number of terms and coefficients. Then a fully Bayesian model for the data is developed by eliciting prior distributions on the number of terms, the coefficients and the level of the noise. In particular, all parameters in the model are treated either as random variables or as nuisance parameters which are recovered from the data. Three different Bayesian models, which vary by the way the noise is treated, are considered. All evaluations are based on analytic expressions. The proposed procedure manages successfully various technical difficulties which arise in microarray time-course experiments such as a small number of observations available, non-uniform sampling intervals, presence of missing or multiple data as well as temporal dependence between observations for each gene. Results BATS is a user-friendly software written in Matlab which is freely available upon request. It allows a user to analyze time series microarray experiments using three different models to account for various types of errors thus offering a good compromise between nonparametric and normality assumption based techniques. It allows a user to specify hyper-parameters of the model or estimate them from the data. The method accounts for multiplicity, selects and ranks differentially expressed genes and estimates their expression profiles. Since all evaluations are performed using analytic expressions, the entire procedure requires very small computational effort. In the talk, we describe statistical model used in BATS, the main features of the software interface and an application of BATS to a case study of a human breast cancer cell stimulated with estrogen. The latter led to the discovery of some new differentially expressed genes which were not marked earlier due to the high variability in the raw data. Although originally designed for cDNA time series microarray experiments, BATS supports different platforms, such as Affimetrix and Illumina. Future work will focus on the development of a method for clustering time series genes expression profiles which is designed for the data described above. References F. Abramovich and C. Angelini (2006), Bayesian Maximum a Posteriori Multiple Testing Procedure, Sankhya, 68, 436--460. (2006) C. Angelini, D. De Canditiis, M. Mutarelli, M. Pensky. (2006) Bayesian approach to estimation and testing in time course microarray experiments, Tech. Rep. IAC-CNR n. 317/06. Available http://www.na.iac.cnr.it/rapporti/anno2006.htm Cicatiello, L., Scarfoglio, C., Altucci, L., Cancemi, M., Natoli, G., Facchiano, A., Iazzetti G., Calogero, R., Biglia, N., De Bortoli, M., Sfiligol, C., Sismondi, P., Bresciani, F. and Weisz, A., (2004). A genomic view of estrogen actions in human breast cancer cells by expression profiling of the hormone-responsive trascriptome. Journal of Molecular Endocrinology, 32, 719--775.

12:00  A fuzzy cluster analysis model for mining the cMap dataset to investigate common drug modes of action (20') (files Slides  ) Francesco Iorio (Telethon Institute of Genetics and Medicine)

Motivation The Connectivity Map (also know as cMap) is a reference collection of gene-expression profiles from cultured human cells, treated with bioactive small molecules. In [1] the authors shown how to use this resource to find connections among small molecules sharing a mechanism of action and left some open questions. To identify small molecules with similar effects, on the basis of gene-expression profiles, in [1] is shown a nonparamentric, rank-based pattern matching strategy based on Kolmogorov-Smirnov statistic [2] . It is used to produce an Enrichment Score [3], for each profile in the set, that quantify how much each of them is similar to a query signature. A query signature is any list of genes whose expression is correlated with a biological state of interest. Each gene in the query signature carries a sign, indicating whether it is up regulated or down regulated. The profiles are rank-ordered according to their expression values. The query signature is compared to each rank-ordered list to determine whether its up regulated genes tend to appear near the top of the list and its down regulated genes near the bottom (positive connection) or vice versa (negative connection), yielding a connectivity score ranging from +1 to -1. All instances in the database are then ranked according to they connectivity scores; those at the top are most strongly correlated to the query signature, and those at the bottom are most strongly anticorrelated. Our goal consists in making cluster analysis on profiles contained in the cMap, providing a complete portfolio of small molecules behaviors and discovering how many groups composed by (reasonably) similar elements does the cMap contain. After this first step, the clusters obtained can be used to find connection between an external signature and the profiles clustered by the evaluation of a cluster membership function. Methods In agreement with [1], we verified that clustering procedures using classical distance metrics detect dominant clusters related to cell types. In order to discover structures related to the molecule mode of action (MOA), we adopt an approach that, starting by the GSEA [4] definition of enrichment score, will get a novel distance metric that highlights similarities of MOA rather than cell types. We plain to use this metric to check the similarity between all gene profiles pairs in the set building, for each profile, a query signature taking in account only the significant up regulated genes and the significant down regulated ones. The ranking lists obtained for each profile will then be used to build a fuzzy similarity matrices over all them, on which fuzzy clusters can be obtained and membership functions can be evaluated. This will allow MOA identification of small molecules. Visualization techniques using spherical surfaces in several different data-transformation domains will be developed and will make clustering results human-eye understandable. Results

12:20  Overall quantitative pattern extraction analysis of two-dimensional electrophoresis images from prostatic cancer mouse model (20') Saveria Mazzara (Department of Biomedical Engineering, Polytechnic University, Milan, Italy)

Motivation Two-dimensional gel electrophoresis (2DE) is still a key component of current proteomics, since it represents an indispensable tool for the analysis of protein expression profiles in complex biological systems such as whole cells and tissues. The technique provides a 2D map which couples orthogonally a charge-based (isoelectric point, pI) to a size-based separation (relative molecular mass, Mr). The maps obtained from proteins migration are acquired as grey level images which can be processed to allow the analysis and the comparison of the experimental outcome for different samples. However the complexity and the intrinsic variability of the maps make this task difficult and time consuming. The analysis of gel images is a very labour intensive step and involves a considerable expertise to properly extract information. The process, in general, takes advantage of dedicated tools for quantitative analysis and comparison of the gels, meant as collections of identified spots, that are matched through the different samples by means of image registration techniques. Beside these tools, or better, in a complementary way, it could be useful to develop strategies for the automatic classification of the whole gels, based on the assessment of the complete ensembles of spots shown in the maps, providing a schematic representation of the overall outcome of the experiments without considering the individual spots in detail. We developed such a strategy and it was applied to the gel images set acquired in a study on prostatic carcinoma, accomplished through the use of an animal model of the pathology: the TRAMP mouse (TRansgenic Adenocarcinoma Mouse Prostate). The 2DE maps were generated from serum samples drawn from transgenic and wild-type mice, in three different developmental stages. Methods Some works may be retrieved in the recent literature where efforts towards the classification of 2DE maps are described; briefly quantitative image descriptors, extracted on the basis of pixel intensities quantified in the different zones in which the image is partitioned, are analyzed with techniques of dimensionality reduction to discriminate different clinical conditions. Starting from this kind of approach we intervened on the definition of the gel descriptors: the idea was that the features refer to areas of the image segmented as spots excluding from the quantitative description artifacts and background signal; then the position of the detected spots are considered, no longer in pixel, but in terms of pI and molecular weight (MW), after ad hoc calibration. This simple step makes the samples comparable, without the accomplishment of a canonic image matching by means of registration techniques; indeed the new space (pI, MW) is invariant in respect of the alterations of the protein migration, allowing the inclusion in the analysis of gel images that otherwise had to be excluded, because of the lack of the necessary homogeneity in respect of the other samples. The quantitative descriptors so defined were processed through the principal component analysis. Results The informativeness of the chosen descriptors allows to see the gels of the data set as items in a privileged three-dimensional space, segregating according to their biological conditions. The samples, both for the transgenic mice and for the control mice, obtained in different times of development, appear as linearly separable. The scatterplots for the transgenic vs. wild type comparison, performed in the three developmental ages, show a degree of separability according to the disease course, in agreement with the biological knowledge. The method proposed is highly repeatable, does not need any a priori information. It may provide an effective visualization tool and important clues for classification; thus it could represent a reliable complement to the routine of a proteomics lab to perform rapid and systematic screening tests capturing the essential impression of the 2DE gel image. References Marengo E, Robotti E, Antonucci F, Cecconi D, Campostrini N, Righetti PG: Numerical approaches for quantitative analysis of two-dimensional maps: A review of commercial software and home-made systems. Proteomics 2005, 5:654-666.

12:40 
Lunch

14:00->15:20    Poster Session

15:20->16:00    Session 4: Biomedical Informatics
Description: Tools and databases for the analysis of biomedical information and their interconnection with bioinformatics resources. 
15:20  Microarray data infrastructure using Gebbalab based on Alfresco technology (20') (files Slides ppt  ) silvia giuliani (High Performance Systems Division, CINECA, Casalecchio di Reno (Bologna))

Motivation Great progress has been made in recent years in integrating technologies and innovations in computer science with those of the life sciences. However, many activities in biological and especially clinical research still do not have access to the necessary computer technology. Hospitals, for example, often perform outstanding research but lack the bioinformatics tools which could fully exploit the activities carried out. The GeBBALab project is addressing these problems by creating a "virtual laboratory" with contributions from both scientific and technological/industrial partners. The project has identified two key areas: 1. Microarray data infrastructure and analysis 2. Integration of patient and clinical data with genomics information Although the latter objective is of critical importance in health care this is still under discussion. This abstract will therefore concentrate on the GebbaLab (Genetics, Biotecnologie and Bioinformatica Applicata) infrastructure for microarray storage and analysis, also because one of the consortium members already provides a microarray analysis service and so can guide and test its design. We emphasise, however, that the system has been designed to allow the addition of clinical data Methods The efficient storage and analysis of microarray data is of considerable interest and there is much activity worldwide. In general most researchers adopt a "single workstation approach" for data management and analysing expression data. However this method is rapidly becoming inconvenient for many reasons: - There is no provision for the systematic recording of experimental information - Current PCs are not sufficiently powerful for analysing data - Comparison with data from other researchers or public repositories is difficult Careful consideration of these points has suggested the following criteria for the design of the microarray infrastructure. - Users must be given the opportunity to use a wide range of common and user-friendly tools for data entry and for the different platforms available, e.g. Affymetrix, Agilent, Illumina etc. - Data should be distributed - Data must be recorded in a format which allows interoperability of all the data sources - User-friendly portals or clients are required to access resources and powerful computational facilities to process datasets To satisfy these criteria the infrastructure was structured into two distinct levels: 1. The data entry and storage level. 2. The application level for running analysis applications. The system consists of a "central" node and many "satellite" nodes, each of which with its own data store, potentially virtualized. The system has been designed in a modular way in order to work even in case of unavailability of the central node. In fact in our schema "central" merely indicates a central registry for distributed indexing and querying. Data is stored, analyzed and exchanged through a complex architecture build upon Alfresco, an advanced Open Source Enterprise Content Management that provides a common interface and access to distributed data sources. Alfresco also includes user authentication and various levels of access privileges, thus allowing many degrees of data security and privacy. Our effort have been lead to build upon the Alfresco structure many software modules in order to manipulate MAGE-ML files, extract metadata from MAGE-MLs, to store and index metadata into the repository for querying microarray data according to different search criteria. All the modules are provided through a SOA (Service Oriented Architecture) layer among different web applications that allow users to choose, through a web portal, the appropriate software from those available that transparently invokes algorithms to fetch the data for analysis. High performance servers are available for CPU or memory intensive calculations Results We have demonstrated a user-oriented, powerful infrastructure for microarray data management and analysis. It allows the user to enter data and being distributed avoids the limitations of a centralised server. A prototype using Alfresco is already available and microarray researchers are invited to contact the authors if they wish to experiment with the system. Future enhancements to Gebbalab will include analysis applications and, crucially, the possibility of integrating patient data.

15:40  Phenotype Miner: an integrated IT system for supporting genetic studies (20') Angelo Nuzzo (Department of computer Science and Systems, University of Pavia, Pavia)

Motivation A specific characteristic of the post-genomic era will be the correlation of genotypic and phenotypic information. In this context, the studies aimed at the so-called genetic dissection of complex traits represent a first crucial benchmark for Biomedical Informatics. This kind of studies is based on different types of data, i.e. clinical, genetic and genealogical data. The definition of an Information Technology infrastructure is crucial to support both phenotype discovering and genotypic traits mapping. We developed a Web-based data management system to combine such different sources of information in an integrated framework, in order to make data investigation more efficient and easier for the final user and improve the knowledge discovery process. Methods The overall strategy of geneticists` analysis is made of 3 main steps: i) discovering the phenotype or clinical condition to be investigated, given clinical data of the population, ii) searching relationships between individuals showing the same phenotype (if any, genetic causes may be supposed), iii) choose appropriate loci to be genotyped to identify genotype-phenotype association. Thus the final purpose of the system that we have developed is to support each of these steps. - Dynamic Query tool A first crucial task that hampers the development of automated IT solutions in genetic studies is an appropriate definition and identification of the phenotypes that geneticists want to investigate. Clinicians and biologists usually define a phenotype by a set of variables and the values they may take. In order to select (and then to analyze) the individuals satisfying that set of rules, it is necessary to write a suitable SQL statement to run a query and get them. However, as the users may have no expertise in the use of a query scripting language, we provide a tool that automatically generate the proper SQL script to select individuals with the defined phenotype using a graphical user interface. This tool is based on a phenotype formalization that corresponds to a logical tree construction, in which the nodes are the conditions, the AND operator is used to go from the top to the bottom (specialization) and the OR operator is used to add an upper node from the bottom to the top (generalization). - OLAP engine Dealing with clinical data to analyze phenotypic information implies to take into account their heterogeneity, thus a browsing interface that allows an easy investigation of such data is needed. This means that it is required a tool for performing a multidimensional inspection of the dataset. The technique of multidimensional analysis is implemented with software tools called Online Analytical Processing (OLAP) engines. In our system we use an OLAP engine written in Java programming language, called Mondrian (http://mondrian.pentaho.org/), which executes queries written in the MDX language (that has actually become a standard for data warehouse applications), reads data from a relational database, and presents the results in a multidimensional format through a Java API, so that the final user may choose the presentation layer. We developed a web application based on JSP pages to integrate it with the Phenotype Editor deployed as Java Web Start applications. - Pedigree Analysis and Visualization Finally, it is necessary to analyze the relationships between them in order to make hypotheses on the possible genetic origin of that phenotype. This corresponds to the search of common shared ancestors between those individuals and analyzing the shared pedigree. To perform this task, we have developed a software engine (written in Java) that allows the use of a free pedigree analysis software, called Pedhunter (http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/pedhunter.html), directly from the Web application. Output files are provided in the standard Linkage-postMakePed format, so that any available tool can be used for visualization. We integrated our specific tool for pedigree visualization as Java Web Start application. Results The system described above is still under development, but it has been just used for several tests on a real dataset, the clinical database of the Val Borbera isolated population study. Geneticists can dynamically compose queries on the dataset using the graphical Phenotype Editor; then they can explore the extracted data at different levels of detail using the OLAP engine, in order to find which phenotypes could be of particular interest for the population. Several phenotypes have been identified (regarding hypertension, thyroid diseases and diabetes) analyzing the clinical database of the project, which actually contains more than 4000 individuals and about one hundred clinical measures. Finally, the automatic mapping of the selected phenotype on the pedigree allows the geneticists to make hypotheses on the genetic origin of that phenotype and make suitable choices for the following genotyping and genetic analysis.


16:00->17:00    "Graziella Persico" Lecture by Guy Rouleau
16:00  Introduction (10') Giuseppe Martini, Catello Polito
16:10  Invited Lecture (50') Guy Rouleau
17:00 
Coffee break

17:20->19:00    BITS Members' Meeting
Description: Assembly of the members of the BITS society. 
17:20 
Free time
19:00 
Transfer to Palazzo Doria d'Angri
20:30 
Social Dinner

Saturday 28 April 2007 toptop

09:00->10:20    Session 6: Gene expression and system biology
Description: Quantitative analysis of the expression of gene sets and their relationships, modeling and inference of biological systems and networks. 
09:00  Identification of Perturbed Pathways Using High-throughput Data (20') Laura Riva (MIT Biological Engineering Division, Cambridge, USA; Department of Biomedical Engineering, Polytechnic University of Milan, Milano)

Motivation Cells live in a dynamic environment in which they encounter various perturbations. These perturbations may arise from toxic compounds, environmental changes, mutations, or a disease. Identifying the molecules and cellular pathways that are affected can reveal the nature of a perturbation, provide potential therapeutic targets and shed light on mechanisms of cellular adaptation. We present a computational method to discover the pathways that are altered by a perturbation by analyzing high-throughput data in Saccharomyces cerevisiae. Methods Our method combines different types of high-throughput data to develop a coherent, mechanistic view of how cellular pathways are alterated. We have created a graphical model of the interactome based on physical data. The model also incorporates relations between genes and the proteins that regulate them, using a genome-wide map of experimentally determined transcriptional regulatory sites. We developed a novel algorithm to search for the pathways that are altered by the perturbation. Results We assess our method by applying it to over 100 datasets where the perturbations are known single gene deletions. In these test data, the genes and pathways identified by the algorithm are relevant to the actual perturbation in 80% of the 104 cases. We use the same algorithm to predict the pathways that are most affected by different compounds. We identify genes known to be responsive to the agents and affected paythways that are in agreement with biological findings.

09:20  Integrated experimental and systems biology approach to the identification of transcriptional regulatory network of p63 transcription factor (20') (files Slides pdf  ) Giusy Della Gatta (Telethon Institute of Genetics and Medicine, Via Pietro Castellino 111, Naples, Italy; Seconda Universita degli studi di Napoli, Italy)

Motivation In spite of the large amount of data deriving from high-throughput genomic technologies, understanding how genes and proteins are connected and operate within networks is still a biological challenge. We developed an integrated experimental and systems biology approach to identify the biological pathways and the direct targets of p63 transcription factor, using primary keratinocytes as a cell culture model. p63, a member of the p53 family, is essential for the development of various ectodermal structures including skin, but its mechanism of action remains largely unknown. Methods A retrovirus expressing an inducible p63 gene was generated by fusing the coding portion of p63 with a tamoxifen-responsive estrogen receptor. Identification of immediate-early genes upon p63 activation was achieved by a time-series microarray analysis. Significant gene expression profiles were obtained for 800 genes. To infer the network surrounding p63 gene, we developed an algorithm called TSNI (Time Series Network Identification). TSNI modeled the network as a system of ordinary differential equations based on relating changes in gene transcript concentrations to each other and to the external stimuli. TSNI ranked the genes according to the probability of being the direct targets of p63. Results We selected top 100 genes as to be significant targets of DNP63a gene. These genes were found to belong to cell cycle control, cell adhesion and keratinocytes differentiation pathways, in agreement with what should be the targets of p63 according to literature. To verify the putative p63 target genes, we measured global changes in gene expression in p63 knockdown keratinocytes versus wild-type and we performed a ChIp on chip analysis on custom Agilent array targets to validate the TSNI predicted genes as functional targets. Our novel approach identified a number of genes that are significantly regulated by p63 at early time points and are involved in cell cycle control, cell adhesion and keratinocytes differentiation. We are currently deciphering the molecular function of these essential genes in stratified epithelia.

09:40  Oscillations and bistability in intracellular signal transduction pathways (20') Maria Bersani Alberto (Dept. of Mathematical Methods and Models (Me.Mo.Mat.), University `La Sapienza`, Rome (Italy))

Motivation Enzyme reactions play a pivotal role in intracellular signal transduction. Many enzymes are known to possess Michaelis-Menten (MM) kinetics and the MM approximation is often used when modeling enzyme reactions. However, it is known that the MM approximation is only valid at low enzyme concentrations, a condition not fulfilled in many in vivo situations. In the last decade many mathematical models have been formulated to investigate the behavior of complex intracellular biochemical networks. The aim of such modeling (which is an integral part of the `Systems Biology` large scale project) is roughly twofold: to reproduce and study some particular phenomena observed experimentally (like bistability, oscillations, ultrasensitivity etc.) and to investigate the properties of these networks as information processing and transducing devices. Surprisingly, the mathematical formulation of these highly interconnected enzyme reactions usually lacks a serious criticism of the delicate passage from the kinetics of simple reactions to the kinetics of complex reaction networks. This can be justified when analyzing underlying mechanisms (e.g., the importance of feedback or the creation of oscillations), where the exact kinetic expressions and parameters are less important since one is usually only interested in the qualitative behavior that the system can perform. However, in the light of the Silicon Cell project, which aims at being a both qualitative as well as quantitative precise representation of the living cell, the use of correct parameters, kinetic expressions and initial conditions (i.e., steady-state concentrations of molecular species) becomes crucial. Methods One of the principal components of the mathematical approach to Systems Biology is the model of biochemical reactions based on the classical Michelis-Menten kinetics. This formulation considers a reaction where a substrate binds an enzyme reversibly to form a complex. The complex can then decay irreversibly to a product and the enzyme, which is then free to bind another molecule of the substrate. This scheme is mathematically represented by a system of two nonlinear ordinary differential equations (ODEs), corresponding initial conditions and two conservation laws. Assuming that the complex concentration is approximately constant after a short transient phase leads to the usual Michaelis-Menten (MM) approximation (or standard quasi steady-state assumption or approximation (sQSSA)), which is valid when the enzyme concentration is much lower than either the substrate concentration or the Michaelis constant. This condition is usually fulfilled for in vitro experiments, but often breaks down in vivo. The advantage of a quasi steady-state approximation is that it reduces the dimensionality of the system, passing from two equations (full system) to one (MM approximation or sQSSA) and thus speeds up numerical simulations greatly, especially for large networks as found in vivo. Moreover, the kinetic constants are usually not known, whereas finding the kinetic parameters for the MM approximation is a standard in vitro procedure in biochemistry. However, to simulate physiologically realistic in vivo scenarios, one faces the problem that the MM approximation is no longer valid as mentioned above. Recently several other mathematical approaches, such as the total quasi steady-state approximation (tQSSA), have been developed for enzymes with MM kinetics. These new approximations are valid not only whenever the MM approximation is, but moreover in a greatly extended parameter range. Importantly, the tQSSA uses the same parameters as the sQSSA. Hence, the parameters found in vitro from the MM approach can be used by the tQSSA for modeling in vivo scenarios. Results Our investigation applies to every biochemical network which includes enzyme reaction cascades. Starting from a single reaction and arriving at the mitogen activated protein kinase (MAPK) cascade, including feedback, we give several examples of biologically realistic scenarios where the MM approximation leads to quantitatively as well as qualitatively wrong conclusions, and show that the tQSSA improves the accuracy of the simulations greatly. However, since the tQSSA, although superior to the sQSSA, also does not always work, we also use the alternative of simulating each step of the reaction by means of the full system of ODEs, which means describing every reaction in terms of two equations, and facing three instead of two parameters for every reaction, as it has been done for example for the MAPK cascade. We also study simple reaction schemes characterized by bistability and concentration oscillations; we compare the solutions of the full system with their standard and total approximations. We show that the parameter ranges giving oscillations or bistability are sometimes slightly different, concluding that in these cases the approximations are absolutely inadequate to represent the behaviour of the network.

10:00  MoVIN server: Motif Validation of Interaction Networks (20') (files Slides pdf  ) Paolo Marcatili (Department of Biochemical Sciences, University "Sapienza" of Rome)

Motivation Protein-protein interactions are at the basis of most cellular processes and crucial for many bio-technological applications. During the last few years the development of high-throughput technologies has produced several large-scale protein-interaction data sets for various organisms and many interaction databases have been created by data-mining techniques. It is well known that interactions can be mediated by the presence of specific features, such as motifs, patches and domains. Even if many efforts are underway to elucidate the role of these features in the regulation of interaction networks, very little is known about them on a genome scale. Data-integration and computational methods can be used to assign a confidence level to specific interactions or datasets and to obtain information about the molecular basis that regulate such interactions. Methods The MoVIN web server is a new bioinformatics resource for the analysis and validation of protein interaction networks. It combines yeast protein interaction data with other biological resources - such as sequences, process and component ontologies, domains and structures - to construct a high-confidence interaction set by identifying similar features in protein groups sharing a common interaction partner. Such results are presented to the user with an integrated graphical interface that offers the possibility of exploring the interaction network and to access many biological-relevant data computed by the server or present in other databases. Results To assess the usefulness of our server, we analysed the presence of similar linear motifs, functions, localization and domains in many different interaction datasets. We observed a statistically significant presence of such features with respect to random datasets and found that these information are consistent but non redundant. Our study shows that the analysis of shared motifs in protein interaction networks can be a valuable method to investigate the properties of interacting proteins and to provide information that can be effectively integrated with other sources. As more experimental interaction data become available, this method will be a useful tool to gain a wider and more precise picture of the interactome.


10:20->11:10    Spot Presentations
Description: 5 min. presentations of selected posters 
10:20  Identification of cancer signaling pathways from published gene expession signatures using PubLiME (05') Heiko Muller (FIRC Institute of Molecular Oncology Foundation (IFOM), Via Adamello 16, 20139 Milan, Italy; European Institute of Oncology (IEO), Via Ripamonti 435, 20141 Milan, Italy)

Motivation Gene expression technology has become a routine application in many laboratories and has provided large amounts of gene expression signatures that have been identified in a variety of cancer types. Interpretation of gene expression signatures would profit from the availability of a procedure capable of assigning differentially regulated genes or entire gene signatures to defined cancer signaling pathways. Here we describe a graph based approach that identifies cancer signaling pathways from published gene expression signatures. Methods Published gene expression signatures are collected in a database (PubLiME: Published Lists of Microarray Experiments) enabled for cross-platform gene annotation. Significant co-occurrence modules composed of up to ten genes in different gene expression signatures are identified. Significantly co-occurring genes are linked by an edge in an undirected graph. Edge betweenness and k-clique clustering combined with graph modularity as a quality measure are used to identify communities in the resulting graph. Results The identified communities consist of cell cycle, apoptosis, phosphorylation cascade, extra cellular matrix, interferon and immune response regulators as well as communities of unknown function. The genes constituting different communities are characterized by common genomic features and strongly enriched cis-regulatory modules in their upstream regulatory regions that are consistent with pathway assignment of those genes.

10:25  IPROC, a remote web application for analysis and processing of multidimensional biological images (05') (files Slides ppt  ) Concita Cantarella (CEINGE Biotecnologie Avanzate, Napoli, Italy.; S.E.M.M. European School of Molecular Medicine, Naples, Italy.)

Motivation Digital images are in widespread use for basic research in cell and molecular biology as well as for laboratory and clinical diagnosis. Fast optical sectioning techniques, in combination with timelapse-microscopy, create big multidimensional data sets, used to image the dynamics of biological events. The management of large number of images requires the use of databases (DB), while processing of the acquired images is often necessary to enhance the visibility of cell features, that would otherwise be hidden. Several image processing options, often designed to address specific biological requirements, are available as unix programs and libraries. Their integration into a web based image processing engine, accessible online by many concurrent users, promises to be at the same time a powerful and efficient solution to deal with images in a scientific environment. Methods A combination of dynamic web pages, SQL database, web services and concurrent process scheduling, was used to create a system, IPROC, able to store images and have them processed interactively by a relatively large number of simultaneous users. Three independent modules cooperate to generate the interface and to insure processing, by coordinating web-page construction, image data retrieval, image processing and delivery. Processing is obtained by integrating a large library of different unix filters installed on the server, while interactivity is provided by the ability to quickly react to user input via small data requests. Filters are applied in sequence, reducing the need for temporary storage and allowing unlimited backstepping. PHP was used as the main language to build a number of objects, which together take care of obtaining images from a DB or from external files and applying the required processing steps. Wrapper objects take care of interfacing with different sets of image processing tools, available as command line programs or webservices. Adapter modules allow execution of different image processing steps in sequence. Results IPROC looks like and behaves as a locally running application, while retaining the ability to take advantage of storage and computational resources available on the servers. Multidimensional images are treated as a collection of independent frames, that may be simultaneously processed. Visualization may highlight various aspects of the image data, by using a variety of display mode. The modular structure of the application permits the distribution of the various parts of the same job on different machines, thus assuring speed and low latency for operations involved in page redrawing. In a parallel environment, a large number of frames, requested by one or more users, may be calculated at the same time on different cluster nodes. Currently several processing filters have been included, by taking advantage of adapters, developed for ImageMagick and PDL libraries as well as for PHP internal image functions. Most point or area processing filters are available, as well as tools for modifying image geometry and a number of specific processing steps acting on the image as a whole, such as reslicing, projection or deconvolution. The integration of image analysis tools allows to easily produce and visualize, in text or graphic formats, histograms or other statistic measurements. A specific set of tools, independently developed for studying cell movement, is also being adapted to work within this environment.

10:30  Human/chimpanzee trans-specific SNPs: searching for balancing selection (05') (files Slides pdf  ) Matteo Fumagalli (Scientific Institute IRCCS E. Medea, Bioinformatic Lab, Via don L. Monza 20, Bosisio Parini (LC), Italy)

Motivation Most species are monophiletic throughout most of their genome. Yet, examples of trans-specific polymorphisms have been reported in different species, including humans. Trans-specific SNPs (ts-SNPs) can be explained by three main reasons: 1) SNP survival due to random chance, 2) coincidental mutations occurred after speciation and 3) balancing selection. In particular, the role of balancing selection on human/chimpanzee coding sequence evolution has been estimated to be modest (Asthana et al., 2005). While few genome-wide attempts have been performed to detect signatures of balancing selection outside coding sequences, recent data on specific loci have indicated the role of balancing selection in the evolution of cis-acting regulators of genes involved in immune response. The availability of extensive human and chimpanzee SNP data, as well as of genome-wide measures of nucleotide diversity and SNP allele frequency now allow the identification of regions under balancing selection. Methods Human and chimpanzee SNPs were retrieved from dbSNP database (build 125); SNPs positions refer to the following genomic assemblies: NCBI build 35 and NCBI build 1version 1, respectively. Only base substitution polymorphisms were included and SNPs located at CpG sites in either species were discarded. A total of 8637604 and 1163289 SNPs were obtained for human and chimpanzee, respectively. Whole genome human-chimpanzee pairwise alignments (available through the UCSC Genome Browser, www.genome.ucsc.edu) were scanned in order to identify SNPs located at the same position and showing the same alleles in both species. Genomic annotations, including Multi-species conserved sequences (MCS), as well as data concerning human recombination rates, were retrieved from the UCSC database (tables phastCons17wayElements and snpRecombRateHapmap). Tajima`s D values were obtained from the UCSC database for three populations with different descent: African, European and Chinese. Tajima`s D (Tajima, 1989) is one of the most frequently used tests to compare nucleotide diversity: regions with an excess of high-frequency variation (observed as a positive Tajima`s D) are consistent with balancing selection. Still, it should be noted that Tajima`s D has low power to detect selection, so that even small departures from expected (in case of neutrality) might indicate balancing selection. All statistical analyses were performed using R (www.r-project.org). Results We identified a total of 1411 non-CpG ts-SNPs. Since it has been previously estimated that a shared polymorphism would survive for 4.6 million years (conservatively, the time separating human for chimpanzee), based on the number of available chimpanzee SNPs, we would expect to identify only 3 `surviving` SNPs. In order to verify whether the number of ts-SNPs is higher than expected from coincidental mutations we calculated the expected number of ts-SNPs on the basis of a random distribution of human/chimpanzee polymorphic sites and verified that the number of SNPs we identified is significantly higher (Binomial Test, < 0.0001 ). Therefore a subset of ts-SNPs might be maintained by balancing selection. ts-SNP distribution was as follows: 5 in coding regions, 46 in conserved noncoding sequences, the remaining being in introns or intergenic spacers. The frequency of ts-SNPs in conserved noncoding regions did not differ from expected. We next wished to identify closer than expected ts-SNPs pairs: based on the distance distribution of 100 samples of randomly selected SNPs (and an empirical p value of 0.01) we determined a conservative threshold of 10 kb. We identified 24 ts-SNPs doublets or trios closer than the threshold as possibly located in regions subjected to balancing selection. Indeed, one of the doublets was located in the MHCII region (in addition to 4 more ts-SNPs), previously shown to be under balancing selection. We next wished to verify whether Tajima`s D is higher in regions surrounding ts-SNPs (or a proportion of them). Mean Tajima`s D values in 10 kb surrounding ts-SNPs were significantly higher than the genome average for the 3 populations. Also, ts-SNPs were located in regions displaying a significantly (Wilcoxon Rank Sum Test, p<0.0001) higher than average polymorphism density; although a high SNP frequency might result in increased probability of retrieving a ts-SNP, an increased nucleotide diversity is expected as a result of balancing selection. The high SNP frequency surrounding ts-SNPs is not the result of strong recombination activity since recombination rates around ts-SNPs do not differ from the average (calculated on random SNP samples). We therefore consider that some of the ts-SNPs we identified might represent a molecular mark of balancing selection and we are thus applying further tests in order to define genomic regions and specific functional elements subjected to such selective process.

10:35  Integrating current knowledge to predict mutations effect on splicing. (05') (files Slides pdf  ) Matteo Cereda (Bioinformatics, Scientific Institute I.R.C.C.S. E.Medea)

Motivation The production of functional messenger RNAs in metazoans is critically dependent upon the accuracy of pre-mRNA splicing, a highly regulated process assuring that introns are removed and ordered array of exons is maintained in mature transcripts. The process requires an accurate recognition of exons, in particular a precise determination and pairing of 5` and 3` splice sites by the splicing machinery. Over the past few years several approaches have been explored to allow for the exact detection of splice sites and an increasing number of sequence elements have been identified that are involved in the splicing mechanism and in its regulation. In this work we wished to test whether a tool can be built that uses the known splicing determinants to predict the effect of mutations/variations on splicing. A similar tool can be very effective when screening for mutations in non coding regions by a priori selecting region/positions that are predicted, if mutated, to yield splicing aberrations; moreover, it can give useful information when missense mutations have already been identified and, more in general, can be used to predict the possible effects of SNPs. Methods To create the exons database, protein coding genes annotated in GenBank were selected considering a single id for each transcription cluster. The data set was composed by 134623 canonical exons (`AG-GT`). A great number of known elements that take part in splicing regulation was identified in the exon set. In particular, we evaluated: consensus values, exon length (EL), Branch Point position (BPP), Exonic Splicing Enhancers motifs (ESE), Exonic Splicing Silencers (ESS), UAGG and GGGG motifs. Frequency distributions were evaluated for exon size, branch point distance and relative element abundance. A number of intronic sub-sequences (pseudo exons) were selected whose flanking regions resemble splice sites (a threshold of 0.6 on consensus values was considered) and whose length was comprised between the 0.005th and 0.995th quantiles of exon length distribution. For both exons and pseudo-exons EL, BPP, ESE, RESE and ESS were scored according to frequency distributions in the real exons set. A total of 8 independent set of data were created each containing scores from 10000 real and 10000 pseudo exons. To extract the most relevant features of this set we choose a Recursive Feature Elimination (RFE) method built upon a Support Vector Machine technique with a linear kernel. RFEs were performed with `Leave-One-Out` cross-classification approach on 6 independent sets. Once determined the most relevant variables, a SVM with radial kernel and 500-fold cross-validation was trained on an set different from the ones used in RFE to identify the model. Another set was used as test set to measure performance in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient (MCC). A preliminary set of experimentally well studied splicing mutations was then analysed comparing the predictions of this SVM model on wild type/mutated sequences in order to verify if the predictions of our model correspond to the described splicing aberrations. Results The RFE technique applied to the 6 independent datasets allowed the identification of 7 features that resulted most important in each set. These features were then used to train the SVM with radial kernel obtaining a final model (sensitivity= 0.96, specificity=0.93, accuracy=0.95 and MCC=0.89). The model was then used to predict mutation effects on splicing (see table 1): out of 15 studied mutations 10 were completely predicted, in one case the effect was only partially predicted, in 2 cases the predicted effect was different from the described one and in 2 cases no effect was predicted. The preliminary results are quite encouraging and deserve further investigations. A greater number of cases should be collected to better test the effectiveness of the model and different strategies in features selection should be investigated.

10:40  BLASTZ-WEB: a web interface to the BLAST-Z algorithm (05') (files Slides ppt  ) Alessandra Traini (traini.alessandra@gmail.com)

Motivation BlastZ [1] is a pairwise alignment tool, based on the gapped BLAST algorithm. The BlastZ algorithm permits the alignment of genomic sequences and it is useful for the detection of local similarities on long nucleotide sequences as, for example, putatively functional conserved DNA sequences or syntenic regions. The BlastZ algorithm searches for short near-perfect matches, with a fixed length, between the two sequences. Each matching region is extended in two steps: first a gap-free dynamic programming approach is used, then, for matches above a specific threshold, the algorithm extends consecutive local similarities using a dynamic programming approach including gaps. Only the matches with a score above a specified threshold are reported. The BlastZ text output file, which includes only the coordinates of the local similarities and of the possible extended alignments, is not easy to derive the sequence alignments and is rather difficult to read for non-expert users. Commonly, BlastZ output is graphically supported by LAJ (Local Alignments with Java) [2], an interactive viewer available from the PipMaker package [3]. The tool can display an `interactive` dot-plot, a percent identity plot (pip), a text representations of the aligned sequences and a diagram showing the genome sequence organization, when known. However, LAJ is limited to sequences shorter than 170 Mb, and the dot-plot viewer, though useful for a general view on regions of similarity between two sequences, is not user friendly and flexible for specific comparative analysis and management of the multiple local similarities. Therefore, in order to allow a user-friendly view of the information content resulting, we wrote a script and designed a web based interface to parse the BlastZ output file. Methods Results The BlastZ output parser is written using Perl, while the web interface is created using HTML and PHP scripts. Input data files must be in FASTA format. BlastZ analysis is performed on two sequences at time. The pairwise sequence alignments are defined and reported in a HTML page. All the local alignments, resulted from the BlastZ gap-free step, are provided together with a popup links to a global view, where extended local alignments are joined using gaps. Two multi-FASTA sequence files can be uploaded in order to align each sequence in the first set (queries) versus all the sequences in the second set (subjects). In this case, the results are reported as a summary list. The alignments produced between each pair of submitted sequences are all reported. For each pairwise comparison, links to the BlastZ text output as well as to the HTML viewer are included. The user can also view the same results through the LAJ graphical user interface. References: [1] Schwartz S. et al.: Human-mouse alignments with BLASTZ. Genome Res. 2003, 13:103-7. [2] LAJ was written by Cathy Riemer from Miller lab at the Penn State University Center for Comparative Genomics and Bioinformatics. [3] Schwartz S. et al.: PipMaker: A Web Server for Aligning Two Genomic DNA Sequences. Genome Res. 2000, Vol. 10, 4:577-586.

10:45  Inferring causal relationships between dynamic genomic phase clusters (05') (files Paper;   files Slides pdf;   files Poster pdf  ) Abhishek Dutta (Department of Information and Communication Technology, University of Trento, Italy; School of Informatics, University of Edinburgh, UK)

Motivation Identification of cyclic genes by biologists and the mining of clusters with their dynamic relationships with computational techniques is well explored. However, these tasks have been performed separately and on small chunks of data. Here we propose a novel approach of filtering and clustering the entire set of yeast genes that are further correlated with the 5 cell-cycle phases and a high level belief network is built on top to capture the regulatory dynamics with the help of a score metric. Methods The data set is of yeast consisting of 7744 genes whose expressions are measured across 14 time points[2]. We identify 3 logical steps. (1)Filtering--The missing values were imputed by an Elman network. Then the genes with no valid names or null value expressions were deleted across the entire data set and log normalised. Since we are interested in the dynamics of the network, the low variance genes to 10th and low entropy genes to 15th percentile were filtered out subsequently leaving a set of 5594 more workable set of gene profiles. (2)Clustering--The above interested genes were then clustered using self organising feature maps, characterized by the Kohonen learning rule. The input space has 14 dimensions and no. of neurons in competitive learning layer is fixed to 5, the rationale being to establish later a correlation with the cell cycle phases. We further cluster them using fuzzy k-means and Pearson correlation as the distance measure. (3)Bayesian Belief nets--The centroids of the above 5 cluster profiles were discretized into up and down regulations w.r.t their corresponding means giving us 5 vectors featuring 14 binary data points in time. These form the 5 discrete nodes of the Bayesian network. Now any combinations of these 5 nodes could be the possible regulators of the others and vice versa i.e. they could be the targets as well; and through all possible time lags. Here the prediction of the network structure with causality inference w.r.t data becomes NP hard in the no. of nodes. Hence we make first assumption that the underlying process is markovian i.e. regulators could affect their target only 1 time step ahead. The second assumption is that a target could be regulated by at most two regulators. After all such alignments between the gene nodes 1 to 5, the prior conditional probability distribution tables are computed based on the data matrices. Then the maximum likelihood estimates are learned for all such possible pairs and also for the respective joints by marginalising. Further we compute the score metric by summing over the regulator`s ML estimates given the up and down regulations of the target and also with the bayesian score metric[1] over both SOM and K-means matrices for single and couple of possible regulators. Results As can be seen in the plots, the mean cluster profiles using SOM and K-means synchronise well w.r.t phases, exactly where our interest lies.These are ordered according to the peaking of the profile, i.e. SG1 peaks before SG2 and so on. These also bear a strong correlation with the Stanford cell cycle classification performed over a subset of genes and hence here on we also label them with their respective phases M, G1, S, G2 etc. The first table shows the scores of the Bayesian network over SOM over each target, with their corresponding regulators in that row. The second table is for the Joint(2 possible regulators) over K-means again numbered row wise. These scores match perfectly with the bayesian scoring metric within a bounded fixed ratio. W.r.t them the final causal inferences are shown between the gene nodes or phases of the yeast cell cycle. SOM and K-means unanimously agree that S/G2 phase genes are regulated jointly by M/G1 and S phase genes as can be verified from the above tables. They also advocate G1 to be the regulator for G2/M. A new framework was developed for obtaining a high level causal relationship between various phases of the cell cycle regulated gene profile clusters build from scratch.

10:50  Open discussion (10')

Discussion of the presentations

11:20 
Coffee break in poster area

11:40->12:30    Keynote Lecture by Tom Blundell
Description: Structural Bioinformatics, the Proteome and Chemical Space: New Dimensions of Drug Discovery. 
11:40  Structural Bioinformatics, the Proteome and Chemical Space: New Dimensions of Drug Discovery (50') Tom Blundell (Department of Biochemistry, Tennis Court Road, Cambridge CB2 1GA, UK)

The knowledge of the three-dimensional structures of protein targets that is now emerging from structural proteomics and targetted structural biology programmes has the potential to accelerate greatly drug discovery. However, knowledge is required not only of the three- dimensional structures of individual proteins but also the chemical space which defines the range of chemical ligands that they might bind. This can be defined by fragment-screening techniques, which inform not only lead discovery but also optimization of candidate molecules. 1. Blundell, T.L., Jhoti, H. and Abell, C. (2002). High- Throughput crystallography for lead discovery in drug design. Nature Reviews Drug Discovery. 1, 45-54; 2. Pellegrini L., Burke D.F., von Delft F., Mulloy B., Blundell TL (2000) Crystal Structure of fibroblast growth factor receptor ectodomain bound to ligand and heparin. Nature 407, 1029-1034 3. Pellegrini L, Yu DS, Lo T, Anand S, Lee M, Blundell TL, Venkitaraman AR (2002) Insights into DNA recombination from the structure of a RAD51-BRCA2 complex. Nature 420, 287-293 4. Congreve M, Murray CW and Blundell TL (2005) Structural Biology and Drug Discovery. Drug Discovery Today 10, 895-907


12:30->13:50    Session 7: Structural biology and drug design
Description: Studies on structure and function of biological structures and their interaction, including drug design. 
12:30  Globularity criteria to evaluate the structural quality of modeled proteins (20') (files Slides ppt  ) Susan Costantini (Centro di Ricerca Interdipartimentale di Scienze Computazionali e Biotecnologiche, Seconda Universita` di Napoli, Italy; Laboratorio di Bioinformatica e Biologia Computazionale, Istituto di Scienze dell'Alimentazione, CNR, Avellino, Italy)

Motivation The basic problem of any computational approach to structure prediction is to evaluate the quality of models. CASP results evidenced that most of the submitted models are far from the real protein structure. Our work has been aimed to analyze structural properties that characterize protein globularity, to suggest an operative procedure for the analysis of globular quality of theoretical protein models obtained by computational approaches in the absence of experimental target structures, and, finally, to prevent the diffusion of theoretical models not suitable with the real features of the protein globularity. Methods The analyses of experimental structures were performed using the PDBselect set of experimentally determined, non-redundant protein structures with mutual sequence similarity <25%. The secondary structure defined by the DSSP algorithm was used to assign each protein to the right structural class, according to the four main SCOP classes ("all-alpha", "all-beta", "alpha/beta", and "alpha+beta"). Some features related to the globularity of proteins have been evaluated for each structure. Four properties, i.e. the total accessible surface area (ASA), the number of MM-type H-bonds, voids and water molecules in a layer of 5 Angstroms, correlated better then others with the molecular weight. All four properties versus the molecular weights of selected proteins in each structural class fit with linear regressions, and the linear function was defined to estimate the expected value for each feature. Therefore, a globularity score value was calculated for each protein by summing the ratios of the differences between the calculated and predicted values for each of the four properties versus the related errors (i.e. Root Mean Squared Error). On the basis of the most frequent score values, calculated for the proteins belonging to the same class, it was possible to identify a threshold score value specific for each of the four structural classes. These threshold score values were then used as cut-off to evaluate the structural properties of models predicted for the CASP6 protein structure prediction experiment in the New Fold (NF) and difficult Fold Recognition Analogous (FR/A) categories, and used as "testing dataset". Results We analyzed different structural properties of globular proteins for experimentally solved proteins belonging to the four different structural classes. The properties were found to be linearly correlated to protein molecular weight, but with some differences among the four classes. These results were applied to develop an evaluation test of theoretical models based on the expected globular properties of proteins. In fact, a score value (i.e. globularity score) for all proteins was calculated by using the parameters having the highest correlation coefficients with the protein molecular weights (i.e. MM-type H-bonds, void number, total accessible surface area and water molecules). To verify the success of our test, we applied our globularity score to several protein models submitted to the sixth edition of CASP. Our results surprisingly show that many of the models submitted (54.6%) should be discarded a priori because they do not have the structural properties expected in globular proteins. Therefore, our study supports the need for careful checks to avoid the diffusion of incorrect structural models and allows the evaluation of models in the absence of experimental reference structures, thus preventing the diffusion of incorrect structural models and the formulation of incorrect functional hypotheses. It can be used to check the globularity of predicted models, and to supplement other methods already used to evaluate their quality.

12:50  Molecular dynamics analyses of peptides forming amyloid-like fibrils (20') Luigi Vitagliano (Istituto di Biostrutture e Bioimmagini, C.N.R., Napoli)

Motivation The insurgence of severe neurodegenerative disease is frequently associated with insoluble amyloid-like fibrils . The structural characterization of these fibrils has long been hampered by the very low solubility and by the non-crystalline nature of these aggregates. Major advances in this field have been recently accomplished by the use of model peptides, whose solid aggregates exhibit most of the features of amyloid fibrils. Particularly impressive are the results achieved in the structural characterization of the peptide GNNQQNY derived from the prion-determining domain of the yeast protein Sup35. The high resolution structure of the peptide offered an atomic detailed model, denoted as cross-beta spine steric zipper, for amyloid-like fibrils (Nelson, R., Sawaya, M.R., Balbirnie, M., Madsen, A.O., Riekel, C., Grothe, R., Eisenberg, D. (2005) Nature 435, 773-8). In order to obtain further insights into the structure determinants of amyloid fiber structure and formation, we have undertaken molecular dynamics (MD) simulations on a variety of different models arranged in a cross-beta spine structure Methods The starting coordinates for the MD simulations were derived from the crystal structure of the peptide GNNQQNY. For simulation on polyglutamine systems, the starting models were generated by molecular modelling using the structure of GNNQQNY as template. MD simulations were performed with the GROMACS software package. Models were immersed in rectangular or cubic boxes filled with water molecules. An extensive analysis of GNNQQNY dynamics was performed by using replica exchange MD (REMD), an advanced methodology for enhancing conformational sampling. Results The MD simulations of several GNNQQNY assemblies of different sizes indicate that the aggregates are endowed with a remarkable stability. Our data also indicate that these assemblies can assume twisted beta-sheeted structures. As a consequence, the occurrence of steric zipper interactions is compatible with both flat and twisted beta-sheets. This result is in line with several literature reports suggesting twisted beta-sheet structures as a basic motif of amyloid fibrils. The evolution of pairs of sheets separated by a wet interface during the simulation has additionally provided interesting information on the structure of larger aggregates (Esposito, L, Pedone, C e Vitagliano, L. (2006). Proc. Natl. Acad. Sci. USA 103, 11533-8). Further studies carried out by using REMD approaches have provided interesting clues on the structural properties of the intermediate states along the fiber formation pathway. Our findings are in line with the concentration dependent lag phase growth of GNNQQNY fibers experimentally observed. Finally, MD simulations were also used to investigate the structural compatibility of polyglutamine fragments, associated with the occurrence of several neurodegenerative diseases, with the steric zipper model. MD simulations have been carried out on a variety of models (parallel and antiparallel pairs of sheets) with different sizes and various aggregation levels. Our simulations, carried out over a wide range of temperatures, clearly indicate that these assemblies are very stable. Glutamine side chains strongly contribute to the overall stability of the models by perfectly fitting within the zipper. In contrast to GNNQQNY zipper motifs, hydrogen bonding interactions provide a significant contribution to the overall stability of polyglutamine models. Simulations carried out at high temperatures (450-500 K) also show that the steric zipper motif is reversibly destroyed. This provides a clear indication on the structural determinants that regulate its formation. On this basis, a mechanism that explains the nucleation-dependent process of polyglutamine fiber formation is proposed.

13:10  An exhaustive analysis of analogies in protein binding sites of known structure (20') (files Slides ppt  ) Gabriele Ausiello (Department of Biology, University of Tor Vergata, Roma)

Motivation The identification of determinants of the interaction specificity in the molecular recognition between proteins and small ligands constitutes some of the most interesting problems still unsolved in structural biology. The unique identification of the ligand by a protein is a complex phenomenon that depends on structural features (shape, flexibility, arrangement of chemical groups) of the protein surface and of the interacting molecule. Methods We performed an exhaustive analysis of all known protein - ligand interfaces to identify the structural determinants of molecular recognition and cases in which an arrangement of similar residues is used to recognize analogous chemical group (pharmacophores) in the context of globally diverse ligands. To this aim, we compared in pairs all the interacting interfaces excluding all cases in which the correspondences deal with homologous proteins or similar ligands. The work is divided in the following steps: - Use of a local structural comparison method, to identify small regions of structural similarity (composed of 3-4 aminoacids) among all protein - ligand interfaces included in the PDB. - Exclusion from the comparison results of the cases of local similarity between homologous proteins (with the same fold) or interacting with similar ligands (with a high Tanimoto coefficient). - Analysis of correlation between the type and the position of the identified pharmacophores in the ligands for every interface pairs sharing a region of local similarity. Results We tried to identify existing correlations between structural elements that are on the protein surface and their ligands` pharmacophores. Some cases of particular interest are discussed as an example of the results obtained by this exhaustive analysis and we will present a statistics of the pharmaphore correspondences on the ligands in function of the distance from the residues with which they interact. The results of the work show that some cases exist in which similar pharmacophores, in the context of diverse ligands, have been recognized by analogous arrangement of the protein residues and that these correspondences concern the part of ligand that effectively is located near the residues in question.

13:30  The effect of mutations on protein stability changes: a three class pair residue-discrimination study (20') Emidio Capriotti (Structural Genomic Unit, Department of Bioinformatics, Centro de Investigacion Principe Felipe (CIPF) Valencia, Spain)

Motivation A basic question in protein structural studies is to which extent mutations affect protein folding stability. This can be asked starting from protein sequence and/or from structure. In proteomics and genomics studies prediction of protein stability free energy change (DDG) upon single point mutation may also help the annotation process. So far methods to address this problem are based on to two different approaches: development of optimised and different energy functions (Gilis and Rooman, 1997; Guerois et al., 2002; Zhou and Zhou, 2002) for proteins whose structure is known and implementations of machine learning approaches when sequences and/or structures are available (Capriotti et al., 2004; Capriotti et al 2005a; Capriotti et al 2005b; Cheng et al., 2006). On the other hand experimental DDG values are affected by uncertainty as measured by standard deviations. Most of the DDG values are nearly zero (about 32% of the DDG data set ranges from -0.5 to 0.5 Kcal/mol); furthermore both the value and sign of DDG may be either positive or negative for the same mutation blurring the relationship among mutations and expected DDG value. In order to overcome this problem we describe a new predictor that discriminates between 3 mutation classes: destabilizing mutations, stabilizing mutations and neutral mutations, being neutral mutations all those substitutions whose effect is to promote a protein stability free energy change (DDG) ranging from -0.5 to 0.5 Kcal/mol. Methods In this paper a support vector machine (SVM) starting from the protein sequence or structure discriminates between stabilizing, destabilizing and neutral mutations. The machine learning method here presented was trained and tested considering experimental data selected from the new release of the ProTherm Database (Kumar et al. 2006). We collect more than 1600 mutations and according to the criterion of thermodynamic reversibility for each mutation, we double all the thermodynamic data. Finally, we end up with more than 3200 mutations to train our method with a cross-validation procedure. Following other previous works of ours (Capriotti et al 2005a; Capriotti et al 2005b) two different SVM methods were developed depending on the provided information. If the predictions are sequence-based an input vector of 42 elements is used. The input accounts for the residue mutation (encoded in the first 20 elements), the sequence environment (encoded in second 20 elements) and experimental conditions (temperature and pH reported in the last two element). When the 3D structure of the mutated proteins is known is it also possible to perform the prediction with the structure-based method that considers an input vector of 43 elements. The input vector is similar to that of the sequence-based predictor: however the 20 element vector encoding for the sequence environment is replaced considering the structural environment and adding also one element accounting for the relative solvent accessible area is added. Results We rank all the possible substitutions according to a three class-classification system that aside prediction indicates also the rate of occurrence of the mutations in the data base and the list of proteins where the mutation effect has been experimentally detected. We show that the overall accuracy of our predictor is as high as 52% when performed starting from sequence information and 58% when the protein structure is available with a mean value correlation coefficient of 0.30 and 0.39 respectively. These values are about 20 points per cent higher than those of a random predictor; when selecting only the mutations with high effect on the protein stability (|DDG|>0.5 Kcal) the prediction of the destabilizing and stabilizing mutations are well balanced and reaches the accuracy values of 71% and 76% with correlation coefficient of 0.43 and 0.52, respectively when sequence-based and structure-based predictions are provided.


13:50->14:00    Conclusions




bits2007_support@ceinge.unina.it | Last modified 08 July 2009 10:35 |




Powered by Indico 0.90.3