High throughput data analysis

With acquisition of a massive parallel sequencer (Illumina [SOLEXA] Genome Analyzer) and microarray platform (Illumina BeadStation) along with other high-throughput experimental technologies, the Wistar Institute's Cancer Center is now well positioned to pursue new avenues of cancer research, particularly in understanding and modeling the genomic changes in cancer development and progression. These technologies generate huge genome-wide multiple data-sets that require equally complex and sophisticated databases, and analyses tools. The bioinformatics shared facility collaborates with the Center for Systems and Computational Biology to develop integrative analytical frameworks for the analysis of the data sets generated by Wistar investigators. The bioinformatics shred facility will provide consulting and integrative data-mining support for:

  • Analyzing SOLEXA data including ChIP-seq, RNA-seq (digital gene expression), small RNA-seq, RNAs associated with RNA binding proteins, SNP genotyping, genome re-sequencing, and de-novo sequencing.
  • Analyzing microarray data including gene expression, ChIP-chip, methylation profiling, copy number variation (CNV), SNP genotyping, miRNA profiling, protein/peptide array data.
  • Analyzing proteomics data (e.g. mass spectrometry-based spectra, LCMS, DIGE).
  • Analyzing molecular screening data by working with the molecular screening facility.

Consulting support in customized bioinformatics services

The bioinformatics shared facility works closely with the Wistar Cancer Center investigators to assist them with use of computational bioinformatics tools and methods for processing and interpretation of genomic, molecular, and proteomic data. Bioinformatics facility staffs also help investigators in integrating data processing results in their reports and proposals. The facility uses publicly available tools, database and in-house developed software for the analyses and offers consultation and training in the areas of bioinformatics, such as:

  • Sequence analysis, provide assistance with annotation of protein sequences, genes and gene regulatory regions predictions, such as promoters, transcription factor binding sites, and motifs.
  • Phylogenetic analysis.
  • Gene Ontology and Pathway analysis.
  • 3D molecular modeling, particularly homology modeling, analysis of protein structure properties such as electrostatic potential, surface area, protein-ligand docking, small molecule screening, protein-protein interaction, molecular dynamic simulation.

Statistics consultation and predictive model building

Typical tasks and applications (from raw data to functional analysis) such as:

  • Advice on experimental design and sample size estimation
  • Point and Confidence Interval estimation
  • Comparative data analysis such as t-test, ANOVA, SAM, Non-parametric test
  • Association studies/Contingency table analysis (e.g. chi-square test)
  • High dimensional data analysis such as repeated measurement, dimension reduction (e.g. SVD, PCA, MDS), permutation test
  • Survival analysis such as Kaplan-Meier or Cox Proportional Hazards models
  • Time series data analysis
  • Statistical modeling/Predictive modeling/Machine learning - Data mining in multivariate settings (supervised and unsupervised learning from data, Regression, Classification, Clustering, Generalized Linear Model)

Computational support for data management, high performance computing and custom programming

Data management

Large volumes of high-dimensional data are generated by Cancer Center shared facilities as well as other research programs such as microarray and sequencing data, tissue related data, image data, and pharmacodynamics data. The bioinformatics facility uses a combination of locally installed and public databases, and provides consulting support to design and maintain databases for various datasets, securely share data within or across Cancer Centers, store and backup data generated by the users. 

High-performance computing

The bioinformatics shared facility consists of an ever evolving group of clusters collaboratively administered by the Center for Systems and Computational Biology. The clusters are utilized as a collective resource for serial and parallel applications that would be computationally too demanding for smaller research groups to implement. Where one researcher could purchase a small cluster in a grant and hire a system administrator to set it up, it is much more efficient to add computing power to existing infrastructure. Bioinformatics clusters are regularly used for large scale problems usch as:

  • Large Scale Sequence Alignment
  • Predictive model development
  • 3D Molecular modeling
  • Mass Spec models
  • Phylogenetic inference

Custom programming

This support is provided for researchers who wish to use the software systems developed and deployed by the shared resource or develop their own software or tools. The facility provides users with basic training to set up and use the existing software system, develop new tools and web application.  Consulting support is also provided to investigators who want to develop databases and workflow in their labs. Our facility staff analyze the data handling requirements of the investigator’s lab and help them choose the best software solution for their studies.

Additionally, the bioinformatics facility employs the caBIG™ compatible products, the caBIG™ Life Science Distribution, and the caBIG™ Data Sharing and Security Framework, being developed by the NCI Center for Bioinformatics.

The specialized facility website provides a portal to centralized computation resources including data management and collaboration tools, sequence databases, homology algorithms, and other sequence manipulation tools.