As traditional in CAMDA contests, neither we nor the producers of the data can provide advice on the datasets to individuals as dealing with the files forms part of the analysis challenge. There is, however, an open forum for the free discussion of the contest data sets and their analysis, in which you are encouraged to participate. For CAMDA 2015, we have have compiled the following exciting contests:
Please notice that CAMDA challenges are not limited to questions proposed here. We look forward to a lively contest!
The FDA SEQC consortium has compiled a series of synthetic benchmarks and applied use-cases to assess the performance of modern gene transcript expression profiling methods, for the first time systematically assessing RNA-Seq in a wider context. SEQC is providing two benchmark challenges to CAMDA 2015.
References:
In this study, matched RNA-Seq and microarray gene expression profiles were collected of 105 rat livers to test their response to 27 chemicals representing 9 different modes of action (MOA). The NGS reads collected comprise 1.5 Terabases. In the study, a key question was the predictability of the chemical mode of action. Initial platform comparison showed consensus as well as variation, and effects of data processing were not yet further explored.
Data Description
This data comprised a training set and a test set with the text on the left detailing the experimental design and the text on the right listing the key analyses conducted (see figure below). Both microarray and RNA-seq were used to profile transcriptional responses induced by treatment of rats by each chemical; each is associated with a specific mode of action (MOA). For each MOA there were three representative chemicals and three biological replicates per chemical. Cross-platform concordance was evaluated at multiple levels: deferentially expressed genes, mechanistic pathways and MOAs. To compare the predictive potential of RNA-seq and microarray as gene-expression biomarkers, four MOAs by both platforms were analyzed as a test set. Two of the MOAs (PPARA and CAR/PXR) were present in the training set whereas the other two were not.
Questions of interest include, but are not limited to
Data download For this challenge, raw and processed data are provided as separate packages. The data packages contain metadata files, and either processed or raw data folders. Participants who want to use this dataset should read and accept the data download agreement to get access.
In this study, we present a reference benchmark data set comprising 30 billion reads (3 Terabases) of RNA-Seq data from The Sequencing Quality Control (SEQC) project, coordinated by the US Food and Drug Administration. Centrally prepared mRNA materials mixed from reference RNA samples with built-in controls were distributed to several independent sites for the construction of four replicate RNA-seq libraries per sample. Those four replicates, plus a fifth vendor prepared library distributed to the three ‘official sites’ for each platform, were then sequences. Platforms include Illumina’s HiSeq 2000 and Life Technologies’ SOLiD 5500 instruments.
Data Description
This dataset contains on a group of studies assessing different sequencing platforms in real-world use cases, including transcriptome annotation and other research applications, as well as clinical settings. In the main study RNA samples A to D were analyzed. Samples C and D were created by mixing the well-characterized samples A and B in 3:1 and 1:3 ratios, respectively. This allows tests for titration consistency and the correct recovery of the known mixing ratios. Synthetic RNAs from ERCC were both added to samples A and B before mixing and also sequenced separately to assess dynamic range (samples E and F). Samples were distributed to independent sites for RNA-seq library construction and profiling by Illumina's HiSeq 2000 (three official + three unofficial sites) and Life Technologies' SOLiD 5500 (three official sites + one unofficial site). Unless mentioned otherwise, data show results from the three official sites (italics). In addition to the four replicate libraries each for samples A to D per site, for each platform, one vendor-prepared library A5…D5 was being sequenced at the official sites, giving a total of 120 libraries. At each site, every library has a unique bar-code sequence, and all libraries were pooled before sequencing, so each lane was sequencing the same material, allowing a study of lane-specific effects. To support a later assessment of gene models, we sequenced samples A and B by Roche 454 (3×, no replicates, see Supplementary Notes to ref. 1, section 2.5).
Questions of interest include, but are not limited to
Data download For this challenge, raw and processed data are provided as separate packages. The data packages contain metadata files, and either processed or raw data folders. Participants who want to use this dataset should read and accept the data download agreement to get access.
From the comprehensive description of genomic, transcriptomic and epigenomic changes provided by ICGC, the main goal of this challenge is to gain novel biological insights to less well studied cancers selected here. However, we are not merely looking for 'old paradigm' cancer subtype classification!
Data Description and Download
For this challenge, only processed data are provided. These cancers all have matched gene expression, microRNA expression, protein expression profiles, somatic CNV, and methylation.
The above links point to release 17 and were meant as the official data set for the contest. We have received reports of the ICGC servers failing to provide particular files without corruption, however. Having no control over the ICGC servers we have, however, been able to download the corresponding files from release 18 instead. Base links for release 18:
Challenges