From Chasm Software Wiki
CHASM uses a machine learning method called Random Forest that learns to distinguish between driver and passenger somatic missense mutations, based on a training set of labeled positive (driver) and negative (passenger) examples. The positive class of driver mutations was curated from the COSMIC database and the negative class is composed of synthetic passenger mutations, which are generated according to background base substitution frequencies observed for the specific tumor type (passenger mutation rates). We annotate mutations with multiple features, including amino acid substitution properties, alignment-based estimates of conservation at the mutated position, predicted local structure at the mutated position, and annotations curated from the UniProtKB feature table. The user provides a list of mutations for prediction, as well as a table describing the passenger mutation spectrum for the tumor type in which the mutations were observed. For more information, please refer to these publications.
CHASM is packaged together with SNVBox, a database of predictive features precomputed for every codon in all expressed mRNA transcripts included in the Refseq, CCDS and Ensembl databases. Given a transcript (or protein) accession number, amino acid position, reference and mutant amino acid residues in the encoded protein, and a list of features, SNVBox retrieves the features of interest for the mutation. SNVBox is a command line utility and can be used independently of CHASM. Our software tools are designed to run on modern Linux platforms and to be incorporated into larger genomic annotation pipelines.
CHASM Mutation Scoring Process
Below is a diagram illustration of the CHASM mutation scoring pipeline.
- Somatic non-synonymous single nucleotide mutations to be classified are specified in the form of a transcript accession identifier (RefSeq, CCDS or Ensembl), codon position, reference amino acid residue and mutant amino acid residue. Protein accession identifiers are also accepted.
- Tumor-specific passenger mutation rates are specified.
- Passenger mutations are generated in silico by randomly mutating codons in RefSeq mRNA transcript sequences (included in Classifier Pack), according to tumor-specific passenger mutation rates.
- The classifier is trained using the Classifier Pack driver mutations and the synthetically generated passenger mutations
- Mutations input by the user are scored by the classifier. A score is defined as the fraction of trees in the Random Forest that voted for the mutation being classified as a passenger.
- A second set of passenger mutations is generated (as above) but is filtered to remove any mutations that occur in genes previously associated with cancer in the Cancer Gene Census, COSMIC Cancer Genes, and MSigDB Cancer Genesets (collection C4). The filtered passengers are scored by the classifier to generate an empirical null score distribution.
- P-values are calculated based on the null score distribution and the Benjamini-Hochberg is applied to estimate the false discovery rates.
CHASM Training Set
The CHASM training set is composed of driver missense mutations from the COSMIC database and synthetic passenger missense mutations simulated to reciprocate the mutation spectrum observed in tumors of similar histological origin.
The driver class of the training set is constructed as follows:
- Genes in the COSMIC database are designated oncogenes or tumor suppressors based on patterns of nonsynonymous mutation observed across multiple tumor sequencing studies.
- A gene is considered an oncogene if the ratio of nonsynonymous mutations affecting the same amino acid position to all nonsynonymous mutations in the gene is greater than 0.15.
- A gene is designated a tumor suppressors if the ratio inactivating mutations (mutations resulting in a premature stop codon, or a shift in the codon reading frame) over all nonsynonymous mutations in the gene is greater than 0.15. Only genes harboring a minimum of 5 nonsynonymous mutations are considered.
- All missense mutations in genes that meet the criteria for oncogene or tumor suppressor in the COSMIC database are included in the driver mutation class.
The passenger class of the CHASM training set is synthetically generated to represent somatic missense mutations that might have occurred at random during tumorigenesis. Tumor sequencing studies have demonstrated DNA sequence specific differences in background substitution rates (also referred to as mutation spectra) among tumors of different histological origins.
Passengers are simulated by sampling from the mutation spectrum observed in sequenced tumors. The mutation spectrum for a tumor sequencing study is first quantified in a passenger mutation rate table. Previous studies of nonsynonymous somatic mutations in cancer have implicated base substitutions in four di-nucleotide contexts as having different mutation rates in some tumor types. In passenger mutation rate table construction, counts of base substitutions are collected and divided into eight groups: the four dinucleotide contexts (C in CpG, G in CpG, C in TpC, G in GpA) and all other events occurring at a A, C, G or T. Base substitutions occurring in genes that are frequently mutated in cancer are considered unlikely to have occurred at random and are not counted. The table is normalized to give relative prevalence of these base substitutions in the tumor genome.