CHASM Tutorial

From Chasm Software Wiki

Revision as of 20:58, 19 September 2013 by WikiSysop (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


Training the Classifier

  1. Cancer-specific passenger mutation rate table.
    Passenger mutation rates are estimated based on mutation frequencies in eight di-nucleotide contexts. The passenger mutation rate table is a tab delimited text file in the following format:
            C*pG	CpG*	TpC*	G*pA	A	C	G	T
    ->A	0.009	0.227	0.014	0.031	0	0.043	0.077	0.028
    ->C	0	0.005	0	0.015	0.012	0	0.021	0.028
    ->G	0.005	0	0.009	0	0.060	0.029	0	0.015
    ->T	0.172	0.003	0.021	0.023	0.025	0.074	0.057	0

    The asterisk indicates the mutated base in the di-nucleotide contexts.
    The di-nucleotide contexts are non-overlapping, which means that if a C to G mutation is observed in a C*pG context, the same mutation won't be counted for the TpC* or C contexts.

    We recommend building a passenger rate table based on your own mutation data. A suggested protocol is provided here. We also provide five passenger mutation tables in the ClassifierPack.

  2. Building your Classifier

    Please run BuildClassifier from the directory where the CHASM scripts are located.
    > BuildClassifier -m MutationTable -o ClassifierName -f FeaturesList -i MutationsList -s RandomSeed

    where MutationTable is the location of the passenger mutation rate table generated in step (1), ClassifierName is the name of the classifier, and "MutationsList" is the set of mutations to be classified. All generated classifiers are contained in the (installation directory)/BuiltClassifiers folder.

    NOTE: If the list of mutations specified with -i uses genomic coordinates, the -g flag must be used:

    > BuildClassifier -m MutationTable -o ClassifierName -f FeaturesList -i MutationsList -s RandomSeed -g

    Please note that -f FeaturesList and -s RandomSeed are optional. If FeaturesList is not specified, the default features list will be used. RandomSeed is the random seed used for training the random forest classifier.

Running the Classifier

  1. Construct the input file
    CHASM accepts mutations defined as amino acid substitutions at a specific codon position in transcript identifiers from from NCBI RefSeq, CCDS, or Ensembl. Both mRNA transcript and protein accession identifiers are accepted. Prepare a tab delimited text file of your mutations in the following format:
    NP_001135977	R641W
    NP_835455	R151C
    NP_055645	L590V
    NP_689808	D28H
    NP_005472	S372R
    NP_112493	S35R
    NP_859061	A118V
    NP_892018	R153C
    NP_001074003	R264Q
    NP_001073893	R1272C
    NP_808882	R30H

    Optionally, a column can be added before the transcript column with an identifier that might be useful for tracking mutations:

    1      NP_001135977	R641W
    2      NP_835455	R151C
    3      NP_055645	L590V
    4      NP_689808	D28H

    NOTE: The ".tmps" extension used by CHASM stands for "transcript-mutation pairs".

    Alternatively, CHASM supports genomic coordinates in build GRCh37/hg19 of the human genome. This file should be tab delimited with 6 or 7 columns (the first column with identifiers is optional - CHASM will use row numbers in the input file if no identifier is specified). Thse columns

    1     chr22   25115448        25115449        +       A       G
    2     chr22   25119119        25119120        +       A       G
    3     chr22   25124310        25124311        -       C       T
    4     chr22   25144911        25144912        +       C       T
    5     chr22   25145752        25145753        -       C       T
    6     chr22   25147422        25147423        +       C       T
    7     chr22   25150137        25150138        +       A       G
    8     chr22   25152617        25152618        +       C       T
    9     chr22   25158437        25158438        +       C       T
    10    chr22   24121377        24121378        +       C       T

    NOTE: Genomic coordinate file columns are as follows:

    • Mutation identifier (optional - CHASM will assign row number if this column is left off)
    • Chromosome
    • 0-based coordinate of base substitution
    • 1-based coordinate of base substitution
    • Strand on which reference and alternative bases are reported
    • Reference base at this position
    • Alternate base at this position

  2. Run the classifier
    Run RunChasm from the directory where the CHASM scripts are located.
    For inputs defined as transcripts and amino acid substitutions:
    RunChasm classifier_name mutation_list

    For inputs defined as genomic coordinates:

    RunChasm classifier_name mutation_list -g

    where classifier_name is the name of the classifier you built earlier, and mutation_list is the location of the list of mutations generated in step (1)

  3. The following files will be generated in the directory containing your mutation list file:
    • <input-file>.arff
      • The ARFF (Attribute-Relation File Format) file generated by SNVBox
    @relation headerfile
    @attribute UID string
    @attribute ID string
    @attribute ExonConservation numeric
    @attribute ExonSnpDensity numeric
    @attribute ExonHapMapSnpDensity numeric
    @attribute HMMRelEntropy numeric
    @attribute HMMEntropy numeric
    @attribute HMMPHC numeric
    @attribute MGARelEntropy numeric
    @attribute MGAEntropy numeric
    @attribute MGAPHC numeric
    1 NP_001135977.1_R641W 0.693567650685 0.00588235294118 0.0 0.191425 ...
    2 NP_835455.1_R151C 0.763530500574 0.00496838301716 0.000451671183379 ...
    3 NP_055645.1_L590V 0.520834494575 0.0448717948718 0.00961538461538 ...
    4 NP_689808.2_D28H 0.540801628658 0.0838926174497 0.00335570469799 ...
    • <input-file>.output
      • The results are given in a tab delimited text file:
        • 1st Column is mutation identifier (either user assigned ID or row number of mutation in the input file)
        • 2nd Column is Transcript + Mutation
        • 3rd Column is the raw CHASM score
        • 4th Column is the p-value
        • 5th Column is the Benjamini-Hochberg False Discovery Rate [not computed if number of mutations to be classified is less than a user-configurable value (Default=10)].
    MutationID     Mutation               CHASM   PValue  BHFDR
    1              NP_001135977.1_R641W   0.866    0.726   1.00
    2              NP_835455.1_R151C      0.872    0.749   1.00
    3              NP_055645.1_L590V      0.888    0.808   1.00
    4              NP_689808.2_D28H       0.886    0.801   1.00
    5              NP_005472.2_S372R      0.902    0.857   1.00
    6              NP_112493.2_S35R       0.832    0.608   1.00
    7              NP_859061.3_A118V      0.772    0.409   1.00
    8              NP_892018.1_R153C      0.932    0.924   1.00
    9              NP_001074003.1_R264Q   0.884    0.792   1.00
    10             NP_001073893.1_R1272C  0.884    0.792   1.00
    11             NP_808882.1_R30H       0.976    0.985   1.00

    where <input-file> is the name of your mutation list file.

Error Messages

Please note that you may encounter SNVbox error messages while running CHASM during feature retrieval.

Passenger Mutation Rate Tables

What is a passenger mutation rate table?

A passenger mutation rate table contains an approximation of the tumor-specific background base substitution rates for base substitutions leading to somatic nonsynonymous passenger mutations.

Available passenger mutation rate tables:

   Currently, CHASM contains passenger rate mutation tables for a number of tumor types including:
   * Acute Myeloid Leukemia
   * Bladder Urothelial Carcinoma
   * Brain Lower Grade Glioma
   * Breast invasive carcinoma
   * Cervical squamous cell carcinoma and endocervical adenocarcinoma
   * Chronic Lymphocytic Leukemia
   * Colon Adenocarcinoma
   * Gastric (intestinal and diffuse)
   * Glioblastoma Multiforme
   * Head and Neck Squamous Carcinoma
   * Hepatocellular carcinoma
   * Kidney Chromophobe
   * Kidney Renal Clear Cell Carcinoma
   * Kidney Renal Papillary Cell Carcinoma
   * Lung Adenocarcinoma
   * Lung Squamous Cell Carcinoma
   * Medulloblastoma
   * Melanoma
   * Ovarian Serous Cystadenocarcinoma
   * Pancreatic Cancer
   * Prostate Adenocarcinoma
   * Prostate Cancer
   * Rectum Adenocarcinoma
   * Skin Cutaneous Melanoma
   * Stomach Adenocarcinoma
   * Thyroid Carcinoma
   * Uterine Corpus Endometriod Carcinoma

Source of data used to construct passenger frequency tables

Filename Cancer type Datasource Date
BLCA.context Bladder Urothelial Carcinoma TCGA Jun 2013
BRCA.context Breast invasive carcinoma TCGA Jun 2012
CESC.context Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma TCGA Jun 2013
CLL.context Chronic Lymphocytic Leukemia ICGC Mar 2013
COAD.context Colon Adenocarcinoma TCGA Jun 2013
GBM.context Glioblastoma Multiforme TCGA Jun 2013
GID.context Gastric (Intestinal and Diffuse) ICGC Mar 2013
HCCA.context Hepatocellular Carcinoma (Secondary to Alcohol and Adiposity) ICGC Mar 2013
HCCV.context Hepatocellular Carcinoma (Viral) ICGC Mar 2013
HNSC.context Head and Neck Squamous Carcinoma TCGA Jun 2013
KICH.context Kidney Chromophobe TCGA Jun 2013
KIRC.context Kidney Renal Clear Cell Carcinoma TCGA Jun 2013
KIRP.context Kidney Renal Papillary Cell Carcinoma TCGA Jun 2013
LAML.context Acute Myeloid Leukemia TCGA Jun 2013
LGG.context Brain Lower Grade Glioma TCGA Jun 2013
LUAD.context Lung Adenocarcinoma TCGA Jun 2013
LUSC.context Lung Squamous Cell Carcinoma TCGA Jun 2013
MB.context Medulloblastoma (mixed with data from tumors with similar mutation spectra) PMID:21163964 Dec 2010
ML.context Melanoma Yardena Samuels Lab Dec 2011
OV.context Ovarian Serous Cystadenocarcinoma TCGA Jun 2013
PC.context Prostate Cancer ICGC Mar 2013
PNCC.context Pancreatic Cancer (OICR-CA and QCMG-AU) ICGC Mar 2013
PRAD.context Prostate Adenocarcinoma TCGA Jun 2013
READ.context Rectum Adenocarcinoma TCGA Jun 2013
SKCM.context Skin Cutaneous Melanoma TCGA Jun 2013
STAD.context Stomach Adenocarcinoma TCGA Jun 2013
THCA.context Thyroid Carcinoma TCGA Jun 2013
UCEC.context Uterine Corpus Endometriod Carcinoma TCGA Jun 2013

NOTES: There may be problems with the Gastric and Heptocellular tables. For Gastric, mutations were not labeled as somatic vs. germline. For Heptocellular, the data was sparse, only ~200 somatic mutations were available. We will provide updated versions of these tables in a future release.

What about cancers other than the types listed?

1) We suggest using the ovarian table as a default, since ovarian cancer shows no clear signatures of exogenous mutagens.
2) You may also consider constructing your own passenger mutation rate table.

There are many ways to construct your own custom table. Here, we suggest a protocol that has worked well for us. It makes a number of simplifying assumptions that you may not agree with. In that case, you may wish to modify the protocol. If you are happy with the results, please share your new protocol with other CHASM users via the mailing list.

Our protocol to build a context table:

Somatic missense mutations observed in tumor sequencing data are composed of a mixture of driver and passenger mutations. In order to limit our analysis to passenger mutation rates, we first subtract mutations observed in genes frequently mutated in cancer. The remaining mutations are conservatively assumed to be passengers, and the underlying base substitutions constitute the background mutation profile for the tumor type. (Simplifying Assumption #1)

To convert this profile to a table, base substitutions are grouped into 8 DNA sequence contexts, selected based on observed significant differences in context mutation rates in previous studies (Sjoblom Supplementary materials). The context table is 8x4 with the 8 selected di-nucleotide contexts as column headers and the 4 nucleotides as the rows. The 8 di-nucleotide contexts are as follows (Simplifying Assumption #2):

1) C*pG: C in a CpG di-nucleotide is mutated
2) CpG*: G in a CpG di-nucleotide is mutated
3) TpC*: C in a TpC di-nucleotide is mutated
4) G*pA: G in a GpA di-nucleotide is mutated
5) A: A is mutated
6) C: C other than C in a CpG or in a TpC is mutated
7) G: G other than G in a CpG or in a GpA is mutated
8) T: T is mutated

These are in order of priority, so a single base substitution in DNA sequence TCG where the C is mutated would fall under context 1 (C*pG) rather than context 2 (TpC*) or context 6 (C). The p represents the phosphate group between the 2 nucleotides.

In each context, the case where the base is not mutated is ignored. The remaining 8x3 possibilities form a distribution over all base substitutions that could occur in the 8 contexts. Each frame in the table then contains the approximate relative rate at which a specific base substitution occurs within a context, estimated from whole exome sequencing data for a specific tumor type.

Example Context Table:

Base To	C*pG	CpG*	TpC*	G*pA	A	C	G	T
A	0.009	0.227	0.014	0.031	-	0.043	0.077	0.028
C	-	0.005	-	0.015	0.012	-	0.021	0.028
G	0.005	-	0.009	-	0.060	0.029	-	0.015
T	0.172	0.003	0.021	0.023	0.025	0.074	0.057	-

Tumor type-specific context tables can be generated from a MAF or VCF file that contains somatic variant calls from a population of tumor samples for a particular cancer being studied. The tables included in the CHASM package were computed with data from studies that included at least ten samples.


  1. Determine the number of non-silent somatic single base variants.
  2. Drop any mutations occurring in genes mutated in the cancer of interest at a high frequency (gene "mountains"). As an expert in the cancer type you have sequenced, you may already know which genes are mountains. Alternatively, you can use the COSMIC database to select "mountains'. For the CHASM publications, the following genes were considered mountains: TP53, KRAS, SMAD4, CDKN2A, NF1, RB1, PIK3CA, PTEN
  3. Divide the non-silent somatic single base variants not in mountains into the 8x3 context + base substitution categories.
  4. Divide the value for each category by the total number of mutations remaining after Step 2.

NOTE: The CHASM software generates mutations at random in mRNA transcript sequence (not in genomic DNA sequence). Thus when constructing passenger rate tables from DNA sequencing data, if a mutation occurs in a transcript on the negative DNA strand, the mutation should be included in the passenger rate table as it would appear on the negative strand.

How CHASM uses passenger mutation rate tables:

CHASM uses passenger mutation rate tables to create the synthetic class of passenger missense mutations used in classifier training, and statistical analysis of CHASM scores. We assume that tumor-specific differences in the relative background context + base substitution rates are representative of the mutation process of the tumor type of interest. We expect that missense mutations observed in this tumor type, whether driver or passenger mutations, were generated under the constraints of this same underlying process, but that driver mutations are rare events that are selected for, while passenger mutations occur more frequently and at random.

Configuration Options

  • User can customise certain parameters used by CHASM by editing the configuration file chasm_classifiers.conf located in the CHASM directory.
  • Below are the default values:
; CHASM_Classifiers Configuration file                                                                                                            
; Contains references to all software and directories used to build the classifiers                                                               

; PARF path                                                                                                                                       

; CHASM Classifier Pack location                                                                                                                  

; Cancer-related gene list (used for nullset generation)                                                                                          

; Minimum number of genes present before passenger generation                                                                                     
; is restricted to the genes in which user variants occured                                                                                       

; Refseq blob files                                                                                                                               

; Ensembl to hugo                                                                                                                                 

; CCDS to Refseq                                                                                                                                  

; CHASM Default Classifier Output location                                                                                                        

; Classifier Specifications                                                                                                                       

; Number of trees to build                                                                                                                        
; Number of random draws used for decision tree construction                                                                                      
; default is sqrt(number of features used)                                                                                                        
; Must be <= total number of features.                                                                                                            
; Large values may cause classifier overfitting.                                                                                                  
  • Description of each parameter:
    • "parfpath" - location of parf executable
    • chasmclassifierpack - location of the Classifier Pack.
    • blacklist - location of file containing list of cancer related genes for filtering generated passengers
    • whitelistcutoff - minimum number of genes represented in the user's input file in order for passenger generation to use those genes rather than the default list supplied by CHASM
    • "defaultwhitelist" - stores whitelist set of genes in which to generate passenger mutations if less than whitelistcutoff genes are represented in the user's mutation data
    • multipletestingminimum - minimum number of mutations before FDR calculations will be performed
    • transcriptlibrary - location of file containing Refseq Transcript sequences used for passenger generation
    • transcriptlibrarymetaData - location of file containing additional information for each Refseq transcript
    • transcriptlibraryinfo - location of file containing mappings between Refseq protein, transcript and Hugo ID
    • Ensembl2Hugo - location of file containing mappings between Ensembl and Hugo
    • CCDS2Refseq - location of file containing mappings between CCDS and Refseq Transcripts
    • chasmdefaultoutputdirectory - default output directory where new classifiers are placed
    • passengernum - number of passengers generated to train each classifier
    • nullnum - number of passengers generated for each null set
    • treenum - number of trees used in random forest classifier
    • "mtry" - number of random draws to perform during random forest construction. Default=squareroot(number of features in .arff file) can be overridden here.
Personal tools