From Chasm Software Wiki
Training the Classifier
- Cancer-specific passenger mutation rate table.
Passenger mutation rates are estimated based on mutation frequencies in eight di-nucleotide contexts. The passenger mutation rate table is a tab delimited text file in the following format:
C*pG CpG* TpC* G*pA A C G T ->A 0.009 0.227 0.014 0.031 0 0.043 0.077 0.028 ->C 0 0.005 0 0.015 0.012 0 0.021 0.028 ->G 0.005 0 0.009 0 0.060 0.029 0 0.015 ->T 0.172 0.003 0.021 0.023 0.025 0.074 0.057 0
The asterisk indicates the mutated base in the di-nucleotide contexts.
The di-nucleotide contexts are non-overlapping, which means that if a C to G mutation is observed in a C*pG context, the same mutation won't be counted for the TpC* or C contexts.
- Building your Classifier
Please run BuildClassifier from the directory where the CHASM scripts are located.
> BuildClassifier -m MutationTable -o ClassifierName -f FeaturesList -i MutationsList -s RandomSeed
where MutationTable is the location of the passenger mutation rate table generated in step (1), ClassifierName is the name of the classifier, and "MutationsList" is the set of mutations to be classified. All generated classifiers are contained in the (installation directory)/BuiltClassifiers folder.
NOTE: If the list of mutations specified with -i uses genomic coordinates, the -g flag must be used:
> BuildClassifier -m MutationTable -o ClassifierName -f FeaturesList -i MutationsList -s RandomSeed -g
Please note that -f FeaturesList and -s RandomSeed are optional. If FeaturesList is not specified, the default features list will be used. RandomSeed is the random seed used for training the random forest classifier.
Running the Classifier
- Construct the input file
CHASM accepts mutations defined as amino acid substitutions at a specific codon position in transcript identifiers from from NCBI RefSeq, CCDS, or Ensembl. Both mRNA transcript and protein accession identifiers are accepted. Prepare a tab delimited text file of your mutations in the following format:
NP_001135977 R641W NP_835455 R151C NP_055645 L590V NP_689808 D28H NP_005472 S372R NP_112493 S35R NP_859061 A118V NP_892018 R153C NP_001074003 R264Q NP_001073893 R1272C NP_808882 R30H ...
Optionally, a column can be added before the transcript column with an identifier that might be useful for tracking mutations:
1 NP_001135977 R641W 2 NP_835455 R151C 3 NP_055645 L590V 4 NP_689808 D28H ...
NOTE: The ".tmps" extension used by CHASM stands for "transcript-mutation pairs".
Alternatively, CHASM supports genomic coordinates in build GRCh37/hg19 of the human genome. This file should be tab delimited with 6 or 7 columns (the first column with identifiers is optional - CHASM will use row numbers in the input file if no identifier is specified). Thse columns
1 chr22 25115448 25115449 + A G 2 chr22 25119119 25119120 + A G 3 chr22 25124310 25124311 - C T 4 chr22 25144911 25144912 + C T 5 chr22 25145752 25145753 - C T 6 chr22 25147422 25147423 + C T 7 chr22 25150137 25150138 + A G 8 chr22 25152617 25152618 + C T 9 chr22 25158437 25158438 + C T 10 chr22 24121377 24121378 + C T ...
NOTE: Genomic coordinate file columns are as follows:
- Mutation identifier (optional - CHASM will assign row number if this column is left off)
- 0-based coordinate of base substitution
- 1-based coordinate of base substitution
- Strand on which reference and alternative bases are reported
- Reference base at this position
- Alternate base at this position
- Run the classifier
Run RunChasm from the directory where the CHASM scripts are located.
For inputs defined as transcripts and amino acid substitutions:
RunChasm classifier_name mutation_list
For inputs defined as genomic coordinates:
RunChasm classifier_name mutation_list -g
where classifier_name is the name of the classifier you built earlier, and mutation_list is the location of the list of mutations generated in step (1)
- The following files will be generated in the directory containing your mutation list file:
- The ARFF (Attribute-Relation File Format) file generated by SNVBox
@relation headerfile @attribute UID string @attribute ID string @attribute ExonConservation numeric @attribute ExonSnpDensity numeric @attribute ExonHapMapSnpDensity numeric @attribute HMMRelEntropy numeric @attribute HMMEntropy numeric @attribute HMMPHC numeric @attribute MGARelEntropy numeric @attribute MGAEntropy numeric @attribute MGAPHC numeric ... @data 1 NP_001135977.1_R641W 0.693567650685 0.00588235294118 0.0 0.191425 ... 2 NP_835455.1_R151C 0.763530500574 0.00496838301716 0.000451671183379 ... 3 NP_055645.1_L590V 0.520834494575 0.0448717948718 0.00961538461538 ... 4 NP_689808.2_D28H 0.540801628658 0.0838926174497 0.00335570469799 ... ...
- The results are given in a tab delimited text file:
- 1st Column is mutation identifier (either user assigned ID or row number of mutation in the input file)
- 2nd Column is Transcript + Mutation
- 3rd Column is the raw CHASM score
- 4th Column is the p-value
- 5th Column is the Benjamini-Hochberg False Discovery Rate [not computed if number of mutations to be classified is less than a user-configurable value (Default=10)].
- The results are given in a tab delimited text file:
MutationID Mutation CHASM PValue BHFDR 1 NP_001135977.1_R641W 0.866 0.726 1.00 2 NP_835455.1_R151C 0.872 0.749 1.00 3 NP_055645.1_L590V 0.888 0.808 1.00 4 NP_689808.2_D28H 0.886 0.801 1.00 5 NP_005472.2_S372R 0.902 0.857 1.00 6 NP_112493.2_S35R 0.832 0.608 1.00 7 NP_859061.3_A118V 0.772 0.409 1.00 8 NP_892018.1_R153C 0.932 0.924 1.00 9 NP_001074003.1_R264Q 0.884 0.792 1.00 10 NP_001073893.1_R1272C 0.884 0.792 1.00 11 NP_808882.1_R30H 0.976 0.985 1.00 ...
where <input-file> is the name of your mutation list file.
Please note that you may encounter SNVbox error messages while running CHASM during feature retrieval.
Passenger Mutation Rate Tables
What is a passenger mutation rate table?
A passenger mutation rate table contains an approximation of the tumor-specific background base substitution rates for base substitutions leading to somatic nonsynonymous passenger mutations.
Available passenger mutation rate tables:
Currently, CHASM contains passenger rate mutation tables for a number of tumor types including: * Acute Myeloid Leukemia * Bladder Urothelial Carcinoma * Brain Lower Grade Glioma * Breast invasive carcinoma * Cervical squamous cell carcinoma and endocervical adenocarcinoma * Chronic Lymphocytic Leukemia * Colon Adenocarcinoma * Gastric (intestinal and diffuse) * Glioblastoma Multiforme * Head and Neck Squamous Carcinoma * Hepatocellular carcinoma * Kidney Chromophobe * Kidney Renal Clear Cell Carcinoma * Kidney Renal Papillary Cell Carcinoma * Lung Adenocarcinoma * Lung Squamous Cell Carcinoma * Medulloblastoma * Melanoma * Ovarian Serous Cystadenocarcinoma * Pancreatic Cancer * Prostate Adenocarcinoma * Prostate Cancer * Rectum Adenocarcinoma * Skin Cutaneous Melanoma * Stomach Adenocarcinoma * Thyroid Carcinoma * Uterine Corpus Endometriod Carcinoma
Source of data used to construct passenger frequency tables
|BLCA.context||Bladder Urothelial Carcinoma||TCGA||Jun 2013|
|BRCA.context||Breast invasive carcinoma||TCGA||Jun 2012|
|CESC.context||Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma||TCGA||Jun 2013|
|CLL.context||Chronic Lymphocytic Leukemia||ICGC||Mar 2013|
|COAD.context||Colon Adenocarcinoma||TCGA||Jun 2013|
|GBM.context||Glioblastoma Multiforme||TCGA||Jun 2013|
|GID.context||Gastric (Intestinal and Diffuse)||ICGC||Mar 2013|
|HCCA.context||Hepatocellular Carcinoma (Secondary to Alcohol and Adiposity)||ICGC||Mar 2013|
|HCCV.context||Hepatocellular Carcinoma (Viral)||ICGC||Mar 2013|
|HNSC.context||Head and Neck Squamous Carcinoma||TCGA||Jun 2013|
|KICH.context||Kidney Chromophobe||TCGA||Jun 2013|
|KIRC.context||Kidney Renal Clear Cell Carcinoma||TCGA||Jun 2013|
|KIRP.context||Kidney Renal Papillary Cell Carcinoma||TCGA||Jun 2013|
|LAML.context||Acute Myeloid Leukemia||TCGA||Jun 2013|
|LGG.context||Brain Lower Grade Glioma||TCGA||Jun 2013|
|LUAD.context||Lung Adenocarcinoma||TCGA||Jun 2013|
|LUSC.context||Lung Squamous Cell Carcinoma||TCGA||Jun 2013|
|MB.context||Medulloblastoma (mixed with data from tumors with similar mutation spectra)||PMID:21163964||Dec 2010|
|ML.context||Melanoma||Yardena Samuels Lab||Dec 2011|
|OV.context||Ovarian Serous Cystadenocarcinoma||TCGA||Jun 2013|
|PC.context||Prostate Cancer||ICGC||Mar 2013|
|PNCC.context||Pancreatic Cancer (OICR-CA and QCMG-AU)||ICGC||Mar 2013|
|PRAD.context||Prostate Adenocarcinoma||TCGA||Jun 2013|
|READ.context||Rectum Adenocarcinoma||TCGA||Jun 2013|
|SKCM.context||Skin Cutaneous Melanoma||TCGA||Jun 2013|
|STAD.context||Stomach Adenocarcinoma||TCGA||Jun 2013|
|THCA.context||Thyroid Carcinoma||TCGA||Jun 2013|
|UCEC.context||Uterine Corpus Endometriod Carcinoma||TCGA||Jun 2013|
NOTES: There may be problems with the Gastric and Heptocellular tables. For Gastric, mutations were not labeled as somatic vs. germline. For Heptocellular, the data was sparse, only ~200 somatic mutations were available. We will provide updated versions of these tables in a future release.
What about cancers other than the types listed?
1) We suggest using the ovarian table as a default, since ovarian cancer shows no clear signatures of exogenous mutagens.
2) You may also consider constructing your own passenger mutation rate table.
There are many ways to construct your own custom table. Here, we suggest a protocol that has worked well for us. It makes a number of simplifying assumptions that you may not agree with. In that case, you may wish to modify the protocol. If you are happy with the results, please share your new protocol with other CHASM users via the mailing list.
Our protocol to build a context table:
Somatic missense mutations observed in tumor sequencing data are composed of a mixture of driver and passenger mutations. In order to limit our analysis to passenger mutation rates, we first subtract mutations observed in genes frequently mutated in cancer. The remaining mutations are conservatively assumed to be passengers, and the underlying base substitutions constitute the background mutation profile for the tumor type. (Simplifying Assumption #1)
To convert this profile to a table, base substitutions are grouped into 8 DNA sequence contexts, selected based on observed significant differences in context mutation rates in previous studies (Sjoblom Supplementary materials). The context table is 8x4 with the 8 selected di-nucleotide contexts as column headers and the 4 nucleotides as the rows. The 8 di-nucleotide contexts are as follows (Simplifying Assumption #2):
1) C*pG: C in a CpG di-nucleotide is mutated 2) CpG*: G in a CpG di-nucleotide is mutated 3) TpC*: C in a TpC di-nucleotide is mutated 4) G*pA: G in a GpA di-nucleotide is mutated 5) A: A is mutated 6) C: C other than C in a CpG or in a TpC is mutated 7) G: G other than G in a CpG or in a GpA is mutated 8) T: T is mutated
These are in order of priority, so a single base substitution in DNA sequence TCG where the C is mutated would fall under context 1 (C*pG) rather than context 2 (TpC*) or context 6 (C). The p represents the phosphate group between the 2 nucleotides.
In each context, the case where the base is not mutated is ignored. The remaining 8x3 possibilities form a distribution over all base substitutions that could occur in the 8 contexts. Each frame in the table then contains the approximate relative rate at which a specific base substitution occurs within a context, estimated from whole exome sequencing data for a specific tumor type.
Example Context Table: Base To C*pG CpG* TpC* G*pA A C G T A 0.009 0.227 0.014 0.031 - 0.043 0.077 0.028 C - 0.005 - 0.015 0.012 - 0.021 0.028 G 0.005 - 0.009 - 0.060 0.029 - 0.015 T 0.172 0.003 0.021 0.023 0.025 0.074 0.057 -
Tumor type-specific context tables can be generated from a MAF or VCF file that contains somatic variant calls from a population of tumor samples for a particular cancer being studied. The tables included in the CHASM package were computed with data from studies that included at least ten samples.
- Determine the number of non-silent somatic single base variants.
- Drop any mutations occurring in genes mutated in the cancer of interest at a high frequency (gene "mountains"). As an expert in the cancer type you have sequenced, you may already know which genes are mountains. Alternatively, you can use the COSMIC database http://www.sanger.ac.uk/genetics/CGP/cosmic/ to select "mountains'. For the CHASM publications, the following genes were considered mountains: TP53, KRAS, SMAD4, CDKN2A, NF1, RB1, PIK3CA, PTEN
- Divide the non-silent somatic single base variants not in mountains into the 8x3 context + base substitution categories.
- Divide the value for each category by the total number of mutations remaining after Step 2.
NOTE: The CHASM software generates mutations at random in mRNA transcript sequence (not in genomic DNA sequence). Thus when constructing passenger rate tables from DNA sequencing data, if a mutation occurs in a transcript on the negative DNA strand, the mutation should be included in the passenger rate table as it would appear on the negative strand.
How CHASM uses passenger mutation rate tables:
CHASM uses passenger mutation rate tables to create the synthetic class of passenger missense mutations used in classifier training, and statistical analysis of CHASM scores. We assume that tumor-specific differences in the relative background context + base substitution rates are representative of the mutation process of the tumor type of interest. We expect that missense mutations observed in this tumor type, whether driver or passenger mutations, were generated under the constraints of this same underlying process, but that driver mutations are rare events that are selected for, while passenger mutations occur more frequently and at random.
- User can customise certain parameters used by CHASM by editing the configuration file chasm_classifiers.conf located in the CHASM directory.
- Below are the default values:
; CHASM_Classifiers Configuration file ; Contains references to all software and directories used to build the classifiers ; PARF path ;parfpath=/programs/parf/parf parfpath=./parf ; CHASM Classifier Pack location chasmclassifierpack=ClassifierPack ; Cancer-related gene list (used for nullset generation) blacklist=blacklist.ids ; Minimum number of genes present before passenger generation ; is restricted to the genes in which user variants occured whitelistcutoff=1000 defaultwhitelist=whitelist.ids multipletestingminimum=10 ; Refseq blob files transcriptlibrary=refseq.NM.fa transcriptlibrarymetaData=refseq.metaData transcriptlibraryinfo=refseq.info ; Ensembl to hugo Ensembl2Hugo=ensembl_export.txt ; CCDS to Refseq CCDS2Refseq=ccds_refseq.txt ; CHASM Default Classifier Output location chasmdefaultoutputdirectory=BuiltClassifiers ; Classifier Specifications passengernum=14000 nullnum=5000 ; Number of trees to build treenum=500 ; Number of random draws used for decision tree construction ; default is sqrt(number of features used) ; Must be <= total number of features. ; Large values may cause classifier overfitting. mtry=default
- Description of each parameter:
- "parfpath" - location of parf executable
- chasmclassifierpack - location of the Classifier Pack.
- blacklist - location of file containing list of cancer related genes for filtering generated passengers
- whitelistcutoff - minimum number of genes represented in the user's input file in order for passenger generation to use those genes rather than the default list supplied by CHASM
- "defaultwhitelist" - stores whitelist set of genes in which to generate passenger mutations if less than whitelistcutoff genes are represented in the user's mutation data
- multipletestingminimum - minimum number of mutations before FDR calculations will be performed
- transcriptlibrary - location of file containing Refseq Transcript sequences used for passenger generation
- transcriptlibrarymetaData - location of file containing additional information for each Refseq transcript
- transcriptlibraryinfo - location of file containing mappings between Refseq protein, transcript and Hugo ID
- Ensembl2Hugo - location of file containing mappings between Ensembl and Hugo
- CCDS2Refseq - location of file containing mappings between CCDS and Refseq Transcripts
- chasmdefaultoutputdirectory - default output directory where new classifiers are placed
- passengernum - number of passengers generated to train each classifier
- nullnum - number of passengers generated for each null set
- treenum - number of trees used in random forest classifier
- "mtry" - number of random draws to perform during random forest construction. Default=squareroot(number of features in .arff file) can be overridden here.