From Chasm Software Wiki
Changes From the Original CHASM
CHASM was originally developed as a software pipeline with dependencies on a large number of third party softwares, most of which were used to calculate features. To distribute CHASM as a software package, we redesigned it by systematically precomputing features on our lab servers, then storing these features in the SNVBox database for easy access. We also introduced efficiencies for several steps in the original software pipeline: the synthetic passenger generation algorithm, estimation of missing feature values, and feature standardization.
Users who would like to match the results of original CHASM more closely are advised to follow a protocol in which a large number of synthetic passenger mutations are generated and a held-out partition of the training set (both drivers and "passengers") is used for feature selection. See Customizing CHASM.
The Original CHASM used a base set of 80 features and a feature selection protocol for each cancer type Carter et al. 2009. Briefly, a training set of drivers and synthetic passengers was split into two equal-sized partitions, with one partition used for feature selection and the other for Random Forest training. The most informative features for a particular cancer type were selected by using mutual information of features and class labels, and retaining only features with at least 0.001 bits of mutual information.
86 features are available in SNVBox. We attempted to reduce the number of redundant features. Also, the ortholog conservation feature "OMA score" has been replaced by three conservation scores generated from UCSC Genome Browser 46-way vertebrate alignments.
If you do not specify which features in SNVBox you want to use, CHASM beta 0.1 will use all available features by default . It may be possible to achieve better results by performing feature selection prior to classifier training.
CHASM now contains an improved version of the passenger generation algorithm described in Carter et al. 2009. The number of passenger mutations generated in a gene now depends only on information from the passenger mutation rate table and the length of the gene. Previously, we also explicitly incorporated (a multiple of) the number of somatic mutations observed in a gene in Wood_2007 Parsons_2008 Jones_2008
The genes used for passenger generation now are based on the genes in which a user's input mutations occur. We do this to control for gene-level bias, so that the classifier is not trained simply to distinguish between driver and passenger mutations based on the genes in which they occur. However, if less than a threshold number of genes is seen in the user's input mutations, passengers are generated from both these genes and an additional 9,000 genes that have been previously seen to be somatically mutated in five cancer types sequenced at the Sidney Kimmel Cancer Center at Johns Hopkins.
We have made a more stringent filter for genes included in the empirical null distribution, used to generate p-values (described in Carter et al. 2009). The filter is designed to prevent passengers being generated in genes that have been previously associated with cancer, so as to minimize the number of driver mutations that are created randomly by the passenger generation algorithm, which confound the statistical significance of CHASM scores. The filter now consists of all genes in the Cancer Gene Census, the COSMIC cancer gene list and MSigDB cancer gene sets (C4 collection).
Missing Value Fill
The original CHASM used a k-nearest neighbors (KNN) algorithm. Missing features were estimated based on the values of each missing feature in the k mutations "closest" to the mutation with the missing feature. Details in Carter 2009. Given the very large number of pre-computed features in SNVBox, KNN is no longer tractable for CHASM and we now fill in missing features using the mean feature value.
The original CHASM computed raw feature values for all mutations used in training, feature selection and all mutations to be classified. Each feature was then scaled using all values computed for these sets by subtracting its mean value and dividing by its rms.
SNVBox contains pre-computed raw feature values for (almost) all positions in the human exome. For each feature it also contains its mean value and rms with respect to the whole exome. CHASM pulls raw features from SNVBox and scales them on the fly with the same formula, in effect scaling based on the whole exome rather than the subset of mutations from a particular dataset.