From Chasm Software Wiki
Preparing the requisite files
SNVBox accepts mutations defined as amino-acid residue substitutions, along with an accession identifier from NCBI RefSeq, CCDS, or Ensembl. Both mRNA transcript and protein accession identifiers are accepted. Prepare a tab delimited text file of your mutations:
NP_001135977 R641W NP_835455 R151C NP_055645 L590V NP_689808 D28H NP_005472 S372R NP_112493 S35R NP_859061 A118V NP_892018 R153C NP_001074003 R264Q NP_001073893 R1272C NP_808882 R30H ...
Alternatively, a column with mutation identifiers can be included. If the user does not include this column, the row number of the mutation in the input file will be used automatically.
1 NP_001135977 R641W 2 NP_835455 R151C 3 NP_055645 L590V 4 NP_689808 D28H 5 NP_005472 S372R 6 NP_112493 S35R 7 NP_859061 A118V 8 NP_892018 R153C 9 NP_001074003 R264Q 10 NP_001073893 R1272C 11 NP_808882 R30H ...
Mutations can also be defined in genomic coordinates in build GRCh37/hg19 of the human genome. This file should be tab delimited with 6 or 7 columns (the first column with identifiers is optional - CHASM will use row numbers in the input file if no identifier is specified). Thse columns
1 chr22 25115448 25115449 + A G 2 chr22 25119119 25119120 + A G 3 chr22 25124310 25124311 - C T 4 chr22 25144911 25144912 + C T 5 chr22 25145752 25145753 - C T 6 chr22 25147422 25147423 + C T 7 chr22 25150137 25150138 + A G 8 chr22 25152617 25152618 + C T 9 chr22 25158437 25158438 + C T 10 chr22 24121377 24121378 + C T ...
NOTE: Genomic coordinate file columns are as follows:
- Mutation identifier (optional - CHASM will assign row number if this column is left off)
- 0-based coordinate of base substitution
- 1-based coordinate of base substitution
- Strand on which reference and alternative bases are reported
- Reference base at this position
- Alternate base at this position
- SNVBox accepts a custom list of features. Prepare a text file with the list of features you want:
ExonConservation ExonSnpDensity ExonHapMapSnpDensity HMMRelEntropy HMMEntropy HMMPHC MGARelEntropy MGAEntropy MGAPHC PredRSAB PredRSAE PredBFactorF PredBFactorS PredSSC PredSSE PredSSH AAHydrophobicity AAVolume AAPolarity ...
If mutations are specified with transcript and amino acid substitution, snvGetTranscript should be used for feature retrieval.
If genomic coordinates are used, snvGetGenomic should be used for feature retrieval.
- To retrieve features from a single mutation file (for genomic coordinates, replace snvGetTranscript with snvGetGenomic):
> snvGetTranscript -f [feature list file] -o [output arff file] [mutation file]
- To retrieve features from multiple classes the file names as class labels:
> snvGetTranscript -f [feature list file] -o [output arff file] [mutation file 1 as class 1] [mutation file 2 as class 2] etc.
- To retrieve features from multiple classes with custom class labels:
> snvGetTranscript -c -f [feature list file] -o [output arff file] [class 1] [mutation file 1] [class 2] [mutation file 2] etc.
- Optional Parameters
- -r Show raw feature values. Do not scale using mean and rms values.
- -m Show missing feature values. Do not fill missing values in with mean value.
Sample error messages
Please note that you may run into the following error messages while using SnvGet.
- The "M44APHC" is not recognized and therefore omitted.
- This error occurs when there is an unrecognized feature name in the input file of requested features (-f option). The unrecognized feature will be ignored.
- The "MGAPHC" feature is repeated in the Features list. It has already been added to the feature set.
- This error occurs when there is a duplicated feature in the input file of requested features. The duplicated feature will be ignored.
- "NP_689637 Y1136*" is not a properly formatted mutation
- This error occurs when a row in the .tmps (transcript-mutation pairs) file is not properly formatted, or contains a stop codon (represented by the asterisk). This row will be ignored when retrieving feature values.
- Position/Wildtype 1126/A not found in NP_000236.
- This error occurs when the reference amino acid at the requested codon position in the transcript does not match what is at that codon position in the transcript stored in the database. May be due to a mismatch in transcript versions. RefSeq, for example, is frequently updated and the translation start site of a transcript may changed between versions. We update the transcript sequences used by SNVBox on a weekly basis and should have the most recent version of each transcript available.
- Transcript ENST00000360484 not found in database.
- This error occurs when SNVBox does not contain a transcript. In this case, all mutations involving the transcript will be ignored.
- Warning: Refseq transcript version number (NM_032173.3) does not match Refseq version in database (NM_032173.2)
- The version number of the transcript used does not match the version number of the transcript in the current version of the SNVBox database. Features will be returned from the SNVBox database for the version stored in the database.
- Sequencing variant 36915 chr22:22288398 C>T maps to NM_014634.3 L185L. Only missense variants will be evaluated by CHASM.
- The specified genomic coordinate did not map to a missense mutation. The mutation will be ignored.
- Sequencing variant chr22:20800834 A>G did not map to a codon.
- The genomic coordinate did not map to a codon. The mutation will be ignored.
- Reference base specified for sequencing variant 1 chr22:25115448 C>G does not match reference base at that coordinate in hg19.
- The reference base specified does not match the base at that position in GRCh37/hg19. The mutation will be ignored.
SNVBox features are detailed here. To use each feature, simply specify the FeatureID in the feature list file specified when you run the SnvGet utility.