SNVBox Tutorial

From Chasm Software Wiki

(Difference between revisions)
Jump to: navigation, search
(Retrieving features)
(Sample error messages)
 
(13 intermediate revisions not shown)
Line 2: Line 2:
== Preparing the requisite files ==
== Preparing the requisite files ==
<ol>
<ol>
-
<li> SNV-Box accepts protein-based mutations. Prepare a tab delimited ".tmps" text file of your mutations in the following format:
+
<li>  
 +
SNVBox accepts mutations defined as amino-acid residue substitutions, along with an accession identifier from NCBI RefSeq, CCDS, or Ensembl. Both mRNA transcript and protein accession identifiers are accepted. Prepare a tab delimited text file of your mutations:
 +
 
<PRE>
<PRE>
NP_001135977 R641W
NP_001135977 R641W
Line 17: Line 19:
...
...
</PRE>
</PRE>
 +
 +
Alternatively, a column with mutation identifiers can be included. If the user does not include this column, the row number of the mutation in the input file will be used automatically.
 +
<PRE>
 +
1      NP_001135977 R641W
 +
2      NP_835455 R151C
 +
3      NP_055645 L590V
 +
4      NP_689808 D28H
 +
5      NP_005472 S372R
 +
6      NP_112493 S35R
 +
7      NP_859061 A118V
 +
8      NP_892018 R153C
 +
9      NP_001074003 R264Q
 +
10    NP_001073893 R1272C
 +
11    NP_808882 R30H
 +
...
 +
</PRE>
 +
<br>
<br>
-
CHASM currently supports Refseq, CCDS, and Ensembl accessions.
+
<br>
 +
Mutations can also be defined in genomic coordinates in build GRCh37/hg19 of the human genome. This file should be tab delimited with 6 or 7 columns (the first column with identifiers is optional - CHASM will use row numbers in the input file if no identifier is specified). Thse columns
 +
<PRE>
 +
1    chr22  25115448        25115449        +      A      G
 +
2    chr22  25119119        25119120        +      A      G
 +
3    chr22  25124310        25124311        -      C      T
 +
4    chr22  25144911        25144912        +      C      T
 +
5    chr22  25145752        25145753        -      C      T
 +
6    chr22  25147422        25147423        +      C      T
 +
7    chr22  25150137        25150138        +      A      G
 +
8    chr22  25152617        25152618        +      C      T
 +
9    chr22  25158437        25158438        +      C      T
 +
10    chr22  24121377        24121378        +      C      T
 +
...
 +
</PRE>
 +
<b>NOTE:</b> Genomic coordinate file columns are as follows:
 +
* Mutation identifier (optional - CHASM will assign row number if this column is left off)
 +
* Chromosome
 +
* 0-based coordinate of base substitution
 +
* 1-based coordinate of base substitution
 +
* Strand on which reference and alternative bases are reported
 +
* Reference base at this position
 +
* Alternate base at this position
 +
<br>
 +
 
 +
 
</li>
</li>
-
<li> SNV-Box accepts a list of custom features. Prepare a text file with the list of features you want:
+
<li> SNVBox accepts a custom list of features. Prepare a text file with the list of features you want:
<PRE>
<PRE>
ExonConservation
ExonConservation
Line 49: Line 93:
== Retrieving features ==
== Retrieving features ==
 +
If mutations are specified with transcript and amino acid substitution, snvGetTranscript should be used for feature retrieval. <br>
 +
If genomic coordinates are used, snvGetGenomic should be used for feature retrieval. <br>
 +
<br>
<ul>
<ul>
-
<li> To retrieve features from a single mutation file:
+
<li> To retrieve features from a single mutation file (for genomic coordinates, replace snvGetTranscript with snvGetGenomic):
<PRE>
<PRE>
-
> SNVget -f [feature list file] -o [output arff file] [mutation file]
+
> snvGetTranscript -f [feature list file] -o [output arff file] [mutation file]
</PRE>
</PRE>
</li>
</li>
Line 58: Line 105:
<li> To retrieve features from multiple classes the file names as class labels:
<li> To retrieve features from multiple classes the file names as class labels:
<PRE>
<PRE>
-
> SNVget -f [feature list file] -o [output arff file] [mutation file 1 as class 1]
+
> snvGetTranscript -f [feature list file] -o [output arff file] [mutation file 1 as class 1]
   [mutation file 2 as class 2] etc.
   [mutation file 2 as class 2] etc.
</PRE>
</PRE>
Line 65: Line 112:
<li> To retrieve features from multiple classes with custom class labels:
<li> To retrieve features from multiple classes with custom class labels:
<PRE>
<PRE>
-
> SNVget -c -f [feature list file] -o [output arff file] [class 1] [mutation file 1]
+
> snvGetTranscript -c -f [feature list file] -o [output arff file] [class 1] [mutation file 1]
   [class 2] [mutation file 2] etc.
   [class 2] [mutation file 2] etc.
</PRE>
</PRE>
 +
</li>
 +
 +
<li> Optional Parameters
 +
<ul>
 +
<li>
 +
''-r'' Show raw feature values. Do not scale using mean and rms values.
 +
</li>
 +
<li>
 +
''-m'' Show missing feature values. Do not fill missing values in with mean value.
 +
</li>
 +
</ul>
</li>
</li>
Line 74: Line 132:
== Sample error messages ==  
== Sample error messages ==  
-
Please note that you may run into the following error messages while using SNVget.
+
Please note that you may run into the following error messages while using SnvGet.
*'''The "M44APHC" is not recognized and therefore omitted.'''
*'''The "M44APHC" is not recognized and therefore omitted.'''
-
** This error occurs when there is an unrecognised feature name in the Features.list file. The unrecognised feature will be ignored.
+
** This error occurs when there is an unrecognized feature name in the input file of requested features (-f option). The unrecognized feature will be ignored.
*'''The "MGAPHC" feature is repeated in the Features list. It has already been added to the feature set.'''
*'''The "MGAPHC" feature is repeated in the Features list. It has already been added to the feature set.'''
-
** This error occurs when there is a duplicated feature in the Features.list file. The duplicated feature will be ignored.
+
** This error occurs when there is a duplicated feature in the input file of requested features. The duplicated feature will be ignored.
-
* '''"NP_689637      *1136Y" is not a properly formatted mutation'''
+
* '''"NP_689637      Y1136*" is not a properly formatted mutation'''
-
** This error occurs when a row in the .tmps file is not properly formatted, or contains stop codons. This row will be ignored when retrieving feature values.
+
** This error occurs when a row in the .tmps (transcript-mutation pairs) file is not properly formatted, or contains a stop codon (represented by the asterisk). This row will be ignored when retrieving feature values.
*'''Position/Wildtype 1126/A not found in NP_000236.'''
*'''Position/Wildtype 1126/A not found in NP_000236.'''
-
** This error occurs when the wild type amino acid for the protein does not match that stored in the database. This could be due to differences in protein version. This mutation will be ignored.
+
** This error occurs when the reference amino acid at the requested codon position in the transcript does not match what is at that codon position in the transcript stored in the database. May be due to a mismatch in transcript versions. RefSeq, for example, is frequently updated and the translation start site of a transcript may changed between versions.  We update the transcript sequences used by SNVBox on a weekly basis and should have the most recent version of each transcript available.  
*'''Transcript ENST00000360484 not found in database.'''
*'''Transcript ENST00000360484 not found in database.'''
-
** This error occurs when the database does not contain the protein sequence in question. All mutations involving that protein sequence will be ignored.
+
** This error occurs when SNVBox does not contain a transcript. In this case, all mutations involving the transcript will be ignored.
 +
*'''Warning: Refseq transcript version number (NM_032173.3) does not match Refseq version in database (NM_032173.2)'''
 +
**The version number of the transcript used does not match the version number of the transcript in the current version of the SNVBox database. Features will be returned from the SNVBox database for the version stored in the database.
 +
*'''Sequencing variant 36915 chr22:22288398 C>T maps to NM_014634.3 L185L. Only missense variants will be evaluated by CHASM.'''
 +
**The specified genomic coordinate did not map to a missense mutation. The mutation will be ignored.
 +
*'''Sequencing variant chr22:20800834 A>G did not map to a codon.'''
 +
**The genomic coordinate did not map to a codon. The mutation will be ignored.
 +
*R'''eference base specified for sequencing variant 1 chr22:25115448 C>G does not match reference base at that coordinate in hg19.'''
 +
**The reference base specified does not match the base at that position in GRCh37/hg19. The mutation will be ignored.
== Available Features ==
== Available Features ==
-
The following features are currently available in SNV-Box. To use each feature, simply specify the ID in the Features.list file.
+
SNVBox features are detailed [[Media:SNVBox_Final.pdf|here]]. To use each feature, simply specify the FeatureID in the feature list file specified when you run the SnvGet utility.
-
 
+
-
<CENTER>
+
-
<TABLE WIDTH=680 BORDER=1 BORDERCOLOR="#000000" CELLPADDING=4 CELLSPACING=0>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P><B>ID</B></P>
+
-
 
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P><B>Feature</B></P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P><B>Description</B></P>
+
-
</TD>
+
-
</TR>
+
-
 
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>AACharge</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Net residue charge change</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
 
+
-
<P>The change in formal charge resulting from the mutation.
+
-
Histidine is assumed protonated (formal charge of +1)  (Wildtype -
+
-
Mutant)</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>AAVolume *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
 
+
-
<P>Net residue volume change</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>The change in residue colume resulting from the mutation.
+
-
(Wildtype - Mutant)</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
 
+
-
<P>AAHydrophobicity *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Net residue hydrophobicity change</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>The change in hydrophobicity resulting from the substitution.
+
-
(Wildtype - Mutant)</P>
+
-
 
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>AAGrantham *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Grantham Score</P>
+
-
 
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>The Grantham substitution score for the wild type to mutant
+
-
transition.</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>AAPolarity *</P>
+
-
 
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Change in Polarity</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>Change in residue polarity due to the wildtype to mutant
+
-
transition.</P>
+
-
</TD>
+
-
</TR>
+
-
 
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>AAEx *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Ex substitution score</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
 
+
-
<P>Amino acid substitution score from the EX matrix.</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>AAPAM250 *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
 
+
-
<P>PAM250 substitution score</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>Amino acid substitution score from the PAM250 matrix.</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
 
+
-
<P>AABLOSUM *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>BLOSUM 62 substitution score</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>Amino acid substitution score from the BLOSUM 62 matrix.</P>
+
-
 
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>AAMJ *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>MJ Substitution score</P>
+
-
 
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>Amino acid substitution score from the Miyazawa-Jernigan
+
-
contact energy matrix.</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>AAHGMD2003 *</P>
+
-
 
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>HGMD2003 mutation count</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>Number of times that the wild type to mutant substitution
+
-
occurs in the Human Gene Mutation Database, 2003 version.</P>
+
-
</TD>
+
-
</TR>
+
-
 
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>AAVB *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>VB mutation score</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
 
+
-
<P>Amino acid substitution score from the VB (Venkatarajan and
+
-
Braun) matrix.</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>AATransition</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
 
+
-
<P>Amino Acid Transition probabilities</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>Frequency of left to right transition between two neighboring amino
+
-
acids based on all Uniprot Human proteins</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
 
+
-
<P>AACOSMIC *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Frequency of missense change type in the Catalog of Somatic
+
-
Mutations in Cancer (COSMIC) database</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>Frequency in natural log that missense change type (amino acid
+
-
type X to amino acid type Y, e.g. ALANINE to GLYCINE) is
+
-
</P>
+
-
 
+
-
<P>seen in COSMIC. These frequencies were calculated during the
+
-
week of August 14, 2008, using COSMIC
+
-
</P>
+
-
<P>release 38.
+
-
</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>AACOSMICvsSWISSPROT</P>
+
-
 
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Count of missense change type in the Catalog of SomaticA
+
-
Mutations in Cancer (COSMIC) database divided by count in
+
-
SWISSPROT database</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>Frequency in natural log that missense change type (amino acid
+
-
type X to amino acid type Y, e.g. ALANINE to GLYCINE) is
+
-
</P>
+
-
<P>seen in COSMIC. These frequencies were calculated during the
+
-
week of August 14, 2008, using COSMIC
+
-
</P>
+
-
 
+
-
<P>release 38 normalized by the occurrences of the wild type
+
-
residue in human proteins found in UniProtKB</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>AACOSMICvsHapMap *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
 
+
-
<P>Count of missense change type in the Catalog of Somatic
+
-
Mutations in Cancer (COSMIC) database divided by count in HapMap.</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>Frequency in natural log that missense change type (amino acid
+
-
type X to amino acid type Y, e.g. ALANINE to GLYCINE) is
+
-
</P>
+
-
<P>seen in COSMIC. These frequencies were calculated during the
+
-
week of August 14, 2008, using COSMIC
+
-
</P>
+
-
<P>release 38 normalized by the number of times the change type is
+
-
observed in the HapMap SNPs database.</P>
+
-
</TD>
+
-
 
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>AAHapMap</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>HAPMAP Amino Acid substitution counts</P>
+
-
</TD>
+
-
 
+
-
<TD WIDTH=270>
+
-
<P>Frequency in natural log the change type from Wildtype to
+
-
Mutant AA that is observed  in the HapMap SNPs database.</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>ExonConservation *</P>
+
-
</TD>
+
-
 
+
-
<TD WIDTH=201>
+
-
<P>46-way exon conservation</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>The conservation score for the entire exon calculated from a
+
-
46-species phylogenetic alignment using the UCSC Genome Browser (hg19).
+
-
Scores are given for windows of nucleotides. We retrieve the
+
-
scores for each region that overlaps the exon in which the base
+
-
substitution occurred and calculated a weighted average of the
+
-
conservation scores where the weight is the number of bases with a
+
-
particular score.</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
 
+
-
<TD WIDTH=183>
+
-
<P>ExonSnpDensity *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>SNP Density</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>The number of SNPs in the exon where the mutation is located
+
-
divided by the length of the exon.</P>
+
-
 
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>ExonHapMapSnpDensity *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>HapMap verified SNP Density</P>
+
-
 
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>The number of HapMap verified SNPs in the exon where the
+
-
mutation is located divided by the length of the exon.</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>MGAPHC *</P>
+
-
 
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Multiz-46-way Alignment Positional Conservation</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>This feature is calculated based on the degree of conservation
+
-
of the residue estimated from a column in the Multiz-46-way
+
-
alignment using the UCSC Human Genome Browser.</P>
+
-
</TD>
+
-
</TR>
+
-
 
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>MGARelEntropy *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Multiz-46-way Alignment Relative Entropy</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
 
+
-
<P>Kullback-Leibler Distance calculated for the column of
+
-
Multiz-46-way alignment (corresponding to the location of the
+
-
mutation) and that of a background distribution of amino acid
+
-
residues computed from a large sample of multiple sequence
+
-
alignments.</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>MGAEntropy *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
 
+
-
<P>Multiz-46-way Alignment Entropy</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>The Shannon entropy calculated for the column of the
+
-
Multiz-46-way alignment, corresponding to the location of the
+
-
mutation.
+
-
</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
 
+
-
<P>HMMPHC *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Positional Hidden Markov Model (HMM) conservation score</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>This feature is calculated based on the degree of (HMM)
+
-
conservation score          conservation of the residue estimated
+
-
from a multiple sequence alignment built with SAM-T2K software
+
-
(29), using the protein in which the mutation occurred as the seed
+
-
sequence (30). The SAM-T2K alignments are large, superfamily-level
+
-
alignments that include distantly related homologs (as well as
+
-
close homologs and orthologs) of the protein of interest.
+
-
</P>
+
-
 
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>HMMRelEntropy *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Relative entropy of HMM alignments</P>
+
-
 
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>Kullback-Leibler Distance calculated for the column of the
+
-
SAM-T2K multiple sequence alignment (corresponding to the location
+
-
of the mutation) and that of a background distribution of amino
+
-
acid residues computed from a large sample of multiple sequence
+
-
alignments.</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>HMMEntropy *</P>
+
-
 
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Entropy of HMM alignment</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>The Shannon entropy calculated for the column of the SAM-T2K
+
-
multiple sequence alignment, corresponding to the location of the
+
-
mutation.
+
-
</P>
+
-
</TD>
+
-
</TR>
+
-
 
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>PredStabilityL *, <BR>PredStabilityM *, <BR>PredStabilityH *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Predicted contribution to protein stability</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
 
+
-
<P>These features consist of the probability that the wild
+
-
stability type residue contributes to overall protein stability in
+
-
a manner that is highly stabilizing, average or destabilizing, as
+
-
predicted by a neural network trained with Predict-2nd software on
+
-
a set of 1763 proteins with less than 30% homology. Stability
+
-
estimates for the neural network training data were calculated
+
-
using the FoldX force field .
+
-
</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>PredSSE *, PredSSH *, PredSSC *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
 
+
-
<P>Predicted secondary structure</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>These features consist of the probability that the secondary
+
-
structure of the region in which the wild type residue exists is
+
-
helix, loop or strand as predicted by a neural net trained with
+
-
Predict-2nd software on a set of 1763 proteins with crystal
+
-
structures and with less than 30% homology.</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
 
+
-
<P>PredRSAB *, PredRSAI *, PredRSAE *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Predicted residue solvent accessibility</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>These features consist of the probability of the wild type
+
-
accessibility residue being buried, intermediate or exposed as
+
-
predicted by a neural network trained with Predict-2nd software on
+
-
a set of 1763 proteins with high- resolution X-ray crystal
+
-
structures sharing less than 30% homology.
+
-
</P>
+
-
 
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>PredBFactorS *, <BR>PredBFactorM *, <BR>PredBFactorF *</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Predicted Bfactor</P>
+
-
 
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>These features consist of the probability that the wild type
+
-
residue backbone is stiff, intermediate or flexible as predicted
+
-
by a neural network trained with Predict-2nd software (29) on a
+
-
set of 1763 proteins with less than 30% homology. Flexibilities
+
-
for the neural net training data were estimated based on
+
-
normalized temperature factors, computed using the method of (38)
+
-
from the X- ray crystal structure files.</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>RegCompP *, RegCompC, RegCompG *, RegCompDE, RegCompQ, RegCompH,
+
-
RegCompKR *, RegCompWYF, RegCompILVM *, RegCompEntropy,
+
-
RegCompNormEntropy</P>
+
-
 
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Regional AA composition</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>The percentage of amino acids in a 15 residue window
+
-
surrounding the mutation that fall into one of the following
+
-
categories. (P,C,G,DE,Q,H,KR,WYF,ILVM)</P>
+
-
</TD>
+
-
</TR>
+
-
 
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>AATripletFirstProbWild, AATripletSecondProbWild,
+
-
AATripletThirdProbWild</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Probability of seeing the wild type residue in that position of
+
-
an amino acid triple.</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
 
+
-
<P>Calculated by joint frequencies of amino acid triples in human
+
-
proteins found in UniProtKB.</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>AATripletFirstProbMut, AATripletSecondProbMut *,
+
-
AATripletThirdProbMut</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
 
+
-
<P>Probability of seeing the mutant residue in that position of an
+
-
amino acid triple.</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>Calculated by joint frequencies of amino acid triples in human
+
-
proteins found in UniProtKB.</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
 
+
-
<P>AATripletFirstDiffProb *</P>
+
-
<P>AATripletSecondDiffProb *</P>
+
-
<P>AATripletThirdDiffProb</P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Difference of the probability of seeing the wild type residue
+
-
in that triplet position as compared with the mutant residue.</P>
+
-
</TD>
+
-
 
+
-
<TD WIDTH=270>
+
-
<P>Based on the values calculated in AATripletPositionProbMut/Wild
+
-
features. (Wildtype - Mutant)</P>
+
-
</TD>
+
-
</TR>
+
-
<TR VALIGN=TOP>
+
-
<TD WIDTH=183>
+
-
<P>UniprotBINDING, UniprotACTSITE, UniprotSITE, UniprotLIPID,
+
-
UniprotMETAL, UniprotCARBOHYD, UniprotDNABIND, UniprotNPBIND *,
+
-
UniprotCABIND, UniprotDISULFID, UniprotSECYS, UniprotMODRES *,
+
-
UniprotPROPEP, UniprotSIGNAL *, UniprotMUTAGEN *, UniprotTRANSMEM,
+
-
UniprotCOMPBIAS, <BR>UniprotREP *, UniprotMOTIF, UniprotZNFINGER *,
+
-
UniprotREGIONS</P>
+
-
<P><BR>
+
-
 
+
-
</P>
+
-
<P ALIGN=LEFT STYLE="font-style: normal"><FONT COLOR="#000000"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3>UniprotDOM_PPI,
+
-
UniprotDOM_RNABD *, UniprotDOM_TF *, UniprotDOM_LOC *,
+
-
UniprotDOM_MMBRBD*, UniprotDOM_Chrom, UniprotDOM_PostModRec,
+
-
UniprotDOM_PostModEnz*</FONT></FONT></FONT></P>
+
-
</TD>
+
-
<TD WIDTH=201>
+
-
<P>Uniprot Annotations (fingerprints)</P>
+
-
</TD>
+
-
<TD WIDTH=270>
+
-
<P>These features give annotations, curated from the literature,
+
-
of general binding sites, general active sites, lipid, metal,
+
-
carbohydrate, DNA, phosphate and calcium binding sites,
+
-
disulfides, modified residues, propeptide residues, signal peptide
+
-
residues, known mutagenic sites, transmembrane regions,
+
-
compositionally biased regions, repeat regions, known motifs, and
+
-
zinc fingers. The integer 1 indicates that a feature is present
+
-
and the integer 0 indicates that it is absent at a mutated
+
-
position .</P>
+
-
 
+
-
<P><BR>
+
-
</P>
+
-
<P>The last 8 features are extracted from the DOMAIN annotation as
+
-
follows:</P>
+
-
<P><BR>
+
-
</P>
+
-
<P ALIGN=LEFT STYLE="margin-bottom: 0in; text-decoration: none"><FONT COLOR="#000000"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3><SPAN STYLE="font-style: normal">UniprotDOM_PPI
+
-
:</SPAN>Protein-protein interaction or oligomerization<SPAN STYLE="font-style: normal">
+
-
  </SPAN></FONT></FONT></FONT>
+
-
 
+
-
</P>
+
-
<P ALIGN=LEFT STYLE="margin-bottom: 0in; font-style: normal; text-decoration: none">
+
-
<BR>
+
-
</P>
+
-
<P ALIGN=LEFT STYLE="margin-bottom: 0in; text-decoration: none"><FONT COLOR="#000000"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3><SPAN STYLE="font-style: normal">UniprotDOM_RNABD:</SPAN><SPAN STYLE="font-style: normal">mRNA
+
-
Binding</SPAN><SPAN STYLE="font-style: normal">  </SPAN></FONT></FONT></FONT>
+
-
</P>
+
-
<P ALIGN=LEFT STYLE="margin-bottom: 0in; font-style: normal; text-decoration: none">
+
-
 
+
-
<BR>
+
-
</P>
+
-
<P ALIGN=LEFT STYLE="margin-bottom: 0in; text-decoration: none"><FONT COLOR="#000000"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3><SPAN STYLE="font-style: normal">UniprotDOM_TF:
+
-
</SPAN><SPAN STYLE="font-style: normal">Transcription factor
+
-
related</SPAN></FONT></FONT></FONT></P>
+
-
<P ALIGN=LEFT STYLE="margin-bottom: 0in; font-style: normal; text-decoration: none">
+
-
<BR>
+
-
</P>
+
-
<P ALIGN=LEFT STYLE="margin-bottom: 0in; text-decoration: none"><FONT COLOR="#000000"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3><SPAN STYLE="font-style: normal">UniprotDOM_LOC:
+
-
</SPAN><SPAN STYLE="font-style: normal">Transport and localization
+
-
related domain (localization signals, etc.)</SPAN></FONT></FONT></FONT></P>
+
-
 
+
-
<P ALIGN=LEFT STYLE="margin-bottom: 0in; font-style: normal; text-decoration: none">
+
-
<BR>
+
-
</P>
+
-
<P ALIGN=LEFT STYLE="margin-bottom: 0in; text-decoration: none"><FONT COLOR="#000000"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3><SPAN STYLE="font-style: normal">UniprotDOM_MMBRBD:
+
-
</SPAN><SPAN STYLE="font-style: normal">Membrane
+
-
binding/interacting (phosphoinositide binding, transmembrane,
+
-
etc.)</SPAN></FONT></FONT></FONT></P>
+
-
<P ALIGN=LEFT STYLE="margin-bottom: 0in; font-style: normal; text-decoration: none">
+
-
<BR>
+
-
</P>
+
-
<P ALIGN=LEFT STYLE="margin-bottom: 0in; text-decoration: none"><FONT COLOR="#000000"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3><SPAN STYLE="font-style: normal">UniprotDOM_Chrom:
+
-
</SPAN><SPAN STYLE="font-style: normal">Chromatin structural
+
-
remodeling related domains (could be indirectly via interaction
+
-
with histones, HATs ao HDACs)</SPAN></FONT></FONT></FONT></P>
+
-
 
+
-
<P ALIGN=LEFT STYLE="margin-bottom: 0in; font-style: normal; text-decoration: none">
+
-
<BR>
+
-
</P>
+
-
<P ALIGN=LEFT STYLE="margin-bottom: 0in; text-decoration: none"><FONT COLOR="#000000"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3><SPAN STYLE="font-style: normal">UniprotDOM_PostModRec:
+
-
</SPAN><SPAN STYLE="font-style: normal">Domains that recognize or
+
-
interact with post-tranlsational mod sites</SPAN></FONT></FONT></FONT></P>
+
-
<P ALIGN=LEFT STYLE="margin-bottom: 0in; font-style: normal; text-decoration: none">
+
-
<BR>
+
-
</P>
+
-
<P ALIGN=LEFT STYLE="text-decoration: none"><FONT COLOR="#000000"><FONT FACE="Liberation Serif, serif"><FONT SIZE=3><SPAN STYLE="font-style: normal">UniprotDOM_PostModEnz:
+
-
</SPAN><SPAN STYLE="font-style: normal"> </SPAN><SPAN STYLE="font-style: normal">Enzymatically
+
-
active domains resulting in post-translational modification such
+
-
as phosphorylation, glycosylation, </SPAN><SPAN STYLE="font-style: normal">amidation</SPAN><SPAN STYLE="font-style: normal">,
+
-
ubiquitination, etc.</SPAN></FONT></FONT></FONT></P>
+
-
 
+
-
</TD>
+
-
</TR>
+
-
</TABLE>
+
-
</CENTER>
+
-
&lowast; indicates features used as defaults in CHASM classifier training.
+

Latest revision as of 21:46, 21 March 2012

Contents

Preparing the requisite files

  1. SNVBox accepts mutations defined as amino-acid residue substitutions, along with an accession identifier from NCBI RefSeq, CCDS, or Ensembl. Both mRNA transcript and protein accession identifiers are accepted. Prepare a tab delimited text file of your mutations:
    NP_001135977	R641W
    NP_835455	R151C
    NP_055645	L590V
    NP_689808	D28H
    NP_005472	S372R
    NP_112493	S35R
    NP_859061	A118V
    NP_892018	R153C
    NP_001074003	R264Q
    NP_001073893	R1272C
    NP_808882	R30H
    ...
    

    Alternatively, a column with mutation identifiers can be included. If the user does not include this column, the row number of the mutation in the input file will be used automatically.

    1      NP_001135977	R641W
    2      NP_835455	R151C
    3      NP_055645	L590V
    4      NP_689808	D28H
    5      NP_005472	S372R
    6      NP_112493	S35R
    7      NP_859061	A118V
    8      NP_892018	R153C
    9      NP_001074003	R264Q
    10     NP_001073893	R1272C
    11     NP_808882	R30H
    ...
    



    Mutations can also be defined in genomic coordinates in build GRCh37/hg19 of the human genome. This file should be tab delimited with 6 or 7 columns (the first column with identifiers is optional - CHASM will use row numbers in the input file if no identifier is specified). Thse columns

    1     chr22   25115448        25115449        +       A       G
    2     chr22   25119119        25119120        +       A       G
    3     chr22   25124310        25124311        -       C       T
    4     chr22   25144911        25144912        +       C       T
    5     chr22   25145752        25145753        -       C       T
    6     chr22   25147422        25147423        +       C       T
    7     chr22   25150137        25150138        +       A       G
    8     chr22   25152617        25152618        +       C       T
    9     chr22   25158437        25158438        +       C       T
    10    chr22   24121377        24121378        +       C       T
    ...
    

    NOTE: Genomic coordinate file columns are as follows:

    • Mutation identifier (optional - CHASM will assign row number if this column is left off)
    • Chromosome
    • 0-based coordinate of base substitution
    • 1-based coordinate of base substitution
    • Strand on which reference and alternative bases are reported
    • Reference base at this position
    • Alternate base at this position



  2. SNVBox accepts a custom list of features. Prepare a text file with the list of features you want:
    ExonConservation
    ExonSnpDensity
    ExonHapMapSnpDensity
    HMMRelEntropy
    HMMEntropy
    HMMPHC
    MGARelEntropy
    MGAEntropy
    MGAPHC
    PredRSAB
    PredRSAE
    PredBFactorF
    PredBFactorS
    PredSSC
    PredSSE
    PredSSH
    AAHydrophobicity
    AAVolume
    AAPolarity
    ...
    

Retrieving features

If mutations are specified with transcript and amino acid substitution, snvGetTranscript should be used for feature retrieval.
If genomic coordinates are used, snvGetGenomic should be used for feature retrieval.

  • To retrieve features from a single mutation file (for genomic coordinates, replace snvGetTranscript with snvGetGenomic):
    > snvGetTranscript -f [feature list file] -o [output arff file] [mutation file]
    
  • To retrieve features from multiple classes the file names as class labels:
    > snvGetTranscript -f [feature list file] -o [output arff file] [mutation file 1 as class 1]
      [mutation file 2 as class 2] etc.
    
  • To retrieve features from multiple classes with custom class labels:
    > snvGetTranscript -c -f [feature list file] -o [output arff file] [class 1] [mutation file 1]
      [class 2] [mutation file 2] etc.
    
  • Optional Parameters
    • -r Show raw feature values. Do not scale using mean and rms values.
    • -m Show missing feature values. Do not fill missing values in with mean value.

Sample error messages

Please note that you may run into the following error messages while using SnvGet.

  • The "M44APHC" is not recognized and therefore omitted.
    • This error occurs when there is an unrecognized feature name in the input file of requested features (-f option). The unrecognized feature will be ignored.
  • The "MGAPHC" feature is repeated in the Features list. It has already been added to the feature set.
    • This error occurs when there is a duplicated feature in the input file of requested features. The duplicated feature will be ignored.
  • "NP_689637 Y1136*" is not a properly formatted mutation
    • This error occurs when a row in the .tmps (transcript-mutation pairs) file is not properly formatted, or contains a stop codon (represented by the asterisk). This row will be ignored when retrieving feature values.
  • Position/Wildtype 1126/A not found in NP_000236.
    • This error occurs when the reference amino acid at the requested codon position in the transcript does not match what is at that codon position in the transcript stored in the database. May be due to a mismatch in transcript versions. RefSeq, for example, is frequently updated and the translation start site of a transcript may changed between versions. We update the transcript sequences used by SNVBox on a weekly basis and should have the most recent version of each transcript available.
  • Transcript ENST00000360484 not found in database.
    • This error occurs when SNVBox does not contain a transcript. In this case, all mutations involving the transcript will be ignored.
  • Warning: Refseq transcript version number (NM_032173.3) does not match Refseq version in database (NM_032173.2)
    • The version number of the transcript used does not match the version number of the transcript in the current version of the SNVBox database. Features will be returned from the SNVBox database for the version stored in the database.
  • Sequencing variant 36915 chr22:22288398 C>T maps to NM_014634.3 L185L. Only missense variants will be evaluated by CHASM.
    • The specified genomic coordinate did not map to a missense mutation. The mutation will be ignored.
  • Sequencing variant chr22:20800834 A>G did not map to a codon.
    • The genomic coordinate did not map to a codon. The mutation will be ignored.
  • Reference base specified for sequencing variant 1 chr22:25115448 C>G does not match reference base at that coordinate in hg19.
    • The reference base specified does not match the base at that position in GRCh37/hg19. The mutation will be ignored.

Available Features

SNVBox features are detailed here. To use each feature, simply specify the FeatureID in the feature list file specified when you run the SnvGet utility.

Personal tools