SNVBox Tutorial

From Chasm Software Wiki

Jump to: navigation, search

Contents

Preparing the requisite files

  1. SNVBox accepts mutations defined as amino-acid residue substitutions, along with an accession identifier from NCBI RefSeq, CCDS, or Ensembl. Both mRNA transcript and protein accession identifiers are accepted. Prepare a tab delimited text file of your mutations:
    NP_001135977	R641W
    NP_835455	R151C
    NP_055645	L590V
    NP_689808	D28H
    NP_005472	S372R
    NP_112493	S35R
    NP_859061	A118V
    NP_892018	R153C
    NP_001074003	R264Q
    NP_001073893	R1272C
    NP_808882	R30H
    ...
    

    Alternatively, a column with mutation identifiers can be included. If the user does not include this column, the row number of the mutation in the input file will be used automatically.

    1      NP_001135977	R641W
    2      NP_835455	R151C
    3      NP_055645	L590V
    4      NP_689808	D28H
    5      NP_005472	S372R
    6      NP_112493	S35R
    7      NP_859061	A118V
    8      NP_892018	R153C
    9      NP_001074003	R264Q
    10     NP_001073893	R1272C
    11     NP_808882	R30H
    ...
    



    Mutations can also be defined in genomic coordinates in build GRCh37/hg19 of the human genome. This file should be tab delimited with 6 or 7 columns (the first column with identifiers is optional - CHASM will use row numbers in the input file if no identifier is specified). Thse columns

    1     chr22   25115448        25115449        +       A       G
    2     chr22   25119119        25119120        +       A       G
    3     chr22   25124310        25124311        -       C       T
    4     chr22   25144911        25144912        +       C       T
    5     chr22   25145752        25145753        -       C       T
    6     chr22   25147422        25147423        +       C       T
    7     chr22   25150137        25150138        +       A       G
    8     chr22   25152617        25152618        +       C       T
    9     chr22   25158437        25158438        +       C       T
    10    chr22   24121377        24121378        +       C       T
    ...
    

    NOTE: Genomic coordinate file columns are as follows:

    • Mutation identifier (optional - CHASM will assign row number if this column is left off)
    • Chromosome
    • 0-based coordinate of base substitution
    • 1-based coordinate of base substitution
    • Strand on which reference and alternative bases are reported
    • Reference base at this position
    • Alternate base at this position



  2. SNVBox accepts a custom list of features. Prepare a text file with the list of features you want:
    ExonConservation
    ExonSnpDensity
    ExonHapMapSnpDensity
    HMMRelEntropy
    HMMEntropy
    HMMPHC
    MGARelEntropy
    MGAEntropy
    MGAPHC
    PredRSAB
    PredRSAE
    PredBFactorF
    PredBFactorS
    PredSSC
    PredSSE
    PredSSH
    AAHydrophobicity
    AAVolume
    AAPolarity
    ...
    

Retrieving features

If mutations are specified with transcript and amino acid substitution, snvGetTranscript should be used for feature retrieval.
If genomic coordinates are used, snvGetGenomic should be used for feature retrieval.

  • To retrieve features from a single mutation file (for genomic coordinates, replace snvGetTranscript with snvGetGenomic):
    > snvGetTranscript -f [feature list file] -o [output arff file] [mutation file]
    
  • To retrieve features from multiple classes the file names as class labels:
    > snvGetTranscript -f [feature list file] -o [output arff file] [mutation file 1 as class 1]
      [mutation file 2 as class 2] etc.
    
  • To retrieve features from multiple classes with custom class labels:
    > snvGetTranscript -c -f [feature list file] -o [output arff file] [class 1] [mutation file 1]
      [class 2] [mutation file 2] etc.
    
  • Optional Parameters
    • -r Show raw feature values. Do not scale using mean and rms values.
    • -m Show missing feature values. Do not fill missing values in with mean value.

Sample error messages

Please note that you may run into the following error messages while using SnvGet.

  • The "M44APHC" is not recognized and therefore omitted.
    • This error occurs when there is an unrecognized feature name in the input file of requested features (-f option). The unrecognized feature will be ignored.
  • The "MGAPHC" feature is repeated in the Features list. It has already been added to the feature set.
    • This error occurs when there is a duplicated feature in the input file of requested features. The duplicated feature will be ignored.
  • "NP_689637 Y1136*" is not a properly formatted mutation
    • This error occurs when a row in the .tmps (transcript-mutation pairs) file is not properly formatted, or contains a stop codon (represented by the asterisk). This row will be ignored when retrieving feature values.
  • Position/Wildtype 1126/A not found in NP_000236.
    • This error occurs when the reference amino acid at the requested codon position in the transcript does not match what is at that codon position in the transcript stored in the database. May be due to a mismatch in transcript versions. RefSeq, for example, is frequently updated and the translation start site of a transcript may changed between versions. We update the transcript sequences used by SNVBox on a weekly basis and should have the most recent version of each transcript available.
  • Transcript ENST00000360484 not found in database.
    • This error occurs when SNVBox does not contain a transcript. In this case, all mutations involving the transcript will be ignored.
  • Warning: Refseq transcript version number (NM_032173.3) does not match Refseq version in database (NM_032173.2)
    • The version number of the transcript used does not match the version number of the transcript in the current version of the SNVBox database. Features will be returned from the SNVBox database for the version stored in the database.
  • Sequencing variant 36915 chr22:22288398 C>T maps to NM_014634.3 L185L. Only missense variants will be evaluated by CHASM.
    • The specified genomic coordinate did not map to a missense mutation. The mutation will be ignored.
  • Sequencing variant chr22:20800834 A>G did not map to a codon.
    • The genomic coordinate did not map to a codon. The mutation will be ignored.
  • Reference base specified for sequencing variant 1 chr22:25115448 C>G does not match reference base at that coordinate in hg19.
    • The reference base specified does not match the base at that position in GRCh37/hg19. The mutation will be ignored.

Available Features

SNVBox features are detailed here. To use each feature, simply specify the FeatureID in the feature list file specified when you run the SnvGet utility.

Personal tools