SNVBox Tutorial

From Chasm Software Wiki

Revision as of 17:34, 8 November 2010 by Andywong86 (Talk | contribs)
Jump to: navigation, search

Preparing the requisite files

  1. SNV-Box accepts protein-based mutations. Prepare a tab delimited ".tmps" text file of your mutations in the following format:
    NP_001135977	R641W
    NP_835455	R151C
    NP_055645	L590V
    NP_689808	D28H
    NP_005472	S372R
    NP_112493	S35R
    NP_859061	A118V
    NP_892018	R153C
    NP_001074003	R264Q
    NP_001073893	R1272C
    NP_808882	R30H
    ...
    


    Currently support Refseq, CCDS, and Ensembl accessions.

  2. SNV-Box accepts a list of custom features. Prepare a text file with the list of features you want:
    ExonConservation
    ExonSnpDensity
    ExonHapMapSnpDensity
    HMMRelEntropy
    HMMEntropy
    HMMPHC
    MGARelEntropy
    MGAEntropy
    MGAPHC
    PredRSAB
    PredRSAE
    PredBFactorF
    PredBFactorS
    PredSSC
    PredSSE
    PredSSH
    AAHydrophobicity
    AAVolume
    AAPolarity
    ...
    

Retrieving features

  • To retrieve features from a single mutation file:
    > SNVget -f [feature list file] -o [output arff file] [mutation file]
    
  • To retrieve features from multiple classes the file names as class labels:
    > SNVget -f [feature list file] -o [output arff file] [mutation file 1 as class 1]
      [mutation file 2 as class 2] etc.
    
  • To retrieve features from multiple classes with custom class labels:
    > SNVget -c -f [feature list file] -o [output arff file] [class 1] [mutation file 1]
      [class 2] [mutation file 2] etc.
    

Available Features

The following features are currently available in SNV-Box. To use each feature, simply specify the ID in the Features.list file.

ID

Feature

Description

AACharge

Net residue charge change

The change in formal charge resulting from the mutation. Histidine is assumed protonated (formal charge of +1) (Wildtype - Mutant)

AAVolume

Net residue volume change

The change in residue colume resulting from the mutation. (Wildtype - Mutant)

AAHydrophobicity

Net residue hydrophobicity change

The change in hydrophobicity resulting from the substitution. (Wildtype - Mutant)

AAGrantham

Grantham Score

The Grantham substitution score for the wild type to mutant transition.

AAPolarity

Change in Polarity

Change in residue polarity due to the wildtype to mutant transition.

AAEx

Ex substitution score

Amino acid substitution score from the EX matrix.

AAPAM250

PAM250 substitution score

Amino acid substitution score from the PAM250 matrix.

AABLOSUM

BLOSUM 62 substitution score

Amino acid substitution score from the BLOSUM 62 matrix.

AAMJ

MJ Substitution score

Amino acid substitution score from the Miyazawa-Jernigan contact energy matrix.

AAHGMD2003

HGMD2003 mutation count

Number of times that the wild type to mutant substitution occurs in the Human Gene Mutation Database, 2003 version.

AAVB

VB mutation score

Amino acid substitution score from the VB (Venkatarajan and Braun) matrix.

AATransition

Amino Acid Transition probabilities

Frequency of the second amino acid followed by the first amino acid observed in Uniprot Human proteins

AACOSMIC

Frequency of missense change type in the Catalog of Somatic Mutations in Cancer (COSMIC) database

Frequency in natural log that missense change type (amino acid type X to amino acid type Y, e.g. ALANINE to GLYCINE) is

seen in COSMIC. These frequencies were calculated during the week of August 14, 2008, using COSMIC

release 38.

AACOSMICvsSWISSPROT

Count of missense change type in the Catalog of SomaticA Mutations in Cancer (COSMIC) database divided by count in SWISSPROT database

Frequency in natural log that missense change type (amino acid type X to amino acid type Y, e.g. ALANINE to GLYCINE) is

seen in COSMIC. These frequencies were calculated during the week of August 14, 2008, using COSMIC

release 38 normalized by the occurrences of the wild type residue in human proteins found in UniProtKB

AACOSMICvsHapMap

Count of missense change type in the Catalog of Somatic Mutations in Cancer (COSMIC) database divided by count in HapMap.

Frequency in natural log that missense change type (amino acid type X to amino acid type Y, e.g. ALANINE to GLYCINE) is

seen in COSMIC. These frequencies were calculated during the week of August 14, 2008, using COSMIC

release 38 normalized by the number of times the change type is observed in the HapMap SNPs database.

AAHapMap

HAPMAP Amino Acid substitution counts

Frequency in natural log the change type from Wildtype to Mutant AA that is observed in the HapMap SNPs database.

ExonConservation

44-way exon conservation

The conservation score for the entire exon calculated from a 44-species phylogenetic alignment using the UCSC Genome Browser. Scores are given for windows of nucleotides. We retrieve the scores for each region that overlaps the exon in which the base substitution occurred and calculated a weighted average of the conservation scores where the weight is the number of bases with a particular score.

ExonSnpDensity

SNP Density

The number of SNPs in the exon where the mutation is located divided by the length of the exon.

ExonHapMapSnpDensity

HapMap verified SNP Density

The number of HapMap verified SNPs in the exon where the mutation is located divided by the length of the exon.

MGAPHC

Multiz-46-way Alignment Positional Conservation

This feature is calculated based on the degree of conservation of the residue estimated from a column in the Multiz-46-way alignment using the UCSC Human Genome Browser.

MGARelEntropy

Multiz-46-way Alignment Relative Entropy

Kullback-Leibler Distance calculated for the column of Multiz-46-way alignment (corresponding to the location of the mutation) and that of a background distribution of amino acid residues computed from a large sample of multiple sequence alignments.

MGAEntropy

Multiz-46-way Alignment Entropy

The Shannon entropy calculated for the column of the Multiz-46-way alignment, corresponding to the location of the mutation.

HMMPHC

Positional Hidden Markov Model (HMM) conservation score

This feature is calculated based on the degree of (HMM) conservation score conservation of the residue estimated from a multiple sequence alignment built with SAM-T2K software (29), using the protein in which the mutation occurred as the seed sequence (30). The SAM-T2K alignments are large, superfamily-level alignments that include distantly related homologs (as well as close homologs and orthologs) of the protein of interest.

HMMRelEntropy

Relative entropy of HMM alignments

Kullback-Leibler Distance calculated for the column of the SAM-T2K multiple sequence alignment (corresponding to the location of the mutation) and that of a background distribution of amino acid residues computed from a large sample of multiple sequence alignments.

HMMEntropy

Entropy of HMM alignment

The Shannon entropy calculated for the column of the SAM-T2K multiple sequence alignment, corresponding to the location of the mutation.

PredStabilityL, PredStabilityM, PredStabilityH

Predicted contribution to protein stability

These features consist of the probability that the wild stability type residue contributes to overall protein stability in a manner that is highly stabilizing, average or destabilizing, as predicted by a neural network trained with Predict-2nd software on a set of 1763 proteins with less than 30% homology. Stability estimates for the neural network training data were calculated using the FoldX force field .

PredSSE, PredSSH, PredSSC

Predicted secondary structure

These features consist of the probability that the secondary structure of the region in which the wild type residue exists is helix, loop or strand as predicted by a neural net trained with Predict-2nd software on a set of 1763 proteins with crystal structures and with less than 30% homology.

PredRSAB, PredRSAI, PredRSAE

Predicted residue solvent accessibility

These features consist of the probability of the wild type accessibility residue being buried, intermediate or exposed as predicted by a neural network trained with Predict-2nd software on a set of 1763 proteins with high- resolution X-ray crystal structures sharing less than 30% homology.

PredBFactorS, PredBFactorM, PredBFactorF

Predicted Bfactor

These features consist of the probability that the wild type residue backbone is stiff, intermediate or flexible as predicted by a neural network trained with Predict-2nd software (29) on a set of 1763 proteins with less than 30% homology. Flexibilities for the neural net training data were estimated based on normalized temperature factors, computed using the method of (38) from the X- ray crystal structure files.

RegCompP, RegCompC, RegCompG, RegCompDE, RegCompQ, RegCompH, RegCompKR, RegCompWYF, RegCompILVM, RegCompEntropy, RegCompNormEntropy

Regional AA composition

The percentage of amino acids in a 15 residue window surrounding the mutation that fall into one of the following categories. (P,C,G,DE,Q,H,KR,WYF,ILVM)

AATripletFirstProbWild, AATripletSecondProbWild, AATripletThirdProbWild

Probability of seeing the wild type residue in that position of an amino acid triple.

Calculated by joint frequencies of amino acid triples in human proteins found in UniProtKB.

AATripletFirstProbMut, AATripletSecondProbMut, AATripletThirdProbMut

Probability of seeing the mutant residue in that position of an amino acid triple.

Calculated by joint frequencies of amino acid triples in human proteins found in UniProtKB.

AATripletFirstProbDiff

AATripletSecondProbDiff

AATripletThirdProbDiff

Difference of the probability of seeing the wild type residue in that triplet position as compared with the mutant residue.

Based on the values calculated in AATripletPositionProbMut/Wild features. (Wildtype - Mutant)

UniprotBINDING, UniprotACTSITE, UniprotSITE, UniprotLIPID, UniprotMETAL, UniprotCARBOHYD, UniprotDNABIND, UniprotNPBIND, UniprotCABIND, UniprotDISULFID, UniprotSECYS, UniprotMODRES, UniprotPROPEP, UniprotSIGNAL, UniprotMUTAGEN, UniprotTRANSMEM, UniprotCOMPBIAS, UniprotREP, UniprotMOTIF, UniprotZNFINGER, UniprotREGIONS


UniprotDOM_PPI, UniprotDOM_RNABD, UniprotDOM_TF, UniprotDOM_LOC, UniprotDOM_MMBRBD, UniprotDOM_Chrom, UniprotDOM_PostModRec, UniprotDOM_PostModEnz

Uniprot Annotations (fingerprints)

These features give annotations, curated from the literature, of general binding sites, general active sites, lipid, metal, carbohydrate, DNA, phosphate and calcium binding sites, disulfides, modified residues, propeptide residues, signal peptide residues, known mutagenic sites, transmembrane regions, compositionally biased regions, repeat regions, known motifs, and zinc fingers. The integer 1 indicates that a feature is present and the integer 0 indicates that it is absent at a mutated position .


The last 8 features are extracted from the DOMAIN annotation as follows:


UniprotDOM_PPI :Protein-protein interaction or oligomerization


UniprotDOM_RNABD:mRNA Binding


UniprotDOM_TF: Transcription factor related


UniprotDOM_LOC: Transport and localization related domain (localization signals, etc.)


UniprotDOM_MMBRBD: Membrane binding/interacting (phosphoinositide binding, transmembrane, etc.)


UniprotDOM_Chrom: Chromatin structural remodeling related domains (could be indirectly via interaction with histones, HATs ao HDACs)


UniprotDOM_PostModRec: Domains that recognize or interact with post-tranlsational mod sites


UniprotDOM_PostModEnz: Enzymatically active domains resulting in post-translational modification such as phosphorylation, glycosylation, amidation, ubiquitination, etc.

Personal tools