VarSome Somatic Classification
(c) Copyright Saphetor SA. All rights reserved.
version: 13.1.2, dated: 15 Mar 2025 06:08:40 UTC
Introduction
The ”Standards and Guidelines for the Interpretation and Reporting of Sequence Variants in Cancer” was published in 2017 by Marilyn Li et al. in their seminal paper (AMP Guidelines). The VarSome Somatic Variant Classifier automatically generates a tier recommendation based on these guidelines and the vast range of machine-readable genomic data available.
These standards are very much written for interpretation by humans, not machines, they assume the clinician has a deep knowledge of the domain and relevant papers and conditions.
Our guiding principle throughout, following the advice of our clinical advisors, has been to implement a rigorously evidence-based approach to determine whether a variant is relevant for cancer therapy, diagnosis or prognosis. We have leveraged a wide range of public-domain and commercial cancer databases, so the quality of the end result will depend in part on the user's subscription level.
All the rules provide clear natural language explanations of why they were triggered and which evidence was used, or, conversely, a full explanation of why the criteria were not met (these 'negative' explanations are displayed if 'show full detail' is ticked, but they are not retained in the Clinical platform for Tier IV variants).
We strive to continuously improve our implementation, adjusting algorithms, incorporating new data sources, and adding refinements as new publications and methodology changes are suggested. We greatly appreciate feedback from the huge VarSome user community, and always aim to promptly act on any suggestions.
Approach & Overview
The AMP Guidelines do not have a strict set of named rules nor a strict method of calculation detailing how to combine various strengths of evidence to reach a verdict. Rather they consider a series of evidence types and set certain criteria that should be met in order to reach an overall tier I, II, III or IV verdict.
Our implementation considers the following types of evidence, each of these is given a 4-letter acronym for convenience which is displayed in VarSome
- Path: Disease-associated pathways
- Drug: Drug-gene interaction, therapies & clinical studies
- Type: Mutation type and coding impact
- Freq: Allele frequency
- Pred: In-silico and splicing predictions
- Soma: Somatic sample databases
- Crtd: Curated somatic variants
- Pubs: Supporting scientific publications
- Germ: Evidence from germline databases
The evidence from all these sources is combined to reach an overall recommended tier.
Overall Recommended Tier
Once all the evidence categories have been evaluated and a tier assigned to each, an overall recommended tier is established, in accordance with the AMP guidelines.
1- Curated Evidence
If there is any curated evidence, this is given priority, and the resulting tier is used. This ensures that the classifier is completely in-line with the available curated evidence.
2- No curated evidence: use Germline + Pathway + Drug
In the absence of curated evidence, we leverage the germline classification, gene pathway and drug data. This ensures that novel pathogenic variants in genes with a known cancer pathway and associated treatments can be correctly classified. We consider:
- Germline classification per ACMG guidelines.Pathway: the variant needs to be in a gene associated with a known cancer pathway.Drug: there needs to be an applicable therapy for this gene.
The resulting tier according to the evidence is:
- Tier I: pathogenic or likely pathogenic variant per ACMG, in a known cancer pathway, with a Tier I therapy or guideline.
- Tier II: pathogenic or likely pathogenic variant per ACMG, in a known cancer pathway, with a Tier II therapy or guideline.
- Tier III: VUS variant per ACMG, in a known cancer pathway.
- Tier IV: frequency >= 0.01 per AMP, (Likely) Benign per ACMG, or a ACMG=VUS variant in a non-cancer pathway.
Observations:
- Cancer type is critical for a correct somatic classification as this is used to match curated evidence and treatments/guidelines.Curated evidence, if available, will heavily influence the resulting classification.
- Novel variants for which there is no Curated evidence, are classified using our well-proven germline classifier, following the ACMG guidelines.
- The lack of a cancer pathway (Path IV) for the gene will result in at most a Tier III classification.
- For cancer-causing variants, the availability of approved therapies matching the patient's cancer type will determine whether the overall classification is Tier I or Tier II.
- There are instances of variants that are classified benign by ACMG, but are reported in curated cancer databases.
Sample Information
The somatic variant classifier is able to leverage data from the sample itself in order to provide additional findings to the clinician and help prioritize which variants to review. These findings do not modify the actual tier assigned to a variant, but show up as flags in the report table in VarSome Clinical.
- Cancer type: this highlights any variants for which evidence is found linking to the same cancer type as the sample.
- Tissue: similarly this will highlight any evidence associating the variant or gene to the sample tissue.
- Age: we are able to obtain an age histogram for certain cancer-types and display the patient's age relative to that.
- Ethnicity: allele frequencies can differ between populations and we report the variant's frequency in the relevant ethnic group.
- Sex: we highlight if the provided sex matches the majority of reported cases across somatic sample databases.
Relation to VarSome's germline variant classifier
The automated somatic variant classifier is related to VarSome's VarSome's germline variant classifier:
- It shares many of the same source databases, though may leverage them differently for cancer.
- It uses the same transcript (therefore gene) as identified for germline, we refer the user to the above documentation for more details.
- The mutation type rules directly leverage the VarSome's germline variant classifier.
The germline classifier is however only used as a fall-back if no evidence exists in the curated data sources. This allows us to potentially identify novel cancer variants.
We have also paid special attention to ensuring we don't double-count the same evidence.
Equivalent Amino-Acid Variants
Our classifier aims to be evidence-based and leverages a significant number of databases. For a number of rules we also consider equivalent amino-acid variants to see whether alternative forms with the same protein impact may have been reported. This approach is applied to the following rules:
- Curated somatic variants
- Supporting scientific publications
- Somatic sample databases
- Evidence from germline databases
The explanations make it clear when evidence for an equivalent variant has been used, and a link to VarSome is provided to view the data for the equivalent variant itself.
Gene-related evidence
The following two evidence types are generally predicated on data from the gene the variant is affecting, generally we expect all variants within that gene to trigger the same findings (however drug associations may be reported for specific variants only).
"Path": Disease-associated pathways
The somatic variant classifier uses the same transcript and gene as the VarSome's germline variant classifier.
The following databases are scanned to see whether this gene is associated with cancer:
- BioCarta
- CKB Genes
- Consensus
- GHR
- KEGG
- Mondo
- The Human Protein Atlas
If a cancer association is found, we assign Path I to tumor-suppressing genes, and Path II to all other genes.
Important: If no cancer association is found, we assign Path IV which will then result in a Tier IV overall verdict, irrespective of any other evidence, on the assumption all non-somatic variants are covered by the VarSome's germline variant classifier.
"Drug": Drug-gene interaction, therapies & clinical studies
The following sources of evidence are analysed in order to identify any drugs associated to the gene considered:
- AACT
- CIViC
- CKB
- DGI
- OncoKB
- PharmGKB
- Pharmacogenomic Biomarkers
Drug I is assigned to drugs that are FDA or EMA approved drugs, are reported by trusted curated sources, or have phase 3 clinical trials. Drug II is assigned otherwise.
Tier I approved drugs that do not match the patient's cancer type (if provided) will be downgraded to Drug II (in accordance with the AMP guidelines).
Note: gene-level drug or clinical trial searches are disabled if there is no curated evidence, the germline classification is not pathogenic, or the gene is not linked to cancer.
Variant-specific evidence
"Crtd": Curated somatic variants
This rule checks whether the variant has been previously reported in any of the following curated databases:
- CIViC
- CKB
- OncoKB
CKB is a database we have licensed from The Jackson Laboratory and it is currently only available for VarSome Clinical users as it incurs a fee per sample processed.
Important: OncoKB is not currently available using the Python API.
Curated Tier
Importantly the curated evidence is filtered by the patient's cancer type. This is a critical component of the clinical analysis. The following evidence tiers are established for curated evidence, in line with the AMP guidelines.
- Crtd I: curated evidence that matches the patient's cancer and for which a Tier I therapy or guideline has been identified.
- Crtd II: curated evidence that matches the patient's cancer and for which a Tier II therapy has been identified.
- Crtd II: downgraded from a Tier I for a different cancer type than the patient's.
- Crtd II: when Tier II evidence was found for a different cancer type than the patient's.
- Crtd III: curated evidence is of uncertain significance.
- Crtd IV: indicates that no curated evidence was identified.
The VarSome user interface will display this evidence in the order indicated above.
Important: the most severe tier will be reported if there is conflicting evidence from the curated sources. The full evidence list can be displayed in the VarSome user interface by ticking 'show full detail' in the Somatic component.
Note about CKB
Following feedback from our users, and in order to reduce the number of false positives, we are applying some rules to filter the information from CKB.
CKB has both 'variant-level' evidence, specific to the precise variant being annotated, and 'extended evidence' that applies to this class of variant within the gene.
The 'extended evidence' will not be used if either of the two following conditions is met:
- The protein effect is unknown or neutral (as recommended by JAX),
- The germline classification is Benign or Likely Benign.
Similarly, the 'variant-level' evidence will be downgraded to Tier III if either of the two following conditions is met:
- The protein effect is unknown or neutral,
- The variant is missense and the germline classification is Benign or Likely Benign.
"Soma": Somatic sample databases
This rule considers evidence from somatic sample databases, counting how many times the variant has been seen in tumor samples, and additionally confirmed to be an acquired mutation as opposed to inherited. We use the following sources:
- CancerHotspots
- GDC
- ICGC
- TP53 Somatic
- cBioPortal
This rule first identifies all the somatic samples containing this variant, and assigns:
- Soma II: if the sample contains confirmed somatic variants,
- Soma I: if the sample has associated publications.
The rule then considers:
- the aggregate of all the somatic samples found for this variant,
- the total number of somatic samples available for the gene,
- the total number of curated somatic variants in the gene,
- the general population GnomAD frequency of the variant (if any).
These three numbers are then combined using basic statistics computed from CKB, ClinVar & CIViC. Tiers are assigned as follows:
- Soma IV: if no somatic samples are found from any source.
- Soma III: too few samples have been found for a high-frequency variant.
- Soma II: a minimal number of somatic samples have been reported for this variant.
- Soma I: a high number of somatic samples have been reported.
Much lower thresholds are used internally if the variant has not been reported in GnomAD.
NB: this rule will not work as accurately in VarSome Premium for which critical databases such as JAX CKB are not licensed.
"Type": Mutation type and coding impact
Here we evidence the impact of the variant on the protein, leveraging results from the germline classifier as follows:
- Type I is assigned to LoF variants (akin to ACMG rule PVS1).
- Type I is also assigned to variants that are predicted splicing, using the same methodology as VarSome's germline variant classifier.
- Type II is assigned to variants that modify the protein length, (akin ACMG rule PM4, and mutually exclusive with the previous LOF evidence).
- Type IV is assigned to non-coding and synonymous variants whose position is not conserved (see Conservation).
- Type III is assigned to all other variant types.
This evidence is provided for information only and does not impact the overall somatic classification.
"Freq": Allele frequency
This rule combines two factors:
- The frequency of the variant as reported in the healthy adult population.
- Whether this variant is somatic, determined by its allelic balance in the sample
The general frequency of the variant is read from gnomAD exomes and gnomAD genomes. The implementation leverages our VarSome's germline variant classifier as follows:
- Rare variants are identified using GnomAD (akin to ACMG rule PM1), this is further augmented taking Conservation into account.
- The AMP-recommended frequency threshold of 0.01 is used to assign Freq IV to high-frequency variants.
- We calculate a benign frequency threshold per gene, statistically derived from all curated cancer variants, to assign Freq IV to variants that have a relatively high frequency in the general population (a similar methodology to ACMG rule BS1 using different calibrated thresholds).
If the variant's allelic balance reported in this sample is less than 0.3, we classify it as a tumor variant, and assign Freq I if this is indeed a rare variant, or Freq II otherwise.
"Pubs": Supporting scientific publications
Here we evidence any publications that the VarSome community have linked to the variant. This does not impact the overall somatic classification and is provided for information only.
Note that the system does not verify whether the publication explicitly mentions cancer, and in all cases publications must be reviewed by the clinician.
Germline evidence
"Germ": Evidence from germline databases
The evidence from the VarSome's germline variant classifier for the variant is summarised as follows:
- Pathogenic ⇒ Germ I
- Likely Pathogenic ⇒ Germ II
- Uncertain Significance ⇒ Germ III
- Likely Benign ⇒ Germ IV
- Benign ⇒ Germ IV
This classification will be leveraged by the somatic classifier in the absence of any curated evidence for the variant.
"Pred": In-silico and splicing predictions
We replicate here the same in-silico predictions as used by the Germline classificier, using the updated ACMG guidelines described in PMID:5630195.
This evidence is provided for information only: it does not directly impact the somatic classification, however it is used by the germline classifier may be particularly relevant for missense and splicing variants.
Databases
The VarSome automated classification processes rely on vast quantities of accurate curated data from the following databases (in no particular order).
Important:depending on licensing agreements and in some cases the fees charged by source organisations, not all databases are visible to all users, and this may directly impact the completeness or quality of automated classifications.
Databases used by the germline variant classifier
- UniProt Variants, provided by UNIPROT, version 07-Feb-2025 (72.5k records)
- UniProt Regions, provided by UNIPROT, version 07-Feb-2025 (283k records)
- RefSeq, provided by NCBI, version 228
- phyloP100way, provided by CSH, version 13-Apr-2021 (3.14G records)
- PanelApp, provided by Genomics England, version 17-Feb-2025
- MitoTip, provided by CHOP, version 13-Dec-2022 (11.1k records)
- Mitomap, provided by CHOP, version 08-Dec-2023 (39.0k records)
- MitImpact, provided by IRCCS, version 13-Dec-2022 (48.2k records)
- MaxEntScan, provided by Burge Lab, version 5-Apr-2023
- LOVD, provided by LUMC, version 19-Feb-2025
- HPO, version 07-Feb-2025 (19.0k records)
- gnomAD Mitochondrial, provided by Broad, version 3.1 (18.2k records)
- gnomAD genomes coverage, provided by Broad, using version 2.1 (3.14G records) for hg19, and using version 3.0 (3.21G records) for hg38
- gnomAD genomes, provided by Broad, using version 2.1.1 (262M records) for hg19, and using version 4.1 (759M records) for hg38
- gnomAD gene constraints, provided by Broad, version 4.1 (18.6k records)
- gnomAD exomes coverage, provided by Broad, using version 2.1 (59.6M records) for hg19, and using version 4.0 (169M records) for hg38
- ClinVar, provided by NCBI, version 07-Feb-2025 (3.22M records)
- ClinGen Disease Validity, provided by NIH, version 07-Feb-2025 (2.50k records)
- CADD, provided by UW, version 1.7
- CGD, provided by NHGRI, version 03-Jul-2024 (4.74k records)
- DANN SNVs, provided by UCI, using version 2014 (9.41G records) for hg19, unavailable for hg38
- dbNSFP-c, provided by dbNSFP, version 4.9 (82.8M records)
- dbNSFP genes, provided by dbNSFP, version 4.9 (21.5k records)
- dbscSNV, provided by dbNSFP, version v1.1 (15.0M records)
- Domino, provided by UNIL, version 04-Sep-2019 (17.9k records)
- Ensembl, provided by EMBL, version 113
- GenCC, version 07-Feb-2025 (5.17k records)
- gene2phenotype, provided by EBI, version 04-Oct-2024 (2.89k records)
- gnomAD exomes, provided by Broad, using version 2.1.1 (17.2M records) for hg19, and using version 4.1 (184M records) for hg38
- Papers & classifications contributed by the VarSome community.
Databases used by the somatic variant classifier
In addition to the databases used for germline classification, the somatic variant classifier leverages information from:
- TP53 Somatic, provided by IARC, version release 20 (2.45k records)
- TP53 Germline, provided by IARC, version release 20 (436 records)
- Cancer Gene Census, provided by Sanger, version v101
- The Human Protein Atlas, provided by KAW, version 14-Mar-2024 (20.1k records)
- PharmGKB, version 07-Feb-2025
- OncoTree, provided by MSK, version 15-Jan-2024
- Mondo, provided by Monarch, version 07-Feb-2025
- ICGC somatic, provided by ICGC
- GTEx, provided by NIH, version v8 (313k records)
- CPIC Genes-Drugs, provided by CPIC, version 07-Feb-2025
- CKB, provided by JAX, version 23-Feb-2025
- CIViC, provided by WUSTL, version 08-Dec-2023 (849 records)
- AACT, provided by CTTI, version 07-Feb-2025
- CancerHotspots, provided by MSK, version 10-Sep-2021 (2.25M records)
- cBioPortal, provided by MSK, version 06-Jun-2023 (19.5M records)
- DGI, provided by WUSTL, version 04-Jun-2024
- Pharmacogenomic Biomarkers, provided by FDA, version 19-Sep-2022
- GDC, provided by NIH, version 08-Dec-2023 (2.17M records)
- GHR Genes, provided by NLM, version 05-Dec-2024 (1.50k records)
Other Databases
VarSome also annotates variants using the following databases, although these are not currently leveraged by the automated classifications:
- VCF attributes, provided by generic, version generic VCF file
- DGV, provided by TCAG, version 30-Jun-2021 (792k records)
- Semantic Scholar, provided by Allen Institute
- Pub Med, provided by NCBI
- PMKB, provided by Weill Cornell Medicine, version 08-Nov-2024 (161 records)
- phastCons100way, provided by CSH, version 14-Apr-2021 (3.14G records)
- Mastermind, provided by Genomenon, version 230612 (22.6M records)
- kaviar3, provided by ISB, version 4-Feb-2016 (83.3M records)
- HGNC, provided by HUGO, version 13-Feb-2025
- GWAS Catalog, provided by EBI, version 07-Feb-2025 (789k records)
- Cosmic Licensed, provided by Sanger, version v101
- ClinVar CNVs, provided by NCBI, version 07-Feb-2025 (61.6k records)
- ClinGen Variants, provided by NIH, version 07-Feb-2025 (9.72k records)
- ClinGen Regions, provided by NIH, version 07-Feb-2025 (516 records)
- ClinGen CNVs, provided by NIH, version 07-Feb-2025 (156 records)
- ClinGen, provided by NIH, version 07-Feb-2025 (1.56k records)
- AlphaMissense, provided by HL, version 03-Jul-2024 (69.1M records)
- Analysis-specific variant data, provided by generic, version any
- BAM Coverage, provided by generic
- Bravo, provided by UMICH, using version Freeze5 (25.5M records) for hg19, and using version Freeze8 (75.5M records) for hg38
- DailyMed, provided by NIH, version 03-Sep-2021
- dbNSFP-p, provided by dbNSFP
- dbSNP, provided by NCBI, version build 156 (1.27G records)
- dbVar, provided by NCBI, version 03-Jul-2024 (3.06M records)
- DVD, provided by UOI, using version v9 (2.49M records) for hg19, unavailable for hg38
- DECIPHER, provided by Sanger, version 07-Feb-2025 (31.0k records)
- DoCM, provided by WUSTL, version 07-Jun-2022 (1.24k records)
- EMA Approved Drugs, provided by EMA, version 03-Sep-2021
- EVE, provided by OATML, unavailable for hg19version 07-Jun-2022 (4.73M records) for hg38
- ExacCNV, provided by Broad, using version 01-Jul-2021 (49.3k records) for hg19, and using version 20180227 (48.6k records) for hg38
- ExAC genes, provided by Broad, version 18-Sep-2018 (18.3k records)
- FDA Approved Drugs, provided by FDA, version 03-Sep-2021
- FusionGDB, provided by UTexas, version 19-Nov-2021 (15.6k records)
- GERP, using version 2010 (2.60G records) for hg19, unavailable for hg38
- gnomAD structural variants, provided by Broad, version 30-Jun-2021 (334k records)
dbNSFP Sources (non-synonymous coding SNVs)
Additional sources annotated using the dbNSFP database:
Functional predictions:
- ALoFT
- BayesDel
- DEOGEN2
- Eigen
- Eigen-PC
- FATHMM
- FATHMM-XF
- FATHMM-MKL
- fitCons
- LIST-S2
- LRT
- M-CAP
- MetaLR
- MetaRNN
- MetaSVM
- MPC
- MutationAssessor
- MutationTaster
- MutPred
- MVP
- Polyphen-2
- PrimateAI
- PROVEAN
- REVEL
- SIFT
- SIFT4G
Conservation scores:
- bStatistic
- phastCons100way Vertebrate
- phastCons30way Mammalian
- phastCons17way Primate
- phyloP100way Vertebrate
- phyloP30way Mammalian
- phyloP17way Primate
- SiPhy
Gene annotation sources:
- BioCarta
- Consensus
- egenetics
- Essential Genes
- GDI
- Gene Ontology
- GHIS
- GNF/Atlas
- HIPred
- KEGG
- LoFTool
- Mouse genes
- P(HI) Score
- P(rec) Score
- RVIS
- UniProt Genes
- Zebrafish genes