VarSome Somatic Classification

(c) Copyright Saphetor SA. All rights reserved.

version: 13.1.2, dated: 15 Mar 2025 06:08:40 UTC

Download the AMP Whitepaper

Introduction

The ”Standards and Guidelines for the Interpretation and Reporting of Sequence Variants in Cancer” was published in 2017 by Marilyn Li et al. in their seminal paper (AMP Guidelines). The VarSome Somatic Variant Classifier automatically generates a tier recommendation based on these guidelines and the vast range of machine-readable genomic data available.

These standards are very much written for interpretation by humans, not machines, they assume the clinician has a deep knowledge of the domain and relevant papers and conditions.

Our guiding principle throughout, following the advice of our clinical advisors, has been to implement a rigorously evidence-based approach to determine whether a variant is relevant for cancer therapy, diagnosis or prognosis. We have leveraged a wide range of public-domain and commercial cancer databases, so the quality of the end result will depend in part on the user's subscription level.

All the rules provide clear natural language explanations of why they were triggered and which evidence was used, or, conversely, a full explanation of why the criteria were not met (these 'negative' explanations are displayed if 'show full detail' is ticked, but they are not retained in the Clinical platform for Tier IV variants).

We strive to continuously improve our implementation, adjusting algorithms, incorporating new data sources, and adding refinements as new publications and methodology changes are suggested. We greatly appreciate feedback from the huge VarSome user community, and always aim to promptly act on any suggestions.

Approach & Overview

The AMP Guidelines do not have a strict set of named rules nor a strict method of calculation detailing how to combine various strengths of evidence to reach a verdict. Rather they consider a series of evidence types and set certain criteria that should be met in order to reach an overall tier I, II, III or IV verdict.

Our implementation considers the following types of evidence, each of these is given a 4-letter acronym for convenience which is displayed in VarSome

  • Path: Disease-associated pathways
  • Drug: Drug-gene interaction, therapies & clinical studies
  • Type: Mutation type and coding impact
  • Freq: Allele frequency
  • Pred: In-silico and splicing predictions
  • Soma: Somatic sample databases
  • Crtd: Curated somatic variants
  • Pubs: Supporting scientific publications
  • Germ: Evidence from germline databases

The evidence from all these sources is combined to reach an overall recommended tier.

Overall Recommended Tier

Once all the evidence categories have been evaluated and a tier assigned to each, an overall recommended tier is established, in accordance with the AMP guidelines.

1- Curated Evidence

If there is any curated evidence, this is given priority, and the resulting tier is used. This ensures that the classifier is completely in-line with the available curated evidence.

2- No curated evidence: use Germline + Pathway + Drug

In the absence of curated evidence, we leverage the germline classification, gene pathway and drug data. This ensures that novel pathogenic variants in genes with a known cancer pathway and associated treatments can be correctly classified. We consider:

  1. Germline classification per ACMG guidelines.Pathway: the variant needs to be in a gene associated with a known cancer pathway.Drug: there needs to be an applicable therapy for this gene.

The resulting tier according to the evidence is:

  • Tier I: pathogenic or likely pathogenic variant per ACMG, in a known cancer pathway, with a Tier I therapy or guideline.
  • Tier II: pathogenic or likely pathogenic variant per ACMG, in a known cancer pathway, with a Tier II therapy or guideline.
  • Tier III: VUS variant per ACMG, in a known cancer pathway.
  • Tier IV: frequency >= 0.01 per AMP, (Likely) Benign per ACMG, or a ACMG=VUS variant in a non-cancer pathway.

Observations:

  • Cancer type is critical for a correct somatic classification as this is used to match curated evidence and treatments/guidelines.Curated evidence, if available, will heavily influence the resulting classification.
  • Novel variants for which there is no Curated evidence, are classified using our well-proven germline classifier, following the ACMG guidelines.
  • The lack of a cancer pathway (Path IV) for the gene will result in at most a Tier III classification.
  • For cancer-causing variants, the availability of approved therapies matching the patient's cancer type will determine whether the overall classification is Tier I or Tier II.
  • There are instances of variants that are classified benign by ACMG, but are reported in curated cancer databases.

Sample Information

The somatic variant classifier is able to leverage data from the sample itself in order to provide additional findings to the clinician and help prioritize which variants to review. These findings do not modify the actual tier assigned to a variant, but show up as flags in the report table in VarSome Clinical.

  • Cancer type: this highlights any variants for which evidence is found linking to the same cancer type as the sample.
  • Tissue: similarly this will highlight any evidence associating the variant or gene to the sample tissue.
  • Age: we are able to obtain an age histogram for certain cancer-types and display the patient's age relative to that.
  • Ethnicity: allele frequencies can differ between populations and we report the variant's frequency in the relevant ethnic group.
  • Sex: we highlight if the provided sex matches the majority of reported cases across somatic sample databases.

Relation to VarSome's germline variant classifier

The automated somatic variant classifier is related to VarSome's VarSome's germline variant classifier:

  • It shares many of the same source databases, though may leverage them differently for cancer.
  • It uses the same transcript (therefore gene) as identified for germline, we refer the user to the above documentation for more details.
  • The mutation type rules directly leverage the VarSome's germline variant classifier.

The germline classifier is however only used as a fall-back if no evidence exists in the curated data sources. This allows us to potentially identify novel cancer variants.

We have also paid special attention to ensuring we don't double-count the same evidence.

Equivalent Amino-Acid Variants

Our classifier aims to be evidence-based and leverages a significant number of databases. For a number of rules we also consider equivalent amino-acid variants to see whether alternative forms with the same protein impact may have been reported. This approach is applied to the following rules:

The explanations make it clear when evidence for an equivalent variant has been used, and a link to VarSome is provided to view the data for the equivalent variant itself.

Gene-related evidence

The following two evidence types are generally predicated on data from the gene the variant is affecting, generally we expect all variants within that gene to trigger the same findings (however drug associations may be reported for specific variants only).

"Path": Disease-associated pathways

The somatic variant classifier uses the same transcript and gene as the VarSome's germline variant classifier.

The following databases are scanned to see whether this gene is associated with cancer:

  • BioCarta
  • CKB Genes
  • Consensus
  • GHR
  • KEGG
  • Mondo
  • The Human Protein Atlas

If a cancer association is found, we assign Path I to tumor-suppressing genes, and Path II to all other genes.

Important: If no cancer association is found, we assign Path IV which will then result in a Tier IV overall verdict, irrespective of any other evidence, on the assumption all non-somatic variants are covered by the VarSome's germline variant classifier.

"Drug": Drug-gene interaction, therapies & clinical studies

The following sources of evidence are analysed in order to identify any drugs associated to the gene considered:

  • AACT
  • CIViC
  • CKB
  • DGI
  • OncoKB
  • PharmGKB
  • Pharmacogenomic Biomarkers

Drug I is assigned to drugs that are FDA or EMA approved drugs, are reported by trusted curated sources, or have phase 3 clinical trials. Drug II is assigned otherwise.

Tier I approved drugs that do not match the patient's cancer type (if provided) will be downgraded to Drug II (in accordance with the AMP guidelines).

Note: gene-level drug or clinical trial searches are disabled if there is no curated evidence, the germline classification is not pathogenic, or the gene is not linked to cancer.

Variant-specific evidence

"Crtd": Curated somatic variants

This rule checks whether the variant has been previously reported in any of the following curated databases:

  • CIViC
  • CKB
  • OncoKB

CKB is a database we have licensed from The Jackson Laboratory and it is currently only available for VarSome Clinical users as it incurs a fee per sample processed.

Important: OncoKB is not currently available using the Python API.

Curated Tier

Importantly the curated evidence is filtered by the patient's cancer type. This is a critical component of the clinical analysis. The following evidence tiers are established for curated evidence, in line with the AMP guidelines.

  • Crtd I: curated evidence that matches the patient's cancer and for which a Tier I therapy or guideline has been identified.
  • Crtd II: curated evidence that matches the patient's cancer and for which a Tier II therapy has been identified.
  • Crtd II: downgraded from a Tier I for a different cancer type than the patient's.
  • Crtd II: when Tier II evidence was found for a different cancer type than the patient's.
  • Crtd III: curated evidence is of uncertain significance.
  • Crtd IV: indicates that no curated evidence was identified.

The VarSome user interface will display this evidence in the order indicated above.

Important: the most severe tier will be reported if there is conflicting evidence from the curated sources. The full evidence list can be displayed in the VarSome user interface by ticking 'show full detail' in the Somatic component.

Note about CKB

Following feedback from our users, and in order to reduce the number of false positives, we are applying some rules to filter the information from CKB.

CKB has both 'variant-level' evidence, specific to the precise variant being annotated, and 'extended evidence' that applies to this class of variant within the gene.

The 'extended evidence' will not be used if either of the two following conditions is met:

  • The protein effect is unknown or neutral (as recommended by JAX),
  • The germline classification is Benign or Likely Benign.

Similarly, the 'variant-level' evidence will be downgraded to Tier III if either of the two following conditions is met:

  • The protein effect is unknown or neutral,
  • The variant is missense and the germline classification is Benign or Likely Benign.

"Soma": Somatic sample databases

This rule considers evidence from somatic sample databases, counting how many times the variant has been seen in tumor samples, and additionally confirmed to be an acquired mutation as opposed to inherited. We use the following sources:

  • CancerHotspots
  • GDC
  • ICGC
  • TP53 Somatic
  • cBioPortal

This rule first identifies all the somatic samples containing this variant, and assigns:

  • Soma II: if the sample contains confirmed somatic variants,
  • Soma I: if the sample has associated publications.

The rule then considers:

  • the aggregate of all the somatic samples found for this variant,
  • the total number of somatic samples available for the gene,
  • the total number of curated somatic variants in the gene,
  • the general population GnomAD frequency of the variant (if any).

These three numbers are then combined using basic statistics computed from CKB, ClinVar & CIViC. Tiers are assigned as follows:

  • Soma IV: if no somatic samples are found from any source.
  • Soma III: too few samples have been found for a high-frequency variant.
  • Soma II: a minimal number of somatic samples have been reported for this variant.
  • Soma I: a high number of somatic samples have been reported.

Much lower thresholds are used internally if the variant has not been reported in GnomAD.

NB: this rule will not work as accurately in VarSome Premium for which critical databases such as JAX CKB are not licensed.

"Type": Mutation type and coding impact

Here we evidence the impact of the variant on the protein, leveraging results from the germline classifier as follows:

  • Type I is assigned to LoF variants (akin to ACMG rule PVS1).
  • Type I is also assigned to variants that are predicted splicing, using the same methodology as VarSome's germline variant classifier.
  • Type II is assigned to variants that modify the protein length, (akin ACMG rule PM4, and mutually exclusive with the previous LOF evidence).
  • Type IV is assigned to non-coding and synonymous variants whose position is not conserved (see Conservation).
  • Type III is assigned to all other variant types.

This evidence is provided for information only and does not impact the overall somatic classification.

"Freq": Allele frequency

This rule combines two factors:

  • The frequency of the variant as reported in the healthy adult population.
  • Whether this variant is somatic, determined by its allelic balance in the sample

The general frequency of the variant is read from gnomAD exomes and gnomAD genomes. The implementation leverages our VarSome's germline variant classifier as follows:

  • Rare variants are identified using GnomAD (akin to ACMG rule PM1), this is further augmented taking Conservation into account.
  • The AMP-recommended frequency threshold of 0.01 is used to assign Freq IV to high-frequency variants.
  • We calculate a benign frequency threshold per gene, statistically derived from all curated cancer variants, to assign Freq IV to variants that have a relatively high frequency in the general population (a similar methodology to ACMG rule BS1 using different calibrated thresholds).

If the variant's allelic balance reported in this sample is less than 0.3, we classify it as a tumor variant, and assign Freq I if this is indeed a rare variant, or Freq II otherwise.

"Pubs": Supporting scientific publications

Here we evidence any publications that the VarSome community have linked to the variant. This does not impact the overall somatic classification and is provided for information only.

Note that the system does not verify whether the publication explicitly mentions cancer, and in all cases publications must be reviewed by the clinician.

Germline evidence

"Germ": Evidence from germline databases

The evidence from the VarSome's germline variant classifier for the variant is summarised as follows:

  • Pathogenic ⇒ Germ I
  • Likely Pathogenic ⇒ Germ II
  • Uncertain Significance ⇒ Germ III
  • Likely Benign ⇒ Germ IV
  • Benign ⇒ Germ IV

This classification will be leveraged by the somatic classifier in the absence of any curated evidence for the variant.

"Pred": In-silico and splicing predictions

We replicate here the same in-silico predictions as used by the Germline classificier, using the updated ACMG guidelines described in PMID:5630195.

This evidence is provided for information only: it does not directly impact the somatic classification, however it is used by the germline classifier may be particularly relevant for missense and splicing variants.

Databases

The VarSome automated classification processes rely on vast quantities of accurate curated data from the following databases (in no particular order).

Important:depending on licensing agreements and in some cases the fees charged by source organisations, not all databases are visible to all users, and this may directly impact the completeness or quality of automated classifications.

Databases used by the germline variant classifier

  1. UniProt Variants, provided by UNIPROT, version 07-Feb-2025 (72.5k records)
  2. UniProt Regions, provided by UNIPROT, version 07-Feb-2025 (283k records)
  3. RefSeq, provided by NCBI, version 228
  4. phyloP100way, provided by CSH, version 13-Apr-2021 (3.14G records)
  5. PanelApp, provided by Genomics England, version 17-Feb-2025
  6. MitoTip, provided by CHOP, version 13-Dec-2022 (11.1k records)
  7. Mitomap, provided by CHOP, version 08-Dec-2023 (39.0k records)
  8. MitImpact, provided by IRCCS, version 13-Dec-2022 (48.2k records)
  9. MaxEntScan, provided by Burge Lab, version 5-Apr-2023
  10. LOVD, provided by LUMC, version 19-Feb-2025
  11. HPO, version 07-Feb-2025 (19.0k records)
  12. gnomAD Mitochondrial, provided by Broad, version 3.1 (18.2k records)
  13. gnomAD genomes coverage, provided by Broad, using version 2.1 (3.14G records) for hg19, and using version 3.0 (3.21G records) for hg38
  14. gnomAD genomes, provided by Broad, using version 2.1.1 (262M records) for hg19, and using version 4.1 (759M records) for hg38
  15. gnomAD gene constraints, provided by Broad, version 4.1 (18.6k records)
  16. gnomAD exomes coverage, provided by Broad, using version 2.1 (59.6M records) for hg19, and using version 4.0 (169M records) for hg38
  17. ClinVar, provided by NCBI, version 07-Feb-2025 (3.22M records)
  18. ClinGen Disease Validity, provided by NIH, version 07-Feb-2025 (2.50k records)
  19. CADD, provided by UW, version 1.7
  20. CGD, provided by NHGRI, version 03-Jul-2024 (4.74k records)
  21. DANN SNVs, provided by UCI, using version 2014 (9.41G records) for hg19, unavailable for hg38
  22. dbNSFP-c, provided by dbNSFP, version 4.9 (82.8M records)
  23. dbNSFP genes, provided by dbNSFP, version 4.9 (21.5k records)
  24. dbscSNV, provided by dbNSFP, version v1.1 (15.0M records)
  25. Domino, provided by UNIL, version 04-Sep-2019 (17.9k records)
  26. Ensembl, provided by EMBL, version 113
  27. GenCC, version 07-Feb-2025 (5.17k records)
  28. gene2phenotype, provided by EBI, version 04-Oct-2024 (2.89k records)
  29. gnomAD exomes, provided by Broad, using version 2.1.1 (17.2M records) for hg19, and using version 4.1 (184M records) for hg38
  30. Papers & classifications contributed by the VarSome community.

Databases used by the somatic variant classifier

In addition to the databases used for germline classification, the somatic variant classifier leverages information from:

  1. TP53 Somatic, provided by IARC, version release 20 (2.45k records)
  2. TP53 Germline, provided by IARC, version release 20 (436 records)
  3. Cancer Gene Census, provided by Sanger, version v101
  4. The Human Protein Atlas, provided by KAW, version 14-Mar-2024 (20.1k records)
  5. PharmGKB, version 07-Feb-2025
  6. OncoTree, provided by MSK, version 15-Jan-2024
  7. Mondo, provided by Monarch, version 07-Feb-2025
  8. ICGC somatic, provided by ICGC
  9. GTEx, provided by NIH, version v8 (313k records)
  10. CPIC Genes-Drugs, provided by CPIC, version 07-Feb-2025
  11. CKB, provided by JAX, version 23-Feb-2025
  12. CIViC, provided by WUSTL, version 08-Dec-2023 (849 records)
  13. AACT, provided by CTTI, version 07-Feb-2025
  14. CancerHotspots, provided by MSK, version 10-Sep-2021 (2.25M records)
  15. cBioPortal, provided by MSK, version 06-Jun-2023 (19.5M records)
  16. DGI, provided by WUSTL, version 04-Jun-2024
  17. Pharmacogenomic Biomarkers, provided by FDA, version 19-Sep-2022
  18. GDC, provided by NIH, version 08-Dec-2023 (2.17M records)
  19. GHR Genes, provided by NLM, version 05-Dec-2024 (1.50k records)

Other Databases

VarSome also annotates variants using the following databases, although these are not currently leveraged by the automated classifications:

  1. VCF attributes, provided by generic, version generic VCF file
  2. DGV, provided by TCAG, version 30-Jun-2021 (792k records)
  3. Semantic Scholar, provided by Allen Institute
  4. Pub Med, provided by NCBI
  5. PMKB, provided by Weill Cornell Medicine, version 08-Nov-2024 (161 records)
  6. phastCons100way, provided by CSH, version 14-Apr-2021 (3.14G records)
  7. Mastermind, provided by Genomenon, version 230612 (22.6M records)
  8. kaviar3, provided by ISB, version 4-Feb-2016 (83.3M records)
  9. HGNC, provided by HUGO, version 13-Feb-2025
  10. GWAS Catalog, provided by EBI, version 07-Feb-2025 (789k records)
  11. Cosmic Licensed, provided by Sanger, version v101
  12. ClinVar CNVs, provided by NCBI, version 07-Feb-2025 (61.6k records)
  13. ClinGen Variants, provided by NIH, version 07-Feb-2025 (9.72k records)
  14. ClinGen Regions, provided by NIH, version 07-Feb-2025 (516 records)
  15. ClinGen CNVs, provided by NIH, version 07-Feb-2025 (156 records)
  16. ClinGen, provided by NIH, version 07-Feb-2025 (1.56k records)
  17. AlphaMissense, provided by HL, version 03-Jul-2024 (69.1M records)
  18. Analysis-specific variant data, provided by generic, version any
  19. BAM Coverage, provided by generic
  20. Bravo, provided by UMICH, using version Freeze5 (25.5M records) for hg19, and using version Freeze8 (75.5M records) for hg38
  21. DailyMed, provided by NIH, version 03-Sep-2021
  22. dbNSFP-p, provided by dbNSFP
  23. dbSNP, provided by NCBI, version build 156 (1.27G records)
  24. dbVar, provided by NCBI, version 03-Jul-2024 (3.06M records)
  25. DVD, provided by UOI, using version v9 (2.49M records) for hg19, unavailable for hg38
  26. DECIPHER, provided by Sanger, version 07-Feb-2025 (31.0k records)
  27. DoCM, provided by WUSTL, version 07-Jun-2022 (1.24k records)
  28. EMA Approved Drugs, provided by EMA, version 03-Sep-2021
  29. EVE, provided by OATML, unavailable for hg19version 07-Jun-2022 (4.73M records) for hg38
  30. ExacCNV, provided by Broad, using version 01-Jul-2021 (49.3k records) for hg19, and using version 20180227 (48.6k records) for hg38
  31. ExAC genes, provided by Broad, version 18-Sep-2018 (18.3k records)
  32. FDA Approved Drugs, provided by FDA, version 03-Sep-2021
  33. FusionGDB, provided by UTexas, version 19-Nov-2021 (15.6k records)
  34. GERP, using version 2010 (2.60G records) for hg19, unavailable for hg38
  35. gnomAD structural variants, provided by Broad, version 30-Jun-2021 (334k records)
(Version information subject to change at any time, some databases may require a license and may not be displayed).

dbNSFP Sources (non-synonymous coding SNVs)

Additional sources annotated using the dbNSFP database:

Functional predictions:

  • ALoFT
  • BayesDel
  • DEOGEN2
  • Eigen
  • Eigen-PC
  • FATHMM
  • FATHMM-XF
  • FATHMM-MKL
  • fitCons
  • LIST-S2
  • LRT
  • M-CAP
  • MetaLR
  • MetaRNN
  • MetaSVM
  • MPC
  • MutationAssessor
  • MutationTaster
  • MutPred
  • MVP
  • Polyphen-2
  • PrimateAI
  • PROVEAN
  • REVEL
  • SIFT
  • SIFT4G

Conservation scores:

  • bStatistic
  • phastCons100way Vertebrate
  • phastCons30way Mammalian
  • phastCons17way Primate
  • phyloP100way Vertebrate
  • phyloP30way Mammalian
  • phyloP17way Primate
  • SiPhy

Gene annotation sources:

  • BioCarta
  • Consensus
  • egenetics
  • Essential Genes
  • GDI
  • Gene Ontology
  • GHIS
  • GNF/Atlas
  • HIPred
  • KEGG
  • LoFTool
  • Mouse genes
  • P(HI) Score
  • P(rec) Score
  • RVIS
  • UniProt Genes
  • Zebrafish genes