kaczmarj's picture
add reuse instructions
6d4a752 verified
metadata
license: cc-by-4.0

Pancancer TP53 classifier from H&E resections

This model classifies an H&E-stained digital pathology image as TP53 wildtype or mutant. It was trained by Jakub Kaczmarzyk using CLAM.

Inputs: Bag of patches with 128um edge length, embedded with CTransPath.

Output classes: wildtype, mutant

Data

Diagnostic slides in TCGA (e.g., DX) were used to train the model. The whole slide images were tiled into 128x128um patches, and each patch was encoded using CTransPath (this produces 768-dimensional embeddings).

Train, validation, and test splits were stratified by TCGA study and TP53 status, and patients did not cross split boundaries.

Samples sizes:

  • Train: 8,736 slides (7,076 patients)
  • Validation: 1,061 slides (881 patients)
  • Test: 1,069 slides (881 patients)

The TP53 status for each sample was downloaded from CBioPortal.

TCGA studies with fewer than 100 samples of mutated TP53 were excluded from training.

The following TCGA studies were used in training: ACC, BLCA, BRCA, CESC, COADREAD, ESCA, GBM, HNSC, KIRC, KIRP, LGG, LIHC, LUAD, LUSC, OV, PAAD, PCPG, PRAD, SARC, SKCM, STAD, TGCT, THCA, THYM, UCEC.

The following TCGA studies were not used in training: CHOL, UVM, UCS, KICH, MESO, DLBC.

Reusing this model

To use this model on the command line, see WSInfer-MIL.

Alternatively, you may use PyTorch on ONNX to run the model. First, embed 128um x 128um patches using CTransPath. Then pass the bag of embeddings to the model.

import onnxruntime as ort
import numpy as np
embedding = np.ones((1_000, 768), dtype="float32")
ort_sess = ort.InferenceSession("model.onnx")
logits, attention = ort_sess.run(["logits", "attention"], {'input': embedding})

Model performance

The model achieved an AUROC of 0.85 on the full test set.

Here are the AUROC values per TCGA study. NaN values are present wherever there was only a single class present in the ground truth labels.

  • ACC: 0.750
  • BLCA: 0.597
  • BRCA: 0.862
  • CESC: 0.562
  • COADREAD: 0.742
  • ESCA: 0.643
  • GBM: 0.792
  • HNSC: 0.599
  • KIRC: 1.000
  • KIRP: nan
  • LGG: 0.763
  • LIHC: 0.769
  • LUAD: 0.842
  • LUSC: 0.610
  • OV: 0.708
  • PAAD: 0.787
  • PCPG: nan
  • PRAD: 0.657
  • SARC: 0.762
  • SKCM: 0.722
  • STAD: 0.716
  • TGCT: nan
  • THCA: nan
  • THYM: nan
  • UCEC: 0.825

Intended uses

This model is ONLY intended for research purposes.

This model may not be used for clinical purposes. This model is distributed without warranties, either express or implied.