SAVSNET
/

PetBERT

@@ -15,7 +15,7 @@ PetBERT is a masked language model based on the BERT architecture further traine
 ## Paper Abstract
 Effective public health surveillance requires consistent monitoring of disease signals such that researchers and decision-makers can react dynamically to changes in disease occurrence. However, whilst surveillance initiatives exist in production animal veterinary medicine, comparable frameworks for companion animals are lacking. First-opinion veterinary electronic health records (EHRs) have the potential to reveal disease signals and often represent the initial reporting of clinical syndromes in animals presenting for medical attention, highlighting their possible significance in early disease detection. Yet despite their availability, there are limitations surrounding their free text-based nature, inhibiting the ability for national-level mortality and morbidity statistics to occur. This paper presents PetBERT, a large language model trained on over 500 million words from 5.1 million EHRs across the UK. PetBERT-ICD is the additional training of PetBERT as a multi-label classifier for the automated coding of veterinary clinical EHRs with the International Classification of Disease 11 framework, achieving F1 scores exceeding 83% across 20 disease codings with minimal annotations. PetBERT-ICD effectively identifies disease outbreaks, outperforming current clinician-assigned point-of-care labelling strategies up to 3 weeks earlier. The potential for PetBERT-ICD to enhance disease surveillance in veterinary medicine represents a promising avenue for advancing animal health and improving public health outcomes.
-- **Developed by:** SAVSNET
 - **Model type:** Masked Language Model
 - **Language(s) (NLP):** English
 - **License:** openrail
@@ -28,6 +28,16 @@ Effective public health surveillance requires consistent monitoring of disease s
 ## How to Get Started with the Model
@@ -35,7 +45,34 @@ Effective public health surveillance requires consistent monitoring of disease s
 ### Training Data
-Electronic health records have been collected since March 2014 by SAVSNET, the Small Animal Veterinary Surveillance Network, comprising a sentinel network of 253 volunteer veterinary practices across the United Kingdom. A complete description of SAVSNET has been presented elsewhere5. In summary, based on convenience, veterinary practices with compatible practice management software with the SAVSNET data exchange are recruited. Within these participating practices, data is collected from each booked consultation (where an appointment has been made to see a veterinary practitioner or nurse). All owners within these practices can opt-out at the time of consultation, and therefore, their data will be excluded. Data is collected on a consultation-by-consultation basis and includes information such as species, breed, sex, neuter status, age, owner’s postcode, insurance and microchipping status, and crucial to this study, a free-text clinical narrative outlining the events that occurred within that consultation. At the end of each consultation, veterinary practitioners are given 10 ‘main presenting complaint’ (MPC) groups to categorise the main reason the animal presented; these are gastrointestinal, respiratory, pruritus, tumour, renal, trauma, post-operative checkups, vaccination and, other healthy and other unwell. Sensitive information, such as personal identifiers, is cleaned from the data without further preprocessing. SAVSNET has ethical approval from the University of Liverpool Research Ethics Committee (RETH001081). Table 4 summarises the cleaned SAVSNET dataset after the above process for cats and dogs only. We segregated EHRs into training and testing sets based on their respective source practices. This stratification approach ensures that the clinical notes used for testing were separate from those generated by clinicians who had contributed to the training sets, thereby fortifying the robustness of our results and mitigating potential bias.
 ### Dataset availability statement:
 The datasets analysed during the current study are not publicly available due to issues surrounding owner confidentiality. Reasonable requests can be made to the SAVSNET Data Access and Publication Panel ([email protected]) for researchers who meet the criteria for access to confidential data.
@@ -44,20 +81,8 @@ The datasets analysed during the current study are not publicly available due to
 Adaption of the ULMFiT framework was utilised in the production of ‘PetBERT’ based upon minimal modifications to the BERT architecture. Firstly, the pre-trained BERT-base model previously exposed to the general-purpose language of Wikipedia and BooksCorpus was further fine-tuned on the 500 million token dataset of first opinion clinical free-text narratives on a simultaneous training task of Masked Language (MLM) and Next Sentence Prediction (NSP), mimicking the tasks used in the initial pre-training of BERT. For MLM training, 15% of the words within a given clinical narrative were masked randomly across the entire training dataset. The model was tasked to substitute the masked word with a suitable word, requiring a deep bidirectional understanding of the text. For NSP training, sentences between narratives were randomly split and rejoined either to the same sentence or to a random sentence with a [SEP] token in between. The model had to determine whether the new sentence pairs made sense, enabling a cross-sentence understanding of the text. The model had a 10% evaluation set created randomly to calculate a validation loss to determine the number of training epochs required. The training ended when evaluation loss increased, occurring beyond epoch 8, with the final model selected for downstream tasks. Training took 450 hours on a single Nvidia A100 GPU.
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
 ## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
 - **Hardware Type:** 1 x NVidia A100
 - **Hours used:** ~450 hours
 - **Cloud Provider:** https://www.dur.ac.uk/arc/nvidiacuda/
@@ -67,21 +92,22 @@ Adaption of the ULMFiT framework was utilised in the production of ‘PetBERT’
 ## Citation
 **BibTeX:**
 @article{Farrell2023PetBERT:Records,
-    title = {{PetBERT: automated ICD-11 syndromic disease coding for outbreak detection in first opinion veterinary electronic health records}},
-    year = {2023},
-    journal = {Scientific Reports 2023 13:1},
-    author = {Farrell, Sean and Appleton, Charlotte and Noble, Peter John Mäntylä and Al Moubayed, Noura},
-    number = {1},
-    month = {10},
-    pages = {1--14},
-    volume = {13},
-    publisher = {Nature Publishing Group},
-    url = {https://www.nature.com/articles/s41598-023-45155-7},
-    isbn = {0123456789},
-    doi = {10.1038/s41598-023-45155-7},
-    issn = {2045-2322},
-    pmid = {37865683},
-    keywords = {Data mining, Machine learning}
-}

 ## Paper Abstract
 Effective public health surveillance requires consistent monitoring of disease signals such that researchers and decision-makers can react dynamically to changes in disease occurrence. However, whilst surveillance initiatives exist in production animal veterinary medicine, comparable frameworks for companion animals are lacking. First-opinion veterinary electronic health records (EHRs) have the potential to reveal disease signals and often represent the initial reporting of clinical syndromes in animals presenting for medical attention, highlighting their possible significance in early disease detection. Yet despite their availability, there are limitations surrounding their free text-based nature, inhibiting the ability for national-level mortality and morbidity statistics to occur. This paper presents PetBERT, a large language model trained on over 500 million words from 5.1 million EHRs across the UK. PetBERT-ICD is the additional training of PetBERT as a multi-label classifier for the automated coding of veterinary clinical EHRs with the International Classification of Disease 11 framework, achieving F1 scores exceeding 83% across 20 disease codings with minimal annotations. PetBERT-ICD effectively identifies disease outbreaks, outperforming current clinician-assigned point-of-care labelling strategies up to 3 weeks earlier. The potential for PetBERT-ICD to enhance disease surveillance in veterinary medicine represents a promising avenue for advancing animal health and improving public health outcomes.
+- **Developed by:** [Small Animal Veterinary Surveillance Network (SAVSNET)](https://www.liverpool.ac.uk/savsnet/)
 - **Model type:** Masked Language Model
 - **Language(s) (NLP):** English
 - **License:** openrail
 ## How to Get Started with the Model
+```
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained("SAVSNET/PetBERT")
+model = AutoModelForMaskedLM.from_pretrained("SAVSNET/PetBERT")
+PetBERT_masked = pipeline("fill-mask", model=model, tokenizer=tokenizer)
+PetBERT('Suspected pneuomina, will require an [MASK] but in the meantime will prescribed antibiotics')
+```
 ### Training Data
+Electronic health records have been collected since March 2014 by SAVSNET, the Small Animal Veterinary Surveillance Network, comprising a sentinel network of 253 volunteer veterinary practices across the United Kingdom. A complete description of SAVSNET has been presented elsewhere5. In summary, based on convenience, veterinary practices with compatible practice management software with the SAVSNET data exchange are recruited. Within these participating practices, data is collected from each booked consultation (where an appointment has been made to see a veterinary practitioner or nurse). All owners within these practices can opt-out at the time of consultation, and therefore, their data will be excluded. Data is collected on a consultation-by-consultation basis and includes information such as species, breed, sex, neuter status, age, owner’s postcode, insurance and microchipping status, and crucial to this study, a free-text clinical narrative outlining the events that occurred within that consultation. At the end of each consultation, veterinary practitioners are given 10 ‘main presenting complaint’ (MPC) groups to categorise the main reason the animal presented; these are gastrointestinal, respiratory, pruritus, tumour, renal, trauma, post-operative checkups, vaccination and, other healthy and other unwell. Sensitive information, such as personal identifiers, is cleaned from the data without further preprocessing. SAVSNET has ethical approval from the University of Liverpool Research Ethics Committee (RETH001081). Table below summarises the cleaned SAVSNET dataset after the above process for cats and dogs only. We segregated EHRs into training and testing sets based on their respective source practices. This stratification approach ensures that the clinical notes used for testing were separate from those generated by clinicians who had contributed to the training sets, thereby fortifying the robustness of our results and mitigating potential bias.
+| Variable | Level                | Dogs              | Cats              |
+|----------|----------------------|-------------------|-------------------|
+| Species  | Dogs                 | 5,275,843         | –                 |
+|          | Cats                 | –                 | 2,062,074         |
+| Sex      | Male                 | 2,710,641 (51.2%) | 1,009,388 (48.1%) |
+|          | Female               | 2,565,202 (48.8%) | 1,052,686 (51.9%) |
+| Country  | England              | 4,7152,76 (90.4%) | 1,871,536 (91.6%) |
+|          | Scotland             | 252,024 (4.8%)    | 81,883 (4.2%)     |
+|          | Wales                | 216,799 (4.2%)    | 77,774 (3.9%)     |
+|          | Northern Ireland     | 34,129 (0.6%)     | 6204 (0.3%)       |
+| Age      | Infant (0 to 1 year) | 501,339 (11.9%)   | 190,534 (9.8%)    |
+|          | Adult (1–10 years)   | 2,830,739 (64.5%) | 887,640 (51.0%)   |
+|          | Senior (10 years)    | 1,036,075 (23.6%) | 691,738 (39.2%)   |
+| Neutered | Yes                  | 3,587,028 (68.0%) | 1,670,280 (81.4%) |
+|          | No                   | 1,688,093 (32.0%) | 391,794 (19.6%)   |
+| MPC      | Gastroenteric        | 174,688 (3.3%)    | 45,368 (2.4%)     |
+|          | Kidney _disease      | 14,046 (0.2%)     | 18,169 (0.9%)     |
+|          | Other _healthy       | 1,333,760 (25.4%) | 494,170 (23.7%)   |
+|          | Other _unwell        | 1,006,031 (19.6%) | 418,107 (21.3%)   |
+|          | Post _op             | 414,764 (7.8%)    | 135,865 (6.6%)    |
+|          | Pruritus             | 283,880 (5.3%)    | 53,869 (2.7%)     |
+|          | Respiratory          | 52,625 (0.9%)     | 27,123 (1.3%)     |
+|          | Trauma               | 249,039 (4.7%)    | 102,646 (4.9%)    |
+|          | Tumour               | 100,080 (1.8%)    | 23,865 (1.1%)     |
+|          | Vaccination          | 1,639,268 (31.0%) | 739,890 (35.1%)   |
 ### Dataset availability statement:
 The datasets analysed during the current study are not publicly available due to issues surrounding owner confidentiality. Reasonable requests can be made to the SAVSNET Data Access and Publication Panel ([email protected]) for researchers who meet the criteria for access to confidential data.
 Adaption of the ULMFiT framework was utilised in the production of ‘PetBERT’ based upon minimal modifications to the BERT architecture. Firstly, the pre-trained BERT-base model previously exposed to the general-purpose language of Wikipedia and BooksCorpus was further fine-tuned on the 500 million token dataset of first opinion clinical free-text narratives on a simultaneous training task of Masked Language (MLM) and Next Sentence Prediction (NSP), mimicking the tasks used in the initial pre-training of BERT. For MLM training, 15% of the words within a given clinical narrative were masked randomly across the entire training dataset. The model was tasked to substitute the masked word with a suitable word, requiring a deep bidirectional understanding of the text. For NSP training, sentences between narratives were randomly split and rejoined either to the same sentence or to a random sentence with a [SEP] token in between. The model had to determine whether the new sentence pairs made sense, enabling a cross-sentence understanding of the text. The model had a 10% evaluation set created randomly to calculate a validation loss to determine the number of training epochs required. The training ended when evaluation loss increased, occurring beyond epoch 8, with the final model selected for downstream tasks. Training took 450 hours on a single Nvidia A100 GPU.
 ## Environmental Impact
 - **Hardware Type:** 1 x NVidia A100
 - **Hours used:** ~450 hours
 - **Cloud Provider:** https://www.dur.ac.uk/arc/nvidiacuda/
 ## Citation
 **BibTeX:**
+```
 @article{Farrell2023PetBERT:Records,
+        title = {{PetBERT: automated ICD-11 syndromic disease coding for outbreak detection in first opinion veterinary electronic health records}},
+        year = {2023},
+        journal = {Scientific Reports 2023 13:1},
+        author = {Farrell, Sean and Appleton, Charlotte and Noble, Peter John Mäntylä and Al Moubayed, Noura},
+        number = {1},
+        month = {10},
+        pages = {1--14},
+        volume = {13},
+        publisher = {Nature Publishing Group},
+        url = {https://www.nature.com/articles/s41598-023-45155-7},
+        isbn = {0123456789},
+        doi = {10.1038/s41598-023-45155-7},
+        issn = {2045-2322},
+        pmid = {37865683},
+        keywords = {Data mining, Machine learning}
+    }
+```