dimfeld michiyasunaga commited on
Commit
09b7d2f
0 Parent(s):

Duplicate from michiyasunaga/BioLinkBERT-large

Browse files

Co-authored-by: Michihiro Yasunaga <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
5
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.model filter=lfs diff=lfs merge=lfs -text
12
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
13
+ *.onnx filter=lfs diff=lfs merge=lfs -text
14
+ *.ot filter=lfs diff=lfs merge=lfs -text
15
+ *.parquet filter=lfs diff=lfs merge=lfs -text
16
+ *.pb filter=lfs diff=lfs merge=lfs -text
17
+ *.pt filter=lfs diff=lfs merge=lfs -text
18
+ *.pth filter=lfs diff=lfs merge=lfs -text
19
+ *.rar filter=lfs diff=lfs merge=lfs -text
20
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
21
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
22
+ *.tflite filter=lfs diff=lfs merge=lfs -text
23
+ *.tgz filter=lfs diff=lfs merge=lfs -text
24
+ *.xz filter=lfs diff=lfs merge=lfs -text
25
+ *.zip filter=lfs diff=lfs merge=lfs -text
26
+ *.zstandard filter=lfs diff=lfs merge=lfs -text
27
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language: en
4
+ datasets:
5
+ - pubmed
6
+ tags:
7
+ - bert
8
+ - exbert
9
+ - linkbert
10
+ - biolinkbert
11
+ - feature-extraction
12
+ - fill-mask
13
+ - question-answering
14
+ - text-classification
15
+ - token-classification
16
+ widget:
17
+ - text: Sunitinib is a tyrosine kinase inhibitor
18
+ duplicated_from: michiyasunaga/BioLinkBERT-large
19
+ ---
20
+
21
+ ## BioLinkBERT-large
22
+
23
+ BioLinkBERT-large model pretrained on [PubMed](https://pubmed.ncbi.nlm.nih.gov/) abstracts along with citation link information. It is introduced in the paper [LinkBERT: Pretraining Language Models with Document Links (ACL 2022)](https://arxiv.org/abs/2203.15827). The code and data are available in [this repository](https://github.com/michiyasunaga/LinkBERT).
24
+
25
+ This model achieves state-of-the-art performance on several biomedical NLP benchmarks such as [BLURB](https://microsoft.github.io/BLURB/) and [MedQA-USMLE](https://github.com/jind11/MedQA).
26
+
27
+
28
+ ## Model description
29
+
30
+ LinkBERT is a transformer encoder (BERT-like) model pretrained on a large corpus of documents. It is an improvement of BERT that newly captures **document links** such as hyperlinks and citation links to include knowledge that spans across multiple documents. Specifically, it was pretrained by feeding linked documents into the same language model context, besides a single document.
31
+
32
+ LinkBERT can be used as a drop-in replacement for BERT. It achieves better performance for general language understanding tasks (e.g. text classification), and is also particularly effective for **knowledge-intensive** tasks (e.g. question answering) and **cross-document** tasks (e.g. reading comprehension, document retrieval).
33
+
34
+
35
+ ## Intended uses & limitations
36
+
37
+ The model can be used by fine-tuning on a downstream task, such as question answering, sequence classification, and token classification.
38
+ You can also use the raw model for feature extraction (i.e. obtaining embeddings for input text).
39
+
40
+
41
+ ### How to use
42
+
43
+ To use the model to get the features of a given text in PyTorch:
44
+
45
+ ```python
46
+ from transformers import AutoTokenizer, AutoModel
47
+ tokenizer = AutoTokenizer.from_pretrained('michiyasunaga/BioLinkBERT-large')
48
+ model = AutoModel.from_pretrained('michiyasunaga/BioLinkBERT-large')
49
+ inputs = tokenizer("Sunitinib is a tyrosine kinase inhibitor", return_tensors="pt")
50
+ outputs = model(**inputs)
51
+ last_hidden_states = outputs.last_hidden_state
52
+ ```
53
+
54
+ For fine-tuning, you can use [this repository](https://github.com/michiyasunaga/LinkBERT) or follow any other BERT fine-tuning codebases.
55
+
56
+
57
+ ## Evaluation results
58
+
59
+ When fine-tuned on downstream tasks, LinkBERT achieves the following results.
60
+
61
+ **Biomedical benchmarks ([BLURB](https://microsoft.github.io/BLURB/), [MedQA](https://github.com/jind11/MedQA), [MMLU](https://github.com/hendrycks/test), etc.):** BioLinkBERT attains new state-of-the-art.
62
+
63
+ | | BLURB score | PubMedQA | BioASQ | MedQA-USMLE |
64
+ | ---------------------- | -------- | -------- | ------- | -------- |
65
+ | PubmedBERT-base | 81.10 | 55.8 | 87.5 | 38.1 |
66
+ | **BioLinkBERT-base** | **83.39** | **70.2** | **91.4** | **40.0** |
67
+ | **BioLinkBERT-large** | **84.30** | **72.2** | **94.8** | **44.6** |
68
+
69
+ | | MMLU-professional medicine |
70
+ | ---------------------- | -------- |
71
+ | GPT-3 (175 params) | 38.7 |
72
+ | UnifiedQA (11B params) | 43.2 |
73
+ | **BioLinkBERT-large (340M params)** | **50.7** |
74
+
75
+
76
+ ## Citation
77
+
78
+ If you find LinkBERT useful in your project, please cite the following:
79
+
80
+ ```bibtex
81
+ @InProceedings{yasunaga2022linkbert,
82
+ author = {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
83
+ title = {LinkBERT: Pretraining Language Models with Document Links},
84
+ year = {2022},
85
+ booktitle = {Association for Computational Linguistics (ACL)},
86
+ }
87
+ ```
config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "gradient_checkpointing": false,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 1024,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 4096,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 512,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 16,
16
+ "num_hidden_layers": 24,
17
+ "pad_token_id": 0,
18
+ "position_embedding_type": "absolute",
19
+ "transformers_version": "4.9.0",
20
+ "type_vocab_size": 2,
21
+ "use_cache": true,
22
+ "vocab_size": 28895
23
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fed75e5716547b54198d4dd123e7a3f3c64a82e1172b3492a11deebd6ab4cd4d
3
+ size 1334073393
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract", "do_basic_tokenize": true, "never_split": null, "tokenizer_class": "BertTokenizer"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff