julien-c HF staff commited on
Commit
66b9f0a
1 Parent(s): 519d842

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/bionlp/bluebert_pubmed_uncased_L-24_H-1024_A-16/README.md

Files changed (1) hide show
  1. README.md +79 -0
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - bert
6
+ - bluebert
7
+ license:
8
+ - PUBLIC DOMAIN NOTICE
9
+ datasets:
10
+ - PubMed
11
+
12
+ ---
13
+
14
+ # BlueBert-Base, Uncased, PubMed
15
+
16
+ ## Model description
17
+
18
+ A BERT model pre-trained on PubMed abstracts.
19
+
20
+ ## Intended uses & limitations
21
+
22
+ #### How to use
23
+
24
+ Please see https://github.com/ncbi-nlp/bluebert
25
+
26
+ ## Training data
27
+
28
+ We provide [preprocessed PubMed texts](https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/NCBI-BERT/pubmed_uncased_sentence_nltk.txt.tar.gz) that were used to pre-train the BlueBERT models.
29
+ The corpus contains ~4000M words extracted from the [PubMed ASCII code version](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PubMed/).
30
+
31
+ Pre-trained model: https://huggingface.co/bert-large-uncased
32
+
33
+ ## Training procedure
34
+
35
+ * lowercasing the text
36
+ * removing speical chars `\x00`-`\x7F`
37
+ * tokenizing the text using the [NLTK Treebank tokenizer](https://www.nltk.org/_modules/nltk/tokenize/treebank.html)
38
+
39
+ Below is a code snippet for more details.
40
+
41
+ ```python
42
+ value = value.lower()
43
+ value = re.sub(r'[\r\n]+', ' ', value)
44
+ value = re.sub(r'[^\x00-\x7F]+', ' ', value)
45
+
46
+ tokenized = TreebankWordTokenizer().tokenize(value)
47
+ sentence = ' '.join(tokenized)
48
+ sentence = re.sub(r"\s's\b", "'s", sentence)
49
+ ```
50
+
51
+ ### BibTeX entry and citation info
52
+
53
+ ```bibtex
54
+ @InProceedings{peng2019transfer,
55
+ author = {Yifan Peng and Shankai Yan and Zhiyong Lu},
56
+ title = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
57
+ booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
58
+ year = {2019},
59
+ pages = {58--65},
60
+ }
61
+ ```
62
+
63
+ ### Acknowledgments
64
+
65
+ This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of
66
+ Medicine and Clinical Center. This work was supported by the National Library of Medicine of the National Institutes of Health under award number 4R00LM013001-01.
67
+
68
+ We are also grateful to the authors of BERT and ELMo to make the data and codes publicly available.
69
+
70
+ We would like to thank Dr Sun Kim for processing the PubMed texts.
71
+
72
+ ### Disclaimer
73
+
74
+ This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced
75
+ on this website is not intended for direct diagnostic use or medical decision-making without review and oversight
76
+ by a clinical professional. Individuals should not change their health behavior solely on the basis of information
77
+ produced on this website. NIH does not independently verify the validity or utility of the information produced
78
+ by this tool. If you have questions about the information produced on this website, please see a health care
79
+ professional. More information about NCBI's disclaimer policy is available.