File size: 5,312 Bytes
1c3dfe6
cbaf9ef
1c3dfe6
cbaf9ef
 
 
 
 
1c3dfe6
 
cbaf9ef
 
 
1c3dfe6
cbaf9ef
1c3dfe6
fce987b
29eee66
fce987b
9dc4327
4a376d7
fce987b
 
e22146c
ec03419
c474fbe
e22146c
fce987b
 
c474fbe
e22146c
 
 
 
c474fbe
 
fce987b
 
c474fbe
 
 
 
e22146c
fce987b
c474fbe
fdf28ce
c474fbe
 
 
 
 
 
 
fce987b
e22146c
 
c474fbe
 
fce987b
c474fbe
bf20bad
 
d524a67
bf20bad
 
 
 
7c1c03c
bf20bad
 
 
c474fbe
 
 
 
 
fdf28ce
c474fbe
 
 
fdf28ce
c474fbe
 
 
fdf28ce
c474fbe
 
 
fdf28ce
c474fbe
 
 
fce987b
a446b98
fce987b
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---

title: README

emoji: πŸ‘

colorFrom: orange

colorTo: indigo

sdk: static

pinned: false

---


<div>
  <img src="https://raw.githubusercontent.com/NCAI-Research/CALM/main/assets/logo.png" width="380" alt="CALM Logo" />
  <p class="mb-2" style="font-size:30px;font-weight:bold">
    CALM: Collaborative Arabic Language Model
  </p>
  <p class="mb-2">
    The CALM project is joint effort lead by <u><a target="_blank" href="https://sdaia.gov.sa/ncai/?Lang=en">NCAI</a></u> in collaboration with 
    <u><a target="_blank"  href="https://yandex.com/">Yandex</a></u>, <u><a href="https://huggingface.co/">HuggingFace</a></u> and <u><a href="http://www.washington.edu/">UW</a></u> to train an Arabic language model with 
    volunteers from around the globe. The project is an adaptation of the framework proposed at the NeurIPS 2021 demonstration: 
    <u><a target="_blank" href="https://huggingface.co/training-transformers-together">Training Transformers Together</a></u>.
  </p>
  <p class="mb-2">
  One of the main obstacles facing many researchers in the Arabic NLP community is the lack of computing resources that are needed for training large models. Models with 
  leading performane on Arabic NLP tasks, such as <u><a target="_blank" href="https://github.com/aub-mind/arabert">AraBERT</a></u>, 
  <u><a href="https://github.com/CAMeL-Lab/CAMeLBERT" target="_blank" >CamelBERT</a></u>, 
  <u><a href="https://huggingface.co/aubmindlab/araelectra-base-generator" target="_blank" >AraELECTRA</a></u>, and 
  <u><a href="https://huggingface.co/qarib">QARiB</a></u>, 
  took days to train on TPUs. In the spirit of democratization of AI and community enabling, a core value at NCAI, CALM aims to demonstrate the effectiveness 
  of collaborative training and form a community of volunteers for ANLP researchers with basic level cloud GPUs who wish to train their own models collaboratively.
  </p>
  <p class="mb-2">
    CALM trains a single BERT model on a dataset that combines MSA, Oscar and Arabic Wikipedia, and dialectal data for the gulf region from existing open source datasets. 
    Each volunteer GPU trains the model locally at its own pace on a portion of the dataset while another portion is being streamed in the background to reduces local 
    memory consumption. Computing the gradients and aggregating them is performed in a distributed manner, based on the computing abilities of each participating 
    volunteer. Details of the distributed training process are further described in the paper 
    <u><a target="_blank" href="https://papers.nips.cc/paper/2021/hash/41a60377ba920919939d83326ebee5a1-Abstract.html">Deep Learning in Open Collaborations</a></u>.
  </p>
  
  <p class="mb-2" style="font-size:20px;font-weight:bold">
    How to participate in training?
  </p>
  <p class="mb-2">
  To join the collaborative training, all you have to do is to keep a notebook running for at <b>least 15 minutes</b>, you're free to close it after that and join again 
  in another time. There are few steps before running the notebook:
  </p>
  
  <ul class="mb-2">
    <li>πŸ‘‰ Create an account on <u><a target="_blank" href="https://huggingface.co">Huggingface</a></u>.</li>
    <li>πŸ‘‰ Join the <u><a target="_blank" href="https://huggingface.co/CALM">NCAI-CALM Organization</a></u> on Huggingface through the invitation link shared with you by email.</li>
    <li>πŸ‘‰ Get your Access Token, it's later required in the notebook.
    </li>
  </ul>
  
  <p class="h2 mb-2" style="font-size:18px;font-weight:bold">How to get my Huggingface Access Token</p>
  <ul class="mb-2">
    <li>πŸ‘‰ Go to your <u><a target="_blank" href="https://huggingface.co">HF account</a></u>.</li>
    <li>πŸ‘‰ Go to Settings β‡’ Access Tokens.</li>
    <li>πŸ‘‰ Generate a new Access Token and enter any name for "what's this token for".</li>
    <li>πŸ‘‰ Select <code>read</code> role.</li>
    <li>πŸ‘‰ Copy your access token.</li>
    <li>πŸ‘‰ In cell 4, it will ask you for an Access Token, paste it there.</li>
  </ul>
  
  <p class="mb-2" style="font-size:20px;font-weight:bold">
    Start training
  </p>
  <p class="mb-2">Pick one of the following methods to run the training code.
  <br /><em>NOTE: Kaggle gives you around 40 hrs per week of GPU time, so it's preferred over Colab, unless you have Colab Pro or Colab Pro+.</em></p>
  <ul class="mb-2">
    <li>πŸ‘‰ <span><a href="https://www.kaggle.com/prmais/volunteer-gpu-notebook">
    <img style="display:inline;margin:0px" src="https://img.shields.io/badge/kaggle-Open%20in%20Kaggle-blue.svg"/>
    </a></span> <b> (recommended)</b> <br />
    </li>
    <li>πŸ‘‰ <span><a href="https://colab.research.google.com/github/NCAI-Research/CALM/blob/main/notebooks/volunteer-gpu-notebook.ipynb">
    <img style="display:inline;margin:0px" src="https://colab.research.google.com/assets/colab-badge.svg"/>
    </a></span>
    </li>
    <li>πŸ‘‰ Running locally: If you have additional local computing GPUs, please visit our discord channel for instructions to set it.
    </li>
  </ul>
  
  <p class="mb-2" style="font-size:20px;font-weight:bold">
    Issues or questions?
  </p>
  
  <p class="mb-2">
    Feel free to reach us on <u><a target="_blank" href="https://discord.gg/peU5Nx77">Discord</a></u> if you have any questions πŸ™‚
  </p>
</div>