File size: 4,658 Bytes

---

language: ar
datasets:
- Marefa-NER
---


# Tebyan تبيـان
## Marefa Arabic Named Entity Recognition Model
## نموذج المعرفة لتصنيف أجزاء النص
---------
**Version**: 1.0.1

**Last Update:** 16-05-2021

## Model description

**Marefa-NER** is a Large Arabic Named Entity Recognition (NER) model built on a completely new dataset and targets to extract up to 9 different types of entities
```

Person, Location, Organization, Nationality, Job, Product, Event, Time, Art-Work

```

نموذج المعرفة لتصنيف أجزاء النص. نموذج جديد كليا من حيث البيانات المستخدمة في تدريب النموذج. 
كذلك يستهدف النموذج تصنيف حتى 9 أنواع مختلفة من أجزاء النص
```

شخص - مكان - منظمة - جنسية - وظيفة - منتج - حدث - توقيت - عمل إبداعي

```

## How to use كيف تستخدم النموذج

Install transformers AND nltk (python >= 3.6)

`$ pip3 install transformers==4.6.0 nltk==3.5 protobuf==3.15.3 torch==1.7.1`

> If you are using `Google Colab`, please restart your runtime after installing the packages.

-----------

```python

# we need to install NLTK punkt to be used for word tokenization

# we need to install NLTK punkt to be used for word tokenization

from collections import defaultdict

import nltk

nltk.download('punkt')

from nltk.tokenize import word_tokenize



from transformers import AutoTokenizer, AutoModelForTokenClassification

from transformers import pipeline



# labels list

labels_list = ['O', 'B-nationality', 'B-event', 'B-person', 'B-artwork', 'B-location', 'B-product', 'B-organization', 'B-job', 'B-time', 'I-nationality', 'I-event', 'I-person', 'I-artwork', 'I-location', 'I-product', 'I-organization', 'I-job', 'I-time']



# ===== import the model

m_name = "marefa-nlp/marefa-ner"

tokenizer = AutoTokenizer.from_pretrained(m_name)

model = AutoModelForTokenClassification.from_pretrained(m_name)



# ===== build the NER pipeline

nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)



# ===== extract the entities from a sample text

example = 'خاضت القوات المصرية حرب السادس من أكتوبر ضد الجيش الصهيوني عام 1973'

# clean the text

example = " ".join(word_tokenize(example))

# feed to the NER model to parse

ner_results = nlp(example)



# we prepared a simple code to generate full entities tokens



modified_results = []

for ent in ner_results:

  if ent["entity_group"].lower().replace("label_","").isnumeric():

      ent["entity_group"] = int(ent["entity_group"].lower().replace("label_",""))

      ent["entity_group"] = labels_list[ent["entity_group"]]



  if len(modified_results) > 0 and ent["start"] == modified_results[-1]["end"]:

    modified_results[-1]["word"] += f"{ent['word']}".replace("▁"," ").strip()

    modified_results[-1]["word"] = modified_results[-1]["word"].replace("▁"," ").strip()

    modified_results[-1]["score"] = sum([modified_results[-1]["score"], ent["score"]])/2

    modified_results[-1]["end"] = ent["end"]

  else:

    modified_results.append(ent)





for res in modified_results:

  print(res["word"], "==>", res["entity_group"])

  

#####

# القوات المصرية ==> organization

# حرب السادس من أكتوبر ==> event

# الجيش الصهيوني ==> organization

# عام 1973 ==> time

####



```

## Acknowledgment شكر و تقدير

قام بإعداد البيانات التي تم تدريب النموذج عليها, مجموعة من المتطوعين الذين قضوا ساعات يقومون بتنقيح البيانات و مراجعتها

- على سيد عبد الحفيظ - إشراف
- نرمين محمد عطيه 
- صلاح خيرالله
- احمد علي عبدربه
- عمر بن عبد العزيز سليمان
- محمد ابراهيم الجمال
- عبدالرحمن سلامه خلف
- إبراهيم كمال محمد سليمان
- حسن مصطفى حسن 
- أحمد فتحي سيد
- عثمان مندو
- عارف الشريف
- أميرة محمد محمود
- حسن سعيد حسن
- عبد العزيز علي البغدادي
- واثق عبدالملك الشويطر
- عمرو رمضان عقل الحفناوي
- حسام الدين أحمد على
- أسامه أحمد محمد محمد
- حاتم محمد المفتي
- عبد الله دردير
- أدهم البغدادي
- أحمد صبري
- عبدالوهاب محمد محمد
- أحمد محمد عوض