metadata

language: ar
datasets:
  - Marefa-NER

Tebyan تبيـان

Marefa Arabic Named Entity Recognition Model

نموذج المعرفة لتصنيف أجزاء النص

Version: 1.0.1

Last Update: 16-05-2021

Model description

Marefa-NER is a Large Arabic Named Entity Recognition (NER) model built on a completely new dataset and targets to extract up to 9 different types of entities

Person, Location, Organization, Nationality, Job, Product, Event, Time, Art-Work

نموذج المعرفة لتصنيف أجزاء النص. نموذج جديد كليا من حيث البيانات المستخدمة في تدريب النموذج. كذلك يستهدف النموذج تصنيف حتى 9 أنواع مختلفة من أجزاء النص

شخص - مكان - منظمة - جنسية - وظيفة - منتج - حدث - توقيت - عمل إبداعي

How to use كيف تستخدم النموذج

Install transformers AND nltk (python >= 3.6)

$ pip3 install transformers==4.6.0 nltk==3.5 protobuf==3.15.3 torch==1.7.1

If you are using Google Colab, please restart your runtime after installing the packages.

# we need to install NLTK punkt to be used for word tokenization
# we need to install NLTK punkt to be used for word tokenization
from collections import defaultdict
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# labels list
labels_list = ['O', 'B-nationality', 'B-event', 'B-person', 'B-artwork', 'B-location', 'B-product', 'B-organization', 'B-job', 'B-time', 'I-nationality', 'I-event', 'I-person', 'I-artwork', 'I-location', 'I-product', 'I-organization', 'I-job', 'I-time']

# ===== import the model
m_name = "marefa-nlp/marefa-ner"
tokenizer = AutoTokenizer.from_pretrained(m_name)
model = AutoModelForTokenClassification.from_pretrained(m_name)

# ===== build the NER pipeline
nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)

# ===== extract the entities from a sample text
example = 'خاضت القوات المصرية حرب السادس من أكتوبر ضد الجيش الصهيوني عام 1973'
# clean the text
example = " ".join(word_tokenize(example))
# feed to the NER model to parse
ner_results = nlp(example)

# we prepared a simple code to generate full entities tokens

modified_results = []
for ent in ner_results:
  if ent["entity_group"].lower().replace("label_","").isnumeric():
      ent["entity_group"] = int(ent["entity_group"].lower().replace("label_",""))
      ent["entity_group"] = labels_list[ent["entity_group"]]

  if len(modified_results) > 0 and ent["start"] == modified_results[-1]["end"]:
    modified_results[-1]["word"] += f"{ent['word']}".replace("▁"," ").strip()
    modified_results[-1]["word"] = modified_results[-1]["word"].replace("▁"," ").strip()
    modified_results[-1]["score"] = sum([modified_results[-1]["score"], ent["score"]])/2
    modified_results[-1]["end"] = ent["end"]
  else:
    modified_results.append(ent)


for res in modified_results:
  print(res["word"], "==>", res["entity_group"])
  
#####
# القوات المصرية ==> organization
# حرب السادس من أكتوبر ==> event
# الجيش الصهيوني ==> organization
# عام 1973 ==> time
####

Acknowledgment شكر و تقدير

قام بإعداد البيانات التي تم تدريب النموذج عليها, مجموعة من المتطوعين الذين قضوا ساعات يقومون بتنقيح البيانات و مراجعتها

على سيد عبد الحفيظ - إشراف
نرمين محمد عطيه
صلاح خيرالله
احمد علي عبدربه
عمر بن عبد العزيز سليمان
محمد ابراهيم الجمال
عبدالرحمن سلامه خلف
إبراهيم كمال محمد سليمان
حسن مصطفى حسن
أحمد فتحي سيد
عثمان مندو
عارف الشريف
أميرة محمد محمود
حسن سعيد حسن
عبد العزيز علي البغدادي
واثق عبدالملك الشويطر
عمرو رمضان عقل الحفناوي
حسام الدين أحمد على
أسامه أحمد محمد محمد
حاتم محمد المفتي
عبد الله دردير
أدهم البغدادي
أحمد صبري
عبدالوهاب محمد محمد
أحمد محمد عوض