sagui-nlp
/

debertinha-ptbr-xsmall-lenerbr

Token Classification

text-classification

Model card Files Files and versions Community

debertinha-ptbr-xsmall-lenerbr / README.md

israelcamp's picture

Update README.md

66afaaa about 1 year ago

|

history blame contribute delete

3.39 kB

	---
	inference: false
	language: pt
	datasets:
	- lener_br
	license: mit
	pipeline_tag: token-classification
	---

	# DeBERTinha XSmall for NER


	## Full Token Classification Example

	```python
	from transformers import AutoModelForTokenClassification, AutoTokenizer, AutoConfig
	import torch

	model_name = "sagui-nlp/debertinha-ptbr-xsmall-lenerbr"
	model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=13)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	input_text = "Acrescento que não há de se falar em violação do artigo 114, § 3º, da Constituição Federal, posto que referido dispositivo revela-se impertinente, tratando da possibilidade de ajuizamento de dissídio coletivo pelo Ministério Público do Trabalho nos casos de greve em atividade essencial."

	inputs = tokenizer(input_text, max_length=512, truncation=True, return_tensors="pt")
	tokens = inputs.tokens()

	outputs = model(**inputs).logits
	predictions = torch.argmax(outputs, dim=2)

	entities = []
	current_entity = []
	current_label = None
	for token, prediction in zip(tokens[1:-1], predictions[0].numpy()[1:-1]):
	# print((token, model.config.id2label[prediction]))
	if not len(current_entity):
	current_entity.append(token)
	current_label = model.config.id2label[prediction]
	elif token.startswith("▁"):
	entities.append(("".join(current_entity), current_label))
	current_entity = [token]
	current_label = model.config.id2label[prediction]
	else:
	current_entity.append(token)
	entities.append(("".join(current_entity), current_label))
	list(filter(lambda x:x[1]!="O", entities))
	```

	## Training notes
	Training was done on label of only the first token
	```python
	label_all_tokens = False
	task="ner"

	def tokenize_and_align_labels(examples):
	tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True, max_length=512)

	labels = []
	for i, label in enumerate(examples[f"{task}_tags"]):
	word_ids = tokenized_inputs.word_ids(batch_index=i)
	previous_word_idx = None
	label_ids = []
	for word_idx in word_ids:
	# Special tokens have a word id that is None. We set the label to -100 so they are automatically
	# ignored in the loss function.
	if word_idx is None:
	label_ids.append(-100)
	# We set the label for the first token of each word.
	elif word_idx != previous_word_idx:
	label_ids.append(label[word_idx])
	# For the other tokens in a word, we set the label to either the current label or -100, depending on
	# the label_all_tokens flag.
	else:
	label_ids.append(label[word_idx] if label_all_tokens else -100)
	previous_word_idx = word_idx

	labels.append(label_ids)

	tokenized_inputs["labels"] = labels
	return tokenized_inputs

	dataset = dataset.map(tokenize_and_align_labels, batched=True)
	```

	## Citation

	```
	@misc{campiotti2023debertinha,
	title={DeBERTinha: A Multistep Approach to Adapt DebertaV3 XSmall for Brazilian Portuguese Natural Language Processing Task},
	author={Israel Campiotti and Matheus Rodrigues and Yuri Albuquerque and Rafael Azevedo and Alyson Andrade},
	year={2023},
	eprint={2309.16844},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```