|
--- |
|
datasets: |
|
- multi_nli |
|
- snli |
|
- scitail |
|
metrics: |
|
- accuracy |
|
- f1 |
|
pipeline_tag: zero-shot-classification |
|
language: |
|
- en |
|
model-index: |
|
- name: AntoineBlanot/flan-t5-xxl-classif-3way |
|
results: |
|
- task: |
|
type: nli |
|
name: Natural Language Inference |
|
dataset: |
|
type: multi_nli |
|
name: MultiNLI |
|
split: validation_matched |
|
metrics: |
|
- type: accuracy |
|
value: 0.9230769230769231 |
|
name: Validation (matched) accuracy |
|
- type: f1 |
|
value: 0.9225172687920663 |
|
name: Validation (matched) f1 |
|
|
|
- task: |
|
type: nli |
|
name: Natural Language Inference |
|
dataset: |
|
type: multi_nli |
|
name: MultiNLI |
|
split: validation_mismatched |
|
metrics: |
|
- type: accuracy |
|
value: 0.9222945484133441 |
|
name: Validation (mismatched) accuracy |
|
|
|
- type: f1 |
|
value: 0.9216699467726924 |
|
name: Validation (mismatched) f1 |
|
|
|
- task: |
|
type: nli |
|
name: Natural Language Inference |
|
dataset: |
|
type: snli |
|
name: SNLI |
|
split: validation |
|
metrics: |
|
- type: accuracy |
|
value: 0.9418817313554155 |
|
name: Validation accuracy |
|
|
|
- type: f1 |
|
value: 0.9416213776111287 |
|
name: Validation f1 |
|
|
|
- task: |
|
type: nli |
|
name: Natural Language Inference |
|
dataset: |
|
type: scitail |
|
name: SciTail |
|
split: validation |
|
metrics: |
|
- type: accuracy |
|
value: 0.9662576687116564 |
|
name: Validation accuracy |
|
|
|
- type: f1 |
|
value: 0.6471347983817357 |
|
name: Validation f1 |
|
|
|
--- |
|
# T5ForSequenceClassification |
|
**T5ForSequenceClassification** adapts the original [T5](https://github.com/google-research/text-to-text-transfer-transformer) architecture for sequence classification tasks. |
|
|
|
T5 was originally built for text-to-text tasks and excels in it. |
|
It can handle any NLP task if it has been converted to a text-to-text format, including sequence classification task! |
|
You can find [here](https://huggingface.co/google/flan-t5-base?text=Premise%3A++At+my+age+you+will+probably+have+learnt+one+lesson.+Hypothesis%3A++It%27s+not+certain+how+many+lessons+you%27ll+learn+by+your+thirties.+Does+the+premise+entail+the+hypothesis%3F) how the original T5 is used for sequence classification task. |
|
|
|
Our motivations for building **T5ForSequenceClassification** is that the full original T5 architecture is not needed for most NLU tasks. Indeed, NLU tasks generally do not require to generate text and thus a large decoder is unnecessary. |
|
By removing the decoder we can *half the original number of parameters* (thus half the computation cost) and *efficiently optimize* the network for the given task. |
|
|
|
## Table of Contents |
|
|
|
0. [Usage](#usage) |
|
1. [Why use T5ForSequenceClassification?](#why-use-t5forsequenceclassification) |
|
2. [T5ForClassification vs T5](#t5forclassification-vs-t5) |
|
3. [Results](#results) |
|
|
|
## Usage |
|
**T5ForSequenceClassification** supports the task of zero-shot classification. |
|
It can direclty be used for: |
|
- topic classification |
|
- intent recognition |
|
- boolean question answering |
|
- sentiment analysis |
|
- and any other task which goal is to clasify a text... |
|
|
|
Since the *T5ForClassification* class is currently not supported by the transformers library, you cannot direclty use this model on the Hub. |
|
To use **T5ForSequenceClassification**, you will have to install additional packages and model weights. |
|
You can find instructions [here](https://github.com/AntoineBlanot/zero-nlp). |
|
|
|
|
|
## Why use T5ForSequenceClassification? |
|
Models based on the [BERT](https://huggingface.co/bert-large-uncased) architecture like [RoBERTa](https://huggingface.co/roberta-large) and [DeBERTa](https://huggingface.co/microsoft/deberta-v2-xxlarge) have shown very strong performance on sequence classification task and are still widely used today. |
|
However, those models only scale up to ~1.5B parameters (DeBERTa xxlarge) resulting in a limited knowledge compare to bigger models. |
|
On the other hand, models based on the T5 architecture scale up to ~11B parameters (t5-xxl) and innovations with this architecture are very recent and keeps improving ([mT5](https://huggingface.co/google/mt5-xxl), [Flan-T5](https://huggingface.co/google/flan-t5-xxl), [UL2](https://huggingface.co/google/ul2), [Flan-UL2](https://huggingface.co/google/flan-ul2), and probably more...) |
|
|
|
## T5ForClassification vs T5 |
|
**T5ForClassification** Architecture: |
|
- Encoder: same as original T5 |
|
- Decoder: only the first layer (for pooling purpose) |
|
- Classification head: simple Linear layer on top of the decoder |
|
|
|
Benefits and Drawbacks: |
|
- (**+**) Keeps T5 encoding strength |
|
- (**+**) Parameters size is half |
|
- (**+**) Interpretable outputs (class logits) |
|
- (**+**) No generation mistakes and faster prediction (no generation latency) |
|
- (**-**) Looses text-to-text ability |
|
|
|
## Results |
|
Results on the validation data of **training tasks**: |
|
| Dataset | Accuracy | F1 | |
|
|:-------:|:--------:|:--:| |
|
| MNLI (m)| 0.923 | 0.923 | |
|
| MNLI (mm) | 0.922 | 0.922 | |
|
| SNLI | 0.942 | 0.942 | |
|
| SciTail | 0.966 | 0.647 | |
|
|
|
Results on validation data of **unseen tasks** (zero-shot): |
|
| Dataset | Accuracy | F1 | |
|
|:-------:|:--------:|:--:| |
|
| ?| ? | ? | |
|
|
|
Special thanks to [philschmid](https://huggingface.co/philschmid) for making a Flan-T5-xxl [checkpoint](https://huggingface.co/philschmid/flan-t5-xxl-sharded-fp16) in fp16. |
|
|