File size: 4,608 Bytes
09d2d3f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
685e19b
09d2d3f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
685e19b
09d2d3f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
language:
- en
license: apache-2.0
library_name: timm
tags:
- mobile
- vison
- image-classification
datasets:
- imagenet-1k
metrics:
- accuracy
---

# EfficientFormer-L7

## Table of Contents
- [EfficientFormer-L7](#-model_id--defaultmymodelname-true)
  - [Table of Contents](#table-of-contents)
  - [Model Details](#model-details)
  - [How to Get Started with the Model](#how-to-get-started-with-the-model)
  - [Uses](#uses)
      - [Direct Use](#direct-use)
      - [Downstream Use](#downstream-use)
      - [Misuse and Out-of-scope Use](#misuse-and-out-of-scope-use)
  - [Limitations and Biases](#limitations-and-biases)
  - [Training](#training)
      - [Training Data](#training-data)
      - [Training Procedure](#training-procedure)
  - [Evaluation Results](#evaluation-results)
  - [Environmental Impact](#environmental-impact)
  - [Citation Information](#citation-information)


<model_details>

## Model Details

EfficientFormer-L7, developed by [Snap Research](https://github.com/snap-research), is one of three EfficientFormer models. The EfficientFormer models were released as part of  an effort to prove that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance.

This checkpoint of EfficientFormer-L7 was trained for 300 epochs.

- Developed by: Yanyu Li, Geng Yuan, Yang Wen, Eric Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren
- Language(s): English
- License: This model is licensed under the apache-2.0 license
- Resources for more information:
  - [Research Paper](https://arxiv.org/abs/2206.01191)
  - [GitHub Repo](https://github.com/snap-research/EfficientFormer/)

</model_details>

<how_to_start>

## How to Get Started with the Model 

Use the code below to get started with the model.

```python
import requests
import torch
from PIL import Image

from transformers import EfficientFormerImageProcessor, EfficientFormerForImageClassificationWithTeacher

# Load a COCO image of two cats to test the model
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Load preprocessor and pretrained model
model_name = "huggingface/efficientformer-l7-300"
processor = EfficientFormerImageProcessor.from_pretrained(model_name)
model = EfficientFormerForImageClassificationWithTeacher.from_pretrained(model_name)

# Preprocess input image
inputs = processor(images=image, return_tensors="pt")

# Inference
with torch.no_grad():
	outputs = model(**inputs)

# Print the top ImageNet1k class prediction 
logits = outputs.logits
scores = torch.nn.functional.softmax(logits, dim=1)
top_pred_class = torch.argmax(scores, dim=1)
print(f"Predicted class: {top_pred_class}")
```
</how_to_start>

<uses>

## Uses

#### Direct Use

This model can be used for image classification and semantic segmentation. On mobile devices (the model was tested on iPhone 12), the CoreML checkpoints will perform these tasks with low latency.

<Limitations_and_Biases>

## Limitations and Biases

Though most designs in EfficientFormer are general-purposed, e.g., dimension- consistent design and 4D block with CONV-BN fusion, the actual speed of EfficientFormer may vary on other platforms. For instance, if GeLU is not well supported while HardSwish is efficiently implemented on specific hardware and compiler, the operator may need to be modified accordingly. The proposed latency-driven slimming is simple and fast. However, better results may be achieved if search cost is not a concern and an enumeration-based brute search is performed.

Since the model was trained on Imagenet-1K, the [biases embedded in that dataset](https://huggingface.co/datasets/imagenet-1k#considerations-for-using-the-data) will be reflected in the EfficientFormer models.

</Limitations_and_Biases>

<Training>

## Training

#### Training Data

This model was trained on ImageNet-1K.
 
See the [data card](https://huggingface.co/datasets/imagenet-1k) for additional information.

#### Training Procedure

* Parameters: 82.2 M
* Train. Epochs: 300

Trained on a cluster with NVIDIA A100 and V100 GPUs.

</Training>

<Eval_Results>

## Evaluation Results

Top-1 Accuracy: 83.3% on ImageNet 10K
Latency: 7.0ms

</Eval_Results>

<Cite>

## Citation Information

```bibtex
@article{li2022efficientformer,
  title={EfficientFormer: Vision Transformers at MobileNet Speed},
  author={Li, Yanyu and Yuan, Geng and Wen, Yang and Hu, Eric and Evangelidis, Georgios and Tulyakov, Sergey and Wang, Yanzhi and Ren, Jian},
  journal={arXiv preprint arXiv:2206.01191},
  year={2022}
}
```
</Cite>