File size: 3,653 Bytes
a1c3590
7db28a3
 
 
 
a1c3590
 
7db28a3
a9e2ffc
7c7b979
 
7db28a3
19fce4d
 
 
 
 
 
2eb6d99
19fce4d
 
 
 
bec7206
19fce4d
7db28a3
c0931f3
 
2995779
cdbd71c
 
 
 
 
2995779
7db28a3
80838a4
7db28a3
d5a705e
7db28a3
 
 
 
 
 
 
 
 
 
 
 
0a04b8c
 
7db28a3
 
 
cac2154
 
 
 
7db28a3
 
 
 
 
2eb6d99
 
7db28a3
80838a4
7db28a3
a9e2ffc
 
642ba04
17b3b4c
 
c034e97
a9e2ffc
cdbd71c
 
642ba04
 
cdbd71c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
pipeline_tag: image-classification
tags:
- arxiv:2010.07611
- arxiv:2104.00298
license: cc-by-nc-4.0
---

To be clear, this model is tailored to my image and video classification tasks, not to imagenet.
I built EfficientNetV2.5 s to outperform the existing EfficientNet b0 to b4, EfficientNet b1 to b4 pruned (I pruned b4), and EfficientNetV2 t to l models, whether trained using TensorFlow or PyTorch, 
in terms of top-1 accuracy, efficiency, and robustness on my dataset and [CMAD benchmark](https://huggingface.co/datasets/aistrova/CMAD).

## Model Details
- **Model tasks:** Image classification / video classification / feature backbone
- **Model stats:**
  - Params: 16.64 M
  - Multiply-Add Operations: 5.32 G
  - Image size: train = 299x299 / 304x304, test = 304x304
  - Classification layer: defaults to 1,000 classes
- **Papers:**
  - EfficientNetV2: Smaller Models and Faster Training: https://arxiv.org/abs/2104.00298
  - Layer-adaptive sparsity for the Magnitude-based Pruning: https://arxiv.org/abs/2010.07611
- **Dataset:** ImageNet-1k
- **Pretrained:** Yes, but requires more pretraining
- **Original:** This model architecture is original

<br>

### Load PyTorch Jit Model with 1000 Classes
```python
from transformers import AutoModel
model = AutoModel.from_pretrained("FredZhang7/efficientnetv2.5_rw_s", trust_remote_code=True)
```

### Load Model with Custom Classes
To change the number of classes, replace the linear classification layer. 
Here's an example of how to convert the architecture into a trainable model.
```bash
pip install ptflops timm
```
```python
from ptflops import get_model_complexity_info
import torch
import urllib.request

nclass = 3                  # number of classes in your dataset
input_size = (3, 304, 304)  # recommended image input size
print_layer_stats = True    # prints the statistics for each layer of the model
verbose = True              # prints additional info about the MAC calculation

# Download the model. Skip this step if already downloaded
base_model = "efficientnetv2.5_base_in1k"
url = f"https://huggingface.co/FredZhang7/efficientnetv2.5_rw_s/resolve/main/{base_model}.pth"
file_name = f"./{base_model}.pth"
urllib.request.urlretrieve(url, file_name)

shape = (2,) + input_size
example_inputs = torch.randn(shape)
example_inputs = (example_inputs - example_inputs.min()) / (example_inputs.max() - example_inputs.min())

model = torch.load(file_name)
model.classifier = torch.nn.Linear(in_features=1984, out_features=nclass, bias=True)
macs, nparams = get_model_complexity_info(model, input_size, as_strings=False, print_per_layer_stat=print_layer_stats, verbose=verbose)
traced_model = torch.jit.trace(model, example_inputs)

model_name = f'{base_model}_{"{:.2f}".format(nparams / 1e6)}M_{"{:.2f}".format(macs / 1e9)}G.pth'
traced_model.save(model_name)

# Load the trainable model
model = torch.load(model_name)
```

### Top-1 Accuracy Comparisons
I finetuned the existing models on either 299x299, 304x304, 320x320, or 384x384 resolution, depending on the input size used during pretraining and the VRAM usage.

`efficientnet_b3_pruned` achieved the second highest top-1 accuracy as well as the highest epoch-1 training accuracy on my task, out of EfficientNetV2.5 small and all existing EfficientNet models my 24 GB VRAM RTX 3090 could handle.

I will publish the detailed report in [this model repository](https://huggingface.co/aistrova/safesearch-v5.0).
This repository is only for the base model, pretrained a bit on ImageNet, not my task.

### Carbon Emissions
Comparing all models and testing my new architectures costed roughly 648 GPU hours, over a span of 35 days.