camenduru commited on Sep 17

Commit

44ca469

•

1 Parent(s): fcfc8e4

thanks to jadechoghari ❤

Browse files

Files changed (25) hide show

.gitattributes +7 -0
README.md +152 -0
config.json +248 -0
figures/Computational requirements.PNG +0 -0
figures/architecture.jpg +3 -0
figures/architecture.pdf +0 -0
figures/gif_output/blur.gif +0 -0
figures/gif_output/blur.jpg +0 -0
figures/gif_output/blur_back_n_forth.gif +3 -0
figures/gif_output/haze.gif +0 -0
figures/gif_output/haze.jpg +0 -0
figures/gif_output/haze_back_n_forth.gif +3 -0
figures/gif_output/lowlight.gif +0 -0
figures/gif_output/lowlight.jpg +0 -0
figures/gif_output/lowlight_back_n_forth.gif +3 -0
figures/gif_output/rain.gif +3 -0
figures/gif_output/rain.jpg +0 -0
figures/gif_output/rain_back_n_forth.gif +3 -0
figures/qualitative_result.PNG +3 -0
figures/seen_dataset_with_synthetic_degradation.PNG +0 -0
figures/unseen_dataset_with_real_degradation.PNG +0 -0
figures/unseen_dataset_with_synthetic_degradation.PNG +0 -0
model.safetensors +3 -0
preprocessor_config.json +35 -0
robustsam_checkpoint_h.pth +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,10 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+figures/architecture.jpg filter=lfs diff=lfs merge=lfs -text
+figures/gif_output/blur_back_n_forth.gif filter=lfs diff=lfs merge=lfs -text
+figures/gif_output/haze_back_n_forth.gif filter=lfs diff=lfs merge=lfs -text
+figures/gif_output/lowlight_back_n_forth.gif filter=lfs diff=lfs merge=lfs -text
+figures/gif_output/rain_back_n_forth.gif filter=lfs diff=lfs merge=lfs -text
+figures/gif_output/rain.gif filter=lfs diff=lfs merge=lfs -text
+figures/qualitative_result.PNG filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+---
+library_name: transformers
+license: mit
+---
+# RobustSAM: Segment Anything Robustly on Degraded Images (CVPR 2024 Highlight)
+#  Model Card for ViT Huge (ViT-H) version
+<a href="https://colab.research.google.com/drive/1mrOjUNFrfZ2vuTnWrfl9ebAQov3a9S6E?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
+[![Huggingfaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/robustsam/robustsam/tree/main)
+Official repository for RobustSAM: Segment Anything Robustly on Degraded Images
+[Project Page](https://robustsam.github.io/) | [Paper](https://arxiv.org/abs/2406.09627) | [Video](https://www.youtube.com/watch?v=Awukqkbs6zM) | [Dataset](https://huggingface.co/robustsam/robustsam/tree/main/dataset)
+## Introduction
+Segment Anything Model (SAM) has emerged as a transformative approach in image segmentation, acclaimed for its robust zero-shot segmentation capabilities and flexible prompting system. Nonetheless, its performance is challenged by images with degraded quality. Addressing this limitation, we propose the Robust Segment Anything Model (RobustSAM), which enhances SAM's performance on low-quality images while preserving its promptability and zero-shot generalization.
+Our method leverages the pre-trained SAM model with only marginal parameter increments and computational requirements. The additional parameters of RobustSAM can be optimized within 30 hours on eight GPUs, demonstrating its feasibility and practicality for typical research laboratories. We also introduce the Robust-Seg dataset, a collection of 688K image-mask pairs with different degradations designed to train and evaluate our model optimally. Extensive experiments across various segmentation tasks and datasets confirm RobustSAM's superior performance, especially under zero-shot conditions, underscoring its potential for extensive real-world application. Additionally, our method has been shown to effectively improve the performance of SAM-based downstream tasks such as single image dehazing and deblurring.
+**Disclaimer**: Content from **this** model card has been written by the Hugging Face team, and parts of it were copy pasted from the original [SAM model card](https://github.com/facebookresearch/segment-anything).
+# Model Details
+The RobustSAM model is made up of 3 modules:
+  - The `VisionEncoder`: a VIT based image encoder. It computes the image embeddings using attention on patches of the image. Relative Positional Embedding is used.
+  - The `PromptEncoder`: generates embeddings for points and bounding boxes
+  - The `MaskDecoder`: a two-ways transformer which performs cross attention between the image embedding and the point embeddings (->) and between the point embeddings and the image embeddings. The outputs are fed
+  - The `Neck`: predicts the output masks based on the contextualized masks produced by the `MaskDecoder`.
+# Usage
+## Prompted-Mask-Generation
+```python
+from PIL import Image
+import requests
+from transformers import AutoProcessor, AutoModelForMaskGeneration
+# load the RobustSAM model and processor
+processor = AutoProcessor.from_pretrained("jadechoghari/robustsam-vit-huge")
+model = AutoModelForMaskGeneration.from_pretrained("jadechoghari/robustsam-vit-huge")
+# load an image from a url
+img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
+raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
+# we define input points (2D localization of an object in the image)
+input_points = [[[450, 600]]]  # example point
+```
+```python
+# process the image and input points
+inputs = processor(raw_image, input_points=input_points, return_tensors="pt").to("cuda")
+# generate masks using the model
+with torch.no_grad():
+    outputs = model(**inputs)
+masks = processor.image_processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu())
+scores = outputs.iou_scores
+```
+Among other arguments to generate masks, you can pass 2D locations on the approximate position of your object of interest, a bounding box wrapping the object of interest (the format should be x, y coordinate of the top right and bottom left point of the bounding box), a segmentation mask. At this time of writing, passing a text as input is not supported by the official model according to [the official repository](https://github.com/facebookresearch/segment-anything/issues/4#issuecomment-1497626844).
+For more details, refer to this notebook, which shows a walk throught of how to use the model, with a visual example!
+## Automatic-Mask-Generation
+The model can be used for generating segmentation masks in a "zero-shot" fashion, given an input image. The model is automatically prompt with a grid of `1024` points
+which are all fed to the model.
+The pipeline is made for automatic mask generation. The following snippet demonstrates how easy you can run it (on any device! Simply feed the appropriate `points_per_batch` argument)
+```python
+from transformers import pipeline
+# initialize the pipeline for mask generation
+generator = pipeline("mask-generation", model="jadechoghari/robustsam-vit-huge", device=0, points_per_batch=256)
+image_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
+outputs = generator(image_url, points_per_batch=256)
+```
+Now to display the generated mask on the image:
+```python
+import matplotlib.pyplot as plt
+from PIL import Image
+import numpy as np
+# simple function to display the mask
+def show_mask(mask, ax, random_color=False):
+    if random_color:
+        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
+    else:
+        color = np.array([30 / 255, 144 / 255, 255 / 255, 0.6])
+    # get the height and width from the mask
+    h, w = mask.shape[-2:]
+    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
+    ax.imshow(mask_image)
+# display the original image
+plt.imshow(np.array(raw_image))
+ax = plt.gca()
+# loop through the masks and display each one
+for mask in outputs["masks"]:
+    show_mask(mask, ax=ax, random_color=True)
+plt.axis("off")
+# show the image with the masks
+plt.show()
+```
+## Visual Comparison
+<table>
+  <tr>
+    <td>
+      <img src="figures/gif_output/blur_back_n_forth.gif" width="380">
+    </td>
+    <td>
+      <img src="figures/gif_output/haze_back_n_forth.gif" width="380">
+    </td>
+  </tr>
+  <tr>
+    <td>
+      <img src="figures/gif_output/lowlight_back_n_forth.gif" width="380">
+    </td>
+    <td>
+      <img src="figures/gif_output/rain_back_n_forth.gif" width="380">
+    </td>
+  </tr>
+</table>
+<img width="1096" alt="image" src='figures/qualitative_result.PNG'>
+## Reference
+If you find this work useful, please consider citing us!
+```python
+@inproceedings{chen2024robustsam,
+  title={RobustSAM: Segment Anything Robustly on Degraded Images},
+  author={Chen, Wei-Ting and Vong, Yu-Jiet and Kuo, Sy-Yen and Ma, Sizhou and Wang, Jian},
+  journal={CVPR},
+  year={2024}
+}
+```
+## Acknowledgements
+We thank the authors of [SAM](https://github.com/facebookresearch/segment-anything) from which our repo is based off of.

config.json ADDED Viewed

	@@ -0,0 +1,248 @@

+{
+  "_commit_hash": null,
+  "architectures": [
+    "SamModel"
+  ],
+  "initializer_range": 0.02,
+  "mask_decoder_config": {
+    "_name_or_path": "",
+    "add_cross_attention": false,
+    "architectures": null,
+    "attention_downsample_rate": 2,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": null,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": null,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "relu",
+    "hidden_size": 256,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "iou_head_depth": 3,
+    "iou_head_hidden_dim": 256,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_eps": 1e-06,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "min_length": 0,
+    "mlp_dim": 2048,
+    "model_type": "",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 8,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_hidden_layers": 2,
+    "num_multimask_outputs": 3,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": null,
+    "prefix": null,
+    "problem_type": null,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": null,
+    "torchscript": false,
+    "transformers_version": "4.29.0.dev0",
+    "typical_p": 1.0,
+    "use_bfloat16": false
+  },
+  "model_type": "sam",
+  "prompt_encoder_config": {
+    "_name_or_path": "",
+    "add_cross_attention": false,
+    "architectures": null,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": null,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": null,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "gelu",
+    "hidden_size": 256,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "image_embedding_size": 64,
+    "image_size": 1024,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_eps": 1e-06,
+    "length_penalty": 1.0,
+    "mask_input_channels": 16,
+    "max_length": 20,
+    "min_length": 0,
+    "model_type": "",
+    "no_repeat_ngram_size": 0,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_point_embeddings": 4,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": null,
+    "patch_size": 16,
+    "prefix": null,
+    "problem_type": null,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": null,
+    "torchscript": false,
+    "transformers_version": "4.29.0.dev0",
+    "typical_p": 1.0,
+    "use_bfloat16": false
+  },
+  "torch_dtype": "float32",
+  "transformers_version": null,
+  "vision_config": {
+    "_name_or_path": "",
+    "add_cross_attention": false,
+    "architectures": null,
+    "attention_dropout": 0.0,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": null,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "dropout": 0.0,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": null,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "global_attn_indexes": [
+      7,
+      15,
+      23,
+      31
+    ],
+    "hidden_act": "gelu",
+    "hidden_size": 1280,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "image_size": 1024,
+    "initializer_factor": 1.0,
+    "initializer_range": 1e-10,
+    "intermediate_size": 6144,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_eps": 1e-06,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "min_length": 0,
+    "mlp_dim": 5120,
+    "mlp_ratio": 4.0,
+    "model_type": "",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 16,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_channels": 3,
+    "num_hidden_layers": 32,
+    "num_pos_feats": 128,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_channels": 256,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": null,
+    "patch_size": 16,
+    "prefix": null,
+    "problem_type": null,
+    "projection_dim": 512,
+    "pruned_heads": {},
+    "qkv_bias": true,
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": null,
+    "torchscript": false,
+    "transformers_version": "4.29.0.dev0",
+    "typical_p": 1.0,
+    "use_abs_pos": true,
+    "use_bfloat16": false,
+    "use_rel_pos": true,
+    "window_size": 14
+  }
+}

figures/Computational requirements.PNG ADDED Viewed

figures/architecture.jpg ADDED Viewed

Git LFS Details

SHA256: 4ea4cd17052ee5b74e99d5c709115163a1025734671a12302444721cc960f527
Pointer size: 132 Bytes
Size of remote file: 2.54 MB

figures/architecture.pdf ADDED Viewed

Binary file (515 kB). View file

figures/gif_output/blur.gif ADDED Viewed

figures/gif_output/blur.jpg ADDED Viewed

figures/gif_output/blur_back_n_forth.gif ADDED Viewed

Git LFS Details

SHA256: 11e91e6bcdcd20fc90f276947958464a5b421dd86990b256d19ab43910725d4e
Pointer size: 132 Bytes
Size of remote file: 1.59 MB

figures/gif_output/haze.gif ADDED Viewed

figures/gif_output/haze.jpg ADDED Viewed

figures/gif_output/haze_back_n_forth.gif ADDED Viewed

Git LFS Details

SHA256: e18eb59510bf9ba0b9c029a2738510e6ddb94762781101944bf0eb852cbd1350
Pointer size: 132 Bytes
Size of remote file: 1.32 MB

figures/gif_output/lowlight.gif ADDED Viewed

figures/gif_output/lowlight.jpg ADDED Viewed

figures/gif_output/lowlight_back_n_forth.gif ADDED Viewed

Git LFS Details

SHA256: 9bbaf17393fed7fe651a0ea48a407ec8c3b77e12c20eeaa28d3436be7662706f
Pointer size: 132 Bytes
Size of remote file: 1.51 MB

figures/gif_output/rain.gif ADDED Viewed

Git LFS Details

SHA256: 8238dbdcafe9e9542303e363b2052c79552f35edfab5b4a15423ebb5838f8dda
Pointer size: 132 Bytes
Size of remote file: 1.33 MB

figures/gif_output/rain.jpg ADDED Viewed

figures/gif_output/rain_back_n_forth.gif ADDED Viewed

Git LFS Details

SHA256: 0669fa4ab685a93d94fc5d73ef6b8777adf1b1c9153fe51c91f4e59bb431a32f
Pointer size: 132 Bytes
Size of remote file: 2.02 MB

figures/qualitative_result.PNG ADDED Viewed

Git LFS Details

SHA256: d0e0872fdf7644df754369b9c9ac2d32996ea010cb3a4bc6bca7ea4a957775ad
Pointer size: 132 Bytes
Size of remote file: 2.44 MB

figures/seen_dataset_with_synthetic_degradation.PNG ADDED Viewed

figures/unseen_dataset_with_real_degradation.PNG ADDED Viewed

figures/unseen_dataset_with_synthetic_degradation.PNG ADDED Viewed

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1cf54b58c6e6bb391b3b2032c0ccb4084f36ecac0b9fec362b4abc2f46862761
+size 2564432184

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_pad": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.485,
+    0.456,
+    0.406
+  ],
+  "image_processor_type": "SamImageProcessor",
+  "image_std": [
+    0.229,
+    0.224,
+    0.225
+  ],
+  "mask_pad_size": {
+    "height": 256,
+    "width": 256
+  },
+  "mask_size": {
+    "longest_edge": 256
+  },
+  "pad_size": {
+    "height": 1024,
+    "width": 1024
+  },
+  "processor_class": "SamProcessor",
+  "resample": 2,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "longest_edge": 1024
+  }
+}

robustsam_checkpoint_h.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:515e3e7437732b54240fe8fc78562e0b4b633451aaee129cc1450323621cef19
+size 3175817941