|
--- |
|
library_name: transformers |
|
tags: [] |
|
inference: false |
|
--- |
|
|
|
# SuperGlue |
|
|
|
The SuperGlue model was proposed |
|
in [SuperGlue: Learning Feature Matching with Graph Neural Networks](https://arxiv.org/abs/1911.11763) by Paul-Edouard Sarlin, Daniel |
|
DeTone, Tomasz Malisiewicz and Andrew Rabinovich. |
|
|
|
This model consists of matching two sets of interest points detected in an image. Paired with the |
|
[SuperPoint model](https://huggingface.co/magic-leap-community/superpoint), it can be used to match two images and |
|
estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc. |
|
|
|
The abstract from the paper is the following: |
|
|
|
*This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences |
|
and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs |
|
are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling |
|
SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, |
|
our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image |
|
pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in |
|
challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and |
|
can be readily integrated into modern SfM or SLAM systems. The code and trained weights are publicly available at this [URL](https://github.com/magicleap/SuperGluePretrainedNetwork).* |
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/632885ba1558dac67c440aa8/2I8QDRNoMhQCuL236CvdN.png" alt="drawing" width="500"/> |
|
|
|
<!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/632885ba1558dac67c440aa8/2I8QDRNoMhQCuL236CvdN.png) --> |
|
|
|
This model was contributed by [stevenbucaille](https://huggingface.co/stevenbucaille). |
|
The original code can be found [here](https://github.com/magicleap/SuperGluePretrainedNetwork). |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
SuperGlue is a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. |
|
It introduces a flexible context aggregation mechanism based on attention, enabling it to reason about the underlying 3D scene and feature |
|
assignments. The architecture consists of two main components: the Attentional Graph Neural Network and the Optimal Matching Layer. |
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/632885ba1558dac67c440aa8/zZGjSWQU2na5aPFRak5kp.png" alt="drawing" width="1000"/> |
|
|
|
<!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/632885ba1558dac67c440aa8/zZGjSWQU2na5aPFRak5kp.png) --> |
|
|
|
The Attentional Graph Neural Network uses a Keypoint Encoder to map keypoint positions and visual descriptors. |
|
It employs self- and cross-attention layers to create powerful representations. The Optimal Matching Layer creates a |
|
score matrix, augments it with dustbins, and finds the optimal partial assignment using the Sinkhorn algorithm. |
|
|
|
- **Developed by:** MagicLeap |
|
- **Model type:** Image Matching |
|
- **License:** ACADEMIC OR NON-PROFIT ORGANIZATION NONCOMMERCIAL RESEARCH USE ONLY |
|
|
|
### Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://github.com/magicleap/SuperGluePretrainedNetwork |
|
- **Paper:** https://arxiv.org/pdf/1911.11763 |
|
- **Demo:** https://psarlin.com/superglue/ |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
### Direct Use |
|
|
|
SuperGlue is designed for feature matching and pose estimation tasks in computer vision. It can be applied to a variety of multiple-view |
|
geometry problems and can handle challenging real-world indoor and outdoor environments. However, it may not perform well on tasks that |
|
require different types of visual understanding, such as object detection or image classification. |
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Here is a quick example of using the model. Since this model is an image matching model, it requires pairs of images to be matched: |
|
|
|
```python |
|
from transformers import AutoImageProcessor, AutoModel |
|
import torch |
|
from PIL import Image |
|
import requests |
|
|
|
url = "https://github.com/magicleap/SuperGluePretrainedNetwork/blob/master/assets/phototourism_sample_images/london_bridge_78916675_4568141288.jpg?raw=true" |
|
im1 = Image.open(requests.get(url, stream=True).raw) |
|
url = "https://github.com/magicleap/SuperGluePretrainedNetwork/blob/master/assets/phototourism_sample_images/london_bridge_19481797_2295892421.jpg?raw=true" |
|
im2 = Image.open(requests.get(url, stream=True).raw) |
|
images = [im1, im2] |
|
|
|
processor = AutoImageProcessor.from_pretrained("stevenbucaille/superglue_indoor") |
|
model = AutoModel.from_pretrained("stevenbucaille/superglue_indoor") |
|
|
|
inputs = processor(images, return_tensors="pt") |
|
outputs = model(**inputs) |
|
``` |
|
|
|
The outputs contain the list of keypoints detected by the keypoint detector as well as the list of matches with their corresponding matching scores. |
|
Due to the nature of SuperGlue, to output a dynamic number of matches, you will need to use the mask attribute to retrieve the respective information: |
|
|
|
```python |
|
from transformers import AutoImageProcessor, AutoModel |
|
import torch |
|
from PIL import Image |
|
import requests |
|
|
|
url_image_1 = "https://github.com/cvg/LightGlue/blob/main/assets/sacre_coeur1.jpg?raw=true" |
|
image_1 = Image.open(requests.get(url_image_1, stream=True).raw) |
|
url_image_2 = "https://github.com/cvg/LightGlue/blob/main/assets/sacre_coeur2.jpg?raw=true" |
|
image_2 = Image.open(requests.get(url_image_2, stream=True).raw) |
|
|
|
images = [image_1, image_2] |
|
|
|
processor = AutoImageProcessor.from_pretrained("stevenbucaille/superglue_indoor") |
|
model = AutoModel.from_pretrained("stevenbucaille/superglue_indoor") |
|
|
|
inputs = processor(images, return_tensors="pt") |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
# Get the respective image masks |
|
image0_mask, image1_mask = outputs_mask[0] |
|
|
|
image0_indices = torch.nonzero(image0_mask).squeeze() |
|
image1_indices = torch.nonzero(image1_mask).squeeze() |
|
|
|
image0_matches = outputs.matches[0, 0][image0_indices] |
|
image1_matches = outputs.matches[0, 1][image1_indices] |
|
|
|
image0_matching_scores = outputs.matching_scores[0, 0][image0_indices] |
|
image1_matching_scores = outputs.matching_scores[0, 1][image1_indices] |
|
``` |
|
|
|
You can then print the matched keypoints on a side-by-side image to visualize the result : |
|
```python |
|
import cv2 |
|
import numpy as np |
|
|
|
# Create side by side image |
|
input_data = inputs['pixel_values'] |
|
height, width = input_data.shape[-2:] |
|
matched_image = np.zeros((height, width * 2, 3)) |
|
matched_image[:, :width] = input_data.squeeze()[0].permute(1, 2, 0).cpu().numpy() |
|
matched_image[:, width:] = input_data.squeeze()[1].permute(1, 2, 0).cpu().numpy() |
|
matched_image = (matched_image * 255).astype(np.uint8) |
|
|
|
# Retrieve matches by looking at which keypoints in image0 actually matched with keypoints in image1 |
|
image0_mask = outputs.mask[0, 0] |
|
image0_indices = torch.nonzero(image0_mask).squeeze() |
|
image0_matches_indices = torch.nonzero(outputs.matches[0, 0][image0_indices] != -1).squeeze() |
|
image0_keypoints = outputs.keypoints[0, 0][image0_matches_indices] |
|
image0_matches = outputs.matches[0, 0][image0_matches_indices] |
|
image0_matching_scores = outputs.matching_scores[0, 0][image0_matches_indices] |
|
# Retrieve matches from image1 |
|
image1_mask = outputs.mask[0, 1] |
|
image1_indices = torch.nonzero(image1_mask).squeeze() |
|
image1_keypoints = outputs.keypoints[0, 1][image0_matches] |
|
|
|
# Draw matches |
|
for keypoint0, keypoint1, score in zip(image0_keypoints, image1_keypoints, image0_matching_scores): |
|
keypoint0_x, keypoint0_y = int(keypoint0[0]), int(keypoint0[1]) |
|
keypoint1_x, keypoint1_y = int(keypoint1[0] + width), int(keypoint1[1]) |
|
color = [0, 1, 0, 0.5] # Set color based on score |
|
plt.plot([keypoint0_x, keypoint1_x], [keypoint0_y, keypoint1_y], color=color, linewidth=1) |
|
|
|
# Save the image |
|
plt.savefig("matched_image.png", dpi=300, bbox_inches='tight') |
|
plt.close() |
|
``` |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/632885ba1558dac67c440aa8/PiLL7svnN2dTqOxrsobJb.png) |
|
|
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
SuperGlue is trained on large annotated datasets for pose estimation, enabling it to learn priors for pose estimation and reason about the 3D scene. |
|
The training data consists of image pairs with ground truth correspondences and unmatched keypoints derived from ground truth poses and depth maps. |
|
|
|
### Training Procedure |
|
|
|
SuperGlue is trained in a supervised manner using ground truth matches and unmatched keypoints. The loss function maximizes |
|
the negative log-likelihood of the assignment matrix, aiming to simultaneously maximize precision and recall. |
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** fp32 |
|
|
|
#### Speeds, Sizes, Times |
|
|
|
SuperGlue is designed to be efficient and runs in real-time on a modern GPU. A forward pass takes approximately 69 milliseconds (15 FPS) for an indoor image pair. |
|
The model has 12 million parameters, making it relatively compact compared to some other deep learning models. |
|
The inference speed of SuperGlue is suitable for real-time applications and can be readily integrated into |
|
modern Simultaneous Localization and Mapping (SLAM) or Structure-from-Motion (SfM) systems. |
|
|
|
## Citation [optional] |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
**BibTeX:** |
|
|
|
```bibtex |
|
@inproceedings{sarlin2020superglue, |
|
title={Superglue: Learning feature matching with graph neural networks}, |
|
author={Sarlin, Paul-Edouard and DeTone, Daniel and Malisiewicz, Tomasz and Rabinovich, Andrew}, |
|
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition}, |
|
pages={4938--4947}, |
|
year={2020} |
|
} |
|
``` |
|
|
|
## Model Card Authors |
|
|
|
[Steven Bucaille](https://github.com/sbucaille) |
|
|
|
|
|
|