metadata

license: cc-by-nc-4.0
tags:
  - sparsh
  - ijepa
  - base
  - tactile

Sparsh (base-sized model) trained using I-JEPA

Sparsh is a Vision Transformer (ViT) model trained using the I-JEPA method, specifically adapted for vision-based tactile sensors such as DIGIT and GelSight.

Disclaimer: This model card was written by the Sparsh authors. The ViT model and I-JEPA objectives have been adapted for the tactile sensing use case.

Model description

We introduce Sparsh, a family of touch representations trained using Self-Supervised Learning (SSL) across multiple sensors, including DIGIT, GelSight 2017 (with markers), and GelSight Mini (without markers). This model was trained using the I-JEPA SSL approach.

The model takes two tactile images as input, with a temporal stride of 5 samples across the channel dimension, $I_t ⊕ I_{t−5} → x ∈ R^{h×w×6}$. For a sensor operating at 60FPS, this corresponds to an inference window of approximately 80ms, which is the reaction time humans need to adjust grip force when detecting partial slip.

We preprocess the tactile images by performing background subtraction, which allows for robustness to distractors such as shadows and light placement variations.

By pre-training the model via SSL, Sparsh learns representations for pairs of tactile images that can then be used to extract features useful for downstream tasks. To train a downstream task in a supervised fashion, you can place a standard decoder (or head) on top of the pre-trained Sparsh (encoder) by using attentive pooling followed by a shallow MLP.

Intended uses & limitations

You can utilize the Sparsh model to extract touch representations for vision-based tactile sensors, including DIGIT, GelSight, and GelSight mini. You have two options:

Use the frozen Sparsh encoder: This allows you to leverage the pre-trained weights of the Sparsh model without modifying them.
Fine-tune the Sparsh encoder: You can fine-tune the Sparsh encoder along with the training of your downstream task, allowing the model to adapt to your specific use case.

Both options enable you to take advantage of the powerful touch representations learned by the Sparsh model.

How to Use

For detailed instructions on how to load the encoder and integrate it into your downstream task, please refer to our GitHub repository.

BibTeX entry and citation info

    @inproceedings{
    higuera2024sparsh,
    title={Sparsh: Self-supervised touch representations for vision-based tactile sensing},
    author={Carolina Higuera and Akash Sharma and Chaithanya Krishna Bodduluri and Taosha Fan and Patrick Lancaster and Mrinal Kalakrishnan and Michael Kaess and Byron Boots and Mike Lambeta and Tingfan Wu and Mustafa Mukadam},
    booktitle={8th Annual Conference on Robot Learning},
    year={2024},
    url={https://openreview.net/forum?id=xYJn2e1uu8}
}