Region-Based Representations Revisited

Michal Shlapentokh-Rothman1*, Ansel Blume1*, Yao Xiao1, Yuqun Wu1, Sethuraman T V1, Heyi Tao1, Jae Yong Lee1, Wilfredo Torres2, Yu-Xiong Wang1, Derek Hoiem1,2
1University of Illinois at Urbana-Champaign, 2Reconstruct
*Equal Contribution
Description of the image

Our framework revisits the use of region features for downstream applications. We generate region features by first segmenting an image, extracting image features, then pooling the image features across the region masks.

Abstract

We investigate whether region-based representations are effective for recognition. Regions were once a mainstay in recognition approaches, but pixel and patch-based features are now used almost exclusively. We show that recent class-agnostic segmenters like SAM can be effectively combined with strong self-supervised representations, like those from DINOv2, and used for a wide variety of tasks, including semantic segmentation, object-based image retrieval, and multi-image analysis. Once the masks and features are extracted, these representations, even with linear decoders, enable competitive performance, making them well suited to applications that require custom queries. The representations' compactness also makes them well-suited to video analysis and other problems requiring inference across many images.

How It Works

Description of the image

Our approach is to pool image features over generated regions to create one feature vector per region, and to make predictions based on these region representations. The main steps are

  1. Extract image features. We used strong representation models like CLIP, DINOv1, DINOv2 and MaskCLIP. We found that DINOv2 worked best.
  2. Generate regions with a class-agnostic segmenter like SAM.
  3. To combine the image features and generated regions, we upsample the extracted features and superimpose the features and regions. Then we perform average pooling to create the feature vectors.

Augmenting SAM with SLIC

Image 1
Original
Image 2
SAM Only
Image 2
SLIC on unmasked regions
Image 3
SAM+SLIC

SAM generated regions tended to leave many pixels uncovered. To improve coverage, we ran the superpixel clustering algortihm SLIC and took the intersection of SLIC and unmasked pixels. SAM+SLIC outperformed SAM and other SAM variants such as MobileSAM and HQ-SAM.

SAM Only
HQ-SAM
Mobile-SAM(v1)
SLIC
SAM+SLIC
Image 1
Image 2
Image 3
Image 4
Image 5
Image 1
Image 2
Image 3
Image 4
Image 4
Image 1
Image 2
Image 3
Image 4
Image 4

Applications

Semantic Segmentation

For semantic segmentation, we classified each region. For pixels belonging to multiple regions, we averaged the confidence. Our results outperformed corresponding patch-based models across different images features. As seen in the qualitative results, with region-based models, pixels within the same object have much more consistent classifications compared to patch-based models.

Original
Regions (SAM+SLIC)
Patch Model Prediction
Region Model Prediction
Ground Truth
Image 1
Image 2
Image 3
Image 4
Image 5
Image 1
Image 2
Image 3
Image 4
Image 4
Image 1
Image 2
Image 3
Image 4
Image 4


Multi-View Semantic Segmentation

For multi-view segmentation, we trained a scene transformer on top of regions from ALL images in a scene. Such a transformer would require 100,000 patches but only 5,000 regions for scenes with an average of 100 frames/scene. Predictions from our models were generally better than the noisy ground truth.

Original
Linear Probe
Single Image Transformer
Scene Transformer
Ground Truth
Image 1
Image 2
Image 3
Image 4
Image 5
Image 1
Image 2
Image 3
Image 4
Image 4
Image 1
Image 2
Image 3
Image 4
Image 4


One-Shot Object Retrieval

We introduce a new task called object retrieval where the goal is to find all instances of an object in an image collection. We compare object queries to individual regions from an image. In the examples below, the third column shows the similarity between the highlighted object in the first column and the database image.

Query Image
Database Image
Heatmap
Image 1
Image 2
Image 3
Image 4
Image 1
Image 2
Image 3
Image 4
Image 4


Activity Classification

The last application is multi-frame activity classification. Similar to multi-view segmentation, we train a transformer on top of per-frame region features.

Description of the image

BibTeX

@inproceedings{regions,
      author = {Michal Shlapentokh{-}Rothman*, and Ansel Blume* and Yao Xiao and Yuqun Wu and Sethuraman TV and Heyi Tao and Jae Yong Lee and Wilfredo Torres and Yu-Xiong Wang and Derek Hoiem},
      title = {Region-Based Representations Revisited},
      booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
      year = {2024}
  }

Website is based on template from Nerfies