We investigate whether region-based representations are effective for recognition. Regions were once a mainstay in recognition approaches, but pixel and patch-based features are now used almost exclusively. We show that recent class-agnostic segmenters like SAM can be effectively combined with strong self-supervised representations, like those from DINOv2, and used for a wide variety of tasks, including semantic segmentation, object-based image retrieval, and multi-image analysis. Once the masks and features are extracted, these representations, even with linear decoders, enable competitive performance, making them well suited to applications that require custom queries. The representations' compactness also makes them well-suited to video analysis and other problems requiring inference across many images.
Our approach is to pool image features over generated regions to create one feature vector per region, and to make predictions based on these region representations. The main steps are
SAM generated regions tended to leave many pixels uncovered. To improve coverage, we ran the superpixel clustering algortihm SLIC and took the intersection of SLIC and unmasked pixels. SAM+SLIC outperformed SAM and other SAM variants such as MobileSAM and HQ-SAM.
For semantic segmentation, we classified each region. For pixels belonging to multiple regions, we averaged the confidence. Our results outperformed corresponding patch-based models across different images features. As seen in the qualitative results, with region-based models, pixels within the same object have much more consistent classifications compared to patch-based models.
For multi-view segmentation, we trained a scene transformer on top of regions from ALL images in a scene. Such a transformer would require 100,000 patches but only 5,000 regions for scenes with an average of 100 frames/scene. Predictions from our models were generally better than the noisy ground truth.
We introduce a new task called object retrieval where the goal is to find all instances of an object in an image collection. We compare object queries to individual regions from an image. In the examples below, the third column shows the similarity between the highlighted object in the first column and the database image.
The last application is multi-frame activity classification. Similar to multi-view segmentation, we train a transformer on top of per-frame region features.
@inproceedings{regions,
author = {Michal Shlapentokh{-}Rothman*, and Ansel Blume* and Yao Xiao and Yuqun Wu and Sethuraman TV and Heyi Tao and Jae Yong Lee and Wilfredo Torres and Yu-Xiong Wang and Derek Hoiem},
title = {Region-Based Representations Revisited},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2024}
}
Website is based on template from Nerfies