eduzhai > Applied Sciences > Engineering >

Visual Transformers Token-based Image Representation and Processing for Computer Vision

  • Save

... pages left unread,continue reading

Document pages: 12 pages

Abstract: Computer vision has achieved remarkable success by (a) representing images asuniformly-arranged pixel arrays and (b) convolving highly-localized features.However, convolutions treat all image pixels equally regardless of importance;explicitly model all concepts across all images, regardless of content; andstruggle to relate spatially-distant concepts. In this work, we challenge thisparadigm by (a) representing images as semantic visual tokens and (b) runningtransformers to densely model token relationships. Critically, our VisualTransformer operates in a semantic token space, judiciously attending todifferent image parts based on context. This is in sharp contrast topixel-space transformers that require orders-of-magnitude more compute. Usingan advanced training recipe, our VTs significantly outperform theirconvolutional counterparts, raising ResNet accuracy on ImageNet top-1 by 4.6 to7 points while using fewer FLOPs and parameters. For semantic segmentation onLIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 pointshigher mIoU while reducing the FPN module s FLOPs by 6.5x.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...