[Notes] MaxViT: Multi-Axis Vision Transformer

Photo Credit MaxViT: Multi-Axis Vision Transformer(1) is a paper jointly produced by Google Research and University of Texas at Austin in 2022. The paper proposes a new attention model, named multi-axis attention, which comprises a blocked local and a dilated global attention module. In addition, the paper introduces MaxViT architecture that combines multi-axis attentions with convolutions, which is highly effective in ImageNet benchmarks and downstream tasks. Multi-Axis Attention Source: [2] ...

July 16, 2023 · Ceshine Lee

[Notes] PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions

Photo Credit Introduction Recall that an one-dimensional Taylor series is an expansion of a real function f(x) about a point x=a [2]: f(x)=f(a)+f(a)(xa)+f(a)2!(xa)2+..+fn(a)n!(xa)n+... We can approximate the cross-entropy loss using the Taylor series (a.k.a. Taylor expansion) using a=1: f(x)=log(x)=0+(1)(1)1(x1)+(1)2(1)2(x1)22+...=j=1(1)j(j1)!j!(x1)j=j=1(1x)jj We can get the expansion for the focal loss simply by multiplying the cross-entropy loss series by (1x)γ: ...

May 15, 2022 · Ceshine Lee

[Notes] Understanding Visual Attention Network

credit Introduction At the start of 2022, we have a new pure convolution architecture (ConvNext)[1] that challenges the transformer architectures as a generic vision backbone. The new Visual Attention Network (VAN)[2] is yet another pure and simplistic convolution architecture that its creators claim to have achieved SOTA results with fewer parameters. Source: [2] What ConvNext tries to achieve is modernizing a standard ConvNet (ResNet) without introducing any attention-based modules. VAN still has attention-based modules, but the attention weights are obtained from a large kernel convolution instead of a self-attention block. To overcome the high computation costs brought by a large kernel convolution, it is decomposed into three components: a spatial local convolution (depth-wise convolution), a spatial long-range convolution (depth-wise dilation convolution), and a channel convolution (1x1 point-wise convolution). ...

March 14, 2022 · Ceshine Lee

[Notes] Understanding ConvNeXt

credit Introduction Hierarchical Transformers (e.g., Swin Transformers[1]) has made Transformers highly competitive as a generic vision backbone and in a wide variety of vision tasks. A new paper from Facebook AI Research — “A ConvNet for the 2020s”[2] — gradually and systematically “modernizes” a standard ResNet[3] toward the design of a vision Transformer. The result is a family of pure ConvNet models dubbed ConvNeXt that compete favorably with Transformers in terms of accuracy and scalability. ...

January 28, 2022 · Ceshine Lee

[Notes] Understanding XCiT - Part 2

Photo Credit In Part 1, we introduced the XCiT architecture and reviewed the implementation of the Cross-Covariance Attention(XCA) block. In this Part 2, we’ll review the implementation of the Local Patch Interaction(LPI) block and the Class Attention layer. from [1] Local Patch Interaction(LPI) Because there is no explicit communication between patches(tokens) in XCA, a layer consisting of two depth-wise 3×3 convolutional layers with Batch Normalization with GELU non-linearity is added to enable explicit communication. ...

July 25, 2021 · Ceshine Lee