Cv | Veritable Tech Blog

[Notes] MaxViT: Multi-Axis Vision Transformer

Photo Credit MaxViT: Multi-Axis Vision Transformer(1) is a paper jointly produced by Google Research and University of Texas at Austin in 2022. The paper proposes a new attention model, named multi-axis attention, which comprises a blocked local and a dilated global attention module. In addition, the paper introduces MaxViT architecture that combines multi-axis attentions with convolutions, which is highly effective in ImageNet benchmarks and downstream tasks. Multi-Axis Attention Source: [2] ...

[Notes] PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions

Photo Credit Introduction Recall that an one-dimensional Taylor series is an expansion of a real function $f (x)$ about a point $x = a$ [2]: $f (x) = f (a) + f^{'} (a) (x - a) + \frac{f^{″} (a)}{2!} (x - a)^{2} + . . + \frac{f^{n} (a)}{n!} (x - a)^{n} + . . .$ We can approximate the cross-entropy loss using the Taylor series (a.k.a. Taylor expansion) using $a = 1$ : $f (x) = - l o g (x) = 0 + (- 1) (1)^{- 1} (x - 1) + (- 1)^{2} (1)^{- 2} \frac{(x - 1)^{2}}{2} + . . . = \sum_{j = 1}^{\infty} (- 1)^{j} \frac{(j - 1)!}{j!} (x - 1)^{j} = \sum_{j = 1}^{\infty} \frac{(1 - x)^{j}}{j}$ We can get the expansion for the focal loss simply by multiplying the cross-entropy loss series by $(1 - x)^{γ}$ : ...

[Notes] Understanding Visual Attention Network

credit Introduction At the start of 2022, we have a new pure convolution architecture (ConvNext)[1] that challenges the transformer architectures as a generic vision backbone. The new Visual Attention Network (VAN)[2] is yet another pure and simplistic convolution architecture that its creators claim to have achieved SOTA results with fewer parameters. Source: [2] What ConvNext tries to achieve is modernizing a standard ConvNet (ResNet) without introducing any attention-based modules. VAN still has attention-based modules, but the attention weights are obtained from a large kernel convolution instead of a self-attention block. To overcome the high computation costs brought by a large kernel convolution, it is decomposed into three components: a spatial local convolution (depth-wise convolution), a spatial long-range convolution (depth-wise dilation convolution), and a channel convolution (1x1 point-wise convolution). ...

[Notes] Understanding ConvNeXt

credit Introduction Hierarchical Transformers (e.g., Swin Transformers[1]) has made Transformers highly competitive as a generic vision backbone and in a wide variety of vision tasks. A new paper from Facebook AI Research — “A ConvNet for the 2020s”[2] — gradually and systematically “modernizes” a standard ResNet[3] toward the design of a vision Transformer. The result is a family of pure ConvNet models dubbed ConvNeXt that compete favorably with Transformers in terms of accuracy and scalability. ...

[Notes] Understanding XCiT - Part 2

Photo Credit In Part 1, we introduced the XCiT architecture and reviewed the implementation of the Cross-Covariance Attention(XCA) block. In this Part 2, we’ll review the implementation of the Local Patch Interaction(LPI) block and the Class Attention layer. from [1] Local Patch Interaction(LPI) Because there is no explicit communication between patches(tokens) in XCA, a layer consisting of two depth-wise 3×3 convolutional layers with Batch Normalization with GELU non-linearity is added to enable explicit communication. ...