Reading the State of AI in 2025 Report from McKinsey
taken from the PDF Preamble and a bit of personal story I recently ended a multi-year consulting engagement. I might publish a reflection on this experience sometime in the future, but for now, let’s just say it was quite mentally draining. I didn’t have any energy left to write public blog posts during my tenure there, as evidenced by the lack of new posts here over the past two years. ...
[Notes] MaxViT: Multi-Axis Vision Transformer
Photo Credit MaxViT: Multi-Axis Vision Transformer(1) is a paper jointly produced by Google Research and University of Texas at Austin in 2022. The paper proposes a new attention model, named multi-axis attention, which comprises a blocked local and a dilated global attention module. In addition, the paper introduces MaxViT architecture that combines multi-axis attentions with convolutions, which is highly effective in ImageNet benchmarks and downstream tasks. Multi-Axis Attention Source: [2] ...
[Notes] PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions
Photo Credit Introduction Recall that an one-dimensional Taylor series is an expansion of a real function about a point [2]: We can approximate the cross-entropy loss using the Taylor series (a.k.a. Taylor expansion) using : We can get the expansion for the focal loss simply by multiplying the cross-entropy loss series by : ...
[Notes] Understanding Visual Attention Network
credit Introduction At the start of 2022, we have a new pure convolution architecture (ConvNext)[1] that challenges the transformer architectures as a generic vision backbone. The new Visual Attention Network (VAN)[2] is yet another pure and simplistic convolution architecture that its creators claim to have achieved SOTA results with fewer parameters. Source: [2] What ConvNext tries to achieve is modernizing a standard ConvNet (ResNet) without introducing any attention-based modules. VAN still has attention-based modules, but the attention weights are obtained from a large kernel convolution instead of a self-attention block. To overcome the high computation costs brought by a large kernel convolution, it is decomposed into three components: a spatial local convolution (depth-wise convolution), a spatial long-range convolution (depth-wise dilation convolution), and a channel convolution (1x1 point-wise convolution). ...
[Notes] Understanding ConvNeXt
credit Introduction Hierarchical Transformers (e.g., Swin Transformers[1]) has made Transformers highly competitive as a generic vision backbone and in a wide variety of vision tasks. A new paper from Facebook AI Research — “A ConvNet for the 2020s”[2] — gradually and systematically “modernizes” a standard ResNet[3] toward the design of a vision Transformer. The result is a family of pure ConvNet models dubbed ConvNeXt that compete favorably with Transformers in terms of accuracy and scalability. ...