[Notes] PPO, GRPO, and GSPO

Cover image generated by Nano Banana 2 Introduction This blog post provides an overview of the core concepts of the Proximal Policy Optimization (PPO) algorithm and its two variants — the Group Relative Policy Optimization (GRPO) and Group Sequence Policy Optimization (GSPO) algorithms. This post assumes basic knowledge of reinforcement learning fundamentals, meaning it does not explain terms such as policy, on-policy/off-policy learning, and state value function in detail. ...

April 11, 2026 · Ceshine Lee

[Kaggle] Google Research Football 2020

Photo Credit (This post an expansion of this Kaggle post.) My Solution Thanks to Kaggle, Manchester City F.C., and Google Research for this fantastic competition. Working on this competition was the most fun I’ve had for a while. The tl;dr version of my solution is that I used an MLP model to stochastically imitate WeKick’s agents, with some rules to help it navigate in unfamiliar waters. Why this Approach After I got the GCP coupon, I looked at the competition timeline and thought that there is no way I can train a competitive RL agent from scratch in less than two weeks. I had to find some way to cut the training time shorter. ...

December 28, 2020 · Ceshine Lee