Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning

Explore AsyPPO, a novel framework replacing traditional value functions with lightweight mini-critics to significantly boost Large Language Model reasoning a...

Level: advanced

By Unknown

Category: research