Policy Gradients: REINFORCE from Scratch with NumPy

In the DQN post , we trained a neural network to estimate Q-values and then picked the best action with argmax. That works when the action space is discrete — push left or push right. But what if you need to control a robotic arm with continuous joint angles, or steer a car with a continuous throttle? You can't argmax over infinity. Policy gradient methods flip the approach: instead of learning a value function and deriving a policy, we directly parameterise the policy and optimise it via gradient ascent. The network outputs action probabilities, we sample from them, and we nudge the parameters toward actions that led to high rewards. No Q-values, no argmax, no experience replay — just a policy, a gradient, and a reward signal. By the end of this post, you'll implement the REINFORCE algorithm entirely from scratch in NumPy — including the forward pass, backpropagation, and RMSProp optimiser — and train it to balance CartPole. The entire implementation is about 100 lines. No PyTorch, no

Policy Gradients: REINFORCE from Scratch with NumPy

Related Articles

ShadCN UI in 2026: the component library that changed how we build UIs

Why OpenClaw Agents Lose Their Minds Mid-Session (And What It Takes to Fix It)

Logos Privacy Builders Bootcamp

#05 Frozen Pipes

Replace Doom Scrolling With Intentional Reading

Related Articles

How-To
ShadCN UI in 2026: the component library that changed how we build UIs
Dev.to • 4h ago

How-To
Why OpenClaw Agents Lose Their Minds Mid-Session (And What It Takes to Fix It)
Dev.to • 5h ago

How-To
Logos Privacy Builders Bootcamp
Reddit Programming • 20h ago

How-To
#05 Frozen Pipes
Dev.to • 1d ago

How-To
Replace Doom Scrolling With Intentional Reading
Dev.to • 1d ago