Back to articles
Policy Gradients: REINFORCE from Scratch with NumPy

Policy Gradients: REINFORCE from Scratch with NumPy

via Dev.toBerkan Sesen

In the DQN post , we trained a neural network to estimate Q-values and then picked the best action with argmax. That works when the action space is discrete — push left or push right. But what if you need to control a robotic arm with continuous joint angles, or steer a car with a continuous throttle? You can't argmax over infinity. Policy gradient methods flip the approach: instead of learning a value function and deriving a policy, we directly parameterise the policy and optimise it via gradient ascent. The network outputs action probabilities, we sample from them, and we nudge the parameters toward actions that led to high rewards. No Q-values, no argmax, no experience replay — just a policy, a gradient, and a reward signal. By the end of this post, you'll implement the REINFORCE algorithm entirely from scratch in NumPy — including the forward pass, backpropagation, and RMSProp optimiser — and train it to balance CartPole. The entire implementation is about 100 lines. No PyTorch, no

Continue reading on Dev.to

Opens in a new tab

Read Full Article
0 views

Related Articles