
I built the first open-source FP8 linear solver in Python — 2-3x faster than cuBLAS
I Built the First Open-Source FP8 Linear Solver in Python I'm a second-year CS student. Last week I published ssBlast — an open-source Python library that solves large linear systems 2-3x faster than CuBLAS using FP8 precision on consumer NVIDIA GPUs. Here's exactly how it works and why it's fast. The Problem Solving Ax = b (where A is a huge matrix) is one of the most common operations in scientific computing: Weather prediction: 1,000,000 unknowns Airplane simulation: 500,000 unknowns Drug discovery: 100,000 unknowns CPU solvers take hours. GPU solvers are faster, but existing tools either don't support FP8 or require C++ expertise. Why FP8 is Faster Floating point numbers store digits: FP64 = 8 bytes per number (very precise) FP32 = 4 bytes per number FP16 = 2 bytes per number FP8 = 1 byte per number (rough) Less bytes = less memory to read from GPU = faster computation. FP64: 128 MB for 4000×4000 matrix FP8: 16 MB for same matrix (8x less!) RTX 4050 FP8 Tensor Cores = ~330 TFLOPS R
Continue reading on Dev.to Python
Opens in a new tab



