From Silent None to Insight: Debugging PySpark UDFs on AWS Glue with Decorators
Last month I was debugging a PySpark UDF that was silently returning None for about 2% of rows in a 10-million-row dataset. No error. No exception. Just... None . I couldn't reproduce it locally because I didn't have the exact row that caused it. I couldn't add print() statements because -- as I painfully discovered -- print() inside a UDF doesn't show up anywhere useful . The output vanishes into executor logs that are buried three clicks deep in the Spark UI, if they exist at all. That frustration led me to build a small set of PySpark debugging decorators. Some of them turned out to be genuinely useful. Others taught me more about Spark's architecture than I expected. And the whole thing sent me down a rabbit hole about how AWS Glue's Docker image actually works under the hood. This post covers: Three decorators I actually use in production debugging Why print() inside a UDF doesn't do what you think How AWS Glue's local Docker environment works (Livy, Sparkmagic, and the stdout bla
Continue reading on Dev.to Python
Opens in a new tab



