
How I Handle Nested Tables and Rowspans (The Hard Parts of HTML Table Parsing)
Parsing HTML tables seems straightforward until you encounter real-world data. Wikipedia tables have navigation rows. Financial sites use complex rowspans. Sports statistics sites nest headers two levels deep. After building HTML Table Exporter , a table extraction tool used on thousands of different sites, I've catalogued the edge cases that break most parsers. Here's how to handle each one. Problem 1: Rowspan Expansion A cell with rowspan="3" occupies vertical space in the current row and the next two rows. If you iterate through row.cells naively, your columns misalign. The broken output: | Country | 2020 | 2021 | 2022 | <- Header | USA | 100 | 200 | 300 | <- Expected | 150 | 250 | 350 | <- Missing "USA" (rowspan continued) The fix: Track occupied positions in a virtual grid. function expandRowspans ( table ) { const rows = Array . from ( table . rows ); const grid = []; rows . forEach (( rowEl , rowIndex ) => { if ( ! grid [ rowIndex ]) grid [ rowIndex ] = []; let colIndex = 0 ; Ar
Continue reading on Dev.to
Opens in a new tab



