Back to articles
Building a Real-Time News Deduplication Engine with Python

Building a Real-Time News Deduplication Engine with Python

via Dev.to Tutorialagenthustler

News aggregators show 80% redundant content. A dedup engine clusters duplicate stories for a clean feed. MinHash + LSH import hashlib , re from collections import defaultdict class MinHashLSH : def __init__ ( self , n = 128 ): self . n = n self . h = [( hash ( f " a_ { i } " ) % ( 2 ** 31 - 1 ), hash ( f " b_ { i } " ) % ( 2 ** 31 - 1 )) for i in range ( n )] self . bands = max ( 1 , n // 4 ) self . rows = n // self . bands self . bkts = defaultdict ( list ) def shingle ( self , text , k = 3 ): w = re . sub ( r ' [^a-z0-9 ] ' , '' , text . lower ()). split () return set ( ' ' . join ( w [ i : i + k ]) for i in range ( len ( w ) - k + 1 )) def minhash ( self , sh ): return [ min (( a * hash ( s ) + b ) % ( 2 ** 32 - 1 ) for s in sh ) if sh else 0 for a , b in self . h ] def add ( self , did , text ): sh = self . shingle ( text ) if not sh : return sig = self . minhash ( sh ) for i in range ( self . bands ): band = tuple ( sig [ i * self . rows :( i + 1 ) * self . rows ]) self . bkts [ f

Continue reading on Dev.to Tutorial

Opens in a new tab

Read Full Article
2 views

Related Articles