How A HyperLogLog And Other Probabilistic Data Structures Work
Titus Brown’s Awesome Big Data Algorithms talk from PyCon 2013 is a fascinating look at probabilistic data structures and is worth a watch if you’re interested in computer science. These sometimes mysterious structures with names like HyperLogLog and Bloom filter let you do seemingly impossible things like count more unique items than your computer has memory to store. This is possible by using statistics to estimate values rather than trying to store them in memory. This lets the structures scale to immense amounts of data, like what Google might process in a day crawling the entire internet!
In addition to Titus’ talk there’s a great explanation of the HyperLogLog data structure from Doug Turnbull that’s worth checking out too. Probabilistic data structures are a fascinating combination of computer science, mathematics, and statistics.