What is HyperLogLog?
Suppose that you have a very large set of objects that may contain the same value multiple times - for example, the IP addresses of Google users who have made a search in the past month. You want to estimate the number of unique elements in this set (i.e. the total number of users).
Answering this question would require vast amounts of memory in order to handle all this data - or would it? The HyperLogLog algorithm can help you approximate the answer very efficiently. But what is HyperLogLog, and how does HyperLogLog work?
HyperLogLog definition
HyperLogLog is defined as both an algorithm for estimating the cardinality of a set, and a data structure that implements that algorithm. The problem that HyperLogLog solves is known in computer science as the “count-distinct problem”: counting the number of distinct elements in a very large set.
Most importantly, HyperLogLog is a probabilistic algorithm. To provide an exact value for the number of unique elements, you would need to hold each element in memory and compare it against all other elements. This could be very expensive in terms of memory if most of the elements in the set are unique.
Instead, HyperLogLog gives a best-guess estimate for the count-distinct problem within a certain range of accuracy. The HyperLogLog algorithm sacrifices precision in favor of speed and efficiency. Using only 1.5 kilobytes of memory, HyperLogLog can run on sets with more than 1 billion distinct elements, with a typical error of 2 percent.
How does HyperLogLog work?
The exact details of the HyperLogLog algorithm are too technical to discuss in depth here; interested readers can check out the original HyperLogLog research paper and its practical implementation by Google. Today, the HyperLogLog algorithm is used by everyone from small tech companies to giants such as Facebook and Twitter, in order to quickly and efficiently provide rough cardinality estimates for very large datasets.
However, the basic concept of HyperLogLog is fairly simple to understand. If you observe an element in the set that consists of n bytes, this increases the likelihood that the cardinality of the set is 2 to the power of n. For example, suppose we sample a set of integers and get the value 6, which is represented as the number 110 in binary. Since this number's representation has 3 bytes, this makes it more likely that the set's cardinality is 8 (i.e. 2 to the power of 3).
Of course, observing just a single element will probably not lead to a good estimate on its own. Instead, the HyperLogLog algorithm splits the set into many smaller subsets, and repeatedly takes samples from each of these subsets, refining its estimate as it sees more samples. Finally, we take the mean of each estimate from each subset to come up with a final estimate for the set's cardinality.
HyperLogLog in Redis
Redis is an open-source, in-memory data structure store frequently used to implement key-value databases, caches, and messaging systems. The Redis project comes with a pre-built HyperLogLog implementation that can easily provide a fairly precise estimate for even databases with millions or billions of elements, all while using extremely limited amounts of space.
There are three relevant Redis commands for implementing the HyperLogLog algorithm:
- PFADD: This command adds the elements to the HyperLogLog data structure.
- PFCOUNT: This command returns the estimated cardinality thus far.
- PFMERGE: This command merges multiple HyperLogLog cardinality estimates into a single estimate.
According to the Redis project, Redis' implementation of HyperLogLog is capable of returning a cardinality estimate with a standard error of 0.81 percent.
Redisson makes it easy to use HyperLogLog in the form of the RHyperLogLog interface, including asynchronous, reactive, and RxJava2 implementations. Below is a toy example of how to use HyperLogLog with Redis and Redisson:
package org.redisson.example.objects; import java.util.Arrays; import org.redisson.Redisson; import org.redisson.api.RHyperLogLog; import org.redisson.api.RedissonClient; public class HyperLogLogExamples { public static void main(String[] args) { // connects to 127.0.0.1:6379 by default RedissonClient redisson = Redisson.create(); RHyperLogLog<String> hyperLogLog = redisson.getHyperLogLog("hyperLogLog"); hyperLogLog.add("1"); hyperLogLog.add("2"); hyperLogLog.add("3"); hyperLogLog.addAll(Arrays.asList("10", "20", "30")); RHyperLogLog<String> hyperLogLog1 = redisson.getHyperLogLog("hyperLogLog1"); hyperLogLog1.add("4"); hyperLogLog1.add("5"); hyperLogLog1.add("6"); RHyperLogLog<String> hyperLogLog2 = redisson.getHyperLogLog("hyperLogLog2"); hyperLogLog1.add("4"); hyperLogLog1.add("5"); hyperLogLog1.add("6"); hyperLogLog2.mergeWith(hyperLogLog1.getName()); hyperLogLog2.countWith(hyperLogLog1.getName()); redisson.shutdown(); } }