To track trends, algorithms have to wade through a constant flow of streaming data, focussing on the important bits while forgetting the rest. This has historically been hard to do quickly and accurately – until now.
One major challenge in designing streaming algorithms is the trade-off between performance factors: if you go for accuracy, the algorithm might be slow and memory-intensive; but focus on speed, and it might not be good at determining trends from frequently occurring items. However, an international team has developed an algorithm capable of extracting trends from large volumes of streaming data without sacrificing accuracy or speed.
Jelani Nelson, a computer scientist from Harvard University, co-developed the algorithm with other academics from Denmark and the US. He said that trade-offs between speed and accuracy might no longer be necessary.
“We developed a new algorithm that is simultaneously the best on every performance dimension,” he told Quanta magazine.
This is especially important when analysing streaming data, such as network traffic, sensor networks, satellite feeds and database transactions. So if Google wants to track the constant queue of search queries, it uses a streaming algorithm.
Algorithms are a way of answering real-time questions about data streams without remembering every piece of data. This algorithm answers the “frequent items” or “heavy hitters” problem: which items are appearing most frequently?
The first algorithm to solve this problem was developed in the US during the 1980s, but was poor at change detection; it could pinpoint which items occurred frequently over the whole stream, but not recent trends caused by sudden changes, for instance a spike in water usage caused by a burst pipe.
Many algorithms developed over the past three decades have used indexing to keep track of how many times particular data items occur. This has provided ready information on data frequency, but can be memory and time intensive due to the effort of keeping track of a large number of individual data items.
Nelson and his team have overcome this issue by breaking the data set into subsets, which can easily be reassembled. For example, the algorithm breaks eight-digit numbers such as 68,909,893 into four two digit blocks (68, 90, 98, 93). Rather than searching one long list for the most frequent numbers, the answer is synthesised from four much shorter lists, which speeds up the process.
“Instead of spending 50 million units of time looping over the entire universe, you only have four algorithms spending 100 units of time,” Nelson said.
Each subset item includes a name tag and link tag to enable the original long number to be reconstructed. Because errors in these tags would mean that the original data could not be retrieved, the algorithm introduces error-resistance by connecting each two-digit block to multiple others in an “expander graph”.
In an expander graph, each two-digit block forms a point according to the tagging process, and these points are connected by lines to form clusters. Each point is connected to its adjoining points, but also to others down the line. For example, in the number 12,345,678, 12 will link with 34, but also 56. Even if the link between 12 and 34 breaks, it can still tell 12 and 56 belong in the same number.
As a further safeguard against errors, the developers have included a “cluster-preserving” sub-algorithm that can accurately determine which points in an expander graph should be clustered together, even in the case of missing or incorrect links.
“This guarantees I can recover something that looks like the original clusters,” said Mikkel Thorup, a member of the development team based at the University of Copenhagen.
The algorithm’s creators see massive potential for its use going forward, and as proof that when it comes to streaming data, you can have your cake and eat it too.