Choosing the base algorithm
Most famous and best known is the Apriori algorithm. Apriori builds candidate item sets breath first. It starts with building sets containing only one item and then expanding those sets in every iteration by one more item. After sets have been generated, they are tested against the data. Infrequent sets — those that do not reach a certain support, defined upfront — are pruned before the next iteration. Pruning might remove a lot of candidates, but the biggest weakness of this approach remains the requirement to keep a lot of item set candidates in memory.
Although the first prototypes of the aggregation used Apriori, it was clear from the beginning that we wanted to switch the algorithm later. We looked for one that better scales in runtime and memory. We decided on Eclat, other alternatives are FP-Growth and LCM. All three use a depth-first approach, which fits our resource model much better. Christian Borgelt’s overview paper has details about the various approaches and differences.
Fields and values
An Elasticsearch index consists of documents with fields and values. Values have different types, and each field can be an array of values. Translated to frequent item sets, a single item consists of exactly one field and one value. If a field stores an array of values, frequent_item_sets treats every value in the array as a single item. In other words, a document is a set of items. Yet not all fields are of interest; only the subset of fields used for frequent_item_sets is a transaction.
Dealing with distributed storage
Beyond choosing the main algorithm, other details required attention. The input data for an aggregation can be in one or many indices further separated in shards. In other words, data isn’t stored in one central place. This sounds like a weakness at first, but it has an advantage. At the shard level execution happens in parallel, so it makes sense to put as much as possible into the mapping phase.
Data preparation and mining basics
During mapping, items and transactions get de-duplicated. To reduce size, we encode items and transactions in big tables together with a counter. That counter later helps us to reduce runtime.
Once all shards have sent data to the coordinating node, the reduce phase starts with merging all shard results. In contrast to other aggregations, the main task of frequent_item_sets starts. Most of the runtime gets spent on generating and testing sets.
After the results are merged, we have a global view and can prune items. An item with a lower count than a minimum count gets dropped. Transactions might collapse as a result of item pruning. We calculate the minimum count using the minimum support parameter and the total document count:
Leave a Reply