Login

kingofring_131@ · 10-04-2017, 08:07 PM

An Implementation of the FP-growth Algorithm

[attachment=406]

ABSTRACT

The FP-growth algorithm is currently one of the fastest approaches
to frequent item set mining. In this paper I describe
a C implementation of this algorithm, which contains
two variants of the core operation of computing a projection
of an FP-tree (the fundamental data structure of the
FP-growth algorithm). In addition, projected FP-trees are
(optionally) pruned by removing items that have become infrequent
due to the projection (an approach that has been
called FP-Bonsai).

INTRODUCTION

One of the currently fastest and most popular algorithms for
frequent item set mining is the FP-growth algorithm [8]. It is
based on a prefix tree representation of the given database
of transactions (called an FP-tree), which can save considerable
amounts of memory for storing the transactions. The
basic idea of the FP-growth algorithm can be described as a
recursive elimination scheme: in a preprocessing step delete
all items from the transactions that are not frequent individually,
i.e., do not appear in a user-specified minimum
number of transactions. Then select all transactions that
contain the least frequent item (least frequent among those
that are frequent) and delete this item from them. Recurse
to process the obtained reduced (also known as projected)
database, remembering that the item sets found in the recursion
share the deleted item as a prefix.

PREPROCESSING

Similar to several other algorithms for frequent item set mining,
like, for example, Apriori or Eclat, FP-growth preprocesses
the transaction database as follows: in an initial scan
the frequencies of the items (support of single element item
sets) are determined. All infrequent items that is, all items
that appear in fewer transactions than a user-specified minimum
number are discarded from the transactions, since,
obviously, they can never be part of a frequent item set.

BUILDING THE INITIAL FP-TREE

After all individually infrequent items have been deleted
from the transaction database, it is turned into an FP-tree.
An FP-tree is basically a prefix tree for the transactions.
That is, each path represents a set of transactions that share
the same prefix, each node corresponds to one item. In addition,
all nodes referring to the same item are linked together
in a list, so that all transactions containing a specific
item can easily be found and counted by traversing this list.
The list can be accessed through a head element, which also
states the total number of occurrences of the item in the
database. As an example, Figure 1 shows the FP-tree for
the (reduced) database shown in Table 1 on the right. The
head elements of the item lists are shown to the left of the
vertical grey bar, the prefix tree to the right of it.