Section: New Software and Platforms


Negative pattern mining with PrefixSpan

Keywords: Pattern discovery - Data mining - Sequential patterns - Traces

Scientific Description: Mining frequent sequential patterns consists in extracting recurrent behaviors, modeled as patterns, in a big sequence dataset. Such patterns inform about which events are frequently observed in sequences, i.e. what does really happen. Sometimes, knowing that some specific event does not happen is more informative than extracting a lot of observed events. Negative sequential patterns (NSP) formulate recurrent behaviors by patterns containing both observed events and absent events. Few approaches have been proposed to mine such NSPs. In addition, the syntax and semantics of NSPs differ in the different methods which makes it difficult to compare them. This article provides a unified framework for the formulation of the syntax and the semantics of NSPs. Then, we introduce a new algorithm, NegPSpan, that extracts NSPs using a PrefixSpan depth-first scheme and enabling maxgap constraints that other approaches do not take into account. The formal framework allows for highlighting the differences between the proposed approach wrt to the methods from the literature, especially wrt the state of the art approach eNSP. Intensive experiments on synthetic and real datasets show that NegPSpan can extract meaningful NSPs and that it can process bigger datasets than eNSP thanks to significantly lower memory requirements and better computation times.

Functional Description: NegPSpan is software to extract patterns from sequential data (traces, sequences of events, client pathways, etc.). The NegPSpan software extracts two types of patterns: the classical sequential patterns and the negative sequential patterns. Sequential patterns describe recurrent behaviors described as a sequence of events (e.g. event A occurs, then event B occurs and finally C occurs) while negative sequential patterns hold information about absent event (e.g. event A occurs, then event B occurs but without any C in between).

The user has to provide a dataset in the IBM sequence format and, at least, a parameters corresponding to the minimal number of occurrences in the dataset (and possible additional parameters). The software efficiently extracts the patterns and output them (in a text or JSON format). The software can use different strategies for exploring negative sequential patterns, and also specify some constraints about the expected patterns.

News Of The Year: NegPSpan has been developed in 2018.