pboost - PrefixSpan based Subsequence Boosting

Introduction

The pboost toolbox is a set of command line programs and a Matlab wrapper for mining frequent subsequences and sequence classification. For our purposes, a sequence is defined an ordered sequence of sets of discrete numbers. (If all sets contain exactly one element, the sequence is a string.) This definition of sequence is flexible enough to model a number of interesting problems and has been used successfully for human action classification in video data.

The pboost classifier checks for the presence of certain subsequences in a sequence to be tested. The subsequences being checked are optimally determined by discriminative subsequence mining. The overall classification function is interpretable because only a small number of subsequences is used to determine the overall classification decision. Hence for some applications, subsequence mining can offer an alternative to implicitly represented feature spaces (eg. string and sequence kernels) which do not allow an interpretation of the resulting classifier.

On this page, we provide instructions and source code, as well as a number of example real world data sets in order to foster the discussion and adoption of the subsequence mining methodology. Please see the publications section for published papers.

Features and Demo

The pboost toolbox includes source codes for the following functionalities:

PrefixSpan frequent subsequence mining (frequency defined either by minimum support threshold or by top-K frequent subsequences)
Discriminative subsequence mining
nu-LPBoost 2-class classifier
nu-LPBoost 1.5-class classifier
DDAG-decomposition multiclass classifier
Matlab wrappers to all of the above functionality, as well as glue code to the gboost toolbox.

All of the code is written in C++ and makes use of libboost, the GETFEM GMM++ matrix library, the COIN-OR Open Solver Interface library and the COIN-OR Linear Programming Solver (CLP). For your convenience all these libraries are bundled in the download package below and allow for easy recompilation, although statically compiled binaries for Linux x86 and x86-64 are included.

The toolbox includes real world data sets for testing purposes. You can run the included demo.sh file in the dataset/kth-dataset/ directory.

Documentation

Below you find the manpage documentation included in the distribution.

Documentation TXT PDF

pspan PrefixSpan frequent subsequence mining pspan.txt pspan.pdf

pboost Subsequence Boosting pboost.txt pboost.pdf

ptest Classifier test program ptest.txt ptest.pdf

Documentation	TXT	PDF
pspan PrefixSpan frequent subsequence mining	pspan.txt	pspan.pdf
pboost Subsequence Boosting	pboost.txt	pboost.pdf
ptest Classifier test program	ptest.txt	ptest.pdf

Download

The pboost toolbox. The package includes the source code, pre-compiled binaries for the Linux/x86 and the Linux/x86-64 architectures. Also included are two data sets, one coming from human action classification in videos and the other is a toy data set of textual descriptions of country flags. See the included demo.sh file on how they are used.

Distribution: source code, precompiled binaries and demo file

pboost-1.0.tar.gz (41Mb)

License: The software is licensed under the GNU General Public License, version 2. A copy of the license document is included in the distribution.

Installation: the distribution includes statically compiled binaries for your convenience. For manual compilation and compilation of the Matlab wrappers, please adjust the variables in the Makefile.options file, especially the MATLABROOT variable. After editing the file accordingly, the program should compile on any recent Linux system. If you have the frequently encountered problem complaining about GCC_3.3 not being found when you run the mex functions, please refer to this discussion at the Mathworks forums.

Publications

Discriminative Subsequence Mining for Action Classification, ICCV 2007, Sebastian Nowozin, Gökhan BakIr and Koji Tsuda.
Weighted Substructure Mining for Image Analysis, CVPR 2007, Sebastian Nowozin, Koji Tsuda, Takeaki Uno, Taku Kudo and Gökhan BakIr.
A Linear Programming Approach for Molecular QSAR analysis, MLG 2006, Hiroto Saigo, Tadashi Kadowaki and Koji Tsuda.
An Application of Boosting to Graph Classification, NIPS 2004, Taku Kudo, Eisaku Maeda and Yuji Matsumoto.

Contact

sebastian.nowozin@tuebingen.mpg.de, primary author of the toolkit

If you have comments or questions, please feel free to contact me. Thanks!