Introduction¶

This document is meant to explain how to use the code that accompanies the subspace clustering project (SuClust) to perform the related tasks. For more detailled information, one can refer to the reports (thesis or slides).

Overview¶

The project consists of 3 main parts, each of which corresponds to a phase in the workflow:

Preprocessing: composed of some scripts, mostly written in R to handle dataset processing (cleaning, normalization...).
Clustering: implementation of some subspace clustering algorithms (in Knime, python...). They perform the cluster analysis on the data provided by the previous process and output results in textfiles of a specific format.
Post-processing: python programs to perform these following tasks: redundancy filtering, measure scoring/ranking on files generated by clustering files.

Source code¶

The updated source code repository is hosted on github whose link is: subspace_clustering.

Prerequisites¶

To get started with SuClust, one must have the following installed:

For executing Python scripts (pre-processing, post-processing, clustering):

Python: Official Python tutorial : 2.7 version is recommended.

Numpy: Official Numpy tutorial : some of the algorithms are implemented based on the array structures of Numpy.

Scipy: Official Scipy tutorial : some modules of Scipy are also used.

For performing cluster analysis with OpenSubspace (clustering):

Knime: Official Knime-OpenSubspace tutorial : a modified version with OpenSubspace integrated.

For running R scripts (pre-processing):

R: Official R tutorial