Post-processing¶
Overview¶
It includes several python scripts for computing the scores and the ranking for a list of clusterings. The format of the input are described in the Clustering Section. More specifically, it includes:
- Redundancy filtering: for each clustering, it will remove from it all redundant clusters using the algorithm given in the report. For the moment, there is only one redundacy model employed (based on structural similarity)
- Scoring/Ranking: for each clustering and a given list of measures, it computes the score of this clustering on each of the measure and eventually the rank of this value. Measures values are all real-valued and by default the ranker will treat them as “smaller is better” (favor clusterings with lower value). In any case, one can pass the parameter to change this behavior.
- Pareto frontier: it takes a list of clusterings and a list of measures and ouput clusterings that lie on the frontier frontier.
Output format¶
There are two different file formats generated by this processs. The first one is called ``cluster level` meaning that, like the input format, it is a csv-like where each line describes a cluster (objects, dimensions...), completed by additional measure variables (each one corresponding to a track). The second one is clustering level meaning they describe only at clustering level (clustering_id and execution information) plus the measure variables.
The first one is perhapps interesting for visualization since we need the information regarding the composition of each of the clusters but much more heavy in term of size. For statistics tasks, the second one seems more suited.
Entry point¶
There are 3 entry points for different tasks, namely: pareto, filtering/scoring and the other one is for generating exhautive kmeans.).:
- km.py : exhautive k_means: actually belongs to the clustering part
- pareto.py: computing the pareto frontier
- subspace_clustering: filtering/scoring/ranking
See Examples section for more information on the usage or type with -h option for help,
Redundancy filtering¶
The details about the redundancy employed can be seen in the report. In brief, the idea is that given a clustering, if we can find 2 clusters that are similar we can take out the one that is less interesting without losing too much information. The similarity is evaluated in term of number of shared objects and dimensions. Specifically, given 2 clustering A and B, we take for instance A as reference (relevant) clustering and compute the precision of B.
precision_score(reference_clustering, target_clustering):
compute precision
the interestingness of a given cluster is simply the product of the number of objects and the number of dimensions that it contains.
Scoring/Ranking¶
This part is responsible for computing the score for a list of measures. There are two type of measures: the first one called external meaning we compare this clustering with another true clustering passed as reference; the second is like the spatial coherence computes the internal value (does not depent on other clustering).
List of measures:
- External measures: we have to load the reference clustering first (Sexton for instance) * F1: measures the similarity in term of objects, it take into account not only homogeneity (precision) but also completeness (recall) * Entropy: also in tern of shared objects, measures mostly the homogeity (precision) * RNIA: not only objects but also dimensions, it is based on the concept of micro-objects * CE: an improved version of RNIA, also based on the concept of micro-objects but it uses the hungarian algorithm to compute the fraction of shared structure.
- Internal measures: * spatial coherence: computing the spatial conpactness of a clustering
It also output the number of rank of each clustering for each of the aformentioned measures. Look at the report for more details about these measures and how they are implemented.
Pareto frontier¶
The most basic definition of the Pareto frontier of a clustering is the set consisting of clusters that are not dominated by any other cluster in any aspect (dimension). Here a simple algorithm to find the frontier was implemented. The idea is pretty straightforward: we can see that the best ranked by any measure always belongs to the Pareto frontier by defition, so we first sort all the clusters by the first measure (the other depends if we want to find the minimum or maximum), add the best ranked and then iterate over the rest to check whether the current cluster is dominated by the last added candidate (otherwise it is not dominated by any of those candidates). If it is not dominated then we add to the Pareto frontier.