ODL
Open and Distance Learning


 

CLASSIFYING VARIABLES BY HIERARCHICAL CLUSTERING MODELS:

EMPIRICAL AND PROBABILISTIC APPROACHES (1)

Helena Bacelar-Nicolau






Keywords: cluster analysis, classification, similarity coefficient, hierarchycal clustering, aggregation criterion, random variable, cumulative distribution function.
 
 

1. Introduction

Cluster analysis or classification refers to a set of multivariate methods for grouping elements (subjects or variables) from some finite set into clusters of similar elements (subjects or variables). Here we are mostly concerned with hierarchical clustering agglomerative models of variables (Bacelar-Nicolau, 1981, 1987, 1988; Gower, 1988; Lerman, 1981; Nicolau and Bacelar-Nicolau, 1998).

Classifying variables, that is partitioning a set of variables into classes or hierarchical groups of classes, is a major problem one has to face very often in Exploratory Multivariate Data Analysis, when dealing with data issued from human and social sciences and related areas. Particular attention has been paid by some researchers in this matter to the development of probabilistic clustering analysis techniques and methods which are able to deal with such kind of data (Bacelar-Nicolau, 1981, 1992; Lerman, 1981; Nicolau and Bacelar-Nicolau, 1998).

In hierarchical cluster analysis of variables we consider two general approaches, depending on the type of model we look for fitting the data, that is, for representing the (hidden) data structure. On the basic level, the most common one, empirical (classical) models are used and comparison (similarity or dissimilarity) coefficients between variables are not associated to any probabilistic aspects. On the upper level we deal with probabilistic models, based on probabilistic coefficients. These models often arise from natural reference assumptions, taking in account some essencial features of the data (and not from any hypothesis meaning that the data "follow" a "given" probabilistic law!).

A hierarchical clustering model can be graphically represented by a tree diagram or dendrogram, corresponding to a hierarchy of partitions (or classifications). In a hierarchical agglomerative model often the hierarchical representation goes from the partition where all the elements are separated in different clusters or classes (singletons) at the first level, to the partition where all the elements are in the same cluster, at the last level. At each level, the clustering algorithm aggregates the two (or more) most similar clusters. We usually refer to a hierarchical agglomerative model as being a general theory about the data and to a comparison function as being a similarity one.

This paper includes three sections: in the next section we introduce the agglomerative hierarchical probabilistic approach; in the second one we present the LEASP software; and in the third part we describe a particular approach to frequency data matrices and give a small example, with real data.
 
 

2. The probabilistic approach

As we have pointed out before, in hierarchical cluster analysis of variables two general approaches are available depending on the type of model we choose to fit the data. On the first level, the lower one, empirical models are used and comparison (usually similarity) coefficients are not associated to any probabilistic aspects. On the upper level we deal with probabilistic clustering models, which can be integrated into some probabilistic parametric families of models. Those families define general structures of data representation, where research on models better fitting the data as well as robustness studies on the methods based on the parameters set, can be performed.

A classical hierarchical clustering model can be characterized by a simple bidimensional functional vector ( s, w ) where s represents a similarity coefficient between variables and w represents an aggregation criterion between clusters of variables. (Bacelar-Nicolau, 1972, 1981, 1985, 1987, 1990; Bacelar-Nicolau and Nicolau, 1994).

In our probabilistic approach ( s, w ) is the realisation of some random vector ( S, W ), under some suitable reference hypothesis, and the general hierarchical clustering model can be discribed as a fourdimensional functional vector ( g , G , L, SD ), where:

1 - grepresents the (exact or asymptotic) cumulative distribution function (cdf) value of S on s; it measures the link between elements (Bacelar-Nicolau, 1972, 1981, 1988, Lerman, 1970, 1973, 1981).

We call g a Validity Link (VL) coefficient, because it validates, that is makes comparable, all the initial s values, in a probabilistic scale.

2 - G is the (exact or asymptotic) cdf value of an appropriate statistics W of the sample of g probabilistic similarity coefficient values on w (Bacelar-Nicolau, 1987, 1988, 1992; Lerman, 1981; Tiago de Oliveira, 1982);

3 - L is a probabilistic hierarchical level index or "level statistics", taking values over the sequence of hierarchy levels (Bacelar-Nicolau, 1972, 1981; Lerman, 1970, 1981; Nicolau, 1981), whose local maxima define the most significant partitions (as a simple particular case, L can take the maximum value of 1- G , in each level), the absolute maximum giving the partition where we usually cut the tree diagram.

4 - and SD is another hierarchical level index (Bacelar-Nicolau, 1972, 1981) for measuring the distortion degree of similarities in each level-partition and/or detecting similarity conserving (stratification) groups of levels / partitions.

The more general situation arises when g or G are already extensions of the previous cdf, each one representing some parametric / adaptive family of VL coefficients or criteria, respectively, instead of single ones (Nicolau, 1981, 1983; Nicolau and Bacelar-Nicolau, 1998): robustness or stability studies based on perturbation of the parameter values can then be performed (G might for instance represent an extension of the well-known Lance and Williams adaptive formula for hierarchical agglomerative criteria. In that case by varying the parameters in a systematic way, robustness and/or validity studies based on the L and SD values can be accomplished inside such a family, helping us to search for the models - either classical or probabilistic models - better fitting the data).

Distribution equivalence (d.e.) between basic coefficients assumes a fundamental role in this context (Bacelar-Nicolau, 1987). In fact, one can find classes of coefficients ascribing the same values to the probabilistic coefficient g exactly or asymptotically (a.d.e.). Therefore d.e. or a.d.e. coefficients allow us - exactly or asymptotically - a unique clustering solution, when combined with some suitable clustering criterion.

The statistical methodology is supported by some specific software packages, that are being developed by the LEAD and some partner laboratories. The statistical software LEASP we refer to here has been developed by a Portuguese team in the Luso-French Scientific Co-operation Program on Multivariate Data Analysis, ADAM, from the previous French software LEAS for teaching statistics and data analysis.
 
 

2.1. Hierarchical clustering probabilistic models

Let S be the random variable (r.v.) associated to a basic similarity coefficient s. For each pair (x, y) of variables and each pair (A,B) of clusters one has:

g (x,y) = P  (S(x,y)  s(x,y)), G (A,B) =P  (W(A,B)  w(A,B))

where the r.v. W(A,B) is a basic aggregation criterion, defined on the set g (A,B) = íg (x,y) ; (x,y) Î A´ Bý with w(A,B) as its actual value; and the set of VL similarities g (A,B) is supposed to be an i.i.d. sample of a r.v. with uniform distribution in the interval [0,1].

Let w be the single linkage criterion. In this case the r.v. W(A,B) will be the maximum of card A ´ card B independent uniform r.v. in [0,1] and G (A,B) takes the following expression:

G (A,B) = ( w(A,B) )a´b

where a = card A and b = card B. This aggregation criterion was the first Validity Linkage method to be derived, the so-called AVL method ("Vraisemblence du Lien" in Lerman, 1970, 1981 and Bacelar-Nicolau, 1972).

The AVL method has been extended, for instance by Nicolau (1983), to some hierarchical parametric families, which can simply be represented by the common following general expression:

G (A,B; a , b ; W ) = ( w(A,B) ) g(a , b ; W )

where g(a , b ; W ) depends on the cardinals a , b and on the parameter space W .

A special case of the above "supermodel" is the following parametric family:

g(a , b ; e , x ) = 1+(e ( (a´b ) x - 1) ), with e , x [0,1]
 
 

It is easy to see that this family includes particular (limited) cases, the single linkage (e = 0 or x = 0) and the AVL (e = x = 1) methods. Thus by varying e from 0 to 1, with x = 1 we can study how the VL models evolve, when going from the "chaining effect" associated to the first criterion, towards the "symmetry effect" associated to the second one. The procedure can of course be generalized to other convenient sets of parameter values. This enables us to look for the best hierarchical clustering model(s) in the parametric family. (Gower, 1988; Gifi, 1990; Bacelar-Nicolau and Nicolau, 1994).

Let N = [nij], i=1,...,r; j=1,...,p, be a frequency data matrix, describing r data units (the rows) by p variables (the columns). The basic affinity coefficient a(x,y) between one pair (x,y) of variables is defined by the inner product of the sequences (, i=1,...,r) and (, i=1,...,r) where (j , j') represents the pair of columns associated to (x , y), and ni/j , ni/j' are conditional frequencies (Bacelar-Nicolau, 1985, 1988, 1994). In case of equally weighted data units, one gets simply the following expression for the affinity a(x,y):

a(x,y) =  (i=1,...,r).

An example of hierarchical clustering empirical model that can be used to deal with such kind of data is represented by the bidimensional vector (a, SL), where (s =) a represents the affinity coefficient and (w =) SL is the single linkage criterion, respectively.

Alternatively we can use a probabilistic hierarchical clustering model based on the following "Validity Affinity Linkage" coefficient:

VAL(x,y) = P (A(x,y)  a(x,y)) @ R (A*(x,y)  a*(x,y))

@F (a*(x,y))


where A is the corresponding random variable and A* is the asymptotic standard normal variable. The asymptotical study of the affinity coefficient A (Bacelar-Nicolau, 1988) has been supported by different general limit theorems, namely the Wald and Wolfowitz (Fraser, 1975) and the Delta-method (Tiago de Oliveira, 1989) theorems, depending on the underlying reference hypothesis. One gets in this way the set of p(p-1)/2 asymptotic values of g (x,y), on which will stand either an aggregation criterion or a general parametric family of criteria.
 
 

3. The LEAS 97 Software

LEASP 97 is the extended, improved Portuguese version of the French package LEAS for teaching and analysing statistical methodology (Bacelar-Nicolau H., Nicolau, F. C., Dias, O., Ramos, L., 1997). The previous version (Bacelar-Nicolau H., C.Nicolau, F., Mira, C., Dias, O., 1994) has been developed into the Luso-French Scientific and Technological Program ADAM1(2) on Multivariate Data Analysis.

LEASP 97 presents Portuguese menus and includes new subroutines, which have been developed by Portuguese teams. Probabilistic hierarchical clustering models, which are now available in LEASP97, deal with classification of variables, a quite important problem in exploratory data analysis (very common in human sciences, for instance). Special attention has been paid to the case of multivariate discrete data analysis: hierarchical clustering algorithms for binary and frequency tables as well as classical and log linear algorithms for contingency tables, were implemented in order to combine exploratory and confirmatory methodology in an easy manner, in the present version of LEASP. This allows us to use a general probabilistic approach for classifying multidimensional data.

The software LEASP 97 is organised in the same pedagogical way as the former LEAS: the tree of menus unfolds into chapters, sections and paragraphs as in a usual course and/or manual of statistics and data analysis. It may help to support a course of three to four semesters. The hierarchy of menus is specially designed for training the observation skills, critical capacity and accuracy of the users, from the first step of collecting data and managing files to the advanced step of using new multivariate statistical methodology. Software improvements presently focuses on graphical representations, implementation of new methods, user's interfaces, and Windows version.

On the next page the general software scheme including the main options offered by the Portuguese version is shown.

The Hierarchical Cluster Analysis scheme emphasises the separation between classification of observations which is mostly concerned with classical clustering methods and classification of variables which is mostly concerned with probabilistic models, in the Portuguese module of LEASP 97. Moreover we use similarity coefficients in classification of variables. From the pedagogical perspective this can be used to learn about and discuss the role and the link among similarity and dissimilarity coefficients in cluster analysis, as well as on the notions of monotonic coefficients and distribution equivalence of coefficients (Bacelar-Nicolau, 1987).

The affinity coefficient (Bacelar-Nicolau, 1985, 1988; Matusita, 1951, 1955) is a similarity coefficient, which we have studied extensively either on the classical or the probabilistic clustering approach. It plays an important role in our hierarchical clustering models and in LEASP software. Moreover its clustering behaviour can easily be compared with other well-known coefficients, like Pearson and Spearman correlation coefficients, commonly used in cluster analysis of variables and in other multivariate data analysis methods.

Concerning the aggregation criteria, the Portuguese module of LEASP 97 offers a set of probabilistic methods, either separately or included in some adaptive formulae. There are some classical aggregation criteria included or not in the adaptive formulae. This allows users to perform validation studies on the clustering results, especially when using an extended version of the well known Lance and Williams adaptive formula for hierarchical agglomerative criteria, based on the VL similarity. Validation studies particularly base on comparison of clustering results obtained from suitable parameters variation and are based on computations concerning the corresponding L ordinal hierarchical levels and SD stratification/distortion coefficients.
 
 
 


 
 
 

5. Classifying ERASMUS students' subject area in approved applicatons from 1992/93

The previous methodology was applied for classifying a real data set concerning the frequency distribution of ERASMUS students, in approved applications in 1992/1993, by home country and subject area. Here we only refer here to the classification of subject area.

It was our goal to find some general patterns of profiles in the set of subject areas, defining "robust" priorities of some country groups. We found such a typology using the above parametric family of probabilistic models based on the VAL coefficient and choosing the significant partitions with two adequate statistics L and SD.

For instance we could compare the three clustering hierarchies corresponding to e = 0 (single linkage ), e = 1 x = 1/2 (AVB) and e = x = 1 (AVL). We could also observe the general pattern evolution of those trees (Bacelar-Nicolau, 1985, 1990), when going from a model which is associated to a chain effect (single linkage), to an other one characterized by an equicardinality or symmetry effect (AVL), crossing a model (AVB) which "balances" both effects. The AVB method gave in fact, as happens very often, the clustering hierarchy which better fits the data, representing a (kind of) consensus between the other two models.

Reading the AVB tree from top to bottom, one finds at the most significant levels the following four clusters: {Agriculture, Law, Medical Sciences}, {Business Management, Engineering, Miscellaneous,}, {Languages, Mathematics, Natural Sciences, Social Sciences}, {Education, Geography, Humanities, Lingua}, which define the general profiles of subject areas explained by the (VAL, AVL, L, SD) probabilistic model. Each profile appeared to be preferentially associated to a given European country set. With the above methodology we were able to find a quite suitable interpretation of the interdependency structure underlying the data.
 

ESTIMATED NUMBER OF STUDENTS IN APPROVED APPLICATIONS IN 1992/93 BY SUBJECT AREA
                 (VAL/WW, AVB, L, SD) PROBABILISTIC MODEL

  AGRICULTURE    --*--------------------------*
                                              |--*
  LAW            --*--------*                 |  |
                            |-----------------*  |
  MEDICAL SCIENCE--*--------*                    |
                                                 |-----*
  BUSINESS MANAGE--*-----*                       |     |
                         |-----------*           |     |
  ENGINEERING    --*     |           |           |     |
                   |-----*           |           |     |
  MISCELLANEOUS  --*                 |-----------*     |
                                     |                 |
  LANGUAGES      --*--------------*  |                 |
                                  |--*                 |
  MATHEMATICS    --*-----------*  |                    |--*
                               |--*                    |  |
  NATURAL SCIENCE--*--*        |                       |  |
                      |--------*                       |  |
  SOCIAL SCIENCES--*--*                                |  |
                                                       |  |
  EDUCATION      --*--------------------*              |  |
                                        |-----------*  |  |--*
  GEOGRAPHY      --*--------------------*           |  |  |  |
                                                    |--*  |  |
  HUMANITIES     --*-----------------------*        |     |  |
                                           |--------*     |  |
  LINGUA         --*-----------------------*              |  |
                                                          |  |
  ARCHITECTURE   --*--------------------------------------*  |
                                                             |
  FINE ARTS      --*-----------------------------------------*
REFERENCES

Bacelar-Nicolau, H. (1972): Analyse d'un algorithme de classification automatique. Rapport Maison de Siences de l'Homme, Paris.

Bacelar-Nicolau, H. (1981): Contributions to the Study of Comparison Coefficients in Cluster Analysis. Univ. Lisbon.

Bacelar-Nicolau, H.(1985): The Affinity Coefficient in Cluster Analysis. Meth. Oper. Res., vol.53, Verlag Anton Hain, 507-512.

Bacelar-Nicolau, H. (1987): On the distribution equivalence in cluster analysis. NATO ASI Series, vol F30, Pattern Recognition Theory and Applications, P.A.Devijver and J.Kittler, Springer-Verlag, 73-79.

Bacelar-Nicolau, H. (1988): Two probabilistic models for classification of variables in frequency tables. in: Classification and Related Methods of Data Analysis, H.H.Bock (ed). North Holland, 181-186.

Bacelar-Nicolau, H. (1989): Sur l'équivalence distributionnelle entre coefficients d'association. Bulletin of the International Statistical Institute (ISI), Contributed Papers of the 47th Session, Book 1, Paris, 89-90.

Bacelar-Nicolau, H. (1990): Classifying integer scale data by the affinity coefficient. Meth. Oper. Res., vol 60, Verlag Anton Hain, 587-595.

Bacelar-Nicolau, H., Nicolau, F. C. (1990): Analyse classificatoire de l'identité nationale chez les portugais. Une étude exploratoire fondée sur le coefficient d'affinité. Mathématiques, Informatique et Sciences Humaines, 28eme année, no 109, 55-64.

Bacelar-Nicolau, H., Nicolau, F.C. and Dias, O. (1992): Le Logiciel LEASP: Methodologie statistique. Applications a l'agro-industrie, Actes des VI Journees Europeennes Agro-Industrie et Methodes Statistiques, Montpellier, França, 118-121.

Bacelar-Nicolau, H., Nicolau, F. C. (1993): Classifying integer scale data by the affinity coefficient: a probabilistic approach. Proceedings of the Sixth International Symposium on Applied Stochastic Models and Data Analysis (ASMDA), J.Jansen and C.H.Skiadas (ed), World Scientific, vol 1, 63-74.

Bacelar-Nicolau, H., C. Nicolau, F. (1994): Exploratory and confirmatory discrete multivariate analysis in a probabilistic approach for studying the regional distribution of AIDS in Angola. New Approaches in Classification and Data Analysis, E. Diday /Y. Lechevalier/M. Schader (eds.), Spriger-Verlag, pp. 610-618

Bacelar-Nicolau, H., C. Nicolau, F., Mira, C., Dias, O., (1994), LEASP: Learning and Teaching New Methodology on Probabilistic Clustering Models. Contributed paper presented at the International Conference on Teaching Statistics (ICOTS4), Marrakech, Marroco

Bacelar-Nicolau, H., C. Nicolau, F. (1997): Classification of Variables VS Classification of Subjects: a Probabilistic Approach for Binary Data. in Invited Paper Meeting on Classification of Variables VS Classification of Subjects, Proceedings of the 51st Session of the International Statistical Institute (ISI), Istambul, Turquia, 349-351.

Bacelar-Nicolau, H., Nicolau, F. C., Dias, O., Ramos, L. (1998): LEASP97: An Improvement in Teaching and Analizing New Methodology on Probabilistic Clustering Models. in Topic 7: "The Role of Technology in the Teaching of Statistics", Proceedings of the Fifth International Conference on Teaching of Statistics, ICOTS-5, Singapore, 863-869,

Fraser, D.A.S. (1975): Non Parametric Methods in Statistics. Chapman and Hall, 235-237.

Gifi, A. (1990): Nonlinear Multivariate Analysis, John Wiley.

Gower, J.C. (1988): Classification, geometry and data analysis. Classification and Related Methods of Data Analysis, H.H.Bock (ed.), North Holland, 3-14.

Lerman, I.C. (1970): Sur l'analyse des données préalable à une classification automatique, proposition d'une nouvelle mesure de similarité. Rapport MSH Paris.

Lerman, I.C. (1973): Etude distributionnelle de statistiques de proximité entre structures finies de même type, application à la classification automatique. Cahiers du BURO, Paris.

Lerman, I.C. (1981): Classification et Analyse Ordinale des Données, Dunod, Paris.

Matusita, K.(1951): On the Theory of Statistical Decision Functions. Ann. Inst. Stat. Math., vol.III, 1-25.

Matusita, K. (1955): Decision rules, based on distance for problems of fit, two samples and estimation. Ann. Inst. Stat. Math., vol. 26, n.º4, 631-640.

Nicolau, F.C. (1981): Hierarchical Cluster Analysis Criteria Based on Distribution Function. Univ. Lisboa.

Nicolau, F.C. (1983): Cluster analysis and distribution function. Meth. Oper. Res., vol 45, Verlag Anton Hain, 431-433.

Nicolau, F.C., Bacelar-Nicolau, H. (1998): Some Trends in the Classification of variables, Data Science, Classification, and Related Methods, C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H. H. Bock, Y. Baba (eds.), Springer, 89-98

Tiago de Oliveira, J.(1982): The Delta-Method for Obtention of Asymptotic Distributions: Applications. Publ.Inst.Stat.Univ.Paris, vol.XXVII, 49-70.

________________________

(1)This work was partially supported by the Erasmus/Sokrates Program on Mathematical Psychology (Program Co-ordinator: Luc Delbeke, Univ. Leuven) and the Luso-French Program on Multivariate Data Analysis, ADAM, (Portuguese co-ordinator: H. Bacelar-Nicolau).

(2)ADAM1: Scientific Research and Technological Luso-French Cooperation Program on Multivariate Data Analysis, supported by the JNICT/Portugal and the Embassy of France, joining the Laboratory of Statistics and Data Analysis (LEAD) at the Faculty of Psychology and Education, University of Lisbon / H. Bacelar-Nicolau, the Laboratory of Statistics and Actuarial Mathematics (LEMA) / Department of Mathematics / New University of Lisbon / F.C. Nicolau and the Biometry Unity, INRA, University of Montpellier / Y. Escoufier.


ODL-Team
Thu Jan 13 2000