Learning from very large data sets

  1. Limas, M.C. 1
  2. Ordieres Mere, J.B. 1
  3. Ciampi, A. 12
  4. Elias, F.A. 13
  1. 1 Universidad de León
    info

    Universidad de León

    León, España

    GRID grid.4807.b

  2. 2 McGill University
    info

    McGill University

    Montreal, Canadá

    GRID grid.14709.3b

  3. 3 Universidad de La Rioja
    info

    Universidad de La Rioja

    Logroño, España

    GRID grid.119021.a

Journal:
WSEAS Transactions on Information Science and Applications

ISSN: 1790-0832

Year of publication: 2005

Volume: 2

Issue: 10

Pages: 1641-1648

Type: Article

Export: RIS

Abstract

Knowing a process from its recorded data yields advantages of great interest. The number of firms that seek and find solutions to their productive problems by means of the analysis of their production data is increasing everyday. Current technology allows to routinely store in databases the control variables of special interest and command history of the processes. A later analysis of these databases provides a potentially precious source of high quality information of great help in decision making. Nevertheless, the analysis of such a database is frequently not a trivial task, needing the simultaneous use of tools of very varied origin, in what has now become a field of research in itself, known as 'data mining'. Amongst the first challenges the data miner must tackle is to summarize the complexity of the data into a number of distinct clusters, which represent 'interesting', often unexpected, behavior patterns of the process under analysis. In spite of the powerful computers and efficient clustering algorithms currently available, limits are typically exceeded when mining massive databases such as those arising from industrial processes. Thus arises the need of new clustering algorithms that directly address the problem of size. We present here one such algorithm, successfully applied to a variety of industrial processes, as well as to data sets of different nature whose origin rested in the medicine, biology and epidemiology fields. Our algorithm, named CiTree, yields a hierarchical structure of the clusters present in the process; thus providing a detailed representation of the relationships amongst sample units. As an example of the CiTree use, we also show a real case study whose analysis provided useful information.