Citation-Based Document Categorization: An Approach Using Artificial Neural Networks
The automatic organization of large collections of documents becomes more important with the growth of the amount of information available in digital form. This study contributes to this issue evaluating the use of Artificial Neural Networks (ANNs) to automatically categorize documents through the analysis of the references cited in these documents. The article describes the method developed to generate clusters of documents based on bibliometric concepts. The method is grounded on the premise that the presence of common citations is indicative of relationships among documents and thus publications are categorized using citations as the main input information. ANNs are typically used to solve problems related to approximation, prediction, classification, categorization and optimization. Many of the experiments reported in the literature describe the use of SOM networks, Self Organizing Maps, in the organization of documents for information retrieval. SOM networks are used in this work in order to categorize documents in a test database. In this categorization process, the semantic relationships among documents are defined not by the identification of terms in common, but by the presence of common cited references and their years of publication. After validation of the method, through the use of a prototype, a database was created, containing the references cited in 200 articles published in the IEEE Transactions on Neural Networks Journal, between years of 2001 and 2010. The publications were categorized by the ANN and presented in groups organized by their common citations. The results obtained show that the ANN successfully identified clusters of authors and texts, through their cited references. These clusters, formed through automatic classification of documents, evidence the existence of semantic relationships between the documents. They can be useful, for example, to automatically identify groups of researchers working in related fields or for identifying research trends in specific domains of knowledge. Another application would be in the process of information retrieval, where they could assist users in the development or reformulation of their queries.