Automatic Classification of Wikipedia Articles by Using Convolutional Neural Network

  • Keita Tsuji Faculty of Library, Information and Media Science, University of Tsukuba

Abstract

Wikipedia has emerged as an important source of information for university students. It has been reported that the students tend to start their search with Google that leads to Wikipedia articles even in university libraries. Recent research findings indicate that relatively few students actually search and read books. Within this context, we are now developing a system that recommends books based on the articles Wikipedia users read in libraries (the system will be added to web browsers of libraries’ desktop PCs). Such a system aims to encourage students to read library books as a more reliable source of information rather than relying on Wikipedia articles. Nippon Decimal Classification (NDC) categories are found to be an effective machine learning method for book recommendation. Therefore, if NDC categories could be assigned to Wikipedia articles, they might be used as an effective tool for book recommendation. Accordingly, we developed a method to automatically assign NDC categories to Wikipedia articles by using convolutional neural network (CNN), which is one of the representative methods of deep learning. We found that the accuracy of assigning top-level (i.e. Main Class) and second-level (i.e. combination of Main Class and Division) of NDC reached 87.7% and 74.7%, respectively. These results were achieved by using titles and categories of Wikipedia articles as input to CNN, while the accuracies obtained by other combinations such as titles, categories, and main texts were relatively poor.

Published
2019-02-11
How to Cite
TSUJI, Keita. Automatic Classification of Wikipedia Articles by Using Convolutional Neural Network. Qualitative and Quantitative Methods in Libraries, [S.l.], v. 6, n. 3, p. 371-380, feb. 2019. ISSN 2241-1925. Available at: <http://qqml-journal.net/index.php/qqml/article/view/415>. Date accessed: 18 jan. 2020.

Most read articles by the same author(s)