Next: Knowledge Discovery in Literature Data Bases
Up: ``Robots Are Us'': Automated Information Discovery
Previous: Addressing the Heterogeneity of Subject Indexing in the ADS Databases
Table of Contents -- Index -- PS reprint -- PDF reprint


Library and Information Services in Astronomy III
ASP Conference Series, Vol. 153, 1998
Editors: U. Grothkopf, H. Andernach, S. Stevens-Rayburn, and M. Gomez
Electronic Editor: H. E. Payne

Comparison of Two ``Document Similarity Search Engines''

Philippe Poinçot
Observatoire de Strasbourg, Université Louis Pasteur, Strasbourg, France

Soizick Lesteven
Observatoire de Strasbourg, Université Louis Pasteur, Strasbourg, France

Fionn Murtagh
Observatoire de Strasbourg, Université Louis Pasteur, Strasbourg, France; Faculty of Informatics, University of Ulster, Londonderry, Northern Ireland

 

Abstract:

We have developed and used the ``CDS document map'' based on neural networks (Kohonen maps)
http://simbad.u-strasbg.fr/A+A/map.pl
In this self-organizing map, documents are gradually clustered by subject themes. The tool is based on keywords associated with the documents. For one selected document, we locate it on the CDS document map and retrieve articles clustered in the same area.

The second search engine, used by the ADS[*], has the capability to find all similar abstracts in the ADS database, with ``keyword request''.

We have compared the results of the document similarity search engines, using the same set of documents. One example will be described and results will be discussed.

1. Introduction

Information broadcasting techniques are constantly in development. This is the reason why electronic publications have become more and more important in many domains, and particularly in astronomy. It is possible to access many documents on-line, and it is important to have efficient information retrieval tools at our disposal.

In this article, we will discuss the bibliographical information retrieval system we have developed at the Centre de Données astronomiques de Strasbourg (CDS). In order to validate our system, we compare our results with those from the ADS. The initial results are then discussed.

2. Our Document Retrieval Tool

We are developing, at the CDS, an information retrieval tool, based on Self-Organizing Maps (SOM). These maps are one of the artificial intelligence techniques, and more precisely an unsupervised neural network.

2.1. Self-Organizing Maps (SOM): Basic Principles

A SOM can be considered as a table where objects are classified (Figure 1). In such a table, similar objects are located in the same area. In the example (Figure 1-right), a global classification is shown: the three different shapes are located in three different clusters, furthermore the largest objects are located towards the center of the table, and each cluster is ordered: the largest objects are at one side of a cluster, smaller shapes are at the other side.
Now, let us see how a SOM works:


  
Figure 1: Object locations. Left: before learning. Right: after learning
poincotp6.eps
poincotp5.eps

2.2. The Bibliographic Map

Our set of documents.

We have used the SOMs in order to classify some bibliographical data. In this study, our set of documents comes from the journal ``Astronomy and Astrophysics'', from 1994 to 1998. The descriptive vector is based on the journal keywords associated with each document. We eliminated the rare keywords (keywords found in less than 5 documents). Finally, we used about 4000 documents described by 269 keywords.

Our use of SOM.

We used the SOMs, adapting them to our own needs:


  
Figure 2: Our use of SOM. Left: shifting the rows or the columns. Right: two classification levels
poincotp9.eps
poincotp4.ps

The graphical interface.

  We display this classification with a density map, which represents graphically the areas containing papers of similar content and the number of documents in the areas. Then, the map is labeled to locate on it the themes dealt with (Figure 3-left).

The user can select one node of the map (by clicking on the picture) to obtain some information about the articles located in it (the number of documents and the keywords describing them appear on the right side of the interface) (Figure 3-right). The user can also access the detailed map, and/or the article content (title, authors, abstract) and all the facilities provided by the CDS bibliographical service (including a link to ADS and to the on-line full paper when available).

The user interface allows one to select and display on the map only a part of the database. This is used with keyword queries (only the documents containing selected keywords are shown), or with an external list of documents (bibcode queries).


  
Figure 3: Left: the principal map. Right: the user interface.
poincotp8.ps  poincotp7.ps

3. Validation of the Bibliographic Map

3.1. Method

In order to validate our document retrieval tool, we have begun a study based on the comparison of our system with the ADS system, taken as a reference. The ADS provides wide query possibilities, so we have been able to use exactly the same set of documents with the two systems.

Two different types of comparison have been done. We first use the ADS with keyword query to validate more precisely our system. For the second type of comparison we use the ADS with full-text query to compare the two systems with similar criteria for document retrieval.

What is different in the two systems

The different steps.

The method consists of selecting one document, called ``starting document'', and retrieving similar documents (coming from the same bibliographic database) using the two systems. Then, the different sets of documents are compared by an expert of the field. More precisely:


  
Figure 4: Spreading of documents retrieved by the ADS. Left to right: 20, 40 and 60 documents.
poincotp1.ps poincotp2.ps poincotp3.ps

4. Results

In the following, as a starting document, we have used the article: ``Metallicities and carbon abundances of 5 red supergiants of the SMC globular cluster NGC 330'' from Meliani, M. T. et al. (1995A&A...300..349M). This article is described by the following keywords: GALAXIES: ABUNDANCES, MAGELLANIC CLOUDS, STAR CLUSTERS, GLOBULAR CLUSTERS: NGC 330 (SMC), STARS: ABUNDANCES.

4.1. Validation with the ``keywords only'' ADS version

Results from the CDS map.

The starting document is located in node 6 of the detailed map, and only 2 of the 8 surrounding nodes contain documents (nodes number 5 and 11). Finally, the first CDS list contains 19 documents with scores as shown in Table 1.

Results from ``keywords only'' ADS version.

For the keyword query, the ADS retrieved 323 documents. The 315 last ones have the same small score because they are described only by one keyword of the query. We eliminated them for our study. 17 documents remain, only one is scored as less relevant by the specialist.

Comparison

These initial results show that both systems retrieve almost the same documents. They allow our system to be validated.


 
Table 1: Results for the documents coming from the detailed map (list 1).
node relevant documents less relevant documents
6* 8 1
5 4  
11 2 4

*node containing the starting document.
 

4.2. Validation with the ADS full text query

Results from the CDS map.

We take into consideration all documents of the detailed map, which contains 80 documents (list-2); a check by an expert shows that 22 of them are relevant, 58 are less relevant.

Results for the ADS full text query.

To compare the second ADS list (list B) with the map, we selected the first 80 documents (the ADS sorts the documents by decreasing similarity order). As we have seen in section 2.2.3., we can visualize on the density map an external list of documents when they are already classified on the map. In Figure 4, we can see the location of the documents retrieved by the ADS on the map, corresponding to the given starting document. There is a spreading of the documents while the list becomes longer and longer. This means that the smaller a document's score is, the farther away it will be located from the ``starting document''.

A check by an expert gives 25 relevant documents, and 55 less relevant.

Comparison

The ADS missing relevant documents are retrieved if we examine the ADS result list beyond the first 80 retrieved documents. These results show that the list of documents retrieved by the ADS (the first documents of the list B) is noisier. But if we examine the complete list, the ADS retrieves more relevant documents than our system.

Full text analysis allows to retrieve more relevant documents, but these are drowned in a longer set of documents.


 
Table 2: Results for all the documents coming from the detailed map (list 2), compared to the ADS (list B).
  relevant less relevant
ADS (first 80 documents) 25 55
map 22 58
common 15 22
 

5. Conclusion

This study shows that the bibliographic map gives results comparable with the ADS request by keywords. These first results allow us to validate our information retrieval system.

When we compare our results to the ADS full text, we show that the ADS retrieves more relevant documents, but it seems that the bibliographic map is more accurate for the first retrieved documents. Is seems logical that the full text search is noisier, but more complete if full results are taken into account.

The two systems appear to be complementary, because about 60% of the retrieved documents are identical, and both systems pull in other similar documents. It is recommended to use both systems to get as much relevant information as possible.

Furthermore the bibliographic map, with its graphical interface, is a new approach which is well-designed for the Web. The keyword and bibcode queries give some complementary access to the map. The links existing between two neighboring nodes provide a new mechanism for helping users retrieve similar documents.

But the bibliographic map works only with documents described by a well-defined set of keywords. To extend this tool, we are going to develop a new map based on full text analysis.



Further Reading


Honkela, T., Pulkki, V., & Kohonen, T. 1995, ICANN-95, 2, 3-7,

Kohonen, T. 1995, Self-Organizing Maps, (Berlin: Springer)

Lesteven, S., Poinçot, P., & Murtagh, F. 1996, Vistas in Astronomy, 40(3), 395-400

Lin, X., Soergel, D., & Marchionini, G. 1991, In: Proc. 14th. Ann. Int. ACM/SIGIR Conf. on R & D in Information Retrieval, 262-269

Murtagh, F. & Hernández-Pajarez, M. 1995, Journal of Classification, 12, 165-190

Poinçot, P. 1997, Actes des Conférences INFORSID, 625-641,

Poinçot, P., Lesteven, S., & Murtagh, F. 1998, A&AS, 130, 183

Salton, G. 1991, Science, 253, 974



Footnotes

...ADS
NASA Astrophysics Data System
(http://cdsads.u-strasbg.fr , http://adswww.harvard.edu , http://ads.nao.ac.jp )

...search
This is an old version of the ADS Abstract Service, but still accessible.


© Copyright 1998 Astronomical Society of the Pacific, 390 Ashton Avenue, San Francisco, California 94112, USA


Next: Knowledge Discovery in Literature Data Bases
Up: ``Robots Are Us'': Automated Information Discovery
Previous: Addressing the Heterogeneity of Subject Indexing in the ADS Databases
Table of Contents -- Index -- PS reprint -- PDF reprint