Next: Analysis of Evolutionary Trends in Astronomical Literature
Up: ``Robots Are Us'': Automated Information Discovery
Previous: Distributed Information Search and Retrieval
Table of Contents -- Index -- PS reprint -- PDF reprint
Library and Information Services in Astronomy III
ASP Conference Series, Vol. 153, 1998
Editors: U. Grothkopf, H. Andernach, S. Stevens-Rayburn, and M. Gomez
Electronic Editor: H. E. Payne
Soizick Lesteven
CDS - Observatoire Astronomique, 11, rue de l'Université,
67000 Strasbourg, France
F. Bonnarel, P. Dubois, D. Egret, P. Fernique, F. Genova, F. Murtagh,
F. Ochsenbein, M. Wenger
CDS - Observatoire Astronomique, 11, rue de l'Université,
67000 Strasbourg, France
Abstract:
The explosion of on-line services and the rapid evolution in information
technology, with the advent of the WWW, gives its full dimension to
the electronic publication. Electronic publication has to be conceived
with links to external resources (databases, bibliographic services) and with
intelligent information retrieval tools.
To provide links one needs to recognize the relevant information from a
document, and to connect this information to the proper distributed resource.
Recognition is the first step; the whole procedure may include the validation
for correctness and completeness, and the addition of dynamic links
to distributed services.
In addition, publication in electronic form permits new methods to access
published information.
Several activities take place at CDS in this context:
CDS develops and maintains links to and from other
distributed services
(CDS services with electronic publications and the ADS);
CDS develops and maintains services which give access to published
information
(the VizieR catalogue browser for published tables or SIMBAD which
tracks object citations in papers);
CDS develops information retrieval tools (the bibliographic maps or
tools to automatically recognize object names in a text).
All these developments require close connections with the distributed services
(editors, database managers, service managers, ...). A few examples will be
presented.
The rapid evolution in information technology and the explosion of on-line
services are bringing important modifications in the way scientists
collect information for their research. The availability of data and of
scientific literature on the WWW makes it possible to interlink resources
on the network, thus giving a highly value-added service for research
purposes.
In the astronomical community, the scientific literature and the data that
support the research are well-defined and electronically available. Some
interoperabilities already exist between data resources (coming from data
centers as
CDS, NED and observatory archives), the ADS abstract service and electronic
publications (ApJ, AJ, PASP, A&A, NewA, ...). Electronic publishing begins
to be conceived with extensive links both within the document and to external
resources.
To provide links one needs to extract relevant information from the document
and to connect this information to the proper distributed resource.
Information extraction is a really complex process that should not be
underestimated. Recognition is the first step of the whole procedure;
it can be relatively straightforward when the information is
tagged in the text or corresponds to a standard format (tables, bibcodes)
but it can be more complex when the data is heterogeneous (e.g. astronomical
object names). The second step is the validation of the extracted information.
The validation process should ensure the correctness of
the information but also its completeness. This
procedure should be completed manually by an expert.
The third step of the extraction procedure is the addition of dynamic links
to the distributed services. One may have to build procedures
that take into consideration the fact that these services can evolve later on,
whereas the published text has to remain unchanged.
Several CDS activities take place in this context.
CDS has developed and maintains services which give access to published
information, links between distributed services, information retrieval tools,
and automatic recognition and extraction tools.
A few applications will be presented in the following. The first one concerns
the published tabular data and is already operational. The second one
is an extraction tool for astronomical object names and is under construction.
These two applications will illustrate the complexity of the information
extraction process.
The CDS collects and distributes astronomical data catalogues, related to
observations of stars and galaxies, and other galactic and extragalactic
objects. Catalogues about the solar system bodies and atomic data are also
included.
Since January 1993, tables from articles published in Astronomy &
Astrophysics
are prepared and made available on-line at CDS, by agreement with the editor.
Tables from the AAS CD-ROMs were also made available on-line by CDS by
agreement
with the AAS. Tables from some other major journals are also available.
The number of tables is continuously
increasing, in May 1998, one counts 2178 published tables from the
major astronomical journals (Table 1).
Table 1:
Number of published tables from the major astronomical journals available from the CDS catalogue service.
Astronomy and Astrophysics |
375 catalogues |
Astronomy and Astrophysics Supplement Series |
843 catalogues |
Astronomical Journal |
346 catalogues |
Astrophysical Journal |
122 catalogues |
Astrophysical Journal Supplement Series |
267 catalogues |
Publications of the Astronomical Society of the Pacific |
52 catalogues |
Other major journals |
173 catalogues |
The CDS offers two ways to retrieve the catalogued data:
-
The ``Astronomer's Bazaar'' (http://cdsweb.u-strasbg.fr/Cats.html
)
describes all catalogues stored at CDS, which can be copied via anonymous FTP.
-
``VizieR'', allows to access the most complete library
of published astronomical catalogues and tables organized in a self-documented
database (http://vizier.u-strasbg.fr/
). VizieR is an excellent example
of a new
and powerful method to access published electronic information.
These services give access both to astronomical catalogues and published tables,
thanks to the definition of a common standard which is used both by data
centers and publishers.
In order to facilitate the usage of the data in a large variety of
contexts and
the data processing, F. Ochsenbein (1994)
proposed a Standard Description for Astronomical Catalogues. This standard
documentation (accessible on the Web at
http://vizier.u-strasbg.fr/doc/catstd.htx
) is now shared with other
astronomical catalogue producers.
The description gives the signification and the format of the tables
thus allowing easy extraction of the data from the tables. The standardization
plays now a key role for exchange of tabular data between different partners
by allowing:
- Data validation (from information given in the description)
Edition of excerpts of tables to check their validity
-
Transformation into other formats (FITS, Fortran, ..)
-
Automated integration into the VizieR database providing access to all
facilities
From an on-line article (for instance a paper in A&AS) the reader can get
direct access to data tables available at CDS and to the facilities offered
by the catalogue database.
Reversely, from the electronic tables, one can access the corresponding
on-line article when it exists.
The interconnectivity between the electronic tables and the other astronomical
services on the network is already running and new links will be added in the
near future.
This interconnectivity is based on another standard: the 19-digit bibcode
(http://cdsweb.u-strasbg.fr/simbad/refcode.html
).
First developed as a result of the cooperation between NED and CDS to
provide a
unique and readable
representation of a bibliographic reference, it has become a standard code
also - with minor variations -
for ADS and other bibliographic services, in particular on-line journals.
This code facilitates the exchange
and can be automatically created. It makes the interconnectibility feasible
between all bibliographic services and bibliographic data producers.
When an astronomer reads an on-line article where an astronomical object
name is cited, he/she would frequently like to get more information about it.
Presently, this can be done by opening a new window on the screen and
connecting to SIMBAD or NED and sending the
appropriate request. Hypertext features of the Web allow in principle a much
easier approach where just by clicking on the
displayed name one would receive that information directly. The link can be
completely transparent for the users, they don't need to know where the
information is located and how the object has to be written for the
query.
SIMBAD and NED also have to manage this information. The work is done manually
by the bibliographers who read publications. Every time they recognize an
object name in the article (title, abstract, table, ...), they update the
databases. This means that they find how the object name has to be written in
the database, and whether or not the object is already in the
database. If this is not the case,
they create the new object name. Then they link the reference to the existing
or new object and add some basic data (coordinates, magnitudes) when known.
The maintenance of that information is done by the SIMBAD and NED
teams.
To help the bibliographers, and to allow direct access to an astronomical
database from a electronic article text, we have begun to develop tools to
automatically recognize the astronomical object names in texts. The problem is
not trivial because an object name may be very complex, and it can be
written in many
different ways. Moreover, new acronyms are
created on a regular basis for newly published lists.
An astronomical object name may be short, long, structured or not:
examples are Orion Nebula, the Superantennae, DR21(OH),
CCDM J00335+4509BC, NGC 1866, QSO 0347-3819, Cl* NGC 2419 SAW V18, T Tau N,
etc.
The extraction of all these kinds of names in a text is not
straightforward; the way these objects are written is heterogeneous and
varies from one paper to another, or even within a given paper.
To provide an automatic extraction tool, we have developed a software based
on the ``Dictionary of the Nomenclature of Celestial Objects'' (Lortet et
al. 1994).
This dictionary is a reference work which tracks all designations
quoted in
the literature; it is available on the Web at
http://vizier.u-strasbg.fr/cgi-bin/Dic
.
A designation is a structured name basically made of an
acronym and a numbering which are both strings of alphanumeric
characters.
The structure of the numbering is called the format. Examples of formats are
NNN for a running number as in NGC, +/-DD NNNN for a
running number in a declination zone as in BD, JHHMMm+DDMMAAA
for J2000 coordinates as
in CCDM, FFF-NNN for a running number in a field as in ESO, etc.
A specifier can be added. There should be
one object per designation but unfortunately many exceptions to this rule
are found in the published literature. The Dictionary provides full references
and usages of the different acronyms. An example, corresponding to the
ESO
acronym, is given in Table 2. Furthermore, the Dictionary also
gives the corresponding names in SIMBAD. It presently contains more than 5000
acronyms and it is updated on a regular basis.
Table 2:
Dictionary of the Nomenclature of Celestial Objects: Result of
a query for the acronym ``ESO''
Acronym |
Use |
Format |
Year |
1st Author |
Obj. Type |
ESO |
ESO |
FFF-TTT NN |
1981 |
HOLMBERG E.B.+ |
(Opt) |
|
ESO |
FFF-NNN |
|
|
|
ESO |
ESO |
HHMMSS+DDMM.m |
1982 |
LAUBERTS A. |
(Opt) |
(ESO) |
Ruiz |
FFF-NNNA |
1988 |
RUIZ M.T.+ |
* |
|
Ruiz |
FFF-NNNW |
|
|
|
ESO-Halpha |
|
NNN |
1992 |
REIPURTH B.+ |
Em. * |
ESO-HA |
ESO-Halpha |
NNN |
1994 |
PETTERSSON B.+ |
Em. * |
ESO-LV |
ESO-LV |
FFF-NNNN |
1989 |
LAUBERTS A.+ |
G |
Electronic publications are more and more conceived with links to external
resources. Different publishers try to integrate direct links from object names
in on-line articles to external information about the object.
Two different approaches appear. The first one is
implemented by the journal ``New Astronomy". Some object names are selected in
the article by the publisher, and a link to SIMBAD is included after
validation by the
SIMBAD team. The second approach is implemented by the journal
``Astronomy and Astrophysics''. A LATEX macro has been created by the
publisher, allowing authors to tag object names in their article.
Some control tools have to be developed to help the authors and maintain the
correctness of the link. Furthermore, validation by an expert will have to be
performed to ensure the validity of the object name.
Another way to extract object names is to develop an automatic
recognition tool.
The tool is based on the ``Dictionary of the Nomenclature of Celestial
Objects''.
It is written in C language and uses rules (written with regular expressions).
Each designation, coming from the dictionary, is automatically translated into
a rule. The text is searched for the set of rules thus collected.
Identifiers are retrieved. As already discussed, an astronomical object
name can be a complex expression, a validation by an expert remains necessary
to ensure the correctness and completeness of the recognition. Some of the
inaccurate recognitions can be detected by filtering the results
(for example, space mission names, spectral types, atomic or
molecular species can easily be confused with object names). At the end of
the process the object names are tagged in the text. Furthermore, a link
between the name found in the literature and the SIMBAD name can be created.
When the above tool will be operational,
the automatic identification of an astronomical object name will help the
bibliographers who update astronomical databases (SIMBAD, NED). Their
work will
evolve to be more focussed on value-added activities such as
validation of
the names proposed by automatic tools or by the authors, and checking of the
SIMBAD/NED syntax for the object name link. In addition, the following
tasks will remain:
-
Detect cross-identifications
-
Add or improve data (coordinates, magnitudes, spectral types, redshifts, etc.)
-
Detect and create new acronyms
-
Control and detect inconsistencies.
The maintenance of the accuracy of links should not be underestimated.
The links will have to survive changes in the database, while the article
itself will by principle remain unchanged.
Astronomical Object name extraction is a complex process.
The automatic recognition has to evolve with nomenclature. The
``Dictionary of the Nomenclature of Celestial Objects'' is updated on
a regular
basis. Automatic recognition becomes still more
complicated when object names do not respect the different rules of the
nomenclature. Specifications concerning designations have been defined by
the Task Group on Astronomical Designations of IAU Commission 5, who wrote a
document giving recommendations, definitions and examples. This document is
available on-line (http://cdsweb.u-strasbg.fr/iau-spec.html
).
Astronomical object name extraction can be done by the
authors and publishers in parallel with automatic techniques, but an expert
will still have to control that information. The links between the literature
and the databases have to be maintained to ensure the correctness of the
information. New tools have to be developed. As this process is
shared between
the authors, publishers and database managers, cooperation is essential.
The relationship between various services dealing with bibliography is shown
in Figure 1, as seen from CDS. The interesting characteristic of
links through the WWW is that each participating service itself may
well be
in the middle of such a plot.
Figure 1:
Interconnectivity between bibliographic services.
|
To obtain a good interoperability between the different astronomical services,
one needs to maintain all links one has created. This is a real challenge
for database managers.
In this context, the CDS has developed the GLU system (Générateur de Liens
Uniformes). This tool generates automatically hypertext links, avoiding the
well-known drawbacks of hard-coded URLs, which are often not modified when the
target address changes or even when small modifications affect a script
generating the answers to an http request. To realize this purpose, the GLU
implements two concepts:
-
The GLU dictionary which is a compilation of symbolic names with their
corresponding URLs and the way the parameters have to be written;
-
The GLU resolver which replaces symbolic names and their associated
parameters with
relevant URLs on the fly.
Using this tool, data managers can forget the URLs and just use their
symbolic names in all their Web documents. So, the
GLU system is particularly adapted to design cooperative Web services, allowing
to generate links to other services which always remain up-to-date.
The GLU is already used, for example, for the bibliographic surfing in
astronomy between the different CDS services and ADS, in the AstroBrowse NASA
initiative, etc.
Another new application recently developed by the CDS is a visual
tool, a bibliographic map, that allows to retrieve papers
relevant to a domain.
This map is based on a neural network analysis of the keywords associated to
the articles. Documents having similar contents are clustered in the same area
of the map. By clicking on a dot of the map, one can retrieve similar
documents which are relevant to the request. A more complete description of
that tool is presented by P. Poinçot in these proceedings
(page )
(http://simbad.u-strasbg.fr/A+A/map.pl
).
Information extraction allows bibliographic and data surfing between all the
services that deal with astronomy. It provides a high added value for
reseach purposes.
Information extraction tools rely on standards shared at the
astronomical community level. The links need to be permanently resolved
requiring close cooperation between all the services (publishers,
databases, authors, ...) shown in Figure 1.
In the future, information extraction should be improved and diversified to
other type of information (magnitudes, coordinates, space missions, ...).
References:
Fernique, P., Ochsenbein, F. & Wenger, M. 1998
, CDS GLU, a
tool for managing heterogeneous distributed Web services, in
Astronomical Data Analysis Software and Systems VII, ASP
Conf. Ser., Vol. 145, R. Albrecht,
R. N. Hook & H. A. Bushouse, eds., (San Francisco: ASP), 466
Genova, F., Bartlett, J., Bonnarel, F., Dubois, P., Egret, D.,
Fernique, P., Jasniewicz, G., Lesteven, S., Ochsenbein, F. & Wenger,
M. 1998
, The CDS information hub, in Astronomical Data Analysis
Software and Systems VII:
ASP Conf. Ser., Vol. 145, R. Albrecht, R. N. Hook &
H. A. Bushouse, eds., (San Francisco: ASP), 470
Lortet, M.-C., Borde, S. & Ochsenbein, F. 1994,
Second Reference Dictionary of the Nomenclature of Celestial Objects,
A&AS, 107, 193
Ochsenbein, F. 1994, Adopted Standards for Catalogues at
CDS, Bull. Inform. CDS, 44, 19
Ochsenbein, F. 1997, Published Tabular Data, Baltic Astronomy 6, 221
Poinçot, P., Lesteven, S. & Murtagh, F. 1998, A spatial
user interface to the astronomical literature,
A&AS, 130, 183
Schmitz M. et al., 1995, NED and SIMBAD Conventions for
Bibliographic Reference Coding, in Information & On-line Data
in Astronomy, D. Egret & M. A.
Albrecht, eds., (Dordrecht: Kluwer Acad. Publ.), 259
© Copyright 1998 Astronomical Society of the Pacific, 390 Ashton Avenue, San Francisco, California 94112, USA
Next: Analysis of Evolutionary Trends in Astronomical Literature
Up: ``Robots Are Us'': Automated Information Discovery
Previous: Distributed Information Search and Retrieval
Table of Contents -- Index -- PS reprint -- PDF reprint