The Astronomical Data Warehouse

Clive Page
University of Leicester
cgp@star.le.ac.uk
VO2002 - Garching - 2002 June 14

What is a Data Warehouse?

In commerce a data warehouse contains a copy of transactional data specifically structured for querying, reporting, and data mining.
Static – to get repeatability
All datasets in same (or compatible) format
Astronomers:
Mostly don’t have transactional data (except in pipelines)
Do have data widely distributed on the network
Do have data in many different DBMS, different formats

User accesses VO Portal (of which there will be many)
VO Portal knows about at least one Resource Registry.
Resource Registry knows about all (participating) astronomical data resources in the world, and can tell the VO Portal which sites are likely to have data which help in the query.
User (assisted by VO Portal) contacts each site and retrieves data (simplified view).
But – anything which involves scanning an entire large dataset serially (or any substantial port of one) can barely be done by present systems.

Observational queries – requests for data from specific observation (may be raw or processed data).
Positional queries – requests based on a named object or a position in the sky, and the small area around it.
Non-positional queries – all other sorts.

For example:
- get me XMM-Newton data from the observation of NGC9876.
- Are the HST data from HD123456 public yet?
Can be handled, as now, by direct connection between user and observatory archive
No need for Data Warehouse, Registry barely needed.
VO can add: infrastructure for authentication and authorization which distinguishes between privileged and public users when allocating resources.

Request data (source-list or pixels) from named object or small patch of sky
For example:
- Are there any X-ray sources near HD123456?
- I’d like an infra-red image centred on PSR1234-456
Needs Resource Registry to find appropriate resources
Needs intelligent Portal to combine results from several sources of information.
Positional queries are predominant right now, partly because current systems are set up to support them (and not much else).

Essentially everything else, just a few examples:
Cross-matching different catalogues (basis of many types of investigation)
Statistical investigations
- Regression analysis to find correlations
- Finding rare objects as outliers from distributions, etc
Clustering algorithms
Time-series analysis: finding periodicities, outbursts, etc.
Similarity and difference searches: finding objects which have moved, or changes in brightness.
Measuring large-scale structure, e.g. Fourier analysis

Most operations cpu-intensive, I/O-intensive, or both.
Some operations (e.g. cross-matching, statistics) can be done with relational DBMS.
But even fairly simple operations (fuzzy joins between source catalogues) are almost infeasible across networks.
Some operations can be done with existing astronomical applications (AIPS, FTOOLS, IRAF, MIDAS, Starlink…)
Others will require new algorithms and software to be developed.

Provide cpu power – most algorithms parallelize well, so cheap (Beowulf) clusters may be suitable.
Provide disc space
- Data mining mostly involves source lists not pixels, sizes in GB range at present.
- Disc space no longer very expensive.
Use authentication/authorization mechanisms (e.g. from Grid Computing) to ensure scarce resources used sensibly.
Provide copies of most popular datasets and software
Disc space for users to store other datasets, software, etc.

Many archive sites already store data from two or more observatories:
- CADC, CASU, CalTech, CDS, ESO, HEASARC, LEDAS, SAO, STScI, Vilspa, WFAU …
- Systems such as Astrobrowse, Aladin, VizieR, MAST explicitly designed to provide multi-mission support.
Many popular datasets and collections have already been replicated at many sites,
- VizieR collection now at 8 sites.
- Many other datasets with at least two copies on the web: Tycho, Hipparcos, GSC, USNO-A2, 2MASS…

Working on authentication and authorization protocols.
Developing the “MySpace” concept – management of user’s own space on the Astronomical Data Warehouse.
Devised faster algorithm for fuzzy joins on source-catalogues using RDBMS (further details of PCODE idea, which can use pixellation such as HEALPix or HTM, can be seen on http://www.star.le.ac.uk/~cgp/skyindex.html).
Starting evaluation of DBMS for astronomy, especially for fuzzy join and statistical queries.

PostgreSQL – Open Source, object-relational, R-tree indexing, good reputation.
MySQL – Open Source, simple, fast, already used by LEDAS and other archive sites.
Oracle – market leader, full-featured, support for parallelism, spatial data option, (but expensive).
DB2 – mature, full-featured, good support for XML and open standards.
SQL Server – said to be most user-friendly, used by JHU for SLOAN data - SkyServer, (but Windows only).
Sybase – used by large number of astronomical archives.
Aim to complete initial evaluations in August.

Database Task Force (UK e-science project): developing database interfaces for Open Grid Services Architecture: beta releases over this summer.
Alex Szalay’s group at JHU: developing SkyQuery to do distributed queries across the Internet.

Can large datasets even be copied across the Internet?
- Bandwidth limitations
- Original curators may object
How to decide where to do the data mining when two (or more) large datasets are needed?
- Find data warehouse holding largest, can the others be copied to it?
- If not, does any accessible warehouse have cpu power and disc space to cope with them all?
Data formats, protocols, standards, …
How can people use their own code on the Warehouse?