The Astronomical Data Warehouse
Clive Page
University of Leicester
cgp@star.le.ac.uk
VO2002 - Garching - 2002 June 14
What is a Data Warehouse?
- In commerce a data warehouse contains a copy of transactional data specifically structured for querying, reporting, and data mining.
- Static – to get repeatability
- All datasets in same (or compatible) format
- Astronomers:
- Mostly don’t have transactional data (except in pipelines)
- Do have data widely distributed on the network
- Do have data in many different DBMS, different formats
VO Architecture in Action
- User accesses VO Portal (of which there will be many)
- VO Portal knows about at least one Resource Registry.
- Resource Registry knows about all (participating) astronomical data resources in the world, and can tell the VO Portal which sites are likely to have data which help in the query.
- User (assisted by VO Portal) contacts each site and retrieves data (simplified view).
- But – anything which involves scanning an entire large dataset serially (or any substantial port of one) can barely be done by present systems.
Types of Query
- Observational queries – requests for data from specific observation (may be raw or processed data).
- Positional queries – requests based on a named object or a position in the sky, and the small area around it.
- Non-positional queries – all other sorts.
Observational Queries
- For example:
- get me XMM-Newton data from the observation of NGC9876.
- Are the HST data from HD123456 public yet?
- Can be handled, as now, by direct connection between user and observatory archive
- No need for Data Warehouse, Registry barely needed.
- VO can add: infrastructure for authentication and authorization which distinguishes between privileged and public users when allocating resources.
Positional Queries
- Request data (source-list or pixels) from named object or small patch of sky
- For example:
- Are there any X-ray sources near HD123456?
- I’d like an infra-red image centred on PSR1234-456
- Needs Resource Registry to find appropriate resources
- Needs intelligent Portal to combine results from several sources of information.
- Positional queries are predominant right now, partly because current systems are set up to support them (and not much else).
Non-positional Queries
- Essentially everything else, just a few examples:
- Cross-matching different catalogues (basis of many types of investigation)
- Statistical investigations
- Regression analysis to find correlations
- Finding rare objects as outliers from distributions, etc
- Clustering algorithms
- Time-series analysis: finding periodicities, outbursts, etc.
- Similarity and difference searches: finding objects which have moved, or changes in brightness.
- Measuring large-scale structure, e.g. Fourier analysis
Non-positional Query Characteristics
- Most operations cpu-intensive, I/O-intensive, or both.
- Some operations (e.g. cross-matching, statistics) can be done with relational DBMS.
- But even fairly simple operations (fuzzy joins between source catalogues) are almost infeasible across networks.
- Some operations can be done with existing astronomical applications (AIPS, FTOOLS, IRAF, MIDAS, Starlink…)
- Others will require new algorithms and software to be developed.
Data Warehouse Details
- Provide cpu power – most algorithms parallelize well, so cheap (Beowulf) clusters may be suitable.
- Provide disc space
- Data mining mostly involves source lists not pixels, sizes in GB range at present.
- Disc space no longer very expensive.
- Use authentication/authorization mechanisms (e.g. from Grid Computing) to ensure scarce resources used sensibly.
- Provide copies of most popular datasets and software
- Disc space for users to store other datasets, software, etc.
Data Warehouses in Formation?
- Many archive sites already store data from two or more observatories:
- CADC, CASU, CalTech, CDS, ESO, HEASARC, LEDAS, SAO, STScI, Vilspa, WFAU …
- Systems such as Astrobrowse, Aladin, VizieR, MAST explicitly designed to provide multi-mission support.
- Many popular datasets and collections have already been replicated at many sites,
- VizieR collection now at 8 sites.
- Many other datasets with at least two copies on the web: Tycho, Hipparcos, GSC, USNO-A2, 2MASS…
Related Work: AstroGrid
- Working on authentication and authorization protocols.
- Developing the “MySpace” concept – management of user’s own space on the Astronomical Data Warehouse.
- Devised faster algorithm for fuzzy joins on source-catalogues using RDBMS (further details of PCODE idea, which can use pixellation such as HEALPix or HTM, can be seen on http://www.star.le.ac.uk/~cgp/skyindex.html).
- Starting evaluation of DBMS for astronomy, especially for fuzzy join and statistical queries.
DBMS under Evaluation
- PostgreSQL – Open Source, object-relational, R-tree indexing, good reputation.
- MySQL – Open Source, simple, fast, already used by LEDAS and other archive sites.
- Oracle – market leader, full-featured, support for parallelism, spatial data option, (but expensive).
- DB2 – mature, full-featured, good support for XML and open standards.
- SQL Server – said to be most user-friendly, used by JHU for SLOAN data - SkyServer, (but Windows only).
- Sybase – used by large number of astronomical archives.
- Aim to complete initial evaluations in August.
Other Related Work
- Database Task Force (UK e-science project): developing database interfaces for Open Grid Services Architecture: beta releases over this summer.
- Alex Szalay’s group at JHU: developing SkyQuery to do distributed queries across the Internet.
Problem Areas
- Can large datasets even be copied across the Internet?
- Bandwidth limitations
- Original curators may object
- How to decide where to do the data mining when two (or more) large datasets are needed?
- Find data warehouse holding largest, can the others be copied to it?
- If not, does any accessible warehouse have cpu power and disc space to cope with them all?
- Data formats, protocols, standards, …
- How can people use their own code on the Warehouse?