XML, Java, and Tools

Small presentation by Heiko Sommer for the software department meeting on April 18, 2002.
This document is available in html format at www.eso.org/~hsommer/XmlJava20020418.

 

Technologies

Java

Object-oriented language with extensive standard libraries and tools
Overview and history: java.sun.com/java2/whatis.
Getting started: developer.java.sun.com/developer/onlineTraining/new2java
Libraries API: java.sun.com/j2se/1.4/docs/api
Best practices book: Effective Java by J. Bloch.
Language specification (very detailed and yet very good): java.sun.com/docs/books/jls/second_edition/html/jTOC.doc.html

Operating system independence - Interpreted language - Virtual Machine
Java is not a scripting language; source code must be compiled to (OS-independent) bytecode which is directly executable on specialized Java-CPUs. Since no common CPU supports this, we need a Java Virtual Machine (JVM) to interpret Java bytecode. The JVMs are specific to the OSs. Sun provides them for Windows, Linux, and Solaris; other OSs need to provide their own according to the JVM specification.
To run other languages on the JVM, see flp.cs.tu-berlin.de/~tolk/vmlanguages.html.
Java is more than a programming language, called a "platform" instead. A large book that touches on everything is "Professional Java Server Programming J2EE Edition", available in my office or at amazon.

Free and Open
Owned by Sun Microsystems (java.sun.com).
Open standardised evolution through Java Community Process (jcp.org).
Freely available from Sun: Standard/Enterprise/Micro Edition.

Easy, Fast, and Safe
Garbage collection, interface mechanism, dynamic class loading (no more worrying about DLLs!), enforced exception handling, standard libraries, ...

Main usage: server-side large business applications
Initial idea: small devices in a network (like smart toasters), but never used much for this.
Later: GUIs (AWT, Swing) serving as front-end to non-Java servers, or funny applets in a webbrowser.
This has completely changed. Now Java is most popular for "serious" server applications with a web-GUI, although the rich Java GUI might be revived thanks to automated deployment with Java Web Start.

Performance
Using just-in-time compilation of Java bytecode to native machine code, modern JVMs match the runtime performance of non-optimizing C++ compilers, although startup is somewhat slower. Benchmark comparisons for typical scientific computations are given in a paper presented at the Java Grande - ISCOPE 2001 Conference, suggesting an average performance loss of about 25% compared to C and Fortran compilers.
For many non-realtime situations, what matters more than execution speed in one thread/process is good scalability, one of Java's strong points.
The site www.javagrande.org has information on Java for high-performance computing.


XML

EXtensible Markup Language, universal format for structured documents and data. Along with XML, there comes a whole family of related standards and tools.
See www.w3.org/XML, e.g. the XML-in-10-points overview or the spec.
Another good source is xml.com.

An example of some xml data:

<?xml version="1.0" encoding="UTF-8"?>
<ObsProposal>
  <ObsProposalEntity entityId="4711" entityVersion="1"/>
  <ObsProjectRef entityId="999999999" entityTypeName="ObsProject"/>
  <SciJustification>basic research can hardly ever be justified.</SciJustification>
  <PerformanceGoals>give me five black holes for one crab nebula.</PerformanceGoals>
</ObsProposal>

XML is meant for hierarchical data (tree), although there is a mechanism to specify cross-links through identifiers (graph).
It's is text-based with variable record lengths, which means it must be sequentially parsed from the beginning to the point of interest.

Meta Data Definition

Definition of your own XML language (structure, content and semantics of XML documents that your application recognizes).
The most widely used grammars are DTD and xml schema.
See www.brics.dk/~amoeller/XML/schemas/.

DTD (Document Type Definition)
Built-in schema language, see www.w3.org/TR/2000/REC-xml-20001006#elemdecls.
Subset of SGML-DTD.
Rather particular syntax, like

XML Schema
For a good introduction, see the W3 schema primer.
Best practices are discussed at www.xfront.com/BestPracticesHomepage.html.

Advantages over DTD: Schema is written in xml (easy to parse!), data is strongly typed, inheritance possible, null representation, ... (see www-106.ibm.com/developerworks/library/x-sbsch.html).


<?xml version = "1.0" encoding = "UTF-8"?>

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" version = "1">

	<xsd:simpletype name = "EntityIdT">
		<xsd:restriction base = "xsd:integer">
			<xsd:totaldigits value = "9"/>
			<xsd:fractiondigits value = "0"/>
			<xsd:whitespace value = "collapse"/>
		</xsd:restriction>
	</xsd:simpletype>

	<xsd:element name = "ObsProposal">
		<xsd:complexType>
			<xsd:sequence>
				<xsd:element ref = "ObsProposalEntity"/>
				<xsd:element name = "ObsProjectRef" type = "ObsProjectRefT"/>
				<xsd:element name = "SciJustification" type = "xsd:string"/>
				<xsd:element name = "PerformanceGoals" type = "xsd:string"/>
			</xsd:sequence>
		</xsd:complexType>
	</xsd:element>

</xsd:schema>

Parsing

Since xml data should be processed by computers rather than humans equipped with emacs, it's important to have powerful and standardized parsing mechanisms. Depending on required speed, resource consumption, and generality of the application, there are different parsing models in use.
Parsers can offer validation against a schema, or just validation with respect to xml grammar conformity.

DOM (Document Object Model)
W3C standard API for a generic object model that can represent any xml document as an in-memory tree which the application can navigate and manipulate.
Suits many programming languages, therefore somewhat awkward for Java, see JDOM.
Resource intensive, since the parser instantiates the entire document at a time.

SAX (Simple API for XML):
API for callback parsers, implemented by most xml parsers. Fast and frugal.
Callback handlers get rather complex though when many levels of hierarchy are involved.
Not W3C, yet de-facto standard, see www.saxproject.org. For comparison with DOM, see www.w3.org/DOM/faq#SAXandDOM.

Pull-APIs
Supposedly easier to use than event-driven callback parsers: see e.g. xmlpull.org (API), Kxml.org (parser implementation).

XML binding
Specialized classes are generated from a DTD or schema. They parse/validate and write xml data that conforms to that schema.
See www.rpbourret.com/xml/XMLDataBinding.htm and the section on Castor below.

 

Tools

Eclipse

Free integrated development environment (IDE) from IBM (see www.eclipse.org and their FAQ).
Available for Windows and Linux (mostly written in Java, but with OS-dependent SWT GUI).
IBM will use Eclipse as the basis for the new "WebSphere Studio Application Developer", which is the successor of Visual Age.

Actually, Eclipse is not really an IDE, but an open platform for building IDEs, which happens to come with plug-ins for Java and C++.
Anybody may contribute extensions, which a number of big companies plan to do.

Eclipse keeps an internal tree representation of any Java files in the project, which allows for the very powerful smart search and code generation features.

[click for larger pic]

Since the download is large (49 MB), I put the latest stable Windows build here (2002-4-16). Have fun playing with it!

 

Castor

Ambitious open-source project, provides XML-Java binding (www.castor.org/xml-framework.html) among other things.

Use in ALMA

Data gets passed among subsystems and to the database in XML format. Since many different groups have to agree on the data definition, we happily make use of the advanced capabilities of XML Schema. Each so-called "entity object" is defined in it's own schema file.

Castor reads in the schema files and generates Java classes with accessor/manipulator methods for all data fields. The instances can be conveniently worked on, w/o knowing much of XML. Data is validated and transformed to its XML representation by the marshal() method; likewise, unmarshal() ingests textual XML into the Java object.