XML

XML stands for eXtensible Markup Language and one might be forgiven for believing that it is the greatest thing ever to have happened to computer science after spending some time examining recent publications in the field (Harold, 2003), yet it is not without its issues (see Vaughan-Nichols, 2003). For all of the accolades heaped on it in regards to its uses for computer scientists it has roots based in a language developed from the world of paper publishing, SGML.

Origins and Development of XML

Charles Goldfarb, the creator of SGML, saw its development as owing to the work of a number of individuals active in the late 1960's whose main concern was the presentation of printed material (Goldfarb, 1990). A number of people proposed splitting the data contained in documents from the formatting of documents; the formatting would then be described in specific ways. Goldfarb took these ideas, and with his colleagues at IBM in 1969, created GML in order to allow text-editing, formatting and information retrieval subsystems to share documents in an integrated law office information system.

Goldfarb continued to work with markup languages (Goldfarb & Prescod, 2001) and designed SGML to have:

Common data representation allowing different hardware/software combinations to read and write the same document
Extensibility allowing the markup language to have the flexibility to be able to work with any of the myriad different types of document
Rules allowing for the creation of a formal description of documents which share a common type

While Goldfarb started work on SGML almost immediately after his work on GML it wasn't until 1974 that SGML was properly proven, this was when Goldfarb proved that software could check the validity of a document against its document type definition. SGML was ratified as a standard in 1986, though it had been in use for some time prior to this by industry.

SGML was used by Tim Berners-Lee as a basis for the development of HTML (HyperText Markup Language) while he worked at CERN and was looking for a way in which scientists working on diverse hardware platforms would be able to share data in a universally accessible format over the infant internet. HTML represents a subset of SGML but it shares only the first feature of SGML - in that it allows the same documents to be read on different systems. It is not formally extensible and it does not enforce specific rules. To some extent it might be said that SGML is the father of both HTML and XML but while HTML represents a subset of SGML, XML represents a simplified version of SGML (Please refer to figure 1).

While not formally extensible, HTML has been extended by browser creators who have made proprietary extensions. Generally these extensions have little to do with the focus of this dissertation though as they are associated with multimedia applications rather than the development of ontologies. However, a team at the University of Maryland's department of computer science did create Simple HTML Ontology Extensions (SHOE) in the late 1990's. It was described as a "superset of HTML which adds the tags necessary to embed arbitrary semantic data into web pages." (Heflin et al., 2000), and it approaches the functionality available in modern approaches to ontology languages.

While no longer actively maintained, the work carried out by the team at the University of Maryland has played a part in the development of other technologies, indeed most of the team members seem to be currently involved in these subsequent efforts, despite their initial misgiving about using XML as the basis for such an approach. SHOE seems reasonably straightforward in its approach and it offers some benefits over XML as it uses a language base already popular, thus meaning that developers don't need to learn a whole new paradigm in terms of information representation. Indeed Heflin, with his colleague Hendler, argued persuasively against turning to XML as a basis for future development in an address given to the 2000 Extreme Markup Languages Conference (Heflin & Hendler, 2000), by pointing out that XML is not suited to the aims of semantic interoperability precisely because it is too extensible and offers too many possibilities for developers to create their own approaches without concern about whether their approach will be suited to the use of others, this argument was effectively quashed when the W3C started work on standardising XML based ontology languages. HTML is now redundant in terms of the description of ontologies and this position is supported by the Semantic Web Community Portal (2003), which notes that "for applications on the web it is important to have a language with a standardized syntax. Because XML emerges to be the standard language for data interchange on the web, it is desirable to also exchange ontologies using an XML syntax, thus simplifying the task of writing parsers".

What is XML?

An exhaustive review of XML is beyond the remit of this work but there are many resources both online and in printed material for interested parties, the author particularly recommends the comp.text.xml newsgroup, available in the first instance via google groups. We shall now however examine the basics associated with the language as it forms the foundation of most, if not all, approaches to ontologies and the semantic web in general.

Documents written using XML are used for the storage or exchange of data or information. As a language it is similar to HTML at first appearances. XML documents, at their simplest, have elements, attributes and content as shown in figure 2, these are contained within a single root element. There are many other rules associated with well-formed XML but at present that should suffice. There is generally a hierarchy of element tags nested within the root element of an XML document, these elements can be qualified with attributes and can enclose text or other types of content. There can be confusion regarding whether or not to use elements or attributes within XML documents and this discussion is resolved to some extent by the work of Daconta (2001) and his published guidelines. For instance it is possible to document an address in either of the following two ways:

      
          <?xml version="1.0"?>
          <ADDRESS-BOOK>
            <ADDRESS>
              <STREET> 424 Any St. </STREET>
              <CITY> Bealeton </CITY>
              <STATE> VA </STATE>
              <ZIP> 22712 </ZIP>
            </ADDRESS>
          </ADDRESSBOOK>

          <ADDRESS-BOOK>
            <ADDRESS street = "424 Any St."
              city = "Bealeton"
              state = "VA"
              zip = "22712"/>
          </ADDRESS-BOOK>

The first example could be described as element-centric whereas the second could be attribute-centric. The proper way would be dictated by, as in most things associated with XML final application for the data.

According to the initial specifications of XML (1.0) all documents must conform to a document type definition (DTD). About the best description of a DTD which the author has found is:

"A DTD contains markup declarations in two subsets, the internal subset and the external subset. Taken together, these define the allowed structure of a class of XML documents.

"Typically, a group of companies or other organisations with a common interest in a particular type of information will agree (after sometimes lengthy discussion) to automate important processes. That will generate a list of documents that they want to exchange and, in turn, agree on a common structure to be used for the exchange of certain types of business or technical information. The formal expression of the agreed-upon structure is a DTD or other type of schema." (Watt, 2003, p43)

All XML (1.0) documents must conform to a DTD in order to be valid XML though more recently the perceived limitations of DTDs have chaffed at the requirements of developers and it can now be replaced with, or used in conjunction with XML schemas (or an XSD). XML Schemas use XML to describe XML documents and according to Webb (2001) "schemas are a richer and more powerful [way] of describing information than what is possible with DTDs". XML can be valid and well-formed, or merely valid if the document doesn't conform to either a DTD or a Schema. The rationale behind the use of DTDs can initially appear confusing but it must be remembered that XML is primarily a language for exchanging data between concerned bodies and as such it is important that that data should correspond to an agreed upon format.

DTDs are not written XML but rather an extended (BNF) Backus-Naur Form (Originally Backus Normal Form). BNF is a "metasyntax used to express context-free grammars: that is, a formal way to describe formal languages" (Wikipedia, 2003).

Now that the development of XML and some of its associated technologies have been examined we can turn to an examination of the development of ontology description languages based upon the common syntax offered by XML.