Authors: Francesco Furfari, Francesco Potortì
Date: 2007
License: GNU LGPL
Source: XML schema (official repository)
CostGlue is a storage tool for big to huge quantities of data derived from simulation runs and measurements. Data stored in the Costglue is described via metadata in XML format.
In recent years increasing attention has been devoted to metadata for every
application domain. The XML Schema
language provides a means for defining the structure, contents and
semantics of an XML document and it is widely used to collect data about data,
that is, metadata.
The HDF5 data format
allows metadata to be associated with every object by using a series of
predefined attributes in the form of name=value
pairs. This
mechanism is too simple for our requirements. Consequently, in order to insert
metadata, we defined an XML Schema whose document instances can be saved
together with the simulation data, so to be part of the logical data structure
of the CostGlue HDF5 archive.
Metadata can be associated to every Data Group; metadata referring to the archive as a whole are saved together with the indexing table, while metadata for a single simulation run are saved in the related Data Group. Metadata can make reference to any kind of additional data, that is, referenced as post-processing objects in the logical data structure. Examples of post-processing objects are statistics on the raw data, charts, images, and any type of data that are produced from or relevant to the raw data.
The CostGlue metadata XML Schema is inspired to the Reference Model for an Open Archival Information System (OAIS) and uses parts of the CCLRC Scientific Metadata Model (CSMD). OAIS is a technical recommendation to provide permanent or indefinite long-term preservation of digital information. The objective of the CSMD model is to aid interoperability of scientific information systems among research organizations. The adoption of a common XML schema could facilitate further aggregation of telecommunication archives in catalogues to be published in repositories for the scientific communities such as CRAWDAD.
Three main elements are present, one for the metadata relative to the root (Study), one for those relative to a Data Group (DataGroup), and one that puts them together (Archive).
The structure of the metadata relative to the root is stored in the root of the archive together with the index table. The Institution, Person, Information and Notes elements are those defined by CSMD. The Study element and its embedded Investigation element reflect the taxonomy used in the CSMD model, where instances of an Investigation can be Experiment, Measurement or Simulation.
The Investigation element is thus central to the CostGLue schema. It is a close match to the one defined in CSMD, with a custom DataHolding element which is able to describe the data structure used in CostGLue. A notable addition is the PostProcessingObject element, whose puprose is allowing to embed arbitrary data inside the archive, such as graphs or statistical characterisation of data relative to the whole archive.
The metadata relative to each Data Group is stored in the Data Group together with the Data Table. One notable element is Software, extended from the CSMD one, which additionally includes the possibility of including the whole program source, the patches used and the input file. These are particularly relevant for a simulation or measurement environment where the simulator or the measurement instruments are heavily or totally software-based. PostProcessingObject elements can also be added to Data Groups.
Most of the elements in the schema are optional. This choice is motivated by the need not to impose an excessive burden on the experimenter. The drawback of this choice is that it will be possible to have vastly incomplete metadata. However, if the objective is cooperation between experimenters, it is easy to envision that a CostGLue module can certify variable degrees of completeness of the metadata, so that repositories can accept only archives that comply to a certain degree of metadata completeness.
Some information in the schema requires verification during ingestion of new data and some can be automatically generated. One case is redundant information, such as the DataGroup/Parameters element, which is a copy of the Index row that points to the Data Group. Another one is elements referring to other elements, such as PostProcessingObject, which is a reference to an opaque object stored in the archive.
The metadata are constructed so as to be separable from data, so that a complete description of the archive can be easily produced, which requires little data storage and can also be made part of repositories, because metadata include pointers to the archive location. Datagroup elements can be extracted separately for inclreased efficiency and flexibility.
The schema does not include a way to serialise the archive data. This choice was made because we see the possibility of exporting the whole archive (data plus metadata) in XML format as a feature at a different level. Specifically, HDF5 has an XML schema and a tool to convert a whole HDF5 binary file to XML.