Last modified 12 years ago Last modified on 10/10/2007 11:03:18 PM

Unstructured Data Management (UDM)

Files play a major role in LSST and could be considered Unstructured data (or data without a schema - even though FITS files and images have a structure).

A holistic approach is needed to manage files during their lifecycle including (but not limited to) creation, file transfers, modification, file format, provenance, distribution and archival. The tools, protocols and file-formats used may be different (or remain the same) during the different processes in the lifecycle of a LSST file.

DC2 Tests

  • Protocols: Four tools or protocols have been chosen for: UDT, iRODS, GridFTP and SRB.
  • Sites and Infrastructure: Three sites have been chosen to perform the Unstructured Data Management (UDM): NCSA, SDSC and IN2P3. Chris Cribbs has already setup the required environment (nodes 5,6,7) in LSST cluster at NCSA.
  • File Size and Format: No particular file format has been chosen so far. Flat files of sizes ranging from 1 CCD to full exposure could be used. The final format(s) or methodologies used must be efficient to scale up. The format(s) must also be appropriate for the life cycle of the LSST file - creation, transfer, update, replication, distribution and archival.

Metrics or features

As mentioned above, an holistic approach has to be taken for unstructured data management that spans the life cycle of files. However, for lack of time, we will focus only on major issues in the following three categories:

1). Data transfer metrics
2). Data management features
3). Data management reliability

Data transfer metrics

  • Raw speed: Just the transfer of data from disk to disk - the ability of a protocol to pull/push bytes. This could be measured in many ways. Either as the metric in terms of time or percentage of bandwidth used.
  • Continuous transfers: Transfer a large number of files for N hours (N could be 10 or 20 hours) with almost zero human-intervention.

Data management features

  • Fewer components: Lesser number of dependent components that need to be managed to provide all features of data management including replication, reliability, logical-physical file mapping etc (less moving parts or less number of systems that can go inconsistent during operations). At the same time, no single point of failure.
  • Infrastructure adaptive: As a lesson learnt from moving terabytes inter-continentally on a daily basis, flexibility to adopt and operate in different existing infrastructures at different centers with same or almost equal metrics is critical.
  • Mass Storage: The format and the data management protocol to be used must support the ability to work with Mass storage systems.
  • Replication: Even though not a necessary feature, the ability to main (or synchronize) replicas would be useful feature for data access centers and dessimination of LSST data

Data management reliability

(Murphy's law does not need one more proof)

  • Data loss detection: The system or protocol must find loss or corruption of data that is at rest. This saves a lot of operational time.
  • Data transfer integrity: (check sum comparison levels CRC, md5 etc of data in motion)
  • Data replication: as part of the data management protocol
  • Recovery and Restart: The software or protocol must be able to recover and restart seamlessly or have mechanisms to recover from failures during data transfer or other data management operations.