Tuesday, 5 April 2011
I2S2 Project workshop at RAL-STFC
Along a busy week in terms of research data management events (due to be shortly reported from this blog), last Friday Apr 1st Sonex had the opportunity -thanks to Simon Hodson, JISC MRD programme manager- to attend the I2S2 Project workshop at the Rutherford-Appleton Laboratory (RAL) at STFC in Didcot. I2S2 -standing for 'Infrastructure for Integration in Structural Sciences' is a JISC MRD project ending in Mar 2011 aiming to "identify requirements for a data-driven research infrastructure in "Structural Science", focusing on the domain of Chemistry, but with a view towards inter-disciplinary application".
Several presentations were delivered along the meeting: Brian Matthews on the I2S2 project achievements, ICAT architecture and CSMD metadata standard, Brian McMahon, International Union of Crystallography (IUCr) on 'Information Management and Publication in Crystallography', Tom Griffin on TopCAT GUI for management of data coming out of STFC ISIS and DIAMOND facilities, Steve Androulakis on the TARDIS ANDS-supported project at Monash University, Mark Borkum on OreCHEM files, Chris Morris on on PiMS (Protein Information Management System) and Juan Bicarregui on the EU PANData project.
Along the IUCr presentation the need was identified for filing & preserving different data categories such as raw measurements, processed numerical data, derived info and the paremeters. The convenience of providing access to raw diffraction images was also stressed along the talk, these files being a few GB in size, and thus not large enough for Data Centres but too big for sites such as CCDC. A review on Crystallographic Information Framework (CIF) file formats was provided, with imgCIF being used for raw data storing out of the experiment, .fcf for including structure factors after data reduction and a final stage of structure solution and refinement being performed in the lab before the author starts formatting those into a IUCr paper, which would translate CIF into SGML for producing final fcf, cif, pdf and html versions.
Raw data was mentioned to be kept for 183 days at SFTC and 3 months at Australian Synchrotron (in which TARDIS is involved), and a discussion followed on the fact that some agreement shoud be reached on the kind of data that ought to be stored and preserved. The process of attachment of DOIs to datasets was also discussed, IUCr being presently involved in projects such as XYZ or Open Bibliography in order to promote this objective.
A TopCAT demo was provided by Tom Griffin. This open source GUI (see image above) is being used for storing raw data from STFC facilities such as ISIS and DIAMOND. TopCAT provides access to its contents through an open registration system, thus operating as a sort of STFC institutional data repository, and would be potentially applicable to other institutions, facilities and disciplines.
TARDIS presentation by Steve Androulakis, Monash Univ, Australia, mentioned their using of XML/METS metadata standards for research data description at the federated institutional repository-platform initially meant to store X-ray diffraction images, later evolving into a much larger initiative with application into microscopy (MicroTARDIS), particle physics and gene processing through the Squirrel software.
Finally, extra presentations were delivered on PiMS (Protein Information Management System) by Chris Morris, STFC and on the European PANData project by Juan Bicarregui, STFC e-Science. PANData aims to build Photon and Neutron Data Infrastructure through a consortium of European synchrotron facilities and neutron sources.
A final summary was made on the whole set of presented I2S2-related features (imgCIF, CIF, IuCr/XML/RDF BIBLIO, PDBML, CML, ICAT, TopCAT, ICAT Lite/CSMD, TARDIS, PiMS, PANData, NeXuS) by mapping them on the I2S2 Idealized Scientific Research Activity Lifecycle Model (see image above - may click on it for an updated version). References were also made to other initiatives not represented at the meeting such as Quixote Project for Computational Chemistry CML data management or Protein Production and Crystallization.