Friday, 26 August 2011
STM research data management and the Quixote Project
A one-day seminar was held yesterday Thu Aug 25th at the Zaragoza Scientific Center for Advanced Modeling (ZCAM) on research data management and the Quixote Project for data management in Computational Chemistry. The session, entitled “Research data management: The experience of the Quixote project for Quantum Chemistry data. Can it be extended into a collection of research data management repositories?”, was attended by a rather diverse group of researchers (both computational chemists and from other disciplines) and repository managers, aiming to learn about research data management initiatives and specifically about the progress of the Quixote Project, in which two researchers from the University of Zaragoza and the CSIC Institute of Physical Chemistry "Rocasolano" are involved.
The Quixote Project (see paper "The Quixote project: Collaborative and Open Quantum Chemistry data management in the Internet age", in press with the J Chem Inf) is developing the infrastructure required to convert output from a number of different molecular quantum chemistry (QC) packages -such as NWChem or Gaussian- to a common semantically rich, machine-readable format and to build repositories of QC data results.
The session started with an introduction to "STM Research data management initiatives in Spain and abroad" delivered by SONEX member Pablo de Castro, in which different national approaches to RDM were presented based mainly on the information collected at the JISC MRD Programme International Workshop held last March in Birmingham.
Different approaches to data management taken from the JISC and SURF Foundation were discussed at Q&A time: for the JISC, datasets are assets per se, regardless of where they are attached to a research paper as supplementary material, whereas the 'Enhanced publication' approach from the SURF Foundation in the Netherlands, regards datasets mainly as digital objects connected to research publications. Some emphasis was made on the fact that the upcoming OpenAIREPlus European project shares the SURF approach.
Two presentations on the Quixote Project followed, "From Databases in QC 2010, ZCAM, Sep 2010 onwards: a brief history of Quixote" by Jorge Estrada and "The Quixote Project: a pioneering work in managing Computational Chemistry research data" by Pablo Echenique. Both Quixote project members explained the results, the challenges and the cooperation opportunities of this non-specifically-funded RDM project, engaging in a fruitful dialogue with the attending researchers and repository managers on how the QC data assets could be best managed.
Finally Peter Murray-Rust closed the morning interventions with some reflections on the subject "Entering a new era in data management" - see his blogpost for a summary of his ideas.
In the afternoon there were joint debates on how to improve implementation of research data management initiatives. Researcher motivation for dataset sharing was extensively debated: this motivation should ideally not just arise from a given funding agency actually requiring those data to be made available, but from the sheer advantages (as summarized by Peter Murray-Rust) that doing so would bring to the research practice and communication ("improving methodology").
An independent debate session was held for discussing how to start developing some kind of research data management infrastructure in those countries where work in this area is presently beginning. These are some recommendations that were put together by the participants in the debate:
- Some workgroup of (not just library-based) IT professionals should be put together for analysing the current infrastructure and the opportunities for launching new initiatives upon potentially reusable pre-existing ones,
- It would be advisable to analyze the researcher behaviour and needs in terms of storing their datasets into international platforms for data sharing (in case they are available for their specific discipline),
- It would be interesting to examine the motivation for data sharing from research groups in different research areas, so that initial efforts to develop data management infrastructures can start working with those areas more willing to share their data (Earth Sciences recurrently showing up when analysing the international perspective),
- Pioneering initiatives for providing services to STM researchers regarding data handling and storing from given Institutional Repositories (such as eSpacio UNED and Digital.CSIC) should be highlighted as a role model to be spread,
- The OpenAIREPlus/SURF Enhanced papers approach could be a good starting point for Institutional Repositories to work at, by finding out which of their presently filed papers have supplementary data attached at the journal site and trying to independently manage those ones,
- A need was detected along the session talks with researchers for a dataset management system at research centres for basic internal organisation purposes. Datasets filed in this internal storage system may or may not be aimed for publication,
- Production and publication of potentially citable datasets should be acknowledged as a relevant scientific contribution for research assessment purposes,
- There are big differences in needs, procedures and required infrastructure regarding data management between Big Science and long-tail science (the greater part actually being groups of three researchers in a lab with specific needs of their own),
- The Library is a potential supplier of know-how on data processing and storing for researchers, and that role should be promoted within the institutions,
- The Spanish e-Research National Network, mostly dealing with Grid and supercomputing initiatives, might be a good workgroup infrastructure for pioneering data management initiatives in Spain,
- There are real collaboration opportunities between the Quixote Project and the research information management infrastruture at the University of Zaragoza (two IRs being currently available, Zaguan at the University and Digital.CSIC at the Spanish Nacional Research Council, CSIC),
- Research staff (mainly PhD students) getting involved in the management and operation of the dataset information management systems (such as Chempound data repository at the University of Cambridge) seems a prerequisite for the success of the data management initiatives
- Due to the specific data features for various research areas, the incipient data management infrastructure available is more developed for the Social Sciences and Humanities than for STM research areas.