First objective of the JISC-supported Sonex initiative was to identify and analyse deposit opportunities (use cases) for ingest of research papers (and potentially other scholarly work) into repositories. Later on, the project scope widened to include identification and dissemination of various projects being developed at institutions in relation to the deposit usecases previously analyzed. Finally, Sonex was recently asked to extend its analysis of deposit opportunities to research data.

Wednesday, 27 April 2011

National initiatives for promoting data management strategies: an overview

- "Hello, I want to deposit my data"
- "Sir, this is a library!"
- "Sorry" -he whispers- "I want to deposit my data".
(as told by Brian Hole, British Library, along his presentation of the DRYAD UK initiative)

  Main objective of the JISC MRD International Workshop held last month was to review progress achieved by the JISC Managing Research Data Programme and to discuss this in the context of broader international developments.

As stated in the workshop programme overview, "this dimension reflects key partnerships which JISC, the JISCMRD Programme and the DCC has been building through the IDCC Conference, the Knowledge Exchange and other initiatives. They include the Australian National Data Service, the NSF funded DataNet Projects, institutions in the US and Australia, the DFG, SURF, DANS etc".

Whithin the broader context, besides a couple of preliminary talks on the European Union approach to (and future funding of) data management initiatives -by John Wood, on the EU 'Riding the wave' report, and by Carlos Morais-Pires on the Digital Agenda for Europe- the workshop featured a specific session on "National and international infrastructure initiatives" whose first panel was called "Approaches and strategies in the UK, US, and Germany". Australian and Dutch national or specific approaches were also discussed, either at this session or later along the event.

Besides the national initiatives featured in this and further sessions along the meeting -it was reassuring to see such a broad scope of strategies or already running projects taking place at the same time in so many different countries- there are also additional, sometimes preliminary initiatives for promoting data management policies at national or institutional level in other countries such as Finland, Portugal, France, Poland or South Africa.

As new initiatives for research data management keep steadily coming up, this session was an opportunity to get an informal update on DCC's report 'Comparative Study of International Approaches to Enabling the Sharing of Research Data' - see its summary and main findings here as of Nov 2008.

Digital Curation Centre - UK
Kevin Ashley, Digital Curation Centre (DCC), described the present picture of data management in the UK as "a new context", where Universities are increasingly willing to take responsibility for data management (specially in areas not covered by Data Centres).
Once UK funder and NSF rules for Data Management Planning are being implemented, this in-advance planning is becoming very important for funders, researchers, institutions, collaborators and reusers. DCC current tasks include integrating different Data Discovery Services plus building institutional capacities: skills, policies, etc. Besides that, DCC is providing the new DMP Online service aimed to produce and maintain Data Management Plans.
Good news is that, despite varying degrees of involvement, institutions in the UK have accepted their role in RDM.

NSF-funded DataNet Projects - US
A summary of present state of research data management in the US was provided by presentations of the DataONE and DataConservancy initiatives, resp. delivered by William Michener (University Libraries at U New Mexico) and Sayeed Choudhury (Johns Hopkins University).

After stating that "researchers are presently using 90% of their time managing data instead of interpreting them", W. Michener presented the Data Observation Network for Earth (DataONE) initiative (a live DataONE presentation at U of Tennessee is available). This NSF-supported initiative aims to ensure preservation and access to multi-scale, multi-discipline, and multi-national science data. DataONE Coordinating Nodes around the world will help achieving needed international collaboration for solving the grand science and data challenges, particularly with regard to education.

The DataConservancy initiative aims to research, design, implement, deploy, and sustain data curation infrastructure for cross-disciplinary discovery with an emphasis on observational data. S. Choudhury's presentation stressed the need for data preservation as a necessary condition for data reuse and introduced the recent connection of data and publications through as one of the pilot projects that build upon the Project APIs.

DFG - Germany
New DFG information infrastructure projects in Germany were presented by Dr Stefan Winkler-Nees, who mentioned both Jan 2009 DFG Recommendations for Secure Storage and Availability of Digital Primary Research Data, as a base report for promoting standardized work in the data management area, and DFG running call for proposals "Information infrastructures for research data". Selected projects at this call are due to be shortly announced and will start on May/Jun'2011. Finally, in a a common line of thought with other initiatives, Dr. Winkler-Nees mentioned DFG is aiming for teaching and qualification of both researchers and data curators.

SURF Foundation & DANS - The Netherlands
Later on along the workshop, John Doove presented the SURF Enhanced Publications initiative within the SURFshare programme 2007-2011. Six new projects funded along 2011 by the SURF Foundation will allow researchers from a variety of disciplines to share datasets, illustrations, audio files, and musical scores with fellow researchers in the context of Enhanced Publications (programme video available on YouTube). There were already two previous grants rounds for Enhanced Publications. The six running projects, whose results are due in May 2011, take place within five disciplines: Economics (Open Data and Publications, Tilburg University), Linguistics (Lenguas de Bolivia, Radboud University Nijmegen, and Enhanced NIAS Publications, KNAW-Royal Netherlands Academy of Arts ans Sciences), Musicology (The Other Josquin, University Utrecht), Communication sciences (Enhancing Scholarly Publishing in the Humanities and Social Sciences, KNAW) and Geosciences (VPcross, KNAW).

The Dutch strategy for increasing research data available online was completed with the presentation "Sustainable and Trusted Data Management" delivered by Laurent Sesink (DANS-Data Archiving and Networked Services). DANS, est. 2005, deals with storage and continuous accessibility of research data in
the social sciences and humanities and promotes the 'Data Seal of Approval' for certification of data repositories, guaranteeing via a series of required criteria a qualitatively high and reliable way of managing research data.

Australian National Data Service (ANDS) - Australia
Finally, Andrew Treloar, Director of Technology, Australian National Data Service (ANDS), supplied a comprehensive perspective from a national infrastructure provider and in a way summarized previous talks by saying that, despite differences, there are common themes emerging in national approaches to data management, as there are things only they can do. Along his plenary presentation "Data: Its origins in the past, what the problems are in the present, and how national responses can help fix the future" he mentioned for instance that Hubble Space Telescope-related publication statistics show double research is being done thanks to data reuse. Efficiency, validation, integrity of scholarly records, value for money and self-interest were listed as (non-altruistic) arguments for data reuse.

Having the chance to attend this series of brilliant presentations and checking out how policies for opening access to research data keep spreading over institutions and countries were undoubtedly part of the Birmingham workshop highlights. Next opportunity for keeping up with it all will be next November at the Knowledge Exchange Workshop on Research Data Management in Bonn, Germany.

Monday, 25 April 2011

Could external cooperation improve collection of specific JISC MRD project-related information?

  In forthcoming days SONEX will be publishing some posts on the JISC MRD Programme International Workshop held last March 28-29th at Aston Business School Conference Centre, Birmingham. Certain aspects debated at this comprehensive meeting were very useful for establishing an approach for dealing with research data management from a SONEX viewpoint, as debated in a SONEX meeting at EDINA on Mar 30th whose outcome will also be shortly blogged.

See IUCr Brian McMahon's report for a general review on the JISC MRD workshop.

One of the most visible disciplinary approaches to data management presented at the JISC MRD event -which featured all kinds of institutional and subject-based initiatives in the area- was the one coming from meteorology, palaeoclimatology and climate-related sciences: there was a presentation of the PEG-BOARD Project (U of Bristol) at the Subject-Oriented Approaches session on Monday, followed by ACRID (U of East Anglia & STFC) and Metafor (BADC & STFC) Project presentations on Tuesday afternoon.

One of the most relevant features of these climate-related projects is interdisciplinarity. PEG-BOARD Project in particular aims to serve the archaeology research community by supplying them their paleoclimate data.

A few specific aspects about PEG-BOARD were discussed after the project presentation. Interesting thing about them is they were not mentioned along the talk, nor are they reported at the project site:

- Due to the project interdisciplinarity, there are two clearly different user groups for palaeoclimatology data produced: climatologists, who will understand the nature of involved datasets, as they're central to their discipline, and archaeologists, who don't and need not know much about the data format but need the information contained in it for their own purposes - thus functioning as regular non-technical users to the project instead of researchers. However, as they are indeed researchers, the feedback they may provide on the project outcome could be so much more valuable.

- What archaeologists care about in the end is the data plottings, and Data Centres will not provide such processing. So what PEG did was implement specific software capabilities that will address the needs of non-technical data users (i.e. archaeologists), as to allow them to search for the plots or false-colour graphics they need. This piece of middleware is a conceptual key feature of the project in terms of deliverables.

- Climate data is usually archived in binary format, so it's often not easy to process. UK Met Office provided lots of info, often incomplete or in old formats. The adaption process of raw data to the project needs was very interesting and worth disseminating.

- Climate models were written in FORTRAN. When re-written or translated into C++, the results would vary for the same data arrays due to specific treatment by the code. That poses a quite amazing challenge in terms of model interpretation.

- When asked on whether researchers provided enriched metadata for their data, the answer was there's usually an input in terms of past experiments, i.e. "this is the data outcome of such and such experiment when changing initial conditions in such a way". Such-and-such experiment would be described the same way until one was reached that wasn't described at all.

The fact that none of these project aspects is recorded or discussed at the project blog poses a question on whether an external approach to data management projects might collect and disseminate very interesting information that researchers may not consider relevant enough to discuss from project blogs. Such an external approach to running projects might be carried out by data librarians in order to
share these specific project details with the data management community.

For whatever it may be worth, Sonex would be keen to do this kind of job for the MRD community.

Tuesday, 5 April 2011

I2S2 Project workshop at RAL-STFC

  Along a busy week in terms of research data management events (due to be shortly reported from this blog), last Friday Apr 1st Sonex had the opportunity -thanks to Simon Hodson, JISC MRD programme manager- to attend the I2S2 Project workshop at the Rutherford-Appleton Laboratory (RAL) at STFC in Didcot. I2S2 -standing for 'Infrastructure for Integration in Structural Sciences' is a JISC MRD project ending in Mar 2011 aiming to "identify requirements for a data-driven research infrastructure in "Structural Science", focusing on the domain of Chemistry, but with a view towards inter-disciplinary application".

Several presentations were delivered along the meeting: Brian Matthews on the I2S2 project achievements, ICAT architecture and CSMD metadata standard, Brian McMahon, International Union of Crystallography (IUCr) on 'Information Management and Publication in Crystallography', Tom Griffin on TopCAT GUI for management of data coming out of STFC ISIS and DIAMOND facilities, Steve Androulakis on the TARDIS ANDS-supported project at Monash University, Mark Borkum on OreCHEM files, Chris Morris on on PiMS (Protein Information Management System) and Juan Bicarregui on the EU PANData project.

Along the IUCr presentation the need was identified for filing & preserving different data categories such as raw measurements, processed numerical data, derived info and the paremeters. The convenience of providing access to raw diffraction images was also stressed along the talk, these files being a few GB in size, and thus not large enough for Data Centres but too big for sites such as CCDC. A review on Crystallographic Information Framework (CIF) file formats was provided, with imgCIF being used for raw data storing out of the experiment, .fcf for including structure factors after data reduction and a final stage of structure solution and refinement being performed in the lab before the author starts formatting those into a IUCr paper, which would translate CIF into SGML for producing final fcf, cif, pdf and html versions.

Raw data was mentioned to be kept for 183 days at SFTC and 3 months at Australian Synchrotron (in which TARDIS is involved), and a discussion followed on the fact that some agreement shoud be reached on the kind of data that ought to be stored and preserved. The process of attachment of DOIs to datasets was also discussed, IUCr being presently involved in projects such as XYZ or Open Bibliography in order to promote this objective.

A TopCAT demo was provided by Tom Griffin. This open source GUI (see image above) is being used for storing raw data from STFC facilities such as ISIS and DIAMOND. TopCAT provides access to its contents through an open registration system, thus operating as a sort of STFC institutional data repository, and would be potentially applicable to other institutions, facilities and disciplines.

TARDIS presentation by Steve Androulakis, Monash Univ, Australia, mentioned their using of XML/METS metadata standards for research data description at the federated institutional repository-platform initially meant to store X-ray diffraction images, later evolving into a much larger initiative with application into microscopy (MicroTARDIS), particle physics and gene processing through the Squirrel software.

Finally, extra presentations were delivered on PiMS (Protein Information Management System) by Chris Morris, STFC and on the European PANData project by Juan Bicarregui, STFC e-Science. PANData aims to build Photon and Neutron Data Infrastructure through a consortium of European synchrotron facilities and neutron sources.

A final summary was made on the whole set of presented I2S2-related features (imgCIF, CIF, IuCr/XML/RDF BIBLIO, PDBML, CML, ICAT, TopCAT, ICAT Lite/CSMD, TARDIS, PiMS, PANData, NeXuS) by mapping them on the I2S2 Idealized Scientific Research Activity Lifecycle Model (see image above - may click on it for an updated version). References were also made to other initiatives not represented at the meeting such as Quixote Project for Computational Chemistry CML data management or Protein Production and Crystallization.