First objective of the JISC-supported Sonex initiative was to identify and analyse deposit opportunities (use cases) for ingest of research papers (and potentially other scholarly work) into repositories. Later on, the project scope widened to include identification and dissemination of various projects being developed at institutions in relation to the deposit usecases previously analyzed. Finally, Sonex was recently asked to extend its analysis of deposit opportunities to research data.

Sunday, 16 January 2011

"On such a full sea are we now afloat"

Such quotation -from W. Shakespeare's 'Julius Caesar'- closed Drs. Eefke Smit's talk "Taking the Current when it Serves: Research Data from the Publisher's Perspective" she delivered along 'Academic Publishing in Europe': the APE 2011 conference, held at the Berlin-Brandenburg Academy of Sciences in Berlin, Jan 11-12th, 2011.

Aiming to gather some facts for its ongoing analysis on research data management and its deposit into repositories, Sonex just attended APE2011, a meeting for the publishing industry and its environment held yearly in Berlin since 2006. The conference organisers do regularly publish a brief official report shortly after the event celebration (reports on previous APE editions
available here, report on this edition due shortly).

This particular visit to Berlin offered the chance to attend yet another event besides APE2011: the SOAP Symposium. Final report by the SOAP (Study of Open Access Publishing) project survey was presented along this one-day meeting, held on Jan 13th in the Goethe Room of the renowned Harnack-Haus in Berlin. The SOAP project describes and analyses the open access publishing landscape as well as exploring the risks and opportunities of the transition to open access publishing for libraries, publishers and funding agencies - see preliminary survey results, final report will be available as of next March.

The conference programme for APE2011, entitled "Smarter Publishing in the New Decade", included promising topics such as evolution of peer-review and ways to improve it, the so-called data deluge, business opportunities in China and how Open Access is becoming increasingly mainstream within the publishing environment. Discussions on those matters were lively both at round tables and at lunch pauses. Sonex interest being mainly on research data management, this report will subsequently focus on presentations and debates on the subject.

On Tuesday Jan 11th afternoon, a session was held on “The Data Deluge: to Drown or to Swim?”, chaired by Bob M. Campbell. Herbert Gruttenmaier, INIST-CNRS, started his presentation "Helping to Ride: a look at data sharing and access policies" by reminding that, since we were in Berlin, the definition of an Open Access Contribution on page 1 of the Berlin Declaration on Open Access to Knowledge includes “raw data and metadata”. Some highlights from his talk were:

  • There is a large number of Data Sharing Policies being defined by administrations, institutions, funding agencies and publishers themselves under the guideline "data should be made as freely and widely available as possible". See for instance NSF’s requirement for submission of data management plans of May 10th, 2010, under general policy statement “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants”.
    Or the very recent (Jan 10th, 2011) commitment by a group of major international funders of public health research to “work together to increase the availability of data emerging from our funded research, in order to accelerate advances in public health”.

  • Publishers such as BioMed Central were featured as high-profile supporters of Open Data (see Dec 11th, 2010 post at this blog), and NPG editorial policy on dataset sharing was specifically mentioned along the talk, as well as the Brussels Declaration on STM Publishing statement that “Raw research data should be made freely available to all researchers”. Finally, discipline-based data policies such as PaN-Data Scientific data Policy Draft for Scientific Data Management Framework at European Photon and Neutron Facilities or the Joint Data Archiving Policy (JDAP) adopted in a coordinated fashion by Dryad partner journals.

  • Not everything is that simple though: the Nov 2009 "Patterns of information use and exchange: case studies of researchers in the life sciences” RIN report shows that researchers are not so eager to share their data with others, and that ‘one-size-fits-all’ information and data sharing policies may not achieve the goals there are aiming for, namely scientifically productive and cost-efficient information use in life sciences.

Drs. Eefke Smit, International Association of STM Publishers, provided a counterexample for these growing data sharing policies by publishers along her talk on "Research Data from the Publisher's Perspective" by describing the Journal of Neuroscience policy of no longer taking supplementary material from authors since Nov 1st, 2010, the procedure posing too heavy a burden on paper reviewers.
She also warned of the so-called data deluge, according to which tera- and petabite sized datasets will increase their share in research projects in upcoming years.
However, when researchers are asked where they would like to submit their research data, the answer is more often than not "publishers". This brings along the issue of research data preservation: results of an internal survey by STM Publishers show what she called “an improvable situation” with regard to preservation.

Planned talk “Data Publishing in the Context of the ICSU World Data System” by Dr. Michael Diepenbroek, Director of WDC-MARE/PANGAEA, University of Bremen, went finally off the conference programme. However, the next speaker, Dr. Jan Brasse, Managing Director of DataCite, provided some information on the progress of one of the main databases for research data in the geosciences area, by for instance stating there was “a wide cooperation between Elsevier and PANGAEA via DOI-based external links from online papers” at the former’s platforms. This kind of cooperation between publishers and international databases for handling research data might be useful for tacking the abovementioned data preservation issues.
Dr. Brasse, affiliated with the German National Library of Science and Technology Hannover, described as well the evolution of the DataCite international project as it gets carried out by local member institutions: as of Dec’10, over 1M records are already registered with DOI names at Perspectives for the project include setting up of a Central Metadata Base as of Jun'11; DataCite becoming a harvest point for third parties such as WoS; and cooperation via CrossRef for data-article lookup.

The data management session ended with the talk on “Managing Publication and Research Data: the eSciDoc Research Infrastructure” by Dr. Malte Dreyer from Max Planck Digital Library (MPDL). eSciDoc is as a joint project of the Max Planck Society and FIZ Karlsruhe, funded by the Federal Ministry of Education and Research (BMBF), with the aim to realize a next-generation platform for communication and publication in research organization. Further eSciDoc projects mentioned along the presentation and dealing with research data management were ‘Astronomer‘s Workbench’ (astronomy), Lifecycle Logger (biochemistry) and BW-eSci(T) for computational linguistics. DARIAH (Digital Research Infrastructure for the Arts and Humanities) –in whose development eSciDoc is directly involved- and CLARIN (Common Language Resources and Technology Infrastructure) projects were repeatedly highlighted along the session as leading EU projects on development of digital research infrastructure (including data management) for the Humanities and Social Sciences.

A joint panel discussion was then held after the presentations on research data management, with speakers taking questions from the floor. Alicia Wise, Elsevier Director of Universal Access and former archaeologist raised the issue of costs attached to research data management and who should fund them: it was agreed by the panellists that national funding bodies should assume the cost of data management. Along her question Dr. Wise incidentally mentioned that data management at the archaeological research project she used to work for succeeded only thanks to researchers dedicating 50% of their time to data curation. This aspect of dataset deposit will be examined by Sonex in order to identify alternative (automatic) curation procedures currently being used to relieve researchers of the data curation burden.

The data management issues extended well outside the session specifically devoted to them and into the Innovation session held next day, where Portland Press Adam Marshall presentation on the Semantic Biochemical Journal and Project Utopia at the Manchester School of Computer Science did extensively deal with data handling (see “Calling International Rescue: knowledge lost in literature and data landslide!” at Biochem J. (2009) 424, 317–333 for a review on “how to provide new ways of interacting with the literature, and new and more powerful tools to access and extract the knowledge sequestered within it”).

At the end of the data session panel discussion Dr. Eefke Smit synthesized the three challenges of research data management: normalization, standardization and migration. She did also remind the audience of verses following the one quoted in the title of this post:

(…) On such a full sea are we now afloat,
And we must take the current when it serves,
Or lose our ventures