First objective of the JISC-supported Sonex initiative was to identify and analyse deposit opportunities (use cases) for ingest of research papers (and potentially other scholarly work) into repositories. Later on, the project scope widened to include identification and dissemination of various projects being developed at institutions in relation to the deposit usecases previously analyzed. Finally, Sonex was recently asked to extend its analysis of deposit opportunities to research data.

Saturday, 17 September 2011

Progress on Researcher ID initiatives: IRISC 2011 Helsinki

The problem with names...

Prof. Carlos Martínez-Alonso is a renowned Spanish senior biochemist. He was actually President of the Spanish National Research Council (CSIC) when the Berlin Declaration was signed by the institution in January 2006. Prof. Martínez-Alonso has published hundreds of papers in high impact factor journals. However, when retrieving a complete list of his publications from PubMed database, you find out it is not possible unless several parallel author queries are carried out: there is a Martinez-A C entry under which most of his publications get listed [222]. But then there's also Martinez-Alonso C [21] and even Alonso CM [1].

It might be argued it's all about funny Spanish names with two surnames in them. That's a problem alright. Not just for Spanish names though: it's quite the same for Portuguese/Brazilian authors as well. Not to mention transliteration of Asian author names (see "Which Wei Wang?" Phys Rev 2007 editorial). PubMed is presently running its Author ID project in order to tackle this problem, which is by no means exclusive of theirs: around 2/3 of the over 6 million authors in MEDLINE share a last name and first initial with at least one other author, and an ambiguous name refers to 8 persons on average (Torvik and Smalheiser, "Author name disambiguation in MEDLINE").

Name disambiguation and proper attribution is a well-known problem in the scholarly publishing ecosystem. There have been and there are lots of initiatives trying to tackle this complex issue at subject, institutional or even national level - with remarkable success in the case of the Dutch Digital Author Identifier (DAI).

However, this is not an issue to be tackled at national nor subject level, but globally. Commercial stakeholders such as ThomsonReuters or Elsevier-Scopus are then in a privileged position to implement some international author unique identification schema. From a knowledge discovery viewpoint there are however some problems in this commercial-stakeholder approach: the ResearcherID, ThomsonReuter's author identifier, will provide seamless integration with ISI Web of Knowledge and show all author publications registered in that database, but will otherwise leave out most of the research output.

Some joint effort between public institutions and private stakeholders (remarkably publishers) must therefore be attempted to unify the multiple author identification standards and devise a single, comprehensive one at a global level. And that's where ORCID comes in.

... and strategies to tackle it: IRISC 2011 workshop

The Open Researcher & Contributor ID (ORCID) initiative started in Dec 2009 as a non-profit organisation. Currently over 240 participants have joined the project for developing the one research identifier which is not limited to discipline, institution or geographical area. Many other projects are working in this issue at the same time (such as abovementioned discipline-based PubMed Author ID and Cornell University initially institutional then grown to national VIVO initiative).

ORCID and VIVO were two of the main topics of the IRISC 2011 Workshop on Identity in Research Infrastructure and Scientific Communication held this week (Sep 12-13) in Helsinki - see the event programme with attached presentations. Gudmundur "Mummi" Thorisson, Research Associate at University of Leicester and member of ORCID Technical Working Group, was IRISC 2011 main organizer.

There were two major IRISC 2011 strands: identity regarding knowledge discovery and identity for security & access control (focusing mainly on identity federation). A third big cross-issue along the Helsinki event was research data management, from three different perspectives:

i) dealing with a rapidly increasing amount of biomedical research data (Andrew Lyall, EMBL, ELIXIR Project)

ii) dealing with clinical research sensitive data (see Tony Brookes GEN2PHEN Project presentation)

iii) benefits the ORCID implementation might bring to research data attribution and management (mentioned in most ORCID-related presentations and discussions along the workshop)

There were several presentations dealing both with ORCID and closely resembling VIVO initiatives. Martin Fenner, Hannover Medical School and member of ORCID Board of Directors announced the ORCID registration service will start operating in spring 2012. ORCID will be open: researchers will be able to manage & maintain their profiles, filed data will be openly available, ORCID-related software will be released as open source, and researchers will control their privacy settings (with a chance too to share with particular members). Finally, for ORCID identity definition purposes, self-claim as well as external claiming sources will be used.

Brian Lowe, University of Cornell, presented the already running NIH-funded, institutionally-managed VIVO initiative. VIVO is aiming for an extensible semantic model-based more comprehensive approach than ORCID. However, links have already been established between both initiatives and ORCID is hoping to build upon VIVO success in the US.

Breakout sessions were held on IRISC Day 2 on the workshop's two main strands: "Unique identifiers and the Digital Scholar" (lead by Cameron Neylon and Jason Priem) and "What do researchers need from the authentication and authorisation infrastructure (AAI)?" (chaired by Michael Linden, CSC). Breakout session #1 was devoted to discussing potential tools and services to researchers ORCID could provide in the short term (6 months from adoption). Several groups were set up for the purpose and proposed ideas were later voted and discussed for selecting three main future worklines for ORCID to deal with. The proposed and selected use cases were the following:

-> data submission to repositories (multiple task attribution)

service to enable attribution or comment

pre-populate ORCID data

-> manuscript/grant tracking system

ORCID app gallery

-> automatic CV maintenance (potentially including data citations in CVs)

connecting different author research & social network profiles

Selected ORCID use cases were later introduced by Cameron Naylon along his talk 'ORCID and researchers' at the second annual ORCID Outreach Meeting held at CERN on Sep 16th, 2011.