The following was adapted from a presentation by Wind Cowles, Associate Dean for Data, Research, and Teaching at Princeton University, and Mikala Narlock, Director of the Data Curation Network based at the University of Minnesota. This was presented on October 19, 2023 during a workshop titled, “Developing New Approaches to Promote Equitable and Inclusive Implementation of Open Scholarship Policies.” Hosted by The National Academies of Sciences, Engineering, and Medicine’s Roundtable on Aligning Incentives for Open Scholarship, hybrid.
The slides are archived through the University of Minnesota Digital Conservancy. https://hdl.handle.net/11299/257558
Open data is invaluable in combating both misinformation and the rising mistrust of research by the public. A recent search of the Retraction Watch Database shows around 1,200 research articles with US-affiliated authors were retracted in the last 12 months for reasons that included issues with the data, and of course there has been a steady drumbeat of headlines about high profile retractions and investigations (e.g., “Harvard behavioral scientist faces research fraud allegations”). So, open data is an important component of open research, as it ensures that we can reproduce and evaluate previous research.
However, when that open data is misorganized, appears sloppy, or is poorly documented, this doesn’t inspire the confidence that we want; this isn’t really open science. It’s not just about openly sharing data, it’s about sharing data that is well-organized, well-documented, interpretable, and re-usable. We saw both sides of this so clearly during the pandemic: sharing data, especially genetic data about the virus, accelerated our response to the pandemic — but we also saw high-profile cases that centered on the challenges of data that can’t or won’t be shared.
Federal and private funders are increasingly requiring data sharing. Merely “open” data isn’t enough to combat misinformation or the rising mistrust of research, however. So how do we move from a compliance mindset, which promotes a minimalist approach in which researchers might only put in the minimum amount of effort and thus potentially share poorly documented or disorganized data, to an ecosystem that actually enables and rewards open, equitable, and accessible scholarship? It’s important to note that a key part of this conversation requires nuance. It is not enough to share or not share data — we have to think of what data can be shared, in what forms, and with what documentation. We have to consider all of the potential outputs of a research project, including how best to organize distinct yet interrelated components and how to document these decisions clearly.
When we envision a system that moves beyond compliance to an ecosystem that can effectively work against misinformation, and incorporate nuance to share data that is truly findable, accessible, interoperable, and reusable (FAIR), data curation plays a key part. But, what do we mean by data curation? Let’s first turn to a definition of data curation, and then explore how we are working as individual institutions and as a collective to leverage the power of data curation.
As defined by the University of Illinois School of Library and Information Science, “Data curation is the active and ongoing management of data through its lifecycle … [to] enable data discovery and retrieval, maintain quality, add value, and provide for re-use over time.” Put another way: “Data curation is akin to work performed by an art or museum curator. Through the curation process, data are organized, described, cleaned, enhanced, and preserved for public use…” (ICPSR) These definitions highlight multiple components of data curation that make it especially valuable to researchers.
First, curators are thinking about data sharing and management over time — not just about sharing this data once, curating it once. We have to think of this sharing and curation practice as an ongoing, iterative process, especially as scholarly outputs are reused, reproduced, retracted, etc. It is also important to recognise that data curators are well positioned to not only provide feedback on what should be shared openly, but also where the outputs should be shared, and what, if any, access restrictions might need to be placed on data. It is the role of the data curator to support researchers in this effort by providing different levels of curation support, and to enable appropriate long-term access to research data.
There are different levels of curation — and while presented sequentially here, no one level is necessarily “ideal” or better than the other. Rather, datasets can be reviewed at different depths, and will impact the findability, accessibility, interoperability, and reusability of the data. Each level will have different costs and benefits; while it might be quicker to curate at the record level, that may lead to less reproducibility. These are decisions that curators and repositories have to navigate when curating research data.
In the Data Curation Network (DCN), we believe that it is essential to be transparent about our curation processes — and at what level we are curating data. Because different levels of curation will lead to differences in the final dataset, it is imperative that we collaborate with each other and our researchers to be sure we are being good stewards of our data, and to make sure that we are providing clear documentation for future users and reusers of data.
Making sure that open data is trustworthy and reusable involves both local support — at each of our institutions — and collaborative effort. Local support is crucial because staff within our institutions are often the first point of contact for our researchers, who are often using institutional computational infrastructure and looking for local expertise to help them. This is why investing in localized support for open data through curation and infrastructure is so important, especially investing in the staff and infrastructure for data curation and sharing. Equally important is to develop and implement policies and clear guidance that support open and well-curated data. This is hands-on and labor intensive work, but it is also work that is made easier when researchers know what is coming at the end of a project, and when there are established relationships between researchers and data curators. Training and advice at the local level are also critical, as we have the local context that researchers need.
Importantly, this work is also distributed because it is work that’s done best when we can draw from each other’s expertise and pooled resources. Data comes in all shapes and sizes, formats, and types, and no one, no single institution, can be expert in them all. Communities like the DCN are powerful sources of distributed knowledge that enable support for a broader range of researchers than individual programs and institutions could alone.
So, how do we incentivize open, curated data? Many of us know first hand how difficult it is to incentivize this work; you want to be a good research citizen, but preparing and sharing data is time consuming and can make you feel vulnerable, especially if you don’t have advice from experts about what and how to share.
Funders and journals can require data sharing, and that’s a great step — but without the infrastructure and cultural change to accompany those requirements, it will be difficult for open data to be truly, transformationally open. There is much more power in getting research communities committed to open and curated data. When we think about areas where open research and open data has really flourished, in neuroscience, for example, it’s been rooted in shared values among the researchers themselves. Working with research societies on what they value from open data is critical to the success of this work.
There are two key things that we can do, both locally and as parts of distributed networks, to incentivize and advance open data:
- Invest in and support an ecosystem of institutional, domain-general, and specialist repositories — these are all essential infrastructures for open, trustworthy data, and data curation and metadata standards create better, more reusable data.
- Place value on data as a co-equal product of research to traditional publications. As many people have already noted, citation is a currency within academic research. We need to make sure that data have equal face value through clear citation standards practices. For example, the Make Data Count initiative is promising here.
To conclude, we leave you with two calls to action:
- Fund curation, not just storage.
- Invest in structures, standards, and capacity that allow data to not just be shared, but published.