This post was authored by intern Jodecy Guerra, as part of a semester-long intern with the DCN Spring 2024.

This post was edited by Shawna Taylor and Mikala Narlock.

Introduction

As a spring 2024 intern with the Data Curation Network (DCN), I had the opportunity to explore how data curation enhances data reuse. While the DCN effectively tracks its datasets using Digital Object Identifiers (DOIs), Mikala explained the network had not yet explored how, and if, datasets curated through the DCN are used or reused. My project in the spring aimed to investigate the extent of (re)use among DCN-curated datasets, emphasizing the importance of data curation in facilitating this process. 

Data citation, still an emerging practice among researchers, is crucial for tracking data reuse and providing intellectual credit to data creators. It can take various forms, including full data citations, data mentions, and indirect citations (DataCite, 2024). A bibliometric study by DataCite showed that the rate of data citation increased from 0.9% in April 2022 to 1.7% by January 2024 (DataCite, 2024). While this rate remains relatively low, this trend suggests growing awareness of its importance. Data citations not only promote the work of original authors but also help demonstrate research impact and facilitate the tracking of data reuse.

However, it’s important to note that not all data reuse results in formal citations. Datasets may be used for teaching, learning, or background knowledge without being referenced. Despite this limitation, citation evaluation currently remains the most effective method for examining data reuse. This project, although adjacent, complements recent DCN research indicating that researchers highly value the support provided by data curators (Marsolek et al., 2023). Through an in-depth analysis of a single DCN-curated dataset, Mikala and I aimed to identify aspects that enhance its reusability and explore potential points of influence from data curators. 

Defining Data Reuse

There is no widely accepted definition of ‘data reuse’ that cuts across disciplines. Pasquetto et al. (2017) offers a foundational distinction between data “use” and “reuse.” They define “use” as the analysis of data by the original collectors, while “reuse” refers to analysis by individuals other than the original collectors (p. 3). The authors highlight various methods of data sharing, including private exchanges between researchers, posting datasets online, depositing data in repositories or archives, and attaching data as supplemental materials to journal articles. Recognizing that there are complexities beyond the foundational definition, for the purposes of this project, we adopt a definition aligned with Pasquetto et al. Mikala and I considered a dataset to be “used” when analyzed by the original researcher. “Reuse” occurs when the data is utilized by a different researcher.

Process

Before exploring a single dataset, the first step was to identify which DCN curated datasets have been reused. To accomplish this, Mikala developed a systematic process to examine the datasets, focusing on the 50 oldest datasets listed on the DCN website. The older datasets were selected as they have had more time to potentially be reused, allowing for a more comprehensive assessment of their impact and reuse patterns. The collected data were organized in a spreadsheet, including information on:

  1. Downloads and views of the original dataset
  2. Citation DOIs
  3. Whether the citations represented use or reuse of the original dataset

To retrieve existing citation results, three primary citation trackers were utilized: OpenAlex, OpenCitations, and Google Scholar.

While these tools do not capture every citation, as they are centered on DOIs and therefore may not capture everything (see Hofelich Mohr et al., 2023, for more information on DOI search limitations), they were chosen for their ability to quickly provide findable datasets. This multi-pronged approach allowed for a more comprehensive assessment of the datasets’ impact and reuse patterns.

Dataset Deep Dive – DCN-10: APAL Coupling Study 2019

Among the datasets identified as having been reused or used, the University of Minnesota dataset “APAL Coupling Study 2019” (DCN-10), curated by Lisa Johnston and Sophia Lafferty-Hess, stood out for further analysis. This dataset clearly has been reused and appeared in two out of three of the search methods described below. The authors of the  APAL Coupling Study 2019 Dataset are researchers from the Affordance Perception-Action Laboratory (APAL), in the School of Kinesiology at the University of Minnesota. 

Repository and Dataset Details

  1. Published in the Data Repository of the University of Minnesota (DRUM) in March 2019
  2. DRUM is built on DSpace platform
  3. Usage statistics: 17,172 downloads (as of May 2024)
  4. Contents: PDF files, Excel sheet, and a zip folder containing Excel sheets
    • PDFs include the Simulator Sickness Questionnaire (SSQ) and the SSQ Scoring sheet
  5. Comprehensive metadata record including: title, publishing date, group, author contact, type, abstract, description, license, and suggested citation
  6. Thorough README file with short descriptions for each file, methodology explanation, and data-specific information

CURATE(D) Actions

The CURATE(D) actions for this dataset, which are recorded in the DCN’s Jira project management system, included: expanding documentation, file format transformation, metadata enhancement, quality assurance, evaluating FAIRness, file renaming, persistent identifier assignment, risk management. See the DCN Curation Glossary for more detail on these actions. 

Citation Analysis

  1. OpenAlex: zero results
  2. OpenCitations: two results
  3. Google Scholar: 12 results, of which six were actual APAL Coupling Study 2019 dataset citations
    • 1 citation result referenced the same set of authors from the original publication, but not the APAL Coupling study
    • 5 results were not associated with either the dataset or the publication 
  4. Publications citing the dataset:
    • MDPI Journals (Information and Sustainability)
    • Sage Journals (Workplace Health and Safety)
    • Journal of Clinical and Experimental Neuropsychology
    • AHFE International Human Factors and Simulation
    • International Journal on Advanced Science Engineering & Information Technology

Usage Pattern

All citing publications referenced either the original Simulator Scoring Questionnaire (SSQ) PDF along with the SSQ Scoring sheet PDF, or just the SSQ itself, from the original APAL dataset. Both of these documents are (re)used across publications to test human movement and motion sickness. The SSQ and scoring sheet have been used in virtual reality testing ranging from topics in exposure therapy, therapy and rehabilitation, immersive reality in drawing and perceiving architectural spaces, virtual high-altitude exposure, and driving simulation. These publications demonstrate data reuse rather than use of the original dataset, since all publication authors are different researchers than the authors of the original dataset (including the SSQ and SSQ Scoring sheet).  

Researcher Location—Data Reuse

To gather a better understanding of the impact of the dataset, I manually captured the geographic affiliations of researchers that had cited the dataset. The original authors were located at the University of Minnesota (shown in purple), while the dataset has been used by additional authors on five continents at 12 institutions. See below for this distribution.

While the geographic spread of the dataset does not necessarily equate impact, it is a useful metric for understanding and visualizing displaying the potential for dataset reuse beyond institutional boundaries and contexts.

Reflection

In total, six out of the 50 datasets were identified as having been reused. OpenCitations was the most helpful tool for identifying data citations due to its ease of use and ability to provide a clear list of publications. When just focusing on one of the datasets with the most number of matches, DCN-10: APAL Coupling Study 2019, OpenCitations easily provided a list of the publication DOIs. Although Google Scholar returned more total matches, additional verification was required to determine if the results returned were actual citations of the dataset or references to related publications. For instance, out of 12 citations reported by Google Scholar for this dataset, many were linked to the associated publication rather than the dataset itself. OpenAlex could have been a useful resource, but its navigation challenges and limited filtering options made it difficult to pinpoint specific DOIs. Ultimately, the structured workflow and organization of information in a spreadsheet were crucial for efficiently managing the citation investigation process, preventing potential confusion. 

Conclusion

The relationship between the number of downloads (17,172) of the original publication and the actual citations made (only six) is interesting, as it underscores a significant gap in data citation practices. It is possible that researchers may simply not be using the data, or using the data to some degree and not citing it as frequently as they should. Once something is downloaded online, tracking its subsequent use becomes challenging, and this is especially true for datasets. While data citations represent one form of reuse, there are numerous other ways in which data can be utilized that do not result in a formal citation. Nevertheless, citations remain a critical metric for assessing reuse, highlighting the importance of initiatives like MakeDataCount which aims to promote responsible and meaningful approaches to research data assessment, with citations playing a key role.

This area may be of particular interest to the DCN, and future research could delve into the potential correlations between specific curation actions and dataset reuse. While citations alone may not fully capture this relationship, a deeper analysis of a dataset like this one could provide valuable qualitative insights into patterns of reuse.

To cite this blog post, please use: Guerra, Jodecy. (2024) “Exploring the Reuse of DCN-Curated Data.” Retrieved from the University of Minnesota Digital Conservancy, https://hdl.handle.net/11299/265259

References

Chander, H., Shojaei, A., Deb, S., Kodithuwakkku-Arachchige, S. N. K., Hudson, C., Knight, A. C., & Carruth, D. W. (2021). Impact of virtual reality-generated construction environments at different heights on postural stability and fall risk. Workplace Health & Safety, 69(1), 4-51. https://doi.org/10.1177/2165079920934000 

Data Curation Network. (n.d.). Mission & Vision.
https://datacurationnetwork.org/about/our-mission/

Gomez-Tone, H. C., Bustamante Escapa, J., Bustamante Escapa, P., & Martin-Gutierrez, J. (2021). The drawing and perception of architectural spaces through immersive virtual reality. Sustainability, 13(11), 6223. https://doi.org/10.3390/su13116223

Hofelilch Mohr, A., Kozlowski, W., & Taylor, S. (March 20, 2023). Pearls and pitfalls – A story of a programmatic data pull. RDAP Summit. https://doi.org/10.17605/OSF.IO/B5H74.

Johnston, L., & Lafferty-Hess, S. (n.d.). DCN-10: APAL coupling study 2019. Data Curation Network. https://datacurationnetwork.org/dataset/apal-coupling-study-2019/

Marsolek, W., Wright, S. J., Luong, H., Braxton, S. M., Carlson, J., & Lafferty-Hess, S. (2023). Understanding the value of curation: A survey of researcher perspectives of data curation services from six US institutions. PLOS ONE, 18(11), e0293534. https://doi.org/10.1371/journal.pone.0293534

OpenAlex. (n.d.). https://openalex.org/

OpenCitations. (n.d.). https://opencitations.net/

Pasquetto, I. V., Randles, B. M., & Borgman, C. L. (2017). On the reuse of scientific data. Data Science Journal, 16, 8. https://doi.org/10.5334/dsj-2017-008 

Pungu Mwange, M. A., Rogister, F., & Rukonic, L. (2022). Measuring driving simulator adaptation using EDA. In T. Ahram & R. Taiar (Eds.), Human Factors and Simulation (pp. 35-43). Springer. https://doi.org/10.1007/978-3-031-04411-5_5

Rodrigues, J., Coelho, T., Menezes, P., & Restivo, M. T. (2020). Immersive environment for occupational therapy: Pilot study. Information, 11(9), 405. https://doi.org/10.3390/info11090405

Suwandi, G. R. F., Khotimah, S. N., Haryanto, F., & Suprijadi. (2023). Electroencephalography signal analysis for virtual reality sickness: Head-mounted display and screen-based. International Journal on Advanced Science, Engineering & Information Technology, 13(4), 1449-1455. https://doi.org/10.18517/ijaseit.13.4.18583

Weitzner, D. S., Calamia, M., & Parsons, T. D. (2021). Test-retest reliability and practice effects of the virtual environment grocery store (VEGS). Journal of Clinical and Experimental Neuropsychology, 43(6), 547-557. https://doi.org/10.1080/13803395.2021.1960277

Similar Posts