This guest blog post was written by OpenRefine Primer authors, Heather Charlotte Owen (University of Rochester, https://orcid.org/0009-0001-1771-366X), Aditi Thite (AVSInfoPro, LLC, https://orcid.org/0009-0007-6193-5407), and Peyton Tvrdy (U.S. Department of Transportation, National Transportation Library, https://orcid.org/0000-0002-9720-4725). All attended the April 2023 Data Curation Workshop sponsored by the National Institutes of Health.

OpenRefine is a powerful data cleaning and curation tool – free, open source, friendly to non-coders, and robust in cleaning ability. When the three members of this primer team met in April 2023 at a DCN (Data Curation Network) training session in Bethesda, Maryland, we all knew we wanted to create a primer for this magnificent tool. This blog article will describe our recently published primer, discuss our process, and reflect on the experience of creating a primer. 

What is OpenRefine?

OpenRefine is a browser-based tool created primarily for cleaning text-based data. It uniquely allows users to understand the dataset with robust sorting and faceting, and allows for cleaning via bulk-editing, clustering, column/row manipulation, and coding (GREL, Python, etc.). OpenRefine also has an extensive history function, allowing one to export a record of all activities performed in OpenRefine, a useful ability when one is keeping a documentation log. OpenRefine is a natural complement to the curation process taught by the DCN. In particular, OpenRefine assists with the Understand (sorting and faceting), Request (editing and clustering), Augment (reconciliation), Transform (changing file types) and Document (export history) steps.

Why did you decide to create this primer?

Creating a primer for OpenRefine was not a natural proposal; unlike other data curation primers, which tend to focus on specific disciplines or formats, this primer would focus less on the nuances of file types and more on the process. DCN did not have any tool-specific primers at that time, and it was unclear how the existing primer templates could be adapted for this new use case. All members were enthusiastic about the idea, and in the end, we decided to attempt to create this new type of primer. Our team came with a solid understanding and knowledge of OpenRefine when we began. Therefore, we decided to create a primer that showcases the various cleaning, curation, and preservation functions of OpenRefine with numerous in-depth examples, along with a separate section presenting OpenRefine’s congruence with the DCN curation checklist.

Our primer accompanies users through their OpenRefine journey. First, it covers basic information on the description, prerequisites, and installation of OpenRefine. The primer then delves into the data cleaning functions of OpenRefine such as transformations, editing, faceting, clustering, and general expressions, which are accompanied by detailed instructions and examples. It further showcases the preservation power of OpenRefine, by emphasizing the history and export functionality. The detailed nature of the primer leaves it friendly to beginners who can use it to become proficient at OpenRefine, and to more skilled users, who may apply more advanced techniques in OpenRefine.

How did you apply the primer template to a curation process?

During our in-depth exploration of OpenRefine, we quickly realized that it is not only a data cleaning and manipulation tool, but it can also be used as a powerful data curation tool throughout the entire data lifecycle. Therefore, we decided to conclude our primer with our own version of data curation steps. We accomplished this by adapting the DCN CURATE(D) checklist to create our own OpenRefine CURATE(D) checklist. Through our checklist, we showed how OpenRefine functions can effectively perform data curation and preservation actions to make the datasets consistent and FAIR. The OpenRefine CURATE(D) checklist includes questions to ask and steps to take to ensure how datasets can meet best practices. 

Some examples of processes added to our OpenRefine CURATE(D) checklist include:

UNDERSTAND

Use OpenRefine functions to understand the dataset: Use custom sorts and text facets to gain new insights on the data and to find inconsistencies in the data.

REQUEST

Collaborate with the researcher to make necessary changes: Perform cell edits/bulk editing, GREL and clustering functions to fix errors.

AUGMENT

Reconciliation feature in OpenRefine enables matching the dataset with that of an external source. OpenRefine matches cell values to the reconciliation information to fix spelling or variations in proper names, clean up subject headings with authorities such as the LCSH, link data to an existing dataset, and add to an editable platform such as Wikidata.

TRANSFORM

Use OpenRefine to transform file types and formats: Export to a variety of data formats including but not limited to CSV, TSV, HTML-formatted table, XLS/XLSX, Google Sheets, and Wikidata.

EVALUATE

The entire project dataset from OpenRefine can be exported into various open-source file formats. Also the ability of OpenRefine to export the project to another computer/network achieves data preservation.

DOCUMENT

OpenRefine has a robust project history feature. The Undo/Redo function allows to retain and revert to any previous steps in the project, thus preserving all actions. This project history can be extracted from OpenRefine to improve and enhance documentation and the Curator Log. 

Who can use the primer?

As with all DCN Primers, the main audience is data curators. However, with this new tool-based primer, we expect this will be a useful starting point for researchers, data librarians, graduate students, and anyone who works with research data and datasets. OpenRefine caters to all kinds of users with varying degrees of skill levels. The variety of functions in OpenRefine enables users to conduct data cleaning, manipulation, curation and preservation in more than one effective way. OpenRefine has a vast number of applications that require no programming knowledge, and our primer showcases these beginner-level techniques with detailed instructions. For users with advanced application and programming skills, OpenRefine offers powerful functionality including regular expressions, GREL (General Refine Expression Language), reconciliation, and more techniques, described in detail in our primer. 

How do you expect the primer to be used?

Users may primarily utilize OpenRefine for data cleaning and data manipulation to better understand and organize data. OpenRefine is particularly effective at interpreting and evaluating large volumes of data. In addition to data cleaning, users can apply OpenRefine functions for data curation and preservation purposes. 

Examples of OpenRefine Functions for Data Curation, Cleaning, and Standardizing Data Include:

  • Common Transformations
  • Sorting
  • Clustering
  • Individual and Bulk Cell Editing
  • Facets
  • Column Transformations
  • Regular Expressions
  • GREL
  • Reconciliation

Examples of OpenRefine Functionality for Data Preservation Include:

  • Undo/Redo
  • File Format Transformations
  • Extraction of Project History for Documentation and Reporting
  • Exporting Project Data for Archiving

What were the challenges for this project?

Creating a primer for a tool was a new concept when we began in Spring 2023, and our primer differs significantly from current DCN primers. Therefore, our group faced several challenges. Although DCN primers follow a loose template, our primer was unique in content, requiring significant changes to the structure and layout. These revisions occurred not only at the beginning but also mid-project, as feedback from the DCN led to large structural changes. This feedback shifted the focus of the primer from a shorter toolkit with additional resources to a broader manual on OpenRefine as a digital curation tool. 

While this review feedback ultimately led to a better product, it was initially challenging to reconstruct and rebuild our primer from the ground up. Interpreting the peer-review feedback and deciding on a path forward was difficult, but we managed to align our vision and thoughts with the feedback we received. We also faced the challenge of viewing OpenRefine not just as a useful piece of software but as a data curation tool for data stewards. This shift mostly was a frame of mind shift during the beginning of the project, but as the project grew, we were able to successfully reorient our thoughts to thinking of OpenRefine under the lens of data curation. 

Additionally, the project became very time-consuming. We did not anticipate that this process would take over a year and result in such a large document. However, the extended timeline and length created a more robust guide for a wider range of users.

What aspects of this project went well?

Despite these challenges, many aspects of the project went very well. This project was immensely rewarding for each of us. We were very collaborative and open, working with each other’s schedules and needs to align our goals. We stayed on track with meetings, goals, and individual assignments through our collaborative attitudes and use of calendars. While teamwork across organizations and schedules can be stressful, this project was not overbearing, even when it required more intense work. We all saw this project as something fun to work on and a way to take a break from our jobs. Our workplaces have been supportive of this additional project with the DCN, and there is general enthusiasm in our work settings about our work on the primer. 

Additionally, this project helped us develop new skills and knowledge of OpenRefine. We all had varying levels of experience with OpenRefine and used it in different applications at our work. These differing levels of experience led to different viewpoints and perspectives during the project. The project also helped us develop our confidence in teaching and sharing information about OpenRefine, which directly improved the workshops and teaching experiences we conducted. Overall, this project encouraged us to learn more about the software, including more difficult functions such as clustering, regular expressions, and reconciliation.

Lastly, by working closely with the DCN’s CURATE(D) Checklist, we improved not only our knowledge of the workflow but also how we implemented the checklist in our jobs. This project greatly improved our data curation skills at our own institutions by incorporating the CURATE(D) steps.

If you had to begin again, what would you do differently?

If we had to start over, we would write excessively and then edit rather than focus on being brief. We initially began this project with the intention of being concise in order to match other DCN primers. However, during the review process of our first draft, through conversations with our mentor Sophia Lafferty-Hess and DCN Director Mikala Narlock, to pivot the purpose of the primer and create a guide that possesses examples, steps, resources, and an adaptation of the DCN CURATE(D) checklist. Since this restructuring was quite difficult, if we were to begin this process again, we would aim to create a more fleshed-out and descriptive first draft. 

Despite these considerations, this project is a huge success for all of us, and we hope that this primer can serve as a good template for tool-based primers going forward. The primer team is open to answering any questions or giving advice on using OpenRefine.

Check out the newly published OpenRefine Primer! Congrats to the authors on this exciting new resource, and thanks to our peer reviewers for their support and feedback.

Similar Posts