Introduction
Overview
Teaching: 10 min
Exercises: 0 minQuestions
What is OpenRefine useful for?
Objectives
Describe OpenRefine’s uses and applications.
Differentiate data cleaning from data organization.
Experiment with OpenRefine’s user interface.
Locate helpful resources to learn more about OpenRefine.
Lesson
Motivations for the OpenRefine Lesson
- It is important to know what you did to your data. Additionally, journals, granting agencies, and other institutions are requiring documentation of the steps you took when working with your data. With OpenRefine, you can capture all actions applied to your raw data and share them with your publication as supplemental material.
- All actions are easily reversed in OpenRefine.
- You must save your work to a new file; OpenRefine does not modify your original dataset.
- Data is often very messy, and this tool saves a lot of time on cleaning headaches.
- Data cleaning steps often need repeating with multiple files. OpenRefine is perfect for speeding up repetitive tasks by replaying previous actions on multiple datasets.
- Some concepts such as clustering algorithms are quite complex, but OpenRefine makes it easy to introduce them, use them, and show their power.
Before we get started
Note: this is a Java program that runs on your machine (not in the cloud). It runs inside your browser, but no web connection is needed.
Follow the Setup instructions to install OpenRefine.
If after installation and running OpenRefine, it does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.
Basics of OpenRefine
You can find out a lot more about OpenRefine at http://openrefine.org and check out some great introductory videos. There is a Google Group that can answer a lot of beginner questions and problems. There is also an OpenRefine Google Plus community where you can find a lot of help and a lot of folks from the life sciences are members. As with other programs of this type, OpenRefine libraries are available too, where you can find a script you need and copy it into your OpenRefine instance to run it on your dataset.
Features
- Open source (source on GitHub).
- A large growing community, from novice to expert, ready to help.
- Works with large-ish datasets (100,000 rows). Does not scale to many millions. (yet).
Key Points
OpenRefine is a powerful, free and open source tool that can be used for data cleaning.
OpenRefine will automatically track any steps you take in working with your data.
Working with OpenRefine
Overview
Teaching: 15 min
Exercises: 20 minQuestions
How can we bring our data into OpenRefine?
How can we sort and summarize our data?
How can we find and correct errors in our raw data?
Objectives
Create a new OpenRefine project from a CSV file.
Recall what facets are and how they are used to sort and summarize data.
Recall what clustering is and how it is applied to group and edit typos.
Manipulate data using previous steps with undo/redo.
Employ drop-downs to split values from one column into multiple columns.
Employ drop-downs to remove white spaces from cells.
Lesson
Creating a Project
Start the program. (Double-click on the openrefine.exe file (or google-refine.exe if using an older version). Java services will start on your machine, and OpenRefine will open in your browser).
Launch OpenRefine (see Getting Started with OpenRefine).
OpenRefine can import a variety of file types, including tab separated (tsv
), comma separated (csv
), Excel (xls
, xlsx
), JSON, XML, RDF as XML, Google Spreadsheets. See the OpenRefine Importers page for more information.
In this first step, we’ll browse our computer to the sample data file for this lesson. In this case, we have created a set of TCP data called eebo.csv.
If you haven’t already, download the data from:
../data/eebo.csv
Once OpenRefine is launched in your browser, the left margin has options to Create Project
, Open Project
, or Import Project
. Here we will create a new project:
- click
Create Project
and selectGet data from
This Computer
. - Click
Choose Files
and select the fileeebo.csv
. ClickOpen
or double-click on the filename. - Click
Next>>
under the browse button to upload the data into OpenRefine. - OpenRefine gives you a preview - a chance to show you it understood the file. If, for example, your file was really tab-delimited, the preview might look strange, you would choose the correct separator in the box shown and click
Update Preview
(bottom left). If this is the wrong file, click<<Start Over
(upper left). - If all looks well, click
Create Project>>
(upper right).
Note that at step 1, you could upload data in a standard form from a web address by selecting Get data from
Web Addresses (URLs)
. However, this won’t work for all URLs.
Faceting
Exploring data by applying multiple filters
OpenRefine supports faceted browsing as a mechanism for
- seeing a big picture of your data, and
- filtering down to just the subset of rows that you want to change in bulk.
Typically, you create a facet on a particular column. The facet summarizes the cells in that column to give you a big picture of that column, and allows you to filter to some subset of rows for which the cells in that column satisfy some constraint. That’s a bit abstract, so let’s jump into some examples.
Here we will use faceting to look for potential errors in data entry in the Author
column.
- Scroll over to the
Author
column. - Click the down arrow and choose
Facet
>Text facet
. - In the left panel, you’ll now see a box containing every unique value in the
Author
column along with a number representing how many times that value occurs in the column. - Try sorting this facet by name and by count. Do you notice any problems with the data? What are they?
- Hover the mouse over one of the names in the
Facet
list. You should see that you have anedit
function available. - You could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. But OpenRefine offers even better ways to find and fix these errors, which we’ll use instead. We’ll learn about these when we talk about clustering.
Solution
There will be several near-identical entries in
Author
. For example, there is one entry forArchdeacony of London
. This a misspelling ofArchdeaconry
. We will see how to correct these misspelled and mistyped entries in a later exercise.
Exercise
Using faceting, find out how many years are represented in the catalogue.
Is the column formatted as Number, Date, or Text? How does changing the format change the faceting display?
Which years have the most and least observations?
Solution
- For the column
Date
doFacet
>Text facet
. A box will appear in the left panel showing that there are 64 unique entries in this column.- By default, the column
Date
is formatted as Text. You can change the format by doingEdit cells
>Common transforms
>To number
. DoingFacet
>Numeric facet
creates a box in the left panel that shows a histogram of the number of entries per year. Notice that the data is shown as a number, not a date. If you instead transform the column to a date, the program will assume all entries are on January 1st of the year.- After creating a facet, click
Sort by count
in the facet box. The year with the most observations is 1640. There are many data but among them are 1632, 1636.
You can also use a custom facet to search by words in a column.
- Go to author
- Go facts, then select customized facets, and select word facet
- A box called place should appear in the left hand side.
Exercise
Using custom faceting, find out how many words are represented in the place column.
How many representations of London can you find?
Solution
23 choices
3 versions of London: London, London?, Londini
Clustering
In OpenRefine, clustering means “finding groups of different values that might be alternative representations of the same thing”. For example, the two strings New York
and new york
are very likely to refer to the same concept and just have capitalization differences. Likewise, Gödel
and Godel
probably refer to the same person. Clustering is a very powerful tool for cleaning datasets which contain misspelled or mistyped entries. OpenRefine has several clustering algorithms built in. Experiment with them, and learn more about these algorithms and how they work.
- In the
Author
Text Facet we created in the step above, click theCluster
button. - In the resulting pop-up window, you can change the
Method
and theKeying Function
. Try different combinations to see what different mergers of values are suggested. - Select the
key collision
method andmetaphone3
keying function. It should identify three clusters. - Click the
Merge?
box beside each, then clickMerge Selected and Recluster
to apply the corrections to the dataset. - Try selecting different
Methods
andKeying Functions
again, to see what new merges are suggested. You may find there are still improvements that can be made, but don’tMerge
again; justClose
when you’re done. We’ll now see other operations that will help us detect and correct the remaining problems, and that have other, more general uses.
Important: If you Merge
using a different method or keying function, or more times than described in the instructions above,
your solutions for later exercises will not be the same as shown in those exercise solutions.
Split
If data in a column needs to be split into multiple columns, and the parts are separated by a common separator (say a comma, or a space), you can use that separator to divide up the pieces into their own columns.
- Let us suppose we want to split the
Author
column into separate columns for Authors - Click the down arrow at the top of the
Author
column. ChooseEdit Column
>Split into several columns...
- In the pop-up, in the
Separator
box, replace the comma with a space. - Uncheck the box that says
Remove this column
. - Click
OK
. You’ll get some new columns calledAuthor 1
,Author 2
, and so on. - Notice that in some cases
Author 1
andAuthor 2
are empty. Why is this? What do you think we can do to fix this?
Solution
The entries that have data in
Author 3
andAuthor 4
but not the first twoAuthor
columns had an extra space at the beginning of the entry. Leading white spaces are very difficult to notice when cleaning data manually. This is another advantage of using OpenRefine to clean your data. We’ll look at how to fix leading and trailing white spaces in a later exercise.
Exercise
Try to change the name of the second new column to “Author”. How can you correct the problem you encounter?
Solution
On the
Author 2
column, click the down arrow and thenEdit column
>Rename this column
. Type “Author” into the box that appears. A pop-up will appear that saysAnother column already named Author
. This is because there is another column where we’ve recorded the species abbreviation. You can choose another name likeauthorName
for this column or change the otherAuthor
column you can change the name toallAuthors
.
Undo / Redo
It’s common while exploring and cleaning a dataset to discover after you’ve made a change that you really should have done something else first. OpenRefine provides Undo
and Redo
operations to make this easy.
- Click where it says
Undo / Redo
on the left side of the screen. All the changes you have made so far are listed here. - Click on the step that you want to go back to, in this case the previous step. The added columns will disappear.
- Notice that you can still click on the last step and make the columns reappear, and toggle back and forth between these states.
- Leave the dataset in the state in which the
Author
was clustered, but not yet split.
Important: If you skip this step, your solutions for later exercises will not be the same as shown in those exercise solutions.
Trim Leading and Trailing Whitespace
Words with spaces at the beginning or end are particularly hard for we humans to tell from strings without, but the blank characters will make a difference to the computer. We usually want to remove these. OpenRefine provides a tool to remove blank characters from the beginning and end of any entries that have them.
- In the header for the column
Author
, chooseEdit cells
>Common transforms
>Trim leading and trailing whitespace
. - Notice that the
Split
step has now disappeared from theUndo / Redo
pane on the left and is replaced with aText transform on 0 cells
- Perform the same
Split
operation onAuthor
that you undid earlier. This time you should only get two new columns. Why?
Solution
Removing the leading white spaces means that each entry in this column has exactly one space (between the genus and species names). Therefore, when you split with space as the separator, you will get only two columns.
Important: Undo
the splitting step before moving on to the next lesson. If you skip this step, your solutions
for later exercises will not be the same as shown in those exercise solutions.
Key Points
Faceting and clustering approaches can identify errors or outliers in data.
Filtering and Sorting with OpenRefine
Overview
Teaching: 10 min
Exercises: 10 minQuestions
How can we select only a subset of our data to work with?
How can we sort our data?
Objectives
Filter to a subset of rows by text filter or include/exclude.
Sort table by a column.
Sort by multiple columns.
Lesson
Filtering
There are many entries in our data table. We can filter it to work on a subset of the data in the list for the next set of operations. Please ensure you perform this step to save time during the class.
- Click the down arrow next to
Author
>Text filter
. AAuthor
facet will appear on the left margin. - Type in
arch
and press return. There are 14 matching rows of the original 99 rows (and these rows are selected for the subsequent steps). - At the top, change the view to
Show
50rows
. This way you will see all the matching rows.
Exercise
- What author names are selected by this procedure?
- How would you restrict this to one of the author selected?
Solution
- Do
Facet
>Text facet
on theAuthor
column after filtering. This will show that four names match your filter criteria. They areChurch of England. Archdeaconry of Surrey
andChurch of England. Archdeaconry of Worcester
.- To restrict to only one of these two species, you could make the search case sensitive or you could split the
Author
column into species and genus before filtering or you could include more letters in your filter.
Excluding entries
In addition to the solutions included above, another way to narrow our filter is to include
and/or exclude
entries in a facet. If you still have your facet for Author
, you can use it, or use drop-down menu > Facet
> Text facet
to create a new facet. Only the entries with names that agree with your Text filter
will be included in this facet.
Faceting and filtering look very similar. A good distinction is that faceting gives you an overview description of all of the data that is currently selected, while filtering allows you to select a subset of your data for analysis.
Exercise
Use
include / exclude
to select only entries from one of these two species.Solution
- In the facet (left margin), click on one of the names, such as
Church of England. Archdeaconry of Surrey
. Notice that when you click on the name, or hover over it, there are entries to the right foredit
andinclude
.- Click
include
. This will explicitly include this species, and exclude others that are not expicitly included. Notice that the option now changes toexclude
.- Click
include
andexclude
on the other species (Church of England. Archdeaconry of Worcester
) and notice how the two entries appear and disappear from the table.
Important: Select both species for your filtered dataset before continuing with the rest of the exercises.
Sort
You can sort the data by a column by using the drop-down menu in that column.
There you can sort by text
, numbers
, dates
or booleans
(TRUE
or FALSE
values). You can also specify what order to put Blanks
and Errors
in the sorted results.
If this is your first time sorting this table, then the drop-down menu for the selected column shows Sort...
. Select what you would like to sort by (such as numbers
). Additional options will then appear for you to fine-tune your sorting.
Exercise
Sort by month. How can you ensure that years are in order?
If you try to re-sort a column that you have already used, the drop-down menu changes slightly, to > Sort
without the ...
, to remind you that you have already used this column. It will give you additional options:
-
Sort
>Sort...
- This option enables you to modify your original sort. -
Sort
>Reverse
- This option allows you to reverse the order of the sort. -
Sort
>Remove sort
- This option allows you to undo your sort.
Exercise
Sort the data by
Date
. What year(s) were observations recorded for plot 1 in this filtered dataset.Solution
In the
plot
column, selectSort...
>numbers
and selectsmallest first
. The years represented are 1571 and 1635.
Sorting by multiple columns.
You can sort by multiple columns by performing sort on additional columns. The sort will depend on the order in which you select columns to sort. To restart the sorting process with a particular column, check the sort by this column alone
box in the Sort
pop-up menu.
Exercise
You might like to look for trends in your data by Page length of collection across years.
- How do you sort your data by year?
- How would you do this differently if you were instead trying to see all of your entries in chronological order?
Solution
- For the
Author
column, click onSort...
and thentext
. This will group all entries made in, for example, 1632, together, regardless of the year that entry was collected.- For the
pages
column, click onSort
>Sort...
>numbers
and selectsort by this column alone
. This will undo the sorting by year step.
If you go back to one of the already sorted colunms and select > Sort
> Remove sort
, that column is removed from your multiple sort. If it is the only column sorted, then data reverts to its original order.
Exercise
Sort by
Date
,Author
andPages
in some order. Be creative: try sorting asnumbers
ortext
, and in reverse order (largest to smallest
orz to a
).Use >
Sort
>Remove sort
to remove the sort on the second of three columns. Notice how that changes the order.
Key Points
OpenRefine provides a way to sort and filter data without affecting the raw data.
Examining Numbers in OpenRefine
Overview
Teaching: 10 min
Exercises: 10 minQuestions
How can we convert a column from one data type to another?
How can we visualize relationships among columns?
Objectives
Transform a text column into a number column.
Identify and modify non-numeric values in a column using facets.
Use scatterplot facet to examine relationships among columns.
Lesson
Numbers
When a table is imported into OpenRefine, all columns are treated as having text values. We saw earlier how we can sort column values as numbers, but this does not change the cells in a column from text to numbers. Rather, this interprets the values as numbers for the purposes of sorting but keeps the underlying data type as is. We can, however, transform columns to other data types (e.g. number or date) using the Edit cells
> Common transforms
feature. Here we will experiment changing columns to numbers and see what additional capabilities that grants us.
Be sure to remove any Text filter
facets you have enabled from the left panel so that we can examine our whole dataset. You can remove an existing facet by clicking the x
in the upper left of that facet window.
To transform cells in the VID
column to numbers, click the down arrow for that column, then Edit cells
> Common transforms…
> To number
. You will notice the VID
values change from left-justified to right-justified, and black to green color.
Exercise
Transform three more columns, including
Date
, from text to numbers. Can all columns be transformed to numbers?Solution
Only observations that include only numerals (0-9) can be transformed to numbers. If you apply a number transformation to a column that doesn’t meet this criteria, and then click the
Undo / Redo
tab, you will see a step that starts withText transform on 0 cells
. This means that the data in that column was not transformed.
Numeric facet
Sometimes there are non-number values or blanks in a column which may represent errors in data entry and we want to find them.
We can do that with a Numeric facet
.
Exercise
- For a column you transformed to numbers, edit one or two cells, replacing the numbers with text (such as
abc
) or blank (no number or text).- Use the pulldown menu to apply a numeric facet to the column you edited. The facet will appear in the left panel.
- Notice that there are several checkboxes in this facet:
Numeric
,Non-numeric
,Blank
, andError
. Below these are counts of the number of cells in each category. You should see checks forNon-numeric
andBlank
if you changed some values.- Experiment with checking or unchecking these boxes to select subsets of your data.
When done examining the numeric data, remove this facet by clicking the x
in the upper left corner of its panel. Note that this does not undo the edits you made to the cells in this column. If you want to reverse these edits, use the Undo / Redo
function.
Scatterplot facet
Now that we have multiple columns representing numbers, we can see how they relate to one another using the scatterplot facet. Select a numeric column, for example Date
, and use the pulldown menu to > Facet
> Scatterplot facet
. A new window called Scatterplot Matrix
will appear. There are squares for each pair of numeric columns organized in an upper right triangle. Each square has little dots for the cell values from each row.
Exercise
- Examine the scatterplots overall. Do the patterns make sense?
- Why does the scatterplot for
VID
vsDate
have the pattern it does?
Examine pair of columns in detail
We can examine one pair of columns by clicking on its square in the Scatterplot Matrix
A new facet with only that pair will appear in the left margin.
Exercise
Click in the scatterplot facet in the left margin and drag to highlight a rectangle. This will subset the data to those entries.
Exercise
- Click on the
Scatterplot Matrix
square forVID
andDate
to get that as a facet in the left margin.- Redo the
Text filter
onAuthor
to show only entries including the lettersArch
. Notice the change in the scatterplot. It might be easier to see if you clickexport plot
to put it on a new browser tab.
Key Points
OpenRefine also provides ways to get overviews of numerical data.
Scripts from OpenRefine
Overview
Teaching: 10 min
Exercises: 5 minQuestions
How can we document the data-cleaning steps we’ve applied to our data?
How can we apply these steps to additional data sets?
Objectives
Describe how OpenRefine generates JSON code.
Demonstrate ability to export JSON code from OpenRefine.
Save JSON code from an analysis.
Apply saved JSON code to an analysis.
Lesson
Scripts
As you conduct your data cleaning and preliminary analysis, OpenRefine saves every change you make to the dataset. These changes are saved in a format known as JSON (JavaScript Object Notation). You can export this JSON script and apply it to other data files. If you had 20 files to clean, and they all had the same type of errors (e.g. species name misspellings, leading white spaces), and all files had the same column names, you could save the JSON script, open a new file to clean in OpenRefine, paste in the script and run it. This gives you a quick way to clean all of your related data.
- In the
Undo / Redo
section, clickExtract...
, and select the steps that you want to apply to other datasets by clicking the check boxes. - Copy the code from the right hand panel and paste it into a text editor (like NotePad on Windows or TextEdit on Mac). Make sure it saves as a plain text file. In TextEdit, do this by selecting
Format
>Make plain text
and save the file as atxt
file.
Let’s practice running these steps on a new dataset. We’ll test this on an uncleaned version of the dataset we’ve been working with.
- Download an uncleaned version of the dataset: Open Refine or use the version of the raw dataset you saved to your computer.
- Start a new project in OpenRefine with this file and name it something different from your existing project.
- Click the
Undo / Redo
tab >Apply
and paste in the contents oftxt
file with the JSON code. - Click
Perform operations
. The dataset should now be the same as your other cleaned dataset.
For convenience, we used the same dataset. In reality you could use this process to clean related datasets. For example, data that you had collected over different fieldwork periods or data that was collected by different researchers (provided everyone uses the same column headings).
Key Points
All changes are being tracked in OpenRefine, and this information can be used for scripts for future analyses or reproducing an analysis.
Exporting and Saving Data from OpenRefine
Overview
Teaching: 10 min
Exercises: 5 minQuestions
How can we save and export our cleaned data from OpenRefine?
Objectives
Save an OpenRefine project.
Export cleaned data from an OpenRefine project.
Lesson
Saving and Exporting a Project
In OpenRefine you can save or export the project. This means you’re saving the data and all the information about the cleaning and data transformation steps you’ve done. Once you’ve saved a project, you can open it up again and be just where you stopped before.
Saving
By default OpenRefine is saving your project. If you close OpenRefine and open it up again, you’ll see a list of your projects. You can click on any one of them to open it up again.
Exporting
You can also export a project. This is helpful, for instance, if you wanted to send your raw data and cleaning steps to a collaborator, or share this information as a supplement to a publication.
- Click the
Export
button in the top right and selectExport project
. - A
tar.gz
file will download to your defaultDownload
directory. Thetar.gz
extension tells you that this is a compressed file. Which means that this file contains multiple files. You can double-click on thetar.gz
file and it will expand into a directory. A folder icon will now appear. - Look at the files that appear in this folder. What files are here? What information do you think these files contain?
Solution
You should see:
- a
history
folder which contains threezip
files. Each of these files itself contains achange.txt
file. Thesechange.txt
files are the records of each individual transformation that you did to your data.- a
data.zip
file. When expanded, thiszip
file includes a file calleddata.txt
which is a copy of your raw data. You may also see other files.
You can import an existing project into OpenRefine by clicking Open...
in the upper right > Import Project
and selecting the tar.gz
project file. This project will include all of the raw data and cleaning steps that were part of the origina project.
Exporting Cleaned Data
You can also export just your cleaned data, rather than the entire project.
- Click
Export
in the top right and select the file type you want to export the data in.Tab-separated values
(tsv
) orComma-separated values
(csv
) would be good choices. - That file will be exported to your default
Download
directory. That file can then be opened in a spreadsheet program or imported into programs like R or Python, which we’ll be discussing later in our workshop.
Remember from our lesson on Spreadsheets that using widely-supported, non-proprietary file formats like tsv
or csv
improves the ability of yourself and others to use your data.
Key Points
Cleaned data or entire projects can be exported from OpenRefine.
Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.
Other Resources in OpenRefine
Overview
Teaching: 5 min
Exercises: 5 minQuestions
What other resources are available for working with OpenRefine?
Objectives
Understand that there are many online resources available for more information on OpenRefine.
Identify other resources about OpenRefine.
Lesson
Identify other Resources about OpenRefine.
OpenRefine is more than a simple data cleaning tool. People are using it for all sorts of activities. Here are some other resources that might prove useful.
OpenRefine has its own web site with documentation and a book:
- OpenRefine web site
- OpenRefine Documentation for Users
- OpenRefine documentation Wiki site
- Using OpenRefine book by Ruben Verborgh, Max De Wilde and Aniket Sawant
- OpenRefine history from Wikipedia
In addition, see these other useful resources:
- Grateful Data is a fun sight with many resources devoted to OpenRefine, including a nice tutorial.
- Margaret Heller shows how she uses OpenRefine for Measuring and Counting Impact in Repositories.
- Intersect Course Resources has Jared Berghold’s Cleaning & Exploring your data with Open Refine (scroll down page to find).
- Enipedia OpenRefine Tutorial
There are more advanced uses of OpenRefine, such as bringing in column or cell data using web locators (URLs or APIs). The links above can give you a start on your journey.
Exercise
Visit one of these sites and share what you find with another person.
Key Points
Other examples and resources online are good for learning more about OpenRefine