Working with OpenRefine
OverviewTeaching: 15 min
Exercises: 20 minQuestions
How can we bring our data into OpenRefine?
How can we sort and summarize our data?
How can we find and correct errors in our raw data?Objectives
Create a new OpenRefine project from a CSV file.
Recall what facets are and how they are used to sort and summarize data.
Recall what clustering is and how it is applied to group and edit typos.
Manipulate data using previous steps with undo/redo.
Employ drop-downs to split values from one column into multiple columns.
Employ drop-downs to remove white spaces from cells.
Creating a Project
Start the program. (Double-click on the openrefine.exe file (or google-refine.exe if using an older version). Java services will start on your machine, and OpenRefine will open in your browser).
Launch OpenRefine (see Getting Started with OpenRefine).
OpenRefine can import a variety of file types, including tab separated (
tsv), comma separated (
csv), Excel (
xlsx), JSON, XML, RDF as XML, Google Spreadsheets. See the OpenRefine Importers page for more information.
In this first step, we’ll browse our computer to the sample data file for this lesson. In this case, we have created a set of TCP data called eebo.csv.
If you haven’t already, download the data from:
Once OpenRefine is launched in your browser, the left margin has options to
Open Project, or
Import Project. Here we will create a new project:
Create Projectand select
Get data from
Choose Filesand select the file
Openor double-click on the filename.
Next>>under the browse button to upload the data into OpenRefine.
- OpenRefine gives you a preview - a chance to show you it understood the file. If, for example, your file was really tab-delimited, the preview might look strange, you would choose the correct separator in the box shown and click
Update Preview(bottom left). If this is the wrong file, click
<<Start Over(upper left).
- If all looks well, click
Create Project>>(upper right).
Note that at step 1, you could upload data in a standard form from a web address by selecting
Get data from
Web Addresses (URLs). However, this won’t work for all URLs.
Exploring data by applying multiple filters
OpenRefine supports faceted browsing as a mechanism for
- seeing a big picture of your data, and
- filtering down to just the subset of rows that you want to change in bulk.
Typically, you create a facet on a particular column. The facet summarizes the cells in that column to give you a big picture of that column, and allows you to filter to some subset of rows for which the cells in that column satisfy some constraint. That’s a bit abstract, so let’s jump into some examples.
Here we will use faceting to look for potential errors in data entry in the
- Scroll over to the
- Click the down arrow and choose
- In the left panel, you’ll now see a box containing every unique value in the
Authorcolumn along with a number representing how many times that value occurs in the column.
- Try sorting this facet by name and by count. Do you notice any problems with the data? What are they?
- Hover the mouse over one of the names in the
Facetlist. You should see that you have an
- You could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. But OpenRefine offers even better ways to find and fix these errors, which we’ll use instead. We’ll learn about these when we talk about clustering.
There will be several near-identical entries in
Author. For example, there is one entry for
Archdeacony of London. This a misspelling of
Archdeaconry. We will see how to correct these misspelled and mistyped entries in a later exercise.
Using faceting, find out how many years are represented in the catalogue.
Is the column formatted as Number, Date, or Text? How does changing the format change the faceting display?
Which years have the most and least observations?
- For the column
Text facet. A box will appear in the left panel showing that there are 64 unique entries in this column.
- By default, the column
Dateis formatted as Text. You can change the format by doing
To number. Doing
Numeric facetcreates a box in the left panel that shows a histogram of the number of entries per year. Notice that the data is shown as a number, not a date. If you instead transform the column to a date, the program will assume all entries are on January 1st of the year.
- After creating a facet, click
Sort by countin the facet box. The year with the most observations is 1640. There are many data but among them are 1632, 1636.
You can also use a custom facet to search by words in a column.
- Go to author
- Go facts, then select customized facets, and select word facet
- A box called place should appear in the left hand side.
Using custom faceting, find out how many words are represented in the place column.
How many representations of London can you find?
3 versions of London: London, London?, Londini
In OpenRefine, clustering means “finding groups of different values that might be alternative representations of the same thing”. For example, the two strings
New York and
new york are very likely to refer to the same concept and just have capitalization differences. Likewise,
Godel probably refer to the same person. Clustering is a very powerful tool for cleaning datasets which contain misspelled or mistyped entries. OpenRefine has several clustering algorithms built in. Experiment with them, and learn more about these algorithms and how they work.
- In the
AuthorText Facet we created in the step above, click the
- In the resulting pop-up window, you can change the
Keying Function. Try different combinations to see what different mergers of values are suggested.
- Select the
key collisionmethod and
metaphone3keying function. It should identify three clusters.
- Click the
Merge?box beside each, then click
Merge Selected and Reclusterto apply the corrections to the dataset.
- Try selecting different
Keying Functionsagain, to see what new merges are suggested. You may find there are still improvements that can be made, but don’t
Closewhen you’re done. We’ll now see other operations that will help us detect and correct the remaining problems, and that have other, more general uses.
Important: If you
Merge using a different method or keying function, or more times than described in the instructions above,
your solutions for later exercises will not be the same as shown in those exercise solutions.
If data in a column needs to be split into multiple columns, and the parts are separated by a common separator (say a comma, or a space), you can use that separator to divide up the pieces into their own columns.
- Let us suppose we want to split the
Authorcolumn into separate columns for Authors
- Click the down arrow at the top of the
Split into several columns...
- In the pop-up, in the
Separatorbox, replace the comma with a space.
- Uncheck the box that says
Remove this column.
OK. You’ll get some new columns called
Author 2, and so on.
- Notice that in some cases
Author 2are empty. Why is this? What do you think we can do to fix this?
The entries that have data in
Author 4but not the first two
Authorcolumns had an extra space at the beginning of the entry. Leading white spaces are very difficult to notice when cleaning data manually. This is another advantage of using OpenRefine to clean your data. We’ll look at how to fix leading and trailing white spaces in a later exercise.
Try to change the name of the second new column to “Author”. How can you correct the problem you encounter?
Author 2column, click the down arrow and then
Rename this column. Type “Author” into the box that appears. A pop-up will appear that says
Another column already named Author. This is because there is another column where we’ve recorded the species abbreviation. You can choose another name like
authorNamefor this column or change the other
Authorcolumn you can change the name to
Undo / Redo
It’s common while exploring and cleaning a dataset to discover after you’ve made a change that you really should have done something else first. OpenRefine provides
Redo operations to make this easy.
- Click where it says
Undo / Redoon the left side of the screen. All the changes you have made so far are listed here.
- Click on the step that you want to go back to, in this case the previous step. The added columns will disappear.
- Notice that you can still click on the last step and make the columns reappear, and toggle back and forth between these states.
- Leave the dataset in the state in which the
Authorwas clustered, but not yet split.
Important: If you skip this step, your solutions for later exercises will not be the same as shown in those exercise solutions.
Trim Leading and Trailing Whitespace
Words with spaces at the beginning or end are particularly hard for we humans to tell from strings without, but the blank characters will make a difference to the computer. We usually want to remove these. OpenRefine provides a tool to remove blank characters from the beginning and end of any entries that have them.
- In the header for the column
Trim leading and trailing whitespace.
- Notice that the
Splitstep has now disappeared from the
Undo / Redopane on the left and is replaced with a
Text transform on 0 cells
- Perform the same
Authorthat you undid earlier. This time you should only get two new columns. Why?
Removing the leading white spaces means that each entry in this column has exactly one space (between the genus and species names). Therefore, when you split with space as the separator, you will get only two columns.
Undo the splitting step before moving on to the next lesson. If you skip this step, your solutions
for later exercises will not be the same as shown in those exercise solutions.
Faceting and clustering approaches can identify errors or outliers in data.