OverviewTeaching: 10 min
Exercises: 0 minQuestions
What is OpenRefine useful for?Objectives
Describe OpenRefine’s uses and applications.
Differentiate data cleaning from data organization.
Experiment with OpenRefine’s user interface.
Locate helpful resources to learn more about OpenRefine.
Motivations for the OpenRefine Lesson
- It is important to know what you did to your data. Additionally, journals, granting agencies, and other institutions are requiring documentation of the steps you took when working with your data. With OpenRefine, you can capture all actions applied to your raw data and share them with your publication as supplemental material.
- All actions are easily reversed in OpenRefine.
- You must save your work to a new file; OpenRefine does not modify your original dataset.
- Data is often very messy, and this tool saves a lot of time on cleaning headaches.
- Data cleaning steps often need repeating with multiple files. OpenRefine is perfect for speeding up repetitive tasks by replaying previous actions on multiple datasets.
- Some concepts such as clustering algorithms are quite complex, but OpenRefine makes it easy to introduce them, use them, and show their power.
Before we get started
Note: this is a Java program that runs on your machine (not in the cloud). It runs inside your browser, but no web connection is needed.
Follow the Setup instructions to install OpenRefine.
If after installation and running OpenRefine, it does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.
Basics of OpenRefine
You can find out a lot more about OpenRefine at http://openrefine.org and check out some great introductory videos. There is a Google Group that can answer a lot of beginner questions and problems. There is also an OpenRefine Google Plus community where you can find a lot of help and a lot of folks from the life sciences are members. As with other programs of this type, OpenRefine libraries are available too, where you can find a script you need and copy it into your OpenRefine instance to run it on your dataset.
- Open source (source on GitHub).
- A large growing community, from novice to expert, ready to help.
- Works with large-ish datasets (100,000 rows). Does not scale to many millions. (yet).
OpenRefine is a powerful, free and open source tool that can be used for data cleaning.
OpenRefine will automatically track any steps you take in working with your data.
Working with OpenRefine
OverviewTeaching: 15 min
Exercises: 20 minQuestions
How can we bring our data into OpenRefine?
How can we sort and summarize our data?
How can we find and correct errors in our raw data?Objectives
Create a new OpenRefine project from a CSV file.
Recall what facets are and how they are used to sort and summarize data.
Recall what clustering is and how it is applied to group and edit typos.
Manipulate data using previous steps with undo/redo.
Employ drop-downs to split values from one column into multiple columns.
Employ drop-downs to remove white spaces from cells.
Creating a Project
Start the program. (Double-click on the openrefine.exe file (or google-refine.exe if using an older version). Java services will start on your machine, and OpenRefine will open in your browser).
Launch OpenRefine (see Getting Started with OpenRefine).
OpenRefine can import a variety of file types, including tab separated (
tsv), comma separated (
csv), Excel (
xlsx), JSON, XML, RDF as XML, Google Spreadsheets. See the OpenRefine Importers page for more information.
In this first step, we’ll browse our computer to the sample data file for this lesson. In this case, we have created a set of TCP data called eebo.csv.
If you haven’t already, download the data from:
Once OpenRefine is launched in your browser, the left margin has options to
Open Project, or
Import Project. Here we will create a new project:
Create Projectand select
Get data from
Choose Filesand select the file
Openor double-click on the filename.
Next>>under the browse button to upload the data into OpenRefine.
- OpenRefine gives you a preview - a chance to show you it understood the file. If, for example, your file was really tab-delimited, the preview might look strange, you would choose the correct separator in the box shown and click
Update Preview(bottom left). If this is the wrong file, click
<<Start Over(upper left).
- If all looks well, click
Create Project>>(upper right).
Note that at step 1, you could upload data in a standard form from a web address by selecting
Get data from
Web Addresses (URLs). However, this won’t work for all URLs.
Exploring data by applying multiple filters
OpenRefine supports faceted browsing as a mechanism for
- seeing a big picture of your data, and
- filtering down to just the subset of rows that you want to change in bulk.
Typically, you create a facet on a particular column. The facet summarizes the cells in that column to give you a big picture of that column, and allows you to filter to some subset of rows for which the cells in that column satisfy some constraint. That’s a bit abstract, so let’s jump into some examples.
Here we will use faceting to look for potential errors in data entry in the
- Scroll over to the
- Click the down arrow and choose
- In the left panel, you’ll now see a box containing every unique value in the
Authorcolumn along with a number representing how many times that value occurs in the column.
- Try sorting this facet by name and by count. Do you notice any problems with the data? What are they?
- Hover the mouse over one of the names in the
Facetlist. You should see that you have an
- You could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. But OpenRefine offers even better ways to find and fix these errors, which we’ll use instead. We’ll learn about these when we talk about clustering.
There will be several near-identical entries in
Author. For example, there is one entry for
Archdeacony of London. This a misspelling of
Archdeaconry. We will see how to correct these misspelled and mistyped entries in a later exercise.
Using faceting, find out how many years are represented in the catalogue.
Is the column formatted as Number, Date, or Text? How does changing the format change the faceting display?
Which years have the most and least observations?
- For the column
Text facet. A box will appear in the left panel showing that there are 64 unique entries in this column.
- By default, the column
Dateis formatted as Text. You can change the format by doing
To number. Doing
Numeric facetcreates a box in the left panel that shows a histogram of the number of entries per year. Notice that the data is shown as a number, not a date. If you instead transform the column to a date, the program will assume all entries are on January 1st of the year.
- After creating a facet, click
Sort by countin the facet box. The year with the most observations is 1640. There are many data but among them are 1632, 1636.
You can also use a custom facet to search by words in a column.
- Go to author
- Go facts, then select customized facets, and select word facet
- A box called place should appear in the left hand side.
Using custom faceting, find out how many words are represented in the place column.
How many representations of London can you find?
3 versions of London: London, London?, Londini
In OpenRefine, clustering means “finding groups of different values that might be alternative representations of the same thing”. For example, the two strings
New York and
new york are very likely to refer to the same concept and just have capitalization differences. Likewise,
Godel probably refer to the same person. Clustering is a very powerful tool for cleaning datasets which contain misspelled or mistyped entries. OpenRefine has several clustering algorithms built in. Experiment with them, and learn more about these algorithms and how they work.
- In the
AuthorText Facet we created in the step above, click the
- In the resulting pop-up window, you can change the
Keying Function. Try different combinations to see what different mergers of values are suggested.
- Select the
key collisionmethod and
metaphone3keying function. It should identify three clusters.
- Click the
Merge?box beside each, then click
Merge Selected and Reclusterto apply the corrections to the dataset.
- Try selecting different
Keying Functionsagain, to see what new merges are suggested. You may find there are still improvements that can be made, but don’t
Closewhen you’re done. We’ll now see other operations that will help us detect and correct the remaining problems, and that have other, more general uses.
Important: If you
Merge using a different method or keying function, or more times than described in the instructions above,
your solutions for later exercises will not be the same as shown in those exercise solutions.
If data in a column needs to be split into multiple columns, and the parts are separated by a common separator (say a comma, or a space), you can use that separator to divide up the pieces into their own columns.
- Let us suppose we want to split the
Authorcolumn into separate columns for Authors
- Click the down arrow at the top of the
Split into several columns...
- In the pop-up, in the
Separatorbox, replace the comma with a space.
- Uncheck the box that says
Remove this column.
OK. You’ll get some new columns called
Author 2, and so on.
- Notice that in some cases
Author 2are empty. Why is this? What do you think we can do to fix this?
The entries that have data in
Author 4but not the first two
Authorcolumns had an extra space at the beginning of the entry. Leading white spaces are very difficult to notice when cleaning data manually. This is another advantage of using OpenRefine to clean your data. We’ll look at how to fix leading and trailing white spaces in a later exercise.
Try to change the name of the second new column to “Author”. How can you correct the problem you encounter?
Author 2column, click the down arrow and then
Rename this column. Type “Author” into the box that appears. A pop-up will appear that says
Another column already named Author. This is because there is another column where we’ve recorded the species abbreviation. You can choose another name like
authorNamefor this column or change the other
Authorcolumn you can change the name to
Undo / Redo
It’s common while exploring and cleaning a dataset to discover after you’ve made a change that you really should have done something else first. OpenRefine provides
Redo operations to make this easy.
- Click where it says
Undo / Redoon the left side of the screen. All the changes you have made so far are listed here.
- Click on the step that you want to go back to, in this case the previous step. The added columns will disappear.
- Notice that you can still click on the last step and make the columns reappear, and toggle back and forth between these states.
- Leave the dataset in the state in which the
Authorwas clustered, but not yet split.
Important: If you skip this step, your solutions for later exercises will not be the same as shown in those exercise solutions.
Trim Leading and Trailing Whitespace
Words with spaces at the beginning or end are particularly hard for we humans to tell from strings without, but the blank characters will make a difference to the computer. We usually want to remove these. OpenRefine provides a tool to remove blank characters from the beginning and end of any entries that have them.
- In the header for the column
Trim leading and trailing whitespace.
- Notice that the
Splitstep has now disappeared from the
Undo / Redopane on the left and is replaced with a
Text transform on 0 cells
- Perform the same
Authorthat you undid earlier. This time you should only get two new columns. Why?
Removing the leading white spaces means that each entry in this column has exactly one space (between the genus and species names). Therefore, when you split with space as the separator, you will get only two columns.
Undo the splitting step before moving on to the next lesson. If you skip this step, your solutions
for later exercises will not be the same as shown in those exercise solutions.
Faceting and clustering approaches can identify errors or outliers in data.
Filtering and Sorting with OpenRefine
OverviewTeaching: 10 min
Exercises: 10 minQuestions
How can we select only a subset of our data to work with?
How can we sort our data?Objectives
Filter to a subset of rows by text filter or include/exclude.
Sort table by a column.
Sort by multiple columns.
There are many entries in our data table. We can filter it to work on a subset of the data in the list for the next set of operations. Please ensure you perform this step to save time during the class.
- Click the down arrow next to
Text filter. A
Authorfacet will appear on the left margin.
- Type in
archand press return. There are 14 matching rows of the original 99 rows (and these rows are selected for the subsequent steps).
- At the top, change the view to
rows. This way you will see all the matching rows.
- What author names are selected by this procedure?
- How would you restrict this to one of the author selected?
Text faceton the
Authorcolumn after filtering. This will show that four names match your filter criteria. They are
Church of England. Archdeaconry of Surreyand
Church of England. Archdeaconry of Worcester.
- To restrict to only one of these two species, you could make the search case sensitive or you could split the
Authorcolumn into species and genus before filtering or you could include more letters in your filter.
In addition to the solutions included above, another way to narrow our filter is to
exclude entries in a facet. If you still have your facet for
Author, you can use it, or use drop-down menu >
Text facet to create a new facet. Only the entries with names that agree with your
Text filter will be included in this facet.
Faceting and filtering look very similar. A good distinction is that faceting gives you an overview description of all of the data that is currently selected, while filtering allows you to select a subset of your data for analysis.
include / excludeto select only entries from one of these two species.
- In the facet (left margin), click on one of the names, such as
Church of England. Archdeaconry of Surrey. Notice that when you click on the name, or hover over it, there are entries to the right for
include. This will explicitly include this species, and exclude others that are not expicitly included. Notice that the option now changes to
excludeon the other species (
Church of England. Archdeaconry of Worcester) and notice how the two entries appear and disappear from the table.
Important: Select both species for your filtered dataset before continuing with the rest of the exercises.
You can sort the data by a column by using the drop-down menu in that column.
There you can sort by
FALSE values). You can also specify what order to put
Errors in the sorted results.
If this is your first time sorting this table, then the drop-down menu for the selected column shows
Sort.... Select what you would like to sort by (such as
numbers). Additional options will then appear for you to fine-tune your sorting.
Sort by month. How can you ensure that years are in order?
If you try to re-sort a column that you have already used, the drop-down menu changes slightly, to >
Sort without the
..., to remind you that you have already used this column. It will give you additional options:
Sort...- This option enables you to modify your original sort.
Reverse- This option allows you to reverse the order of the sort.
Remove sort- This option allows you to undo your sort.
Sort the data by
Date. What year(s) were observations recorded for plot 1 in this filtered dataset.
smallest first. The years represented are 1571 and 1635.
Sorting by multiple columns.
You can sort by multiple columns by performing sort on additional columns. The sort will depend on the order in which you select columns to sort. To restart the sorting process with a particular column, check the
sort by this column alone box in the
Sort pop-up menu.
You might like to look for trends in your data by Page length of collection across years.
- How do you sort your data by year?
- How would you do this differently if you were instead trying to see all of your entries in chronological order?
- For the
Authorcolumn, click on
text. This will group all entries made in, for example, 1632, together, regardless of the year that entry was collected.
- For the
pagescolumn, click on
sort by this column alone. This will undo the sorting by year step.
If you go back to one of the already sorted colunms and select >
Remove sort, that column is removed from your multiple sort. If it is the only column sorted, then data reverts to its original order.
Pagesin some order. Be creative: try sorting as
text, and in reverse order (
largest to smallestor
z to a).
Remove sortto remove the sort on the second of three columns. Notice how that changes the order.
OpenRefine provides a way to sort and filter data without affecting the raw data.
Examining Numbers in OpenRefine
OverviewTeaching: 10 min
Exercises: 10 minQuestions
How can we convert a column from one data type to another?
How can we visualize relationships among columns?Objectives
Transform a text column into a number column.
Identify and modify non-numeric values in a column using facets.
Use scatterplot facet to examine relationships among columns.
When a table is imported into OpenRefine, all columns are treated as having text values. We saw earlier how we can sort column values as numbers, but this does not change the cells in a column from text to numbers. Rather, this interprets the values as numbers for the purposes of sorting but keeps the underlying data type as is. We can, however, transform columns to other data types (e.g. number or date) using the
Edit cells >
Common transforms feature. Here we will experiment changing columns to numbers and see what additional capabilities that grants us.
Be sure to remove any
Text filter facets you have enabled from the left panel so that we can examine our whole dataset. You can remove an existing facet by clicking the
x in the upper left of that facet window.
To transform cells in the
VID column to numbers, click the down arrow for that column, then
Edit cells >
Common transforms… >
To number. You will notice the
VID values change from left-justified to right-justified, and black to green color.
Transform three more columns, including
Date, from text to numbers. Can all columns be transformed to numbers?
Only observations that include only numerals (0-9) can be transformed to numbers. If you apply a number transformation to a column that doesn’t meet this criteria, and then click the
Undo / Redotab, you will see a step that starts with
Text transform on 0 cells. This means that the data in that column was not transformed.
Sometimes there are non-number values or blanks in a column which may represent errors in data entry and we want to find them.
We can do that with a
- For a column you transformed to numbers, edit one or two cells, replacing the numbers with text (such as
abc) or blank (no number or text).
- Use the pulldown menu to apply a numeric facet to the column you edited. The facet will appear in the left panel.
- Notice that there are several checkboxes in this facet:
Error. Below these are counts of the number of cells in each category. You should see checks for
Blankif you changed some values.
- Experiment with checking or unchecking these boxes to select subsets of your data.
When done examining the numeric data, remove this facet by clicking the
x in the upper left corner of its panel. Note that this does not undo the edits you made to the cells in this column. If you want to reverse these edits, use the
Undo / Redo function.
Now that we have multiple columns representing numbers, we can see how they relate to one another using the scatterplot facet. Select a numeric column, for example
Date, and use the pulldown menu to >
Scatterplot facet. A new window called
Scatterplot Matrix will appear. There are squares for each pair of numeric columns organized in an upper right triangle. Each square has little dots for the cell values from each row.
- Examine the scatterplots overall. Do the patterns make sense?
- Why does the scatterplot for
Datehave the pattern it does?
Examine pair of columns in detail
We can examine one pair of columns by clicking on its square in the
Scatterplot Matrix A new facet with only that pair will appear in the left margin.
Click in the scatterplot facet in the left margin and drag to highlight a rectangle. This will subset the data to those entries.
- Click on the
Scatterplot Matrixsquare for
Dateto get that as a facet in the left margin.
- Redo the
Authorto show only entries including the letters
Arch. Notice the change in the scatterplot. It might be easier to see if you click
export plotto put it on a new browser tab.
OpenRefine also provides ways to get overviews of numerical data.
Scripts from OpenRefine
OverviewTeaching: 10 min
Exercises: 5 minQuestions
How can we document the data-cleaning steps we’ve applied to our data?
How can we apply these steps to additional data sets?Objectives
Describe how OpenRefine generates JSON code.
Demonstrate ability to export JSON code from OpenRefine.
Save JSON code from an analysis.
Apply saved JSON code to an analysis.
- In the
Undo / Redosection, click
Extract..., and select the steps that you want to apply to other datasets by clicking the check boxes.
- Copy the code from the right hand panel and paste it into a text editor (like NotePad on Windows or TextEdit on Mac). Make sure it saves as a plain text file. In TextEdit, do this by selecting
Make plain textand save the file as a
Let’s practice running these steps on a new dataset. We’ll test this on an uncleaned version of the dataset we’ve been working with.
- Download an uncleaned version of the dataset: Open Refine or use the version of the raw dataset you saved to your computer.
- Start a new project in OpenRefine with this file and name it something different from your existing project.
- Click the
Undo / Redotab >
Applyand paste in the contents of
txtfile with the JSON code.
Perform operations. The dataset should now be the same as your other cleaned dataset.
For convenience, we used the same dataset. In reality you could use this process to clean related datasets. For example, data that you had collected over different fieldwork periods or data that was collected by different researchers (provided everyone uses the same column headings).
All changes are being tracked in OpenRefine, and this information can be used for scripts for future analyses or reproducing an analysis.
Exporting and Saving Data from OpenRefine
OverviewTeaching: 10 min
Exercises: 5 minQuestions
How can we save and export our cleaned data from OpenRefine?Objectives
Save an OpenRefine project.
Export cleaned data from an OpenRefine project.
Saving and Exporting a Project
In OpenRefine you can save or export the project. This means you’re saving the data and all the information about the cleaning and data transformation steps you’ve done. Once you’ve saved a project, you can open it up again and be just where you stopped before.
By default OpenRefine is saving your project. If you close OpenRefine and open it up again, you’ll see a list of your projects. You can click on any one of them to open it up again.
You can also export a project. This is helpful, for instance, if you wanted to send your raw data and cleaning steps to a collaborator, or share this information as a supplement to a publication.
- Click the
Exportbutton in the top right and select
tar.gzfile will download to your default
tar.gzextension tells you that this is a compressed file. Which means that this file contains multiple files. You can double-click on the
tar.gzfile and it will expand into a directory. A folder icon will now appear.
- Look at the files that appear in this folder. What files are here? What information do you think these files contain?
You should see:
historyfolder which contains three
zipfiles. Each of these files itself contains a
change.txtfiles are the records of each individual transformation that you did to your data.
data.zipfile. When expanded, this
zipfile includes a file called
data.txtwhich is a copy of your raw data. You may also see other files.
You can import an existing project into OpenRefine by clicking
Open... in the upper right >
Import Project and selecting the
project file. This project will include all of the raw data and cleaning steps that were part of the origina project.
Exporting Cleaned Data
You can also export just your cleaned data, rather than the entire project.
Exportin the top right and select the file type you want to export the data in.
csv) would be good choices.
- That file will be exported to your default
Downloaddirectory. That file can then be opened in a spreadsheet program or imported into programs like R or Python, which we’ll be discussing later in our workshop.
Remember from our lesson on Spreadsheets that using widely-supported, non-proprietary file formats like
csv improves the ability of yourself and others to use your data.
Cleaned data or entire projects can be exported from OpenRefine.
Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.
Other Resources in OpenRefine
OverviewTeaching: 5 min
Exercises: 5 minQuestions
What other resources are available for working with OpenRefine?Objectives
Understand that there are many online resources available for more information on OpenRefine.
Identify other resources about OpenRefine.
Identify other Resources about OpenRefine.
OpenRefine is more than a simple data cleaning tool. People are using it for all sorts of activities. Here are some other resources that might prove useful.
OpenRefine has its own web site with documentation and a book:
- OpenRefine web site
- OpenRefine Documentation for Users
- OpenRefine documentation Wiki site
- Using OpenRefine book by Ruben Verborgh, Max De Wilde and Aniket Sawant
- OpenRefine history from Wikipedia
In addition, see these other useful resources:
- Grateful Data is a fun sight with many resources devoted to OpenRefine, including a nice tutorial.
- Margaret Heller shows how she uses OpenRefine for Measuring and Counting Impact in Repositories.
- Intersect Course Resources has Jared Berghold’s Cleaning & Exploring your data with Open Refine (scroll down page to find).
- Enipedia OpenRefine Tutorial
There are more advanced uses of OpenRefine, such as bringing in column or cell data using web locators (URLs or APIs). The links above can give you a start on your journey.
Visit one of these sites and share what you find with another person.
Other examples and resources online are good for learning more about OpenRefine