Regular Expressions for Biologists

Do you often work with lots of data files on the computer? Are you often trying to spot particular files or lines of text in them that are important for you?

If so, then using regular expressions could save you a lot of time and frustration!

Regular expressions (regex/REs) are a method for describing patterns of characters that you want to match in a body of text. A knowledge of regular expressions can be extremely helpful in computational biology and when combined with text editors and common tools (grep, sed, awk, etc) used in command line computing.

These course materials are designed to give an introduction to using regular regular expressions. Working through the materials, you will learn how to quickly find and replace text in large files, controlling the types and numbers of characters matched, handling repeats, keeping certain parts of a matched pattern during replacement, and constructing sets of different options to be matched.

The course does not provide a comprehensive overview of the regex syntax or engine. Instead, they reflect the vast majority of use cases that the authors encounter. The background of the authors is represented in many of the examples chosen, which often focus on biological contexts and file formats.

For a comprehensive overview of regular expressions, we highly recommend the excellent regular-expressions.info.

Prerequisites

To follow this lesson, learners should know how to open a text file on their computer, and some familiarity with the Find/Replace functionality available in most text editing tools will be beneficial. The examples used will feel most relevant to those who are familiar with common file formats used to represent biological data, and some knowledge of sequence biology (DNA/RNA and protein sequences) is assumed.

Schedule

	Setup	Download files required for the lesson
00:00	1. Introduction	What is a regular expression? In what programs can I use regular expessions?
00:00	2. Regex Fundamentals	How can I search for sets of characters in a text file? How can I specify ranges of characters in a search?
00:00	3. Tokens and Wildcards	How can I specify that patterns should only match as whole words or whole lines? How can I only match patterns that appear at the very beginning or very end of a line? How can I specify positions in a pattern that could match any character?
00:00	4. Repeated Matches	How can I define a character or set that appears multiple times in a pattern? How can I define the maximum and minimum number of times a character or set should appear in the pattern?
00:00	5. Capture Groups and References	How can I reuse parts of the matched pattern when I replace it?
00:00	6. Alternative Matches	How can I define multiple possible strings that can be matched in a regular expression?
00:00	Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.