Introduction

Overview

Teaching: 0 min
Exercises: 0 min

Questions

What is a regular expression?

In what programs can I use regular expessions?

Objectives

Explain that regular expressions are a way of describing patterns in text.

Describe circumstances where a regular expression search could be helpful.

Access the regular expression search function of a text editor.

What is a regular expression?

A regular expression (regex) is a string of characters defining a pattern to be found in a block of text. A regex can be constructed to allow multiple possible matches, while restricting other possibilities,. This makes them considerably more powerful and precise than the simple ‘CTRL+F’ or ‘Find/Replace’ operations that you might be familiar with from word processing/web browser software. It is also possible to use regular expressions to specify parts of the search pattern that will be kept in the replacement string of a substitution operation.

Illustrative Example

Imagine that you have a long text document. The document contains a lot of numbers - years (2014, 2016, 1998, etc), digits (9, 4, 58, etc), and a single phone number (i.e. twelve digits). You need to find the phone number, but it is buried in this large body of text, amongst many other numbers. You cannot remember much about the phone number (like what digit(s) it starts with), you can’t just randomly search strings of digits with CTRL+F/Find/Replace, because there are far too many possible combinations (one trillion twelve-digit numbers!), and simply performing ten individual searches for the digits from 0 to 9 returns many unwanted results.

You could run one of these searches for an individual digit, then manually click through each result, until you eventually find the phone number. Or you could simply scroll through the document, looking for something that resembles a phone number. But that is likely to take a long time, is really boring, and it would be easy to miss the number while hastily scanning through so many pages.

With regular expressions, you can define a pattern to find any combination of twelve digits in succession, which you can use to find the phone number - and nothing else - wherever it is in the document.

Regular Expression Engines

In order to perform a regex search, you need to provide the expression to a program that can interpret it. Such tools are known as regular expression engines. Unfortunately, there are multiple different ‘flavours’ of regex engines, which each work slightly differently. However, the core rules, which are covered in this lesson and will probably address 99.9% of your text-searching needs, are effectively common to all of them.

How can I use Regular Expressions?

Text Editors

Most text editors include a regular expression engine to provide regex search/replace functionality. See the Setup page for suggested text editors to use on different operating systems when following this lesson.

Usually, you will need to specify that you want to use regular expression searching instead of the normal plain text searching that you may already be familiar with. When you open up the Find/Replace interface in your editor, you should look for an option with a label like “mode”, “grep-like”, “regular expression” or “regex”, and use this to control the type of searching that you want to do.

Command Line

In addition, several Unix command line tools are commonly available that use regular expressions, for searching (e.g. grep/egrep) and replacement/substitution (e.g. sed, awk, perl -e).

Key Points

Regular expressions are a way of describing patterns in text.

Most text editors and many other tools include a regular expression engine for performing these kinds of searches.

Regular expressions are often offered as a mode of find/replace that can be turned on and off by the user.

Regex Fundamentals

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How can I search for sets of characters in a text file?

How can I specify ranges of characters in a search?

Objectives

Compose regular expressions to match patterns in text.

Specify sets and ranges of characters to include in a search.

Use inverted matches to exclude particular characters from a search.

Basic String Matching

In regular expression mode, you can still search for a literal string i.e. the exact letter(s)/word(s) that you want to find. So, to find the gene name ‘HDAC1’, you would type:

HDAC1

Using your text editor, open the example file example.gff and try using the Find/Replace interface to search for this term. Remember to make sure that you have your searches switched to regex mode.

Gene names are one example of where the letters in your target string might be in upper or lower case (or a mix of the two) - DNA and RNA sequences are another. You should consider this when doing your search: most search functions/programs provide an option to switch case sensitivity on and off. For example, when using grep on the command line, the -i option activates case insensitivity.

grep -i HDAC1 example.gff

Finding HDACs

Find every instance of a histone deacetylase (HDAC) in the file, based on the gene name. Are there any potential problems with your search? Think about how you might make your search as specific as possible, to avoid spurious matches.
Solution

To make the search less specific, remove the ‘1’ from the end
HDAC
Or, to be more specific and avoid any spurious matches:
Name=HDAC

Sets and Ranges

What if we want to find only ‘HDAC1’ and ‘HDAC2’ but no other patterns beginning with this gene name? You could do this in two separate searches, but getting everything you need in a single search might be better for a number of reasons: it is faster; it will preserve the order of the results, e.g. if you are extracting them to a separate file; if you learn how to do it for two strings, you can apply the same approach to ten, 100, etc; and it is more satisfying!

In a regular expression, we can specify groups of characters to be matched in a certain position by placing them between sqaure brackets []. For example,

[bfr]oot

will match ‘boot’, ‘foot’, and ‘root’, but not ‘loot’, or ‘moot’. Only a single character inside the square brackets [] will be matched, so the pattern above will not match the whole of ‘froot’ or ‘broot’ either. As a substring of these, ‘root’ will be matched in both cases, which may not be what we want. We will learn more soon about how to avoid these partial matches.

This approach can be used to match only HDAC1 and HDAC2 in our example GFF file, with the regex below:

HDAC[12]

The set of characters specified inside [] can be a mix of letters, numbers, and symbols. So,

[&k8Y]

is a valid set.

What if we want to match any uppercase letter? Applying what we’ve learned already, we could use the set

[ABCDEFGHIJKLMNOPQRSTUVWXYZ]at

to match any string beginning with an upper case letter, followed by ‘at’ e.g. ‘Cat’, ‘Bat’, ‘Mat’, etc. But that’s a lot of typing (30 characters to match only three in the string), and we can instead specify ranges of characters in [] with -. So, to match the same strings as before, our regex can instead look like this:

[A-Z]at

Only seven characters now - that’s much better! All lower case letters can be matched with the set [a-z], and digits with [0-9].

Character Sets

Here, we will discuss only characters that fall inside the ASCII set - a limited set of roman alphabet letters, numbers, and symbols, which includes almost everything commonly used in English, but not accented characters (ü, é, ø, etc) or many specialised symbols (e.g. €, ¿, ±). Many regular expression engines provide a ‘Unicode mode’, often switched on with the ‘u’ flag, which will allow you to specify and match the full range of Unicode characters. This includes most alphabets, symbols, and even emojis._

Exercise 2.2

a) In total, how many lines mention HDAC 1-5 in example.gff?

Solution

82 (using the regex Name=HDAC[1-5][^0-9])

b) Which of the following expressions could you use to match any four-letter word beginning with an uppercase letter, followed by two vowels, and ending with ‘d’?
	i) [upper][vowel][vowel]d

	ii) [A-Z][a-u][a-u]d

	iii) [A-Z][aeiou][aeiou]d

	iv) [A-Z][aeiou]*2;d
Solution

Option iii) fits the description. You might also have chosen option ii), which would match the described pattern, but also other non-vowel letters in the middle two positions.

c) Try playing around with the character ranges that you have learned above. What does [A-z] evaluate to? What about [a-Z]? Can you create a range that covers all letter and number characters?

Solution

[A-z] matches all letter characters (both upper and lower case). [a-Z] is an invalid set. [A-9] will match any letter or digit character.

Ranges don’t have to include the whole alphabet or every digit - we can match only the second half of the alphabet with

[N-Z]

and only the numbers from 5 to 8 with

[5-8]

We can specify multiple ranges together in the same set, so matching ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, or any digit can be done with

[a-f0-9]

But if we’re using the - symbol to specify a range of characters, how do we include the literal ‘-‘ symbol in a set to be matched? To do this, the - should be specified at the start of the set. So

[-A-K]

will match ‘-‘ as well as any uppercase letter between ‘A’ and ‘K’.

Inverted sets

The last thing to tell you about sets and ranges (for now), is that we can also specify a set or range of characters to not match in a position. This is achieved with the ^ symbol at the beginning of the set.

201[^269]

will match ‘2010’, ‘201K’, ‘201j’, etc, but not ‘2012’, ‘2016’, or ‘2019’. In contrast to -, which is only taken literally when at the start of the set, ^ only takes a special meaning at the start of a set - it is treated literally if it appears anywhere else in the set. If you want to invert a set that should include the - symbol, start the set with ^- followed by whatever other characters you don’t want to match.

Exercise 2.3

Use an inverted set to only match the human autosomes (chr1-22), i.e. filtering out chromosomes chrX, chrY and chrM. How many records with autosomes can you find in file example.gff?
Solution
chr[^XYM]
There are 897 records matching this regular expression. In fact, there are only 895 lines beginning with chr[^XYZ], but two other lines also match the regex above because they contain the string ‘chromosome’. To avoid matching these, anchor the regex to the beginning of the line with ^ i.e. ^chr[^MXY] (see chapter 3).

Key Points

Wrap characters in [] to define a set of valid matches for a given position.

Use - between two characters to define a range of characters to match.

^ at the start of a set to invert it, indicating that the given characters should be excluded from a match.

Tokens and Wildcards

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How can I specify that patterns should only match as whole words or whole lines?

How can I only match patterns that appear at the very beginning or very end of a line?

How can I specify positions in a pattern that could match any character?

Objectives

Compose regular expressions that include tokens to match particular classes of character.

Describe the risks associated with using tokens and wildcards that match many characters.

Modify a regular expression to match only strings that appear at the start or end of a line or word.

Referencing multiple characters

In the introductory example we introduced the \d token, used to represent any single digit. In this regard, the two regular expressions below are equivalent.

[0-9]
\d

Tokens and the Backslash

Tokens in general are shorter form ways of describing standard, commonly-used character sets/classes. The table below describes the tokens available for use in regex.

Token	Matches	Set Equivalent
`\d`	Any digit	`[0-9]`
`\s`	Any whitespace character (space, tab, newlines)	`[ \t\n\r\f]`
`\w`	Any ‘word character’ - letters, numbers, and underscore	`[A-Za-z0-9_]`
`\b`	A word boundary character	no equivalent
`\D`	The inverse of `\d` i.e. any character except a digit	`[^0-9]`
`\S`	Any non-whitespace character	`[^ \t\n\r\f]`
`\W`	Any non-word character	`[^A-Za-z0-9_]`

Notice that these tokens have a common syntax - a backslash character ‘' followed by a letter establishing the set (lowercase) or inverse of a set (uppercase) being represented. The backslash is important in regular expressions, and in programming/command line computing in general. It is often used as an ‘escape character’ - it signifies to a program/interpreter that the character that follows the backslash should be treated in some special way. In regex, the backslash has a slightly confusing role, in that it is used as both an escape character and as a way of conferring special meaning on characters that would otherwise be treated literally. So, for tokens the inclusion of a backslash changes the meaning of the w, s, b, and d characters from “match exactly the character ‘w’” and so on, to “match any character in this set/class”. But for other characters that already have a special meaning (e.g. ., $, [, etc), the inclusion of a backslash in the preceding position in a regex tells the engine “treat this character literally, instead of interpreting it with a special value”. We will discuss the implications of matching special characters in more detail in later sections.

This even extends so far as to the backslash character itself - you can specify that you want to match a literal backslash, by preceding that backslash character with - you guessed it! - a backslash i.e. with \\.

Exercise 3.1

Match dates of the form 31.01.2017 (DAY.MONTH.YEAR) in the example file person_info.csv. Pay attention to not wrongly match phone numbers. How many matches do you find?
Solution
\d\d\.\d\d\.\d\d\d\d
There are 25 matches in the file person_info.csv (every even record).

Note that the solution above will also match strings like 131.01.20171 or 99.99.9999. If you really need to avoid matches like that, you will need to construct a more specific regex, e.g. one based on the ranges of years you expect to find in your date mateches.

Word Boundaries

The \d, \w, and \s tokens each represent a clearly defined set of characters. The \b token is more interesting - it is used to match what is referred to as a ‘word boundary’, and can be used to ensure matching of whole words only. For example, we want to find every instance of ‘chr1’ and ‘chr2’ in the file example.gff. Using what we’ve already learned, we can design the regex

chr[12]

which will match either of the two target strings. However, this regex will also match all but the last character of ‘chr13’ and ‘chr22’, which is not what we want. How can we be sure that we will only match the two chromosome identifiers that we want, without additional digits on the end? We could add a space character to the end of the regex. But what if the target string appears at the end of a line? Or before a symbol/delimiter such as ‘;’ or ‘.’? These strings will be missed by our regex ending with a space.

This is where the \b token comes in handy. ‘Word boundary’ characters include all of the options described above - symbols that might be used as field delimiters, periods and commas, newline characters, plus the special regex characters ^ and $, which refer to the beginning and end of a string respectively (more on these in a moment). So, by using the regex

\bchr[12]\b

we ensure that we will only get matches to ‘chr1’ and ‘chr2’ as whole words, regardless of whether they are flanked by spaces, symbols, or the beginning or end of a line.

Exercise 3.2

How can you refine the regex from the previous exercise to prevent matching strings with preceding or succeeding digits, such as 131.01.20171?
Solution
\b\d\d\.\d\d\.\d\d\d\d\b
Note that this prevents matching strings like 131.01.20171, but still allows non-sensical dates such as 99.99.9999.

Exercise 3.3

When designing a regular expression to match exactly four digits, what would be the difference between using the two regular expressions \b\d\d\d\d\b and \D\d\d\d\d\D?

Solution

While both regular expressions will prevent leading and succeeding digits, the regular expression \D\d\d\d\d\D won’t work if the four digits appear at the beginning or end of the string. That is because the \D tokens MUST match a character.

The `.` Wildcard

As well as the set tokens described above, there is also the more general wildcard ., which can be used to match any single character. Although it can be very helpful at times, it is recommended to use a more specific token/set wherever possible so as to avoid spurious matches - we will discuss more about this in the next section. Remember, to match a literal . character, escape it with a backslash i.e. \..

`^`Start and End`$`

The ^ and $ symbols are used in a regex to represent the beginning and end of the searched line - they are refered to as ‘anchor’ characters. This can be extremely helpful when searching for lines that begin with a particular string/pattern, but where that pattern might also be found elsewhere in the lines.

In the example SAM file, ‘example.sam’, there are several header lines before the main body of individual alignments. These header lines begin with the ‘@’ symbol, which is also contained within the quality strings and other fields of the alignment lines that make up the bulk of the file. If we search only with @, we won’t be able to pull out only the header lines, so instead we can use the regex

^@

to capture only the header lines. A similar approach can also be useful when searching for particular primer/adapter sequences in high-throughput DNA sequencing data.

Exercise 3.4

Count how many sequences in example_protein.fasta are of transcript_biotype “protein_coding”. Hint: sequence records have a header that starts with the character “>”.
Solution
^>.*transcript_biotype:protein_coding
There are 17 matches in the file example_protein.fasta. Note: be careful when using > in a regular expression on the command line - the > symbol has a special meaning in many command line environments, and using it can result in accidentally wiping the content of files etc.

Key Points

Use the token to match a word boundary, and ^ and $ to match the beginning and end of a line respectively.

\ has special meaning in regular expressions, and \\ should be used to specify a literal backslash in a pattern.

. describes a position that could match any character.

When composing a regular expression, it is good practice to be as specific as possible about what you want to match.

Repeated Matches

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How can I define a character or set that appears multiple times in a pattern?

How can I define the maximum and minimum number of times a character or set should appear in the pattern?

Objectives

Choose the appropriate modifier to optionally or repeatedly match a given character or set.

Compose a regular expression that will match a character appearing between two and five times in succession.

Now that we’ve learned how to account for uncertainty of the characters that should be matched by a regex, it’s time to focus on how to account for uncertainty in the length of the patterns that we would like to match.

For example, imagine that we are working with a large list of files, a collection of several different file formats. We have FASTA files (extension .fasta), BAM files (.bam), text files (.txt), and BED files (.bed). What if we want to match the complete filename of all of the text files? There are files called ‘table1.txt’, ‘table20.txt’, ‘samples.txt’, ‘Homo_sapiens.txt’, and so on, i.e. the filenames vary in length and lack a standard structure/format. So we can’t construct a working regex to capture all of these ‘.txt’ files, using only the character sets and ranges that we’ve learned up to now.

Repeat Modifiers

The symbols +, ?, and * can be used to control the number of times that a character or set should be matched in the pattern. The behaviour of each is summarised in the table below.

Symbol	Behaviour	Example	Matches	Doesn’t Match
`+`	Match one or more of preceding character/set	`AGC+T`	AGCCCT	AGT
`?`	Match zero or one of preceding character/set	`AGC?T`	AGT	AGCCT
`*`	Match zero or more of preceding character/set	`AGC*T`	AGCCCCCT	AGG

So, bo?t will match ‘bt’ and ‘bot’ only, bo+t will match ‘boot’, ‘bot’, ‘boooot’, and so on, and `bo*t’ will match ‘bt’, ‘bot’, ‘boot’, ‘booot’, and so on. These modifiers can also be applied to sets of characters, so the regex

f[aeiou]+nd

will match ‘find’, ‘found’, and ‘fiuoaaend’. Note that the whole class can be repeated, and it is not only repeats of the same character that match i.e. the regex

d[efor]*med

will match ‘deformed’, as well as ‘dmed’, ‘doomed’, and ‘doooooooooooomed’. It is also important to be aware that the modifiers are ‘greedy’, which means that the regex engine will match as many characters as possible before moving on to the next character of the pattern.

Exercise 4.1

a) Which of the follow strings will be matched by the regular expression MVIA*CP?

`~~~ i) MVIAACP
ii) MVICP

iii) MVIACP

iv) all of the above ~~~
Solution

iv) the regex will match all of the strings i) - iii)

b) Write a regular expression to match the string

“ATGCTTTCG”

and

“ATCTCG”

but not

“ATGGCCG”
Solution
ATG?CT+CG

Specifying Repeats

In addition to the modifiers above, which allow the user to specify whether to match, zero, one, or an arbitrary multitude of a character/set, it is also possible to match only a certain number of repeats, or within a certain range of numbers of repeats, using {}.

GCA{3}T

matches ‘GCAAAT’ only, while

GC[AT]{3}T

matches ‘GCATAT’, ‘GCTTTT’, and ‘GCTAAT’, but not ‘GCTAT’, ‘GCATT’, or ‘GCp)T’. As well as an exact number of repeats to match, a range can also be specified with {n,m}, where n is the minimum and m the maximum number of matches allowed in the pattern. So,

AG[ACGT]{4,10}GC

matches any sequence between four and ten nucleotides, flanked by ‘AG’ on one side and ‘GC’ on the other.

Grouping

In some circumstances, you will need to specify exactly what it is that you would like to repeat within a regex. The +, ?, and * modifiers above apply by default only to the character/set that immediately precedes them in the regex. But what if you would like to match multi-character repeats? Or make a substring in the pattern optional? To apply a modifier to a whole group of characters/sets, wrap the group in () parentheses.

For example, in the string ‘rain wind rainrain sunshine’, we can match ‘rain’ and ‘rainrain’ with the regex (rain)+. This kind of grouping with () is used in a few different ways, as we will discover in the next couple of chapters.

Exercise 4.2

use {} to search example_protein.fasta for trytophan (W) followed by tyrosine (Y), with an increasing number of leucine (L) residues in between. Start by searching for this pattern with three leucines (i.e. ‘WLLLY’), then reduce this to two, and one. Is this working as you expect? How would you search for a range of lengths of the leucine repeat? Try searching for any patterns with at between one and four leucines. What happens if you leave out the upper limit inside the {}? Or the lower limit? Can you leave out both?
Solution
WL{3}Y
WL{2}Y
WL{1}Y
WL{1,4}Y

Specifying only one of these limits, while retaining the comma in the appropriate position, allows you to control for a only an upper or lower limit to the number of repeats. So,

GC[A]{3,}GC

matches ‘GCAAAGC’, ‘GCAAAAAAAAGC’, and any other string starting and ending with ‘GC’, with more than two ‘A’s in between. Conversely,

GCA{,4}GC

matches ‘GCGC’, ‘GCAGC’, ‘GCAAGC’, and so on, but won’t match ‘GCAAAAAAGC’ or any other combination containing a run of more than four ‘A’s between flanking ‘GC’ substrings. This syntax, where the minimum number of matches is left blank, is only valid in a limited number of programs/environments.

Key Points

? indicates that the preceding character or set should be treated as optional in this position.

* indicates that the preceding character or set should appear 0 or more times in this position.

+ indicates that the preceding character or set should appear 1 or more times in this position.

{2,4} indicates that the preceding character or set should appear at least twice but no more than four times in this position.

Capture Groups and References

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How can I reuse parts of the matched pattern when I replace it?

Objectives

Compose regular expressions that include capture groups.

Compose replacement strings that include characters captured in a regular expression.

One of the most common uses of regular expressions is in string replacement, or substitution, where the patterns found will be replaced by some other string - this usage is similar again to the ‘Find and Replace’ functionality of word processors that we mentioned in the introduction. Beyond the potential for “fuzzy” finding, a major advantage that regular expressions provide in this context is that they allow all or part(s) of the matched string to be re-used in the replacement. This means matched patterns can be rearranged, added to, and given new context, without the need for prior knowledge of the specific pattern that will be found.

To understand the difference between this and the functionality of a standard, literal ‘Find/Replace’ tool, consider the following example: you have been given a FASTA file of protein sequences in the following format

>GX3597KLM "Homo sapiens"
MQSYASAMLSVFNSDDYSPAVQENIPALRRSSSFLCTESCNSKYQCETGENSKGNVQDRVKRPMNAFIVW
SRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK
NCSLLPADPASVLCSEVQLDNRLYRDDCTKATHSRMEHQLGHLPPINAASSPQQRDRYSHWTKL

>GK9113FGH "Homo sapiens"
MRPEGSLTYRVPERLRQGFCGVGRAAQALVCASAKEGTAFRMEAVQEGAAGVESEQAALGEEAVLLLDDI
MAEVEVVAEEEGLVERREEAQRAQQAVPGPGPMTPESAPEELLAVQVELEPVNAQARKAFSRQREKMERR
RKPHLDRRGAVIQSVPGFWANVIANHPQMSALITDEDEDMLSYMVSLEVGEEKHPVHLCKIMLFFRSNPY
FQNKVITKEYLVNITEYRASHSTPIEWYPDYEVEAYRRRHHNSSLNFFNWFSDHNFAGSNKIAEILCKDL
WRNPLQYYKRMKPPEEGTETSGDSQLLS

>GF7745PUP "Mus musculus"
MGRLVLLWGAAVFLLGGWMALGQGGAAEGVQIQIIYFNLETVQVTWNASKYSRTNLTFHYRFNGDEAYDQ
CTNYLLQEGHTSGCLLDAEQRDDILYFSIRNGTHPVFTASRWMVYYLKPSSPKHVRFSWHQDAVTVTCSD
LSYGDLLYEVQYRSPFDTEWQSKQENTCNVTIEGLDAEKCYSFWVRVKAMEDVYGPDTYPSDWSEVTCWQ
RGEIRDACAETPTPPKPKLSKFILISSLAILLMVSLLLLSLWKLWRVKKFLIPSVPDPKSIFPGLFEIHQ
GNFQEWITDTQNVAHLHKMAGAEQESGPEEPLVVQLAKTEAESPRMLDPQTEEKEASGGSLQLPHQPLQG
GDVVTIGGFTFVMNDRSYVAL

>GD8332BAG "Homo sapiens"
MEELTAFVSKSFDQKSKDGNGGGGGGGGKKDSITYREVLESGLARSRELGTSDSSLQDITEGGGHCPVHL
FKDHVDNDKEKLKEFGTARVAEGIYECKEKREDVKSEDEDGQTKLKQRRSRTNFTLEQLNELERLFDETH
YPDAFMREELSQRLGLSEARVQVWFQNRRAKCRKQENQMHKGVILGTANHLDACRVAPYVNMGALRMPFQ
QVQAQLQLEGVAHAHPHLHPHLAAHAPYLMFPPPPFGLPIASLAESASAAAVVAAAAKSNSKNSSIADLR
LKARKHAEALGL

>GK3091TFB "Mus musculus"
MAILFAVVARGTTILAKHAWCGGNFLEVTEQILAKIPSENNKLTYSHGNYLFHYICQDRIVYLCITDDDF
ERSRAFNFLNEIKKRFQTTYGSRAQTALPYAMNSEFSSVLAAQLKHHSENKGLDKVMETQAQVDELKGIM
VRNIDLVAQRGERLELLIDKTENLVDSSVTFKTTSRNLARAMCMKNLKLTIIIIIVSIVFIYIIVSPLCG
GFTWPSCVKK

>GK3141YRK "Pan troglodytes"
[ many more records follow... ]

In the sequence header lines (the lines starting with ‘>’), you have what looks like a unique identifier as the sequence ID string, and some additional information about species in the description. These are helpful for find/replace, as they allow us to search the large file for sequences with a particular identifier or that relate to a particular species. But there are two potential problems: first, the species string is wrapped in quotes, which can cause problems in downstream analysis; and there are spaces in the sequence ID line, which can often result in the information after the first space being lost when the data is processed by a lot of software. To avoid these issues, what we’d really like to have is the species name - without quotes - attached to the ID string with underscores. This should make sure that all relevant information will be retained in any further processing steps.

We can match the species name in each header line, with the following regex:

[A-Z][a-z]+ [a-z]+

which is a good start. But, now we need to consider how to make sure that we keep these species names during the replacement. In a standard “Find/Replace” operation, as well as specifying exactly the text that we want to find (as discussed before), we also specify exactly the text that we will replace it with. So, to we can find “Pan troglodytes” and replace it with Pan_troglodytes, but if our file contains 100 different species names, we would need to manually perform 100 of these operations.

Using regular expressions, we can identify and store parts of a matched pattern for reuse in the replacement string. This means that we can find and replace all of those species names in a single operation, while maintaining the specific strings that we still need. To do this, we need to use capture groups.

As we saw in the previous chapter, a group is established in a regular expression with () parentheses. When matched, these groups can be referred to in the replacement string with \N, where N is an integer (1-9) signifying which group in the regex should be substituted in at the specified position. Captured groups are counted from left to right. This should be made clearer with an example below:

group_1, group_2, group_3

In this case, we want to rearrange the parts of the string above, so that we have ‘group 3’ first, followed by ‘group_1’ twice, and then ‘group_2’. Using what we’ve already learned, we can match each ‘group_x’ substring with the one of the regexes below

group_\d

# or

group_[123]

# or

group_[0-9]

so we could match the whole string as follows:

group_\d, group_\d, group_\d

However, in order to rearrange the parts in a replacement string, we need to capture them individually with () parentheses. Using the following regex:

(group_\d), (group_\d), (group_\d)

we could create a rearranged replacement string with

\3, \1, \1, \2

# returns "group_3, group_1, group_1, group_2"

or, to wrap each group in quotation marks:

"\1", "\2", "\3"

# returns "group_1", "group_2", "group_3"

A Note About Capture Group References

Here we are using the \1, \2 notation for referencing captured groups, but you will often see $1, $2 etc used instead. This is the case when using regular expressions in Perl, and in many text editors e.g. Atom. Be wary of the different tokens and wildcards used for regexes in different environments - it can easily trip you up. You should always try out a regex replacement before running it on any large volume of data/text, and make sure that you have a backup so that you can easily revert any unintended changes!

Using this approach, you can capture up to nine different groups and re-use as many of them as you like in your replacement string. This is really helpful when reformatting large files e.g. to remove additional characters, which can otherwise be very fiddly and time-consuming.

Now we can return to our FASTA sequence header example introduced at the beginning of this section. How can we use capture groups to reach our aim of attaching the species names to the sequence IDs, removing the quotation marks along the way? First, we should identify the groups that we need to capture in the regex matching the species names:

[A-Z][a-z]+ [a-z]+ # matches the species names

([A-Z][a-z]+) ([a-z]+) # captures genus & species separately as \1 and \2

Great, but now we need to add the quotation marks and leading space to our regex, so that we can remove/replace them when performing the substitution for each matched line. Adding this punctuation to the regex, gives us:

\s"([A-Z][a-z]+) ([a-z]+)"

Note, that we don’t need to reuse the punctuation during replacement, so we haven’t wrapped them in () parentheses. Now we can define our replacement string to add underscores before and in between the genus and species names:

_\1_\2

Using this pair of regex and replacement string should convert the headers from the FASTA snippet above:

>GX3597KLM "Homo sapiens"
>GK9113FGH "Homo sapiens"
>GF7745PUP "Mus musculus"
>GD8332BAG "Homo sapiens"
>GK3091TFB "Mus musculus"
>GK3141YRK "Pan troglodytes"

to the desired format:

>GX3597KLM_Homo_sapiens
>GK9113FGH_Homo_sapiens
>GF7745PUP_Mus_musculus
>GD8332BAG_Homo_sapiens
>GK3091TFB_Mus_musculus
>GK3141YRK_Pan_troglodytes

Exercise 5.1

The file example_protein_malformed.fasta is missing the > character at the beginning of the headers. Use a capture group to add them.

Solution

Use the regex ^(ENSP\d.+) and substitute with >\1.

Note that we explicitly match a digit character \d. This is because E, N, S and P are all character of the amino acid alphabet and thus ENSP can wrongly match protein sequences.

Exercise 5.2

The file file_paths.txt contains file paths of image files. The files are organised by folders based on vacations, but the files themselves have cryptic names. You want the files to be prefixed by the vacation and move them into a shared folder. At the end the list should look like:
/Users/Jane/shared/vacation-pics/France-2015-IMG-06650.jpg
/Users/Jane/shared/vacation-pics/France-2015-IMG-06651.jpg
...
/Users/Jane/shared/vacation-pics/France-2017-IMG-08449.jpg
...
/Users/Jane/shared/vacation-pics/Greece-2016-IMG-07895.jpg
...
Use a capture group to transform the file paths accordingly.

Solution

Use the regex Pictures\/(France-2015)\/ and substitute with shared/vacation-pics/\1-.

Key Points

Capture groups are defined within () in a regular expression.

The left-most capture group in a regular expression is referred to with \1 in the replacement string, the next with \2, and so on.

Alternative Matches

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How can I define multiple possible strings that can be matched in a regular expression?

Objectives

Compose regular expressions to match several different strings.

So far, we’ve seen how to perform very specific matching - literal string matching, using wildcards to match the same thing multiple times, etc - and how to match every substring that fits a particular pattern - e.g. all strings of at least six digits, every line starting or ending with a particular character, etc - but what if we only want to match under a limited number of set circumstances? For an example,consider again the FASTA file introduced in the previous chapter, a sample of which is reproduced below:

>GX3597KLM "Homo sapiens"
MQSYASAMLSVFNSDDYSPAVQENIPALRRSSSFLCTESCNSKYQCETGENSKGNVQDRVKRPMNAFIVW
SRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK
NCSLLPADPASVLCSEVQLDNRLYRDDCTKATHSRMEHQLGHLPPINAASSPQQRDRYSHWTKL

>GK9113FGH "Homo sapiens"
MRPEGSLTYRVPERLRQGFCGVGRAAQALVCASAKEGTAFRMEAVQEGAAGVESEQAALGEEAVLLLDDI
MAEVEVVAEEEGLVERREEAQRAQQAVPGPGPMTPESAPEELLAVQVELEPVNAQARKAFSRQREKMERR
RKPHLDRRGAVIQSVPGFWANVIANHPQMSALITDEDEDMLSYMVSLEVGEEKHPVHLCKIMLFFRSNPY
FQNKVITKEYLVNITEYRASHSTPIEWYPDYEVEAYRRRHHNSSLNFFNWFSDHNFAGSNKIAEILCKDL
WRNPLQYYKRMKPPEEGTETSGDSQLLS

>GF7745PUP "Mus musculus"
MGRLVLLWGAAVFLLGGWMALGQGGAAEGVQIQIIYFNLETVQVTWNASKYSRTNLTFHYRFNGDEAYDQ
CTNYLLQEGHTSGCLLDAEQRDDILYFSIRNGTHPVFTASRWMVYYLKPSSPKHVRFSWHQDAVTVTCSD
LSYGDLLYEVQYRSPFDTEWQSKQENTCNVTIEGLDAEKCYSFWVRVKAMEDVYGPDTYPSDWSEVTCWQ
RGEIRDACAETPTPPKPKLSKFILISSLAILLMVSLLLLSLWKLWRVKKFLIPSVPDPKSIFPGLFEIHQ
GNFQEWITDTQNVAHLHKMAGAEQESGPEEPLVVQLAKTEAESPRMLDPQTEEKEASGGSLQLPHQPLQG
GDVVTIGGFTFVMNDRSYVAL

>GD8332BAG "Homo sapiens"
MEELTAFVSKSFDQKSKDGNGGGGGGGGKKDSITYREVLESGLARSRELGTSDSSLQDITEGGGHCPVHL
FKDHVDNDKEKLKEFGTARVAEGIYECKEKREDVKSEDEDGQTKLKQRRSRTNFTLEQLNELERLFDETH
YPDAFMREELSQRLGLSEARVQVWFQNRRAKCRKQENQMHKGVILGTANHLDACRVAPYVNMGALRMPFQ
QVQAQLQLEGVAHAHPHLHPHLAAHAPYLMFPPPPFGLPIASLAESASAAAVVAAAAKSNSKNSSIADLR
LKARKHAEALGL

>GK3091TFB "Mus musculus"
MAILFAVVARGTTILAKHAWCGGNFLEVTEQILAKIPSENNKLTYSHGNYLFHYICQDRIVYLCITDDDF
ERSRAFNFLNEIKKRFQTTYGSRAQTALPYAMNSEFSSVLAAQLKHHSENKGLDKVMETQAQVDELKGIM
VRNIDLVAQRGERLELLIDKTENLVDSSVTFKTTSRNLARAMCMKNLKLTIIIIIVSIVFIYIIVSPLCG
GFTWPSCVKK

>GK3141YRK "Pan troglodytes"
...

This FASTA file is huge - containing many tens of thousands of sequences. We would like to know how many sequences belonging to either Pan troglodytes *or* Homo sapiens are contained in the file. We could count the two species separately and add the results together, but what if we wanted the count for three species? Or ten? In this case, it might be better (and would almost certainly be faster) to perform a count using a set of optional matches in a single regular expression.

A set of options for matching in a regex can be defined using the | pipe symbol. For example, to match either ‘fish’ or ‘whale’, we can construct the following expression:

fish|whale

So, to match only ‘Homo sapiens’ or ‘Pan troglodytes’ in the FASTA file mentioned above, we can construct the regex:

Homo sapiens|Pan troglodytes

which we can then use with grep or some other program to get a count of the matches. You can use this approach to selectively extract lines from a larger file, while preserving their order relative to each other. For example, this can be useful when subsetting a GFF annotation file based on feature type, source, etc.

Exercise 6.1

If you study the contents of the file person_info.csv, you will see that some variation exists in the address formatting. For example, some of the addresses use ‘First Street’ while others use ‘1st Street’ or some other variation. Can you find all the lines containing information on a person living on 1st/First Street/street, using a single reglar expression?
Solution

Here are two possible solutions:
[Ff]irst [Ss]treet|1st [Ss]treet
(Fir|fir|1)st [Ss]treet

Exercise 6.2

The FASTQ file example.fastq contains sequence reads with quality scores, in the format
@sequence header line with barcode sequence
sequence
+
quality scores for basecalls
Unfortunately, the barcode sequences in the header lines are wrong, and the barcodes are still attached to the front of the sequences. There are three barcodes that we are interested in: AGGCCA, TGTGAC, and GCTGAC.

a) how many reads are there in the file for each of these barcodes?

Solution

AGGCCA: 25

TGTGAC: 29

GCTGAC: 19

b) write a regular expression that will find and these barcodes and a replacement string that will remove from the start of the sequences in which they are found

Solution

regex: ^(AGGCCA|TGTGAC|GCTGAC)([ACGTN]+\n\+\n) replacement string: \2 ($2)

c) of course, the format of the file means that you should probably remove the quality scores associated with those sequence positions too. Rewrite your regex so that the barcode sequence AND its corresponding quality scores (i.e. the first six characters on the sequence and quality lines) are removed.

Solution

regex: ^(AGGCCA|TGTGAC|GCTGAC)([ACGTN]+\n\+\n).{6} replacement string: \2 ($2)

d) finally, can you build on the regex and replacement string from part c), to replace the incorrect index sequences in the header lines with the barcodes for each relevant sequence record?

Solution

regex: ^(@VFI-SQ316.+:)GCGCTG\n(AGGCCA|TGTGAC|GCTGAC)([ACGTN]+\n\+\n).{6} replacement string: \1\2\n\3

Key Points

Alternative strings to match can be combined with |.

Regular Expressions for Biologists

Introduction

Overview

What is a regular expression?

Illustrative Example

Regular Expression Engines

How can I use Regular Expressions?

Text Editors

Command Line

Key Points

Regex Fundamentals

Overview

Basic String Matching

Finding HDACs

Solution

Sets and Ranges

Character Sets

Exercise 2.2

Solution

Solution

Solution

Inverted sets

Exercise 2.3

Solution

Key Points

Tokens and Wildcards

Overview

Referencing multiple characters

Tokens and the Backslash

Exercise 3.1

Solution

Word Boundaries

Exercise 3.2

Solution

Exercise 3.3

Solution

The . Wildcard

^Start and End$

Exercise 3.4

Solution

Key Points

Repeated Matches

Overview

Repeat Modifiers

Exercise 4.1

Solution

Solution

Specifying Repeats

Grouping

Exercise 4.2

Solution

Key Points

Capture Groups and References

Overview

A Note About Capture Group References

Exercise 5.1

Solution

Exercise 5.2

Solution

Key Points

Alternative Matches

Overview

Exercise 6.1

Solution

Exercise 6.2

Solution

Solution

Solution

Solution

Key Points

The `.` Wildcard

`^`Start and End`$`