Search term selection with litsearchr

Overview

Teaching: 30 min
Exercises: 20 min

Questions

How does litsearchr select important keywords?

Objectives

Explain the method litsearchr uses to identify important terms and work through this process.

How litsearchr identifies useful keywords

Because not all the keywords are relevant (and you probaby do not want to run a search with several thousand terms!), the next step is to determine which keywords are important to the topic of the systematic review and manually review those suggestions.

To do this, litsearchr puts all of the terms into a keyword co-occurrence network. What this means is that if two terms both appear in the title/abstract/keywords of an article, they get marked as co-occurring. Terms that co-occur with other terms frequently are considered to be important to the topic because they are well-connected, especially if they are connected to other terms that are also well-connected. Building the keyword co-occurrence network lets litsearchr automatically identify which terms are probably important and suggest them as potential keywords.

Assessing how important terms are

First, we will create a document-feature matrix. What litsearchr will do is take the list of potential keywords and determine which articles they appear in. We once again use our my_text object we created earlier by pasting together titles, abstracts, and keywords. These are the elements in which we want to look for occurrences of our features, which are the raked_keywords. This code creates a large simple triplet matrix, so it can take a little while to run.

naive_dfm <- create_dfm(
    elements = my_text,
    features = raked_keywords
  )

Now that we have a document-feature matrix, we can convert it to a network graph which makes it easy to to visualize relationships between terms and to assess their importance. We have the option to restrict how many studies a term must appear in to be included by using min_studies or how many times in total it appears with min_occ, however here we have left it the same as when we identified possible terms (2).

naive_graph <- create_network(
    search_dfm = as.matrix(naive_dfm),
    min_studies = 2,
    min_occ = 2
  )

Check-in with helpers

For virtual lessons: head to breakout rooms to check in with helpers.

Now that we have our network, we need to figure out which terms are the most important. One of the interesting things about keyword co-occurrence networks (and networks in general) is that their importance metrics follow a power law, where there are many terms with low importance and a few terms that are very important. To see what this looks like, we can extract the node strength from the network using the igraph package, sort the terms in order or their strenghts, and plot the node strengths.

# load the igraph library
library(igraph)

## Attaching package: ‘igraph’

## The following objects are masked from ‘package:stats’:

##    decompose, spectrum

## The following object is masked from ‘package:base’:

##    union

Next, we want to get the node strength for the graph and sort the strengths in decreasing order, so that the most important terms are at the top.

strengths <- sort(strength(naive_graph), decreasing=TRUE)

# then we can see the most important terms

head(strengths, 10)

##      alcohol consumption       alcohol advertising 
##                     1556                      1106 
##            public health          alcohol drinking 
##                     1046                      1026 
##        physical activity              young people 
##                      993                       896 
##          school students         drinking patterns 
##                      839                       820 
##            united states alcohol drinking patterns 
##                      722                       660 

# most of them look pretty relevant to the topic, which is good

To see what we mean by the network node strengths following a power law, let’s plot our sorted vector of strengths. We will use type="l" to plot a line, and las=1 to make our y-axis labels parallel with our x-axis so they are easy to read.

plot(strengths, type="l", las=1)

Check-in with helpers

For virtual lessons: head to breakout rooms to check in with helpers.

Selecting important terms

We can use this power law relationship as a way to identify important terms. We want all the terms above some cutoff in importance, and do not want to consider the terms in the tail of the distribution.

To find a cutoff in node importance, we can use the aptly-named find_cutoff function. With method cumulative, we can tell the function to return a cutoff that gives us the minimum number of terms that gives us 30% of the total importance in the network.

cutoff <- find_cutoff(
    naive_graph,
    method = "cumulative",
    percent = .30,
    imp_method = "strength"
  )
  
print(cutoff)
## [1] 162

Exercise: Changing cutoff stringency

In some cases, we may want to be more or less flexible with where the cutoff is placed for term importance. Using the code from the lesson as a template, find a new cutoff that returns 80% of the strength of the network.

Next, we want to reduce our graph (reduce_graph) to only terms that have a node strength above our cutoff value. We can then get the keywords (get_keywords) that are still included in the network, which will give us a vector of important keywords to manually consider whether or not they should be included in the search terms.

reduced_graph <- reduce_graph(naive_graph, cutoff_strength = cutoff)

search_terms <- get_keywords(reduced_graph)

Challenge: Counting search terms

How many terms has litsearchr suggested we review manually? Hint: length will be useful.

Bonus Challenge

How many more search terms are retrieved when using a threshold of 50% instead of 30% of the network strength?

Hint: you will need to find cutoff values for both 50% and 30%, create two new reduced graphs with the different cutoff strengths, and check the length of the resulting search terms for each graph.

Key Points

litsearchr helps identify important search terms related to the topic of a systematic review.

previous episode

Library Carpentry: Introduction to R and litsearchr

next episode

Search term selection with litsearchr

Overview

How litsearchr identifies useful keywords

Assessing how important terms are

Check-in with helpers

Check-in with helpers

Selecting important terms

Exercise: Changing cutoff stringency

Challenge: Counting search terms

Bonus Challenge

Key Points

previous episode

next episode