Last updated on 2025-11-24 | Edit this page

Estimated time: 120 minutes

Overview

Questions

What is Natural Language Processing?
What are some common applications of NLP?
What makes text different from other data?
Why not just learn Large Language Models?
What linguistic properties should we consider when dealing with texts?

Objectives

Define Natural Language Processing
Show the most relevant NLP tasks and applications in practice
Learn how to handle Linguistic Data and how is Linguistics relevant to NLP

What is NLP?

Natural language processing (NLP) is an area of research and application that focuses on making human languages processable for computers, so that they can perform useful tasks. It is therefore not a single method, but a collection of techniques that help us deal with linguistic inputs. The range of techniques spans simple word counts, to Machine Learning (ML) methods, all the way up to complex Deep Learning (DL) architectures.

We use the term “natural language”, as opposed to “artificial language” such as programming languages, which are by design constructed to be easily formalized into machine-readable instructions. In contrast to programming languages, natural languages are complex, ambiguous, and heavily context-dependent, making them challenging for computers to process. To complicate matters, there is not only a single human language. More than 7000 languages are spoken around the world, each with its own grammar, vocabulary, and cultural context.

In this course we will mainly focus on written language, specifically written English, we leave out audio and speech, as they require a different kind of input processing. But consider that we use English only as a convenience so we can address the technical aspects of processing textual data. While ideally most of the concepts from NLP apply to most languages, one should always be aware that certain languages require different approaches to solve seemingly similar problems. We would like to encourage the usage of NLP in other less widely known languages, especially if it is a minority language. You can read more about this topic in this blogpost.

Challenge

NLP in the real world

Name three to five tools/products that you use on a daily basis and that you think leverage NLP techniques. To do this exercise you may make use of the Web.

Show me the solution

These are some of the most popular NLP-based products that we use on a daily basis:

Agentic Chatbots (ChatGPT, Perplexity)
Voice-based assistants (e.g., Alexa, Siri, Cortana)
Machine translation (e.g., Google translate, DeepL, Amazon translate)
Search engines (e.g., Google, Bing, DuckDuckGo)
Keyboard autocompletion on smartphones
Spam filtering
Spell and grammar checking apps
Customer care chatbots
Text summarization tools (e.g., news aggregators)
Sentiment analysis tools (e.g., social media monitoring)

We can already find differences between languages in the most basic step for processing text. Take the problem of segmenting text into meaningful units, most of the times these units are words, in NLP we call this task tokenization. A naive approach is to obtain individual words by splitting text by spaces, as it seems obvious that we always separate words with spaces. Just as human beings break up sentences into words, phrases and other units in order to learn about grammar and other structures of a language, NLP techniques achieve a similar goal through tokenization. Let’s see how can we segment or tokenize a sentence in English:

PYTHON

english_sentence = "Tokenization isn't always trivial."
english_words = english_sentence.split(" ")
print(english_words)
print(len(english_words))

OUTPUT

['Tokenization', "isn't", 'always', 'trivial.']
4

The words are mostly well separated, however we do not get fully formed words (we have punctuation with the period after “trivial” and also special cases such as the abbreviation of “is not” into “isn’t”). But at least we get a rough count of the number of words present in the sentence. Let’s now look at the same example in Chinese:

PYTHON

# Chinese Translation of "Tokenization is not always trivial"
chinese_sentence = "标记化并不总是那么简单" 

chinese_words = chinese_sentence.split(" ")
print(chinese_words)
print(len(chinese_words))

OUTPUT

['标记化并不总是那么简单']
1

The same example however did not work in Chinese, because Chinese does not use spaces to separate words. This is an example of how the idiosyncrasies of human language affects how we can process them with computers. We therefore need to use a tokenizer specifically designed for Chinese to obtain the list of well-formed words in the text. Here we use a “pre-trained” tokenizer called MicroTokenizer, which uses a dictionary-based approach to correctly identify the distinct words:

PYTHON

import MicroTokenizer  # A popular Chinese text segmentation library
chinese_sentence = "标记化并不总是那么简单"
chinese_words = MicroTokenizer.cut(chinese_sentence)
print(chinese_words)
# ['mark', 'transform', 'and', 'no', 'always', 'so', 'simple']
print(len(chinese_words))  # Output: 7

OUTPUT

['标记', '化', '并', '不', '总是', '那么', '简单']
7

We can trust that the output is valid because we are using a verified library - jieba, even though we don’t speak Chinese. Another interesting aspect is that the Chinese sentence has more words than the English one, even though they convey the same meaning. This shows the complexity of dealing with more than one language at a time, as is the case in task such as Machine Translation (using computers to translate speech or text from one human language to another).

Natural Language Processing deals with the challenges of correctly processing and generating text in any language. This can be as simple as counting word frequencies to detect different writing styles, using statistical methods to classify texts into different categories, or using deep neural networks to generate human-like text by exploiting word co-occurrences in large amounts of texts.

Why should we learn NLP Fundamentals?

In the past decade, NLP has evolved significantly, especially in the field of deep learning, to the point that it has become embedded in our daily lives, one just needs to look at the term Large Language Models (LLMs), the latest generation of NLP models, which is now ubiquitous in news media and tech products we use on a daily basis.

The term LLM now is often (and wrongly) used as a synonym of Artificial Intelligence. We could therefore think that today we just need to learn how to manipulate LLMs in order to fulfill our research goals involving textual data. The truth is that Language Modeling has always been part of the core tasks of NLP, therefore, by learning NLP you will understand better where are the main ideas behind LLMs coming from.

NLP is an interdisciplinary field, and LLMs are just a subset of it

LLM is a blanket term for an assembly of large neural networks that are trained on vast amounts of text data with the objective of optimizing for language modeling. Once they are trained, they are used to generate human-like text or fine-tunned to perform much more advanced tasks. Indeed, the surprising and fascinating properties that emerge from training models at this scale allows us to solve different complex tasks such as answer elaborate questions, translate languages, solve complex problems, generate narratives that emulate reasoning, and many more, all of this with a single tool.

It is important, however, to pay attention to what is happening behind the scenes in order to be able trace sources of errors and biases that get hidden in the complexity of these models. The purpose of this course is precisely to take a step back, and understand that:

There are a wide variety of tools available, beyond LLMs, that do not require so much computing power
Sometimes a much simpler method than an LLM is available that can solve our problem at hand
If we learn how previous approaches to solve linguistic problems were designed, we can better understand the limitations of LLMs and how to use them effectively
LLMs excel at confidently delivering information, without any regards for correctness. This calls for a careful design of evaluation metrics that give us a better understanding of the quality of the generated content.

Let’s go back to our problem of segmenting text and see what ChatGPT has to say about tokenizing Chinese text:

We got what sounds like a straightforward confident answer. However, it is not clear how the model arrived at this solution. Second, we do not know whether the solution is correct or not. In this case ChatGPT made some assumptions for us, such as choosing a specific kind of tokenizer to give the answer, and since we do not speak the language, we do not know if this is indeed the best approach to tokenize Chinese text. If we understand the concept of Token (which we will today!), then we can be more informed about the quality of the answer, whether it is useful to us, and therefore make a better use of the model.

And by the way, ChatGPT was almost correct, in the specific case of the gpt-4 tokenizer, the model will return 12 tokens (not 11!) for the given Chinese sentence.

We can also argue if the statement “Chinese is generally tokenized character by character” is an overstatement or not. In any case, the real question here is: Are we ok with almost correct answers? Please note that this is not a call to avoid using LLM’s but a call for a careful consideration of usage and more importantly, an attempt to explain the mechanisms behind via NLP concepts.

Language as Data

From a more technical perspective, NLP focuses on applying advanced statistical techniques to linguistic data. This is a key factor, since we need a structured dataset with a well defined set of features in order to manipulate it numerically. Your first task as an NLP practitioner is to understand what aspects of textual data are relevant for your application and apply techniques to systematically extract meaningful features from unstructured data (if using statistics or Machine Learning) or choose an appropriate neural architecture (if using Deep Learning) that can help solve our problem at hand.

What is a word?

When dealing with language our basic data unit is usually a word. We deal with sequences of words and with how they relate to each other to generate meaning in text pieces. Thus, our first step will be to load a text file and provide it with structure by splitting it into valid words (tokenization)!

Callout

Token vs Word

For simplicity, in the rest of the course we will use the terms “word” and “token” interchangeably, but as we just saw they do not always have the same granularity. Originally the concept of token comprised dictionary words, numeric symbols and punctuation. Nowadays, tokenization has also evolved and became an optimization task on its own (How can we segment text in a way that neural networks learn optimally from text?). Tokenizers allow one to reconstruct or revert back to the original pre-tokenized form of tokens or words, hence we can afford to use token and word as synonyms. If you are curious, you can visualize how different state-of-the-art tokenizers split text in this WebApp

Let’s open a file, read it into a string and split it by spaces. We will print the original text and the list of “words” to see how they look:

PYTHON

with open("data/84_frankenstein_clean.txt") as f:
  text = f.read()

print(text[:100])
print("Length:", len(text))

proto_tokens = text.split()
print(proto_tokens[:40])
print(len(proto_tokens))

OUTPUT

Letter 1 St. Petersburgh, Dec. 11th, 17-- TO Mrs. Saville, England You will rejoice to hear that no disaster has accompanied the commencement of an en
Length: 417931

Proto-Tokens:
['Letter', '1', 'St.', 'Petersburgh,', 'Dec.', '11th,', '17--', 'TO', 'Mrs.', 'Saville,', 'England', 'You', 'will', 'rejoice', 'to', 'hear', 'that', 'no', 'disaster', 'has', 'accompanied', 'the', 'commencement', 'of', 'an', 'enterprise', 'which', 'you', 'have', 'regarded', 'with', 'such', 'evil', 'forebodings.', 'I', 'arrived', 'here', 'yesterday,', 'and', 'my']
74942

Splitting by white space is possible but needs several extra steps to separate out punctuation appropriately. A more sophisticated approach is to use the spaCy library to segment the text into human-readable tokens. First we will download the pre-trained model, in this case we only need the small English version:

PYTHON

! python -m spacy download en_core_web_sm

This is a model that spaCy already trained for us on a subset of web English data. Hence, the model already “knows” how to tokenize into English words. When the model processes a string, it does not only do the splitting for us but already provides more advanced linguistic properties of the tokens (such as part-of-speech tags, or named entities). You can check more languages and models in the spacy documentation

Callout

Pre-trained Models and Fine-tuning

These two terms frequently arise in discussions of NLP. The notion of pre-trained comes from Machine Learning and describes a model that has already been optimized on relevant data for a given task. Such a model can typically be loaded and applied directly to new datasets, often working “out of the box.” without need of further refinement. Ideally, publicly released pre-trained models have undergone rigorous testing for both generalization and output quality on different textual data that it was intended to be used on. Nevertheless, it remains essential to carefully review the evaluation methods used before relying on them in practice. It is also recommended that you perform your own evaluation of the model on text that you intend to use it on.

Sometimes a pre-trained model is of good quality, but it does not fit the nuances of our specific dataset. For example, the model was trained on newspaper articles but you are interested in poetry. In this case, it is common to perform fine-tuning, this means that instead of training your own model from scratch, you start with the knowledge obtained in the pre-trained model and adjust it (fine-tune it) to work optimally with your specific data. If this is done well it leads to increased performance in the specific task you are trying to solve. The advantage of fine-tuning is that you often do not need a large amount of data to improve the results, hence the popularity of the technique.

Let’s now import the model and use it to parse our document:

PYTHON

import spacy

nlp = spacy.load("en_core_web_sm") # we load the small English model for efficiency

doc = nlp(text) # Doc is a python object with several methods to retrieve linguistic properties

# SpaCy-Tokens
tokens = [token.text for token in doc] # Note that spacy tokens are also python objects 
print(tokens[:40])
print(len(tokens))

OUTPUT

['Letter', '1', 'St.', 'Petersburgh', ',', 'Dec.', '11th', ',', '17', '-', '-', 'TO', 'Mrs.', 'Saville', ',', 'England', 'You', 'will', 'rejoice', 'to', 'hear', 'that', 'no', 'disaster', 'has', 'accompanied', 'the', 'commencement', 'of', 'an', 'enterprise', 'which', 'you', 'have', 'regarded', 'with', 'such', 'evil', 'forebodings', '.']
85713

The differences look subtle at the beginning, but if we carefully inspect the way spaCy splits the text, we can see the advantage of using a specialized tokenizer. There are also several useful features that spaCy provides us with. For example, we can choose to extract only symbols, or only alphanumerical tokens, and more advanced linguistic properties, for example we can remove punctuation and only keep alphanumerical tokens:

PYTHON

only_words = [token for token in doc if token.is_alpha]  # Only alphanumerical tokens
print(only_words[:50])
print(len(only_words))

OUTPUT

[Letter, Petersburgh, TO, Saville, England, You, will, rejoice, to, hear, that, no, disaster, has, accompanied, the, commencement, of, an, enterprise, which, you, have, regarded, with, such, evil, forebodings, I, arrived, here, yesterday, and, my, first, task, is, to, assure, my, dear, sister, of, my, welfare, and, increasing, confidence, in, the]
75062

or keep only the verbs from our text:

PYTHON

only_verbs = [token for token in doc if token.pos_ == "VERB"]  # Only verbs
print(only_verbs[:10])
print(len(only_verbs))

OUTPUT

[rejoice, hear, accompanied, regarded, arrived, assure, increasing, walk, feel, braces]
10148

SpaCy also predicts the sentences under the hood for us. It might seem trivial to you as a human reader to recognize where a sentence begins and ends but for a machine, just like finding words, finding sentences is a task on its own, for which sentence-segmentation models exist. In the case of Spacy, we can access the sentences like this:

PYTHON

sentences = [sent.text for sent in doc.sents] # Sentences are also python objects
print(sentences[:5])
print(len(sentences))

OUTPUT

['Letter 1 St. Petersburgh, Dec. 11th, 17-- TO Mrs. Saville, England You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings.', 'I arrived here yesterday, and my first task is to assure my dear sister of my welfare and increasing confidence in the success of my undertaking.', 'I am already far north of London, and as I walk in the streets of Petersburgh, I feel a cold northern breeze play upon my cheeks, which braces my nerves and fills me with delight.', 'Do you understand this feeling?', 'This breeze, which has traveled from the regions towards which I am advancing, gives me a foretaste of those icy climes.']
3317

We can also see what named entities the model predicted:

PYTHON

print(len(doc.ents))
for ent in doc.ents[:5]:
    print(ent.label_, ent.text)

OUTPUT

1713
DATE Dec. 11th
CARDINAL 17
PERSON Saville
GPE England
DATE yesterday

These are just basic tests to demonstrate how you can immediately process the structure of text using existing NLP libraries. The spaCy models we used are simpler relative to state of the art approaches. So the more complex the input text and task, the more errors are likely to appear when using such models. The biggest advantage of using these existing libraries is that they help you transform unstructured plain text files into structured data that you can manipulate later for your own goals such as training language models.

Challenge

Computing stats with spaCy

Use the spaCy Doc object to compute an aggregate statistic about the Frankenstein book. HINT: Use the python set, dictionary or Counter objects to hold the accumulative counts. For example: - Give the list of the 20 most common verbs in the book - How many different Places are identified in the book? (Label = GPE) - How many different entity categories are in the book? - Who are the 10 most mentioned PERSONs in the book? - Or any other similar aggregate you want…

Show me the solution

Let’s describe the solution to obtain all the different entity categories. For that we should iterate the whole text and keep a python set with all the seen labels.

PYTHON

entity_types = set()

for ent in doc.ents:
    entity_types.add(ent.label_)

print(entity_types)
print(len(entity_types))

OUTPUT

{'CARDINAL', 'GPE', 'WORK_OF_ART', 'ORDINAL', 'DATE', 'LAW', 'PRODUCT', 'QUANTITY', 'ORG', 'TIME', 'PERSON', 'LOC', 'LANGUAGE', 'FAC', 'NORP'}
15

NLP tasks

The previous exercise shows that a great deal of NLP techniques are embedded in our daily life. Indeed NLP is an important component in a wide range of software applications that we use in our day to day activities.

There are several ways to describe the tasks that NLP solves. From the Machine Learning perspective, we have:

Unsupervised tasks: exploiting existing patterns from large amounts of text.

Supervised tasks: learning to classify texts given a labeled set of examples

The Deep Learning perspective usually involves the selection of the right model among different neural network architectures to tackle an NLP task, such as:

Multi-layer Perceptron
Recurrent Neural Network
Convolutional Neural Network
Long-Short Term Memory Networks (LSTMs)
Transformer (including LLMs!)

Regardless of the chosen method, below we show one possible taxonomy of NLP tasks. The tasks are grouped together with some of their most prominent applications. This is definitely a non-exhaustive list, as in reality there are hundreds of them, but it is a good start:

Text Classification: Assign one or more labels to a given piece of text. This text is usually referred to as a document and in our context this can be a sentence, a paragraph, a book chapter, etc…
- Language Identification: determining the language in which a particular input text is written.
- Spam Filtering: classifying emails into spam or not spam based on their content.
- Authorship Attribution: determining the author of a text based on its style and content (based on the assumption that each author has a unique writing style).
- Sentiment Analysis: classifying text into positive, negative or neutral sentiment. For example, in the sentence “I love this product!”, the model would classify it as positive sentiment.
Token Classification: The task of individually assigning one label to each word in a document. This is a one-to-one mapping; however, because words do not occur in isolation and their meaning depend on the sequence of words to the left or the right of them, this is also called Word-In-Context Classification or Sequence Labeling and usually involves syntactic and semantic analysis.
- Part-Of-Speech Tagging: is the task of assigning a part-of-speech label (e.g., noun, verb, adjective) to each word in a sentence.
- Chunking: splitting a running text into “chunks” of words that together represent a meaningful unit: phrases, sentences, paragraphs, etc.
- Word Sense Disambiguation: based on the context what does a word mean (think of “book” in “I read a book.” vs “I want to book a flight.”)
- Named Entity Recognition: recognize world entities in text, e.g. Persons, Locations, Book Titles, or many others. For example “Mary Shelley” is a person, “Frankenstein or the Modern Prometheus” is a book, etc.
- Semantic Role Labeling: the task of finding out “Who did what to whom?” in a sentence: information from events such as agents, participants, circumstances, subject-verb-object triples etc.
- Relation Extraction: the task of identifying named relationships between entities in a text, e.g. “Apple is based in California” has the relation (Apple, based_in, California).
- Co-reference Resolution: the task of determining which words refer to the same entity in a text, e.g. “Mary is a doctor. She works at the hospital.” Here “She” refers to “Mary”.
- Entity Linking: the task of disambiguation of named entities in a text, linking them to their corresponding entries in a knowledge base, e.g. Mary Shelley’s biography in Wikipedia.
Language Modeling: Given a sequence of words, the model predicts the next word. For example, in the sentence “The capital of France is _____”, the model should predict “Paris” based on the context. This task was initially useful for building solutions that require speech and optical character recognition (even handwriting), language translation and spelling correction. Nowadays this has scaled up to the LLMs that we know. A byproduct of pre-trained Language Modeling is the vectorized representation of texts which allows to perform specific tasks such as:
- Text Similarity: The task of determining how similar two pieces of text are.
- Plagiarism detection: determining whether a piece of text, B, is close enough to another known piece of text, A, which increases the likelihood that it was plagiarized.
- Document clustering: grouping similar texts together based on their content.
- Topic modelling: a specific instance of clustering, here we automatically identify abstract “topics” that occur in a set of documents, where each topic is represented as a cluster of words that frequently appear together.
- Information Retrieval: this is the task of finding relevant information or documents from a large collection of unstructured data based on user’s query, e.g., “What’s the best restaurant near me?”.
Text Generation: the task of generating text based on a given input. This is usually done by generating the output word by word, conditioned on both the input and the output so far. The difference with Language Modeling is that for generation there are higher-level generation objectives such as:
- Machine Translation: translating text from one language to another, e.g., “Hello” in English to “Que tal” in Spanish.
- Summarization: generating a concise summary of a longer text. It can be abstractive (generating new sentences that capture the main ideas of the original text) but also extractive (selecting important sentences from the original text).
- Paraphrasing: generating a new sentence that conveys the same meaning as the original sentence, e.g., “The cat is on the mat.” to “The mat has a cat on it.”.
- Question Answering: given a question and a context, the model generates an answer. For example, given the question “What is the capital of France?” and the Wikipedia article about France as the context, the model should answer “Paris”. This task can be approached as a text classification problem (where the answer is one of the predefined options) or as a generative task (where the model generates the answer from scratch).
- Conversational Agent (ChatBot): Building a system that interacts with a user via natural language, e.g., “What’s the weather today, Siri?”. These agents are widely used to improve user experience in customer service, personal assistance and many other domains.

For the purposes of this episode, we will focus on supervised learning tasks and we will emphasize how the Transformer architecture is used to tackle some of them.

Challenge

Inputs and Outputs

Look at the NLP Task taxonomy described above and write down a couple of examples of (Input, Output) instance pairs that you would need in order to train a supervised model for your chosen task.

Show me the solution

Example: the task of Conversational agent. Here are 3 instances to provide supervision for a model:

Input: “Hello, how are you?” Output: “I am fine thanks!”

Input: “Do you know at what time is the World Cup final today?” Output: “Yes, the World Cup final will be at 6pm CET”

Input: “What color is my shirt?” Output: “Sorry, I am unable to see what you are wearing.”

A Primer on Linguistics

Natural language exhibits a set of properties that make it more challenging to process than other types of data such as tables, spreadsheets or time series. Language is hard to process because it is compositional, ambiguous, discrete and sparse.

Compositionality

The basic elements of written languages are characters, a sequence of characters form words, and words in turn denote objects, concepts, events, actions and ideas (Goldberg, 2016). Subsequently words form phrases and sentences which are used in communication and depend on the context in which they are used. We as humans derive the meaning of utterances from interpreting contextual information that is present at different levels at the same time:

The first two levels refer to spoken language only, and the other four levels are present in both speech and text. Because in principle machines do not have access to the same levels of information that we do (they can only have independent audio, textual or visual inputs), we need to come up with clever methods to overcome this significant limitation. Knowing the levels of language is important so we consider what kind of problems we are facing when attempting to solve our NLP task at hand.

Ambiguity

The disambiguation of meaning is usually a by-product of the context in which utterances are expressed and also the historic accumulation of interactions which are transmitted across generations (think for instance to idioms – these are usually meaningless phrases that acquire meaning only if situated within their historical and societal context). These characteristics make NLP a particularly challenging field to work in.

We cannot expect a machine to process human language and simply understand it as it is. We need a systematic, scientific approach to deal with it. It’s within this premise that the field of NLP is born, primarily interested in converting the building blocks of human/natural language into something that a machine can understand.

The image below shows how the levels of language relate to a few NLP applications:

Diagram showing building blocks of language

Challenge

Levels of ambiguity

Discuss what the following sentences mean. What level of ambiguity do they represent?:

“The door is unlockable from the inside.” vs “Unfortunately, the cabinet is unlockable, so we can’t secure it”
“I saw the cat with the stripes” vs “I saw the cat with the telescope”
“Please don’t drive the cat to the vet!” vs “Please don’t drive the car tomorrow!”
“I never said she stole my money.” (re-write this sentence multiple times and each time write a different word in italics).

Show me the solution

This is why the previous statements were difficult:

“Un-lockable vs Unlock-able” is a Morphological ambiguity: Same word form, two possible meanings
“I saw the cat with the telescope” has a Syntactic ambiguity: Same sentence structure, different properties
“drive the cat” vs “drive the car” shows a Semantic ambiguity: Syntactically identical sentences that imply quite different actions.
“I NEVER said she stole MY money.” is a Pragmatic ambiguity: Meaning relies on word emphasis

Whenever you are solving a specific task, you should ask yourself what kind of ambiguity can affect your results, and to what degrees? At what level are your assumptions operating when defining your research questions? Having the answers to this can save you a lot of time when debugging your models. Sometimes the most innocent assumptions (for example using the wrong tokenizer) can create enormous performance drops even when the higher level assumptions were correct.

Sparsity

Another key property of linguistic data is its sparsity. This means that if we are hunting for a specific phenomenon, we may often realize it barely occurs inside a vast amount of text. Imagine we have the following brief text and we are interested in pizzas and hamburgers:

PYTHON

# A mini-corpus where our target words appear
text = """
I am hungry. Should I eat delicious pizza?
Or maybe I should eat a juicy hamburger instead.
Many people like to eat pizza because is tasty, they think pizza is delicious as hell!
My friend prefers to eat a hamburger and I agree with him.
We will drive our car to the restaurant to get the succulent hamburger.
Right now, our cat sleeps on the mat so we won't take him.
I did not wash my car, but at least the car has gasoline.
Perhaps when we come back we will take out the cat for a walk.
The cat will be happy then.
"""

We can first use spaCy to tokenize the text and do some direct word count:

PYTHON

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(text)
words = [token.lower_ for token in doc if token.is_alpha]  # Filter out punctuation and new lines
print(words)
print(len(words))

We have in total 104 words, but we actually want to know how many times each word appears. For that we use the Python Counter and then we can visualize it inside a chart with matplotlib:

PYTHON

from collections import Counter
import matplotlib.pyplot as plt

word_count = Counter(words).most_common()
tokens = [item[0] for item in word_count]
frequencies = [item[1] for item in word_count]

plt.figure(figsize=(18, 6))
plt.bar(tokens, frequencies)
plt.xticks(rotation=90)
plt.show()

This bar chart shows us several things about sparsity, even with such a small text:

The most common words are filler words such as “the”, “of”, “not” etc. These are known as stopwords because such words by themselves generally do not hold a lot of information about the meaning of the piece of text.
The two concepts (hamburger and pizza) we are interested in, appear only 3 times each, out of 104 words (comprising only ~3% of our corpus). This number only goes lower as the corpus size increases
There is a long tail in the distribution, where actually a lot of meaningful words are located.

Callout

Stop Words

Stop words are extremely frequent syntactic filler words that do not provide relevant semantic information for our use case. For some use cases it is better to ignore them in order to fight the sparsity phenomenon. However, consider that in many other use cases the syntactic information that stop words provide is crucial to solve the task.

SpaCy has a pre-defined list of stopwords per language. To explicitly load the English stop words we can do:

PYTHON

from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS)  # a set of common stopwords
print(len(STOP_WORDS)) # There are 326 words considered in this list

You can also manually extend the list of stop words if you are interested in ignoring other unlisted terms that you encounter in your data.

Alternatively, you can filter out stop words when iterating your tokens (remember the spaCy token properties!) like this:

PYTHON

doc = nlp(text)
content_words = [token.text for token in doc if token.is_alpha and not token.is_stop]  # Filter out stop words and punctuation
print(content_words)

Callout

Sparsity is closely related to what is frequently called domain-specific data. The discourse context in which language is used varies importantly across disciplines (domains). Take for example law texts and medical texts which are typically filled with domain-specific jargon. We should expect the top part of the distribution to contain mostly the same words as they tend to be stop words. But once we remove the stop words, the top of the distribution will contain very different content words.

Also, the meaning of concepts described in each domain might significantly differ. For example the word “trial” refers to a procedure for examining evidence in court, but in the medical domain this could refer to a clinical “trial” which is a procedure to test the efficacy and safety of treatments on patients. For this reason there are specialized models and corpora that model language use in specific domains. The concept of fine-tuning a general purpose model with domain-specific data is also popular, even when using LLMs.

Discreteness

There is no inherent relationship between the form of a word and its meaning. For this reason, by syntactic or lexical analysis alone, there is no automatic way of knowing if two words are similar in meaning or how they relate semantically to each other. For example, “car” and “cat” appear to be very closely related at the morphological level, only one letter needs to change to convert one word into the other. But the two words represent concepts or entities in the world which are very different. Conversely, “pizza” and “hamburger” look very different (they only share one letter in common) but are more closely related semantically, because they both refer to typical fast foods.

How can we automatically know that “pizza” and “hamburger” share more semantic properties than “car” and “cat”? One way is by looking at the context (neighboring words) of these words. This idea is the principle behind distributional semantics, and aims to look at the statistical properties of language, such as word co-occurrences (what words are typically located nearby a given word in a given corpus of text), to understand how words relate to each other.

Let’s keep using the list of words from our mini corpus:

PYTHON

words = [token.lower_ for token in doc if token.is_alpha]

Now we will create a dictionary where we accumulate the words that appear around our words of interest. In this case we want to find out, according to our corpus, the most frequent words that occur around pizza, hamburger, car and cat:

PYTHON

target_words = ["pizza", "hamburger", "car", "cat"] # words we want to analyze
co_occurrence = {word: [] for word in target_words}
co_occurrence

We iterate over each word in our corpus, collecting its surrounding words within a defined window. A window consists of a set number of words to the left and right of the target word, as determined by the window_size parameter. For example, with window_size = 3, a word W has a window of six neighboring words—three preceding and three following—excluding W itself:

PYTHON

window_size = 3 # How many words to look at on each side
for i, word in enumerate(words):
    # If the current word is one of our target words...
    if word in target_words:
        start = max(0, i - window_size) # get the start index of the window
        end = min(len(words), i + 1 + window_size) # get the end index of the window
        context = words[start:i] + words[i+1:end]  # Exclude the target word itself
        co_occurrence[word].extend(context)

print(co_occurrence)

We call the words that fall inside this window the context of a target word. We can already see other interesting related words in the context of each target word, but a lot of non interesting stuff is in there. To obtain even nicer results, we can delete the stop words from the context window before adding it to the dictionary. You can define your own stop words, here we use the STOP_WORDS list provided by spaCy:

PYTHON

from spacy.lang.en.stop_words import STOP_WORDS

co_occurrence = {word: [] for word in target_words} # Empty the dictionary

window_size = 3 # How many words to look at on each side
for i, word in enumerate(words):
    # If the current word is one of our target words...
    if word in target_words:
        start = max(0, i - window_size) # get the start index of the window
        end = min(len(words), i + 1 + window_size) # get the end index of the window
        context = words[start:i] + words[i+1:end]  # Exclude the target word itself
        context = [w for w in context if w not in STOP_WORDS] # Filter out stop words
        co_occurrence[word].extend(context)

print(co_occurrence)

Our dictionary keys represent each word of interest, and the values are a list of the words that occur within window_size distance of the word. Now we use a Counter to get the most common items:

PYTHON

# Print the most common context words for each target word
print("Contextual Fingerprints:\n")
for word, context_list in co_occurrence.items():
    # We use Counter to get a frequency count of context words
    fingerprint = Counter(context_list).most_common(5)
    print(f"'{word}': {fingerprint}")

OUTPUT

Contextual Fingerprints:

'pizza': [('eat', 2), ('delicious', 2), ('tasty', 2), ('maybe', 1), ('like', 1)]
'hamburger': [('eat', 2), ('juicy', 1), ('instead', 1), ('people', 1), ('agree', 1)]
'car': [('drive', 1), ('restaurant', 1), ('wash', 1), ('gasoline', 1)]
'cat': [('walk', 2), ('right', 1), ('sleeps', 1), ('happy', 1)]

As our mini experiment demonstrates, discreteness can be combatted with statistical co-occurrence: words with similar meaning will occur around similar concepts, giving us an idea of similarity that has nothing to do with syntactic or lexical form of words. This is the core idea behind most modern semantic representation models in NLP.

Callout

NLP Libraries

Related to the need of shaping our problems into a known task, there are several existing NLP libraries which provide a wide range of models that we can use out-of-the-box (without further need of modification). We already saw simple examples using SpaCy for English and jieba for Chinese. Again, as a non-exhaustive list, we mention some widely used NLP libraries in Python:

Callout

Linguistic Resources

There are also several curated resources (textual data) that can help solve your NLP-related tasks, specifically when you need highly specialized definitions. An exhaustive list would be impossible as there are thousands of them, and also them being language and domain dependent. Below we mention some of the most prominent, just to give you an idea of the kind of resources you can find, so you don’t need to reinvent the wheel every time you start a project:

HuggingFace Datasets: A large collection of datasets for NLP tasks, including text classification, question answering, and language modeling.
WordNet: A large lexical database of English, where words are grouped into sets of synonyms (synsets) and linked by semantic relations.
Europarl: A parallel corpus of the proceedings of the European Parliament, available in 21 languages, which can be used for machine translation and cross-lingual NLP tasks.
Universal Dependencies: A collection of syntactically annotated treebanks across 100+ languages, providing a consistent annotation scheme for syntactic and morphological properties of words, which can be used for cross-lingual NLP tasks.
PropBank: A corpus of texts annotated with information about basic semantic propositions, which can be used for English semantic tasks.
FrameNet: A lexical resource that provides information about the semantic frames that underlie the meanings of words (mainly verbs and nouns), including their roles and relations.
BabelNet: A multilingual lexical resource that combines WordNet and Wikipedia, providing a large number of concepts and their relations in multiple languages.
Wikidata: A free and open knowledge base initially derived from Wikipedia, that contains structured data about entities, their properties and relations, which can be used to enrich NLP applications.
Dolma: An open dataset of 3 trillion tokens from a diverse mix of clean web content, academic publications, code, books, and encyclopedic materials, used to train English large language models.

What did we learn in this lesson?

Key Points

NLP is a subfield of Artificial Intelligence (AI) that, using the help of Linguistics, deals with approaches to process, understand and generate natural language
Linguistic Data has special properties that we should consider when modeling our solutions
Key tasks include language modeling, text classification, token classification and text generation
Deep learning has significantly advanced NLP, but the challenge remains in processing the discrete and ambiguous nature of language
The ultimate goal of NLP is to enable machines to understand and process language as humans do

Content from From words to vectors

Last updated on 2025-11-24 | Edit this page

Estimated time: 120 minutes

Overview

Questions

Why do we need to preprocess text in NLP?
What are the most common preprocessing operations and in which contexts should each be used?
What properties do word embeddings have?
What is a word2vec model?
What insights can I get from word embeddings?
How do I train my own word2vec model?

Objectives

After following this lesson, learners will be able to:

Perform basic NLP preprocessing operations
Implement a basic NLP Pipeline
Explain the motivation for vectorization in modern NLP
Train a custom Word2Vec model using the Gensim library
Apply a Word2Vec model to interpret and analyze semantics of text (either a pre-trained model or custom model)
Describe the kinds of semantic relationships captured by Word2Vec, and identify NLP tasks it is suited to support
Explain, with examples, what the limitations are for the Word2Vec representation

Introduction

In this episode, we will learn about the importance of preprocessing text in NLP, and how to apply common preprocessing operations to text files. We will also learn more about NLP Pipelines, learn about their basic components and how to construct such pipelines.

We will then address the transition from rule-based NLP to distributional semantics approaches which encode text into numerical representations based on statistical relationships between tokens. We will introduce one particular algorithm for this kind of encoding called Word2Vec proposed in 2013 by Mikolov et al. We will show what kind of useful semantic relationships these representations encode in text, and how we can use them to solve specific NLP tasks. We will also discuss some of the limitations of Word2Vec which are addressed in the next lesson on transformers before concluding with a summary of what we covered in this lesson.

Preprocessing Operations

As is common in data science and machine learning, raw textual data often comes in a form that is not readily suitable for downstream NLP tasks. Preprocessing operations in NLP are analogous to the data cleaning and sanitation steps in traditional non-NLP Machine Learning tasks. Sometimes you are extracting text from PDF files which contain line breaks, headers, tables etc. that are not relevant to NLP tasks and which need to be removed. You may need to remove punctuation and special characters, or lowercase text for some NLP tasks etc. Whether you need to perform certain preprocessing operations, and the order in which you should perform them, will depend on the NLP task at hand.

Also note that preprocessing can differ significantly if you work with different languages. This is both in terms of which steps to apply, but also which methods to use for a specific step.

Here we will analyze with more detail the most common pre-processing steps when dealing with unstructured English text data:

Data Formatting

Text comes from various sources and are available in different formats (e.g., Microsoft Word documents, PDF documents, ePub files, plain text files, Web pages etc…). The first step is to obtain a clean text representation that can be transferred into python UTF-8 strings that our scripts can manipulate.

Take a look at the data/84_frankenstein_or_the_modern_prometheus.txt file:

PYTHON

filename = "data/84_frankenstein_or_the_modern_prometheus.txt"
with open(filename, 'r', encoding='utf-8') as file:
    text = file.read()

print(text[:300]) # print the first 300 characters

Our file is already in plain text so it might seem we do not need to do anything; however, if we look closer we see new line characters separating not only paragraphs but breaking the lines in the middle of sentences. While this is useful to keep the text in a narrow space to help the human reader, it introduces artificial breaks that can confuse any automatic analysis (for example to identify where sentences start and end).

One straightforward way is to replace the new lines with spaces so all the text is in a single line:

PYTHON

text_flat = text.replace("\n", " ")
print(text_flat[:300]) # print the first 300 characters

Other data formatting operations might include: - Removal of special or noisy characters. For example:

SH

- Random symbols: "The total cost is $120.00#" → remove #
- Incorrectly recognized letters or numbers: 1 misread as l, 0 as O, etc. Example: "l0ve" → should be "love"
- Control or formatting characters: \n, \t, \r appearing in the middle of sentences. Example: "Please\nsubmit\tyour form." → "Please submit your form."
- Non-standard Unicode characters: �, �, or other placeholder symbols where OCR failed. Example: "Th� quick brown fox" → "The quick brown fox"

Remove HTML tags (e.g., if you are extracting text from Web pages)
Strip non-meaningful punctuation (e.g., “The quick brown fox jumps over the lazy dog and con- tinues to run across the field.)
Strip footnotes, headers, tables, images etc.
Remove URLs or phone numbers

Callout

What if I need to extract text from MS Word docs or PDF files or Web pages? - There are various Python libraries for helping you extract and manipulate text from these kinds of sources. - For MS Word documents python-docx is popular. - For (text-based) PDF files PyPDF2 and PyMuPDF are widely used. Note that some PDF files are encoded as images (pixels) and not text. If the text in these files is digital (as opposed to scanned handwriting), you can use OCR (Optical Character Recognition) libraries such as pytesseract to convert the image to machine-readable text. - For scraping text from websites, BeautifulSoup and Scrapy are some common options. - LLMs also have something to offer here, and the field is moving pretty fast. There are some interesting open source LLM-based document parsers and OCR-like extractors such as Marker, or PyMuPDF4LLM, just to mention a couple.

Callout

Another important choice at the data formatting level is to decide at what granularity do you need to perform the NLP task:

Are you analyzing phenomena at the word level? For example, detecting abusive language (based on a known vocabulary).
Do you need to first extract sentences from the text and do analysis at the sentence level? For example, extracting entities in each sentence.
Do you need full chunks of text? (e.g. paragraphs or chapters?) For example, summarizing each paragraph in a document.
Or perhaps you want to extract patterns at the document level? For example each full book should have one genre tag (Romance, History, Poetry).

Sometimes your data will be already available at the desired granularity level. If this is not the case, then during the tokenization step you will need to figure out how to obtain the desired granularity level.

Tokenization

Tokenization is a foundational operation in NLP, as it helps to create structure from raw text. This structure is a basic requirement and input for modern NLP algorithms to attribute and interpret meaning from text. This operation involves the segmentation of the text into smaller units referred to as tokens. Tokens can be sentences (e.g. 'the happy cat'), words ('the', 'happy', 'cat'), subwords ('un', 'happiness') or characters ('c','a', 't'). Different NLP algorithms may require different choices for the token unit. And different languages may require different approaches to identify or segment these tokens.

Python strings are by definition sequences of characters, thus we can iterate through the string character by character:

PYTHON

print(type(text_flat))  # Should be <class 'str'>
for ch in text_flat:
    print(ch)

However, it is often more advantageous if our atomic units are words. As we saw in Lesson 1, the task of extracting word tokens from texts is not trivial, therefore pre-trained models such as spaCy can help with this step. In this case we will use the small English model that was trained on a web corpus:

PYTHON

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(text)
print(type(doc))  # Should be <class 'spacy.tokens.doc.Doc'>
print(len(doc))
print(doc)

Callout

A good word tokenizer for example, does not simply break up a text based on spaces and punctuation, but it should be able to distinguish:

abbreviations that include points (e.g.: e.g.)
times (11:15) and dates written in various formats (01/01/2024 or 01-01-2024)
word contractions such as don’t, these should be split into do and n’t
URLs

Many older tokenizers are rule-based, meaning that they iterate over a number of predefined rules to split the text into tokens, which is useful for splitting text into word tokens for example. Modern large language models use subword tokenization, which learn to break text into pieces that are statistically convenient, this makes them more flexible but less human-readable.

We can access the tokens by iterating the document and getting its .text property:

PYTHON

tokens_txt = [token.text for token in doc]
print(tokens_txt[:15])

OUTPUT

['Letter', '1', '\n\n\n', 'St.', 'Petersburgh', ',', 'Dec.', '11th', ',', '17', '-', '-', '\n\n', 'TO', 'Mrs.']

This shows us the individual tokens, including new lines and punctuation (in case we didn’t run the previous cleaning step). spaCy allows us to filter based on token properties. For example, assuming we are not interested in the newlines, punctuation nor in numeric tokens, in one single step we can keep only the token objects that contain alphabetical characters:

PYTHON

tokens = [token for token in doc if token.is_alpha]
print(tokens[:15])

OUTPUT

[Letter, Petersburgh, TO, Saville, England, You, will, rejoice, to, hear, that, no, disaster, has, accompanied]

We do not have to depend necessarily on the Doc and Token spaCy objects. Once we tokenized the text with the spaCy model, we can extract the list of words as a list of strings and continue our text analysis:

PYTHON

words = [token.text for token in doc]
print(words[:20])

Again, it all depends on what your requirements are. For example, sometimes it is more useful if our atomic units are sentences. Think of the NLP task of classifying each whole sentence inside a text as Positive/Negative/Neutral in terms of sentiment (e.g., within movie reviews). spaCy also helps with this using a sentence segmentation model:

PYTHON

sentences = [sent.text for sent in doc.sents]
[print(s) for s in sentences[:5]]

Note that in this case each sentence is a python object, and the property .text returns an untokenized string (in terms of words). But we can still access the list of word tokens inside each sentence object if we want:

PYTHON

sents_sample = list(doc.sents)[:10]
for sent in sents_sample:
    print("Sentence:", sent.text)
    for token in sent:
        print("\tToken:", token.text)

This will give us enough flexibility to work at the sentence and word level at the same time. In terms of what we can do with these sentences once spaCy has identified them, we could ask humans to label each sentence as either Positive/Negative/Neutral and train a supervised model for sentiment classification on the set of sentences. Or if we have a pre-trained model for sentiment classification on sentences, we could load this model and classify each of our input sentences as either Positive/Negative/Neutral.

Lowercasing

Removing uppercases to e.g. avoid treating “Dog” and “dog” as two different words is also a common step, for example to train word vector representations, we want to merge both occurrences as they represent exactly the same concept. Lowercasing can be done with Python directly as:

PYTHON

lower_text = text_flat.lower()
lower_text[:100] # Beware that this is a python string operation

Beware that lowercasing the whole string as a first step might affect the tokenizer behavior since tokenization benefits from information provided by case-sensitive strings. We can therefore tokenize first using spaCy and then obtain the lowercase strings of each token using the .lower_ property:

PYTHON

lower_text = [token.lower_ for token in doc]
lower_text[:10] # Beware that this is a list of strings now!

In other tasks, such as Named Entity Recognition (NER), lowercasing before training can actually lower the performance of your model. This is because words that start with an uppercase (not preceded by a period) can represent proper nouns that map into Entities, for example:

PYTHON

# Preserving uppercase characters increases the likelihood that an NER model
# will correctly identify Apple and Will as a company (ORG) and a person (PER)
# respectively.
str1 = "My next laptop will be from Apple, Will said." 
# Lowercasing can reduce the likelihood of accurate labeling
str2 = "my next laptop will be from apple, will said."

Lemmatization

Although it has become less widely used in modern NLP approaches, normalizing words into their dictionary form can help to focus on relevant aspects of text. Consider how “eating”, “ate”, “eaten” are all variations of the root verb “eat”. Each variation is sometimes known as an inflection of the root word. Conversely, we say that the word “eat” is the lemma for the words “eating”, “eats”, “eaten”, “ate” etc. Lemmatization is therefore the process of rewriting each token or word in a given input text as its lemma.

Lemmatization is not only a possible preprocessing step in NLP but also an NLP task on its own, with different algorithms for it. Therefore we also tend to use pre-trained models to perform lemmatization. Using spaCy we can access the lemmmatized version of each token with the lemma_ property (notice the underscore!):

PYTHON

lemmas = [token.lemma_ for token in doc]
print(lemmas)

Note that the list of lemmas is now a list of strings.

Having a lemmatized text allows us to merge the different surface occurrences of the same concept into a single token. This can be very useful for count-based NLP methods such as topic modelling approaches which count the frequency of certain words to see how prevalent a given topic is within a document. If you condense “eat”, “eating”, “ate”, “eaten” to the same token “eat” then you can count four occurrences of the same “topic” in a text, instead of treating these four tokens as distinct or unrelated topics just because they are spelled differently. You can also use lemmatization for generating word embeddings. For example, you can have a single vector for eat instead of one vector per verb tense.

As with each preprocessing operation, this step is optional. Consider, for example, the cases where the differences of verb usage according to tense is informative, or the difference between singular and plural usage of nouns, in those cases lemmatizing will get rid of important information for your task. For example, if your chosen NLP task is to detect past tense verbs from a document, then lemmatizing “eaten” into “eat” loses crucial information about tense that your model requires.

Stop Word Removal

The most frequent words in texts are those which contribute little semantic value on their own: articles (‘the’, ‘a’, ‘an’), conjunctions (‘and’, ‘or’, ‘but’), prepositions (‘on’, ‘by’), auxiliary verbs (‘is’, ‘am’), pronouns (‘he’, ‘which’), or any highly frequent word that might not be of interest in several content-only related tasks. Let’s define a small list of stop words for this specific case:

PYTHON

STOP_WORDS = ["the", "you", "will"] # This list can be customized to your needs...

Using Python directly, we need to manually define a list of what we consider to be stop words and directly filter the tokens that match this. Notice that lemmatization was a crucial step to get more coverage with the stop word filtering:

PYTHON

lemmas = [token.lemma_ for token in doc]
content_words = []
for lemma in lemmas:
    if lemma not in STOP_WORDS:
        content_words.append(lemma)
print(content_words[:20])

Using spaCy we can filter the stop words based on the token properties:

PYTHON

tokens_nostop = [token for token in tokens if not token.is_stop]
print(tokens[:15])

There is no canonical definition of stop words because what you consider to be a stop word is directly linked to the objective of your task at hand. For example, pronouns are usually considered stopwords, but if you want to do gender bias analysis then pronouns are actually a key element of your text processing pipeline. Similarly, removing articles such as prepositions from text is obviously not advised if you are doing dependency parsing (the task of identifying the parts of speech in a given text).

Another special case is the word ‘not’ which may encode the semantic notion of negation. Removing such tokens can drastically change the meaning of sentences and therefore affect the accuracy of models for which negation is important to preserve (e.g., sentiment classification “this movie was NOT great” vs. “this movie was great”).

NLP Pipeline

The concept of NLP pipeline refers to the sequence of operations that we apply to our data in order to go from the original data (e.g. original raw documents) to the expected outputs of our NLP Task at hand. The components of the pipeline refer to any manipulation we apply to the text, and do not necessarily need to be complex models, they involve preprocessing operations, application of rules or machine learning models, as well as formatting the outputs in a desired way.

A simple rule-based classifier

Imagine we want to build a very lightweight sentiment classifier. A basic approach is to design the following pipeline:

Clean the original text file (as we saw in the Data Formatting section)
Apply a sentence segmentation or tokenization model
Define a set of positive and negative words (a hard coded dictionary)
For each sentence:
- If it contains one or more of the positive words, classify as POSITIVE
- If it contains one or more of the negative words, classify as NEGATIVE
- Otherwise classify as NEUTRAL
Output a table with the original sentence and the assigned label

This is implemented with the following code:

Read the text and normalize it into a single line

PYTHON

import spacy
nlp = spacy.load("en_core_web_sm")

filename = "data/84_frankenstein_or_the_modern_prometheus.txt"
with open(filename, 'r', encoding='utf-8') as file:
    text = file.read()

text = text.replace("\n", " ") # some cleaning by removing new line characters

Apply Sentence segmentation

PYTHON

doc = nlp(text)
sentences = [sent.text for sent in doc.sents]

Define the positive and negative words you care about:

PYTHON

positive_words = ["happy", "excited", "delighted", "content", "love", "enjoyment"]
negative_words = ["unhappy", "sad", "anxious", "miserable", "fear", "horror"]

Apply the rules to each sentence and collect the labels

PYTHON

classified_sentences = []

for sent in sentences:
    if any(word in sent.lower() for word in positive_words):
        classified_sentences.append((sent, 'POSITIVE'))
    elif any(word in sent.lower() for word in negative_words):
        classified_sentences.append((sent, 'NEGATIVE'))
    else:
        classified_sentences.append((sent, 'NEUTRAL'))

Save the classified data

PYTHON

import pandas as pd
df = pd.DataFrame(classified_sentences, columns=['sentence', 'label'])
df.to_csv('results_naive_rule_classifier.csv', sep='\t')

Challenge

Discuss the pros and cons of the proposed NLP pipeline: 1. Do you think it will give accurate results? 2. What do you think about the coverage of this approach? What cases will it miss? 3. Think of possible drawbacks of chaining components in a pipeline.

Show me the solution

This classifier only considers the presence of one word to apply a label. It does not analyze sentence semantics or even syntax.
Given how the rules are defined, if both positive and negative words are present in the same sentence it will assign the POSITIVE label. It will generate a lot of false positives because of the simplistic rules
The errors from previous steps get carried over to the next steps increasing the likelihood of noisy outputs.

So far we’ve seen how to format and segment the text to have atomic data at the word level or sentence level. We then apply operations to the word and sentence strings. This approach still depends on counting and exact keyword matching. And as we have already seen it has several limitations. The method cannot interpret words outside the dictionary defined for example.

One way to combat this is by transforming each word into numeric representation and study statistical patterns in how these words are distributed in text. For example, what words tend to occur “close” to a given word in my data? For example, if we analyze restaurant menus we find that “cheese”, “mozzarella”, “base” etc. frequently occur near the token “pizza”. We can then exploit these statistical patterns to inform various NLP tasks. This concept is commonly known as distributional semantics. It is based on the assumption “words that appear in similar contexts have similar meanings.”

This concept is powerful for enabling, for example, the measurement of semantic similarity of words, sentences, phrases etc. in text. And this, in turn, can help with other downstream NLP tasks, as we shall see in the next section on word embeddings.

Word Embeddings

Reminder: Neural Networks

Understanding how neural networks work is out of the scope of this course. For our purposes we will simplify the explanation in order to conceptually understand how Neural Network works. A Neural Network (NN) is a pattern-finding machine with layers (a deep neural network is the same concept but scaled to dozens or even hundreds of layers). In a neural network, each layer has several interconnected neurons, each one corresponding to a random number initially. The deeper the network is, the more complex patterns it can learn. As the neural netork gets trained (that is, as it sees several labeled examples that we provide), each neuron value will be updated in order to maximize the probability of getting the answers right. A well trained neural network will be able to predict the right labels on completely new data with certain accuracy.

After seeing thousands of examples, each layer represents different “features” that maximize the success of the task, but they are not human-readable. The last layer acts as a classifier and outputs the most likely label given the input

The main difference with traditional machine learning models is that we do not need to design explicitly any features, rather the network will adjust itself by looking at the data alone and executing the back-propagation algorithm. The main job when using NNs is to encode our data properly so it can be fed into the network.

Rationale behind Embeddings

A word embedding is a numeric vector that represents a word. Word2Vec exploits the “feature agnostic” power of neural networks to transform word strings into trained word numeric representations. Hence we still use words as features but instead of using the string directly, we transform that string into its corresponding vector in the pre-trained Word2Vec model. And because both the network input and output are the words themselves in text, we basically have billions of labeled training datapoints for free.

To obtained the word embeddings, a shallow neural network is optimized with the task of language modeling and the final hidden layer inside the trained network holds the fixed size vectors whose values can be mapped into linguistic properties (since the training objective was language modeling). Since similar words occur in similar contexts, or have same characteristics, a properly trained model will learn to assign similar vectors to similar words.

By representing words with vectors, we can mathematically manipulate them through vector arithmetic and express semantic similarity in terms of vector distance. Because the size of the learned vectors is not proportional to the amount of documents we can learn the representations from larger collections of texts, obtaining more robust representations, that are less corpus-dependent.

There are two main algorithms for training Word2Vec:

Continuous Bag-of-Words (CBOW): Predicts a target word based on its surrounding context words.
Continuous Skip-Gram: Predicts surrounding context words given a target word.

If you want to know more about the technicl aspecs of training Word2Vec you can visit this tutorial

The Word2Vec Vector Space

The python module gensim offers a user-friendly interface to interact with pre-trained Word2vec models and also to train our own. First we will explore the model from the original Word2Vec paper, which was trained on a big corpus from Google News (English news articles). We will see what functionalities are available to explore a vector space. Then we will prepare our own text step-by-step to train our own Word2vec models and save them.

Load the embeddings and inspect them

The library gensim has a repository with English pre-trained models. We can take a look at the models:

PYTHON

import gensim.downloader
available_models = gensim.downloader.info()['models'].keys()
print(list(available_models))

We will download the google News model with:

PYTHON

w2v_model = gensim.downloader.load('word2vec-google-news-300')

We can do some basic checkups such as showing how many words are in the vocabulary (i.e., for how many words do we have an available vector), what is the total number of dimensions in each vector, and print the components of a vector for a given word:

PYTHON

print(len(w2v_model.key_to_index.keys())) # 3 million words
print(w2v_model.vector_size) # 300 dimensions. This can be chosen when training your own model
print(w2v_model['car'][:10]) # The first 10 dimensions of the vector representing 'car'.
print(w2v_model['cat'][:10]) # The first 10 dimensions of the vector representing 'cat'.

OUTPUT

3000000
300
[ 0.13085938  0.00842285  0.03344727 -0.05883789  0.04003906 -0.14257812
  0.04931641 -0.16894531  0.20898438  0.11962891]
[ 0.0123291   0.20410156 -0.28515625  0.21679688  0.11816406  0.08300781
  0.04980469 -0.00952148  0.22070312 -0.12597656]

As we can see, this is a very large model with 3 million words and the dimensionality chosen at training time was 300, thus each word will have a 300-dimension vector associated with it.

Even with such a big vocabulary we can always find a word that won’t be in there:

PYTHON

print(w2v_model['bazzinga'][:10])

This will throw a KeyError as the model does not know that word. Unfortunately this is a limitation of Word2vec - unseen words (words that were not included in the training data) cannot be interpreted by the model.

Now let’s talk about the vectors themselves. They are not easy to interpret as they are a bunch of floating point numbers. These are the weights that the network learned when optimizing for language modelling. As the vectors are hard to interpret, we rely on a mathematical method to compute how similar two vectors are. Generally speaking, the recommended metric for measuring similarity between two high-dimensional vectors is cosine similarity .

Callout

cosine similarity ranges between [-1 and 1]. It is the cosine of the angle between two vectors, divided by the product of their length. Mathematically speaking, when two vectors point in exactly the same direction their cosine will be 1, and when they point in the opposite direction their cosine will be -1. In python we can use Numpy to compute the cosine similarity of vectors.

We can use sklearn learn to measure any pair of high-dimensional vectors:

PYTHON

from sklearn.metrics.pairwise import cosine_similarity

car_vector = w2v_model['car']
cat_vector = w2v_model['cat']

similarity = cosine_similarity([car_vector], [cat_vector])
print(f"Cosine similarity between 'car' and 'cat': {similarity[0][0]}")

similarity = cosine_similarity([w2v_model['hamburger']], [w2v_model['pizza']])
print(f"Cosine similarity between 'hamburger' and 'pizza': {similarity[0][0]}")

PYTHON


Cosine similarity between 'car' and 'cat': 0.21528185904026031
Cosine similarity between 'hamburger' and 'pizza': 0.6153676509857178

Or you can use directly the w2v_model.similarity('car', 'cat') function which gives the same result.

The higher similarity score between the hamburger and pizza indicates they are more similar based on the contexts where they appear in the training data. Even though is hard to read all the floating numbers in the vectors, we can trust this metric to always give us a hint of which words are semantically closer than others

Challenge

Think of different word pairs and try to guess how close or distant they will be from each other. Use the similarity measure from the word2vec module to compute the metric and discuss if this fits your expectations. If not, can you come up with a reason why this was not the case?

Show me the solution

Interesting cases are synonyms, antonyms and morphologically related words:

PYTHON

print(w2v_model.similarity('democracy', 'democratic'))
print(w2v_model.similarity('queen', 'princess'))
print(w2v_model.similarity('love', 'hate')) #!! (think of "I love X" and "I hate X")
print(w2v_model.similarity('love', 'lover'))

OUTPUT

Vector Neighborhoods

Now that we have a metric we can trust, we can retrieve neighborhoods of vectors that are close to a given word. This is analogous to retrieving semantically related terms to a target term. Let’s explore the neighborhood around `pizza` using the most_similar() method:

PYTHON

print(w2v_model.most_similar('pizza', topn=10))

This returns a list of ranked tuples with the form (word, similarity_score). The list is already ordered in descent, so the first element is the closest vector in the vector space, the second element is the second closest word and so on…

OUTPUT

[('pizzas', 0.7863470911979675),
('Domino_pizza', 0.7342829704284668),
('Pizza', 0.6988078355789185),
('pepperoni_pizza', 0.6902607083320618),
('sandwich', 0.6840401887893677),
('burger', 0.6569692492485046),
('sandwiches', 0.6495091319084167),
('takeout_pizza', 0.6491535902023315),
('gourmet_pizza', 0.6400628089904785),
('meatball_sandwich', 0.6377009749412537)]

Exploring neighborhoods can help us understand why some vectors are closer (or not so much). Take the case of love and lover, originally we might think these should be very close to each other but by looking at their neighborhoods we understand why this is not the case:

PYTHON

print(w2v_model.most_similar('love', topn=10))
print(w2v_model.most_similar('lover', topn=10))

OUTPUT

[('loved', 0.6907791495323181), ('adore', 0.6816874146461487), ('loves', 0.6618633270263672), ('passion', 0.6100709438323975), ('hate', 0.6003956198692322), ('loving', 0.5886634588241577), ('Ilove', 0.5702950954437256), ('affection', 0.5664337873458862), ('undying_love', 0.5547305345535278), ('absolutely_adore', 0.5536840558052063)]

[('paramour', 0.6798686385154724), ('mistress', 0.6387110352516174), ('boyfriend', 0.6375402212142944), ('lovers', 0.6339589953422546), ('girlfriend', 0.6140860915184021), ('beau', 0.609399676322937), ('fiancé', 0.5994566679000854), ('soulmate', 0.5993717312812805), ('hubby', 0.5904166102409363), ('fiancée', 0.5888950228691101)]

The first word is a noun or a verb (depending on the context) that denotes affection to someone/something , so it is associated with other concepts of affection (positive or negative). The case of lover is used to describe a person, hence the associated concepts are descriptors of people with whom the lover can be associated.

Word Analogies with Vectors

Another powerful property that word embeddings show is that vector algebra can preserve semantic analogy. An analogy is a comparison between two different things based on their similar features or relationships, for example king is to queen as man is to woman. We can mimic this operations directly on the vectors using the most_similar() method with the positive and negative parameters:

PYTHON

# king is to man as what is to woman?
# king + woman - man = queen
w2v_model.most_similar(positive=['king', 'woman'], negative=['man'])

OUTPUT

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674735069275),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411403656006)]

Train your own Word2Vec

The gensim package has implemented everything for us, this means we have to focus mostly on obtaining clean data and then calling the Word2Vec object to train our own model with our own data. This can be done like follows:

PYTHON

import gensim 
from gensim.models import Word2Vec 

model = Word2Vec([clean_tokens], sg=0 , vector_size=300, window=5, min_count=1, workers=4)

With this line code we are configuring our whole Word2Vec training schema. We will be using CBOW (sg=0 means CBOW, sg=1 means Skip-gram). We are interested in having vectors with 300 dimensions vector_size=300 and a context size of 5 surrounding words window=5. Because we already filtered our tokens, we include all words present in the filtered corpora, regardless of their frequency of occurrence min_count=1. The last parameters tells python to use 4 CPU cores for training.

See the Gensim documentation for more training options.

Save and Retrieve your model

Once your model is trained it is useful to save the checkpoint in order to retrieve it next time instead of having to train it every time. You can save it with:

PYTHON

model.save("word2vec_mini_books.model")

Let’s put everything together. We have now the following NLP task: train our own Word2Vec model. We are interested on having vectors for content words only, so even though our preprocessing will unfortunately lose a lot of the original information, in exchange we will be able to manipulate the most relevant conceptual words as individual numeric representations.

To load back the pre-trained vectors you just created you can use the following code:

PYTHON

model = Word2Vec.load("word2vec_mini_books.model")
w2v = model.wv
# Test:
w2v.most_similar('monster')

Challenge

Let’s apply this step by step on a longer text. In this case, because we are learning the process, our corpus will be only one book but in reality we would like to train a network with thousands of them. We will use two books: Frankenstein and Dracula to train a model of word vectors.

Write the code to follow the proposed pipeline and train the word2vec model. The proposed pipeline for this task is:

load the text files
tokenize files
keep only alphanumerical tokens
lemmatize words
Remove stop words
Train a Word2Vec model (feed the clean tokens to the Word2Vec object) with vector_size=50
Save the trained model

Show me the solution

PYTHON

import spacy
from gensim.models import Word2Vec 

def process_book(book_filename: str, spacy_model: spacy.lang) -> list[str]:
    with open(book_filename) as f:
        book_text = f.read()
    
    book_doc = spacy_model(book_text)
    valid_tokens = [tok for tok in book_doc if tok.is_alpha and not tok.is_stop]
    lemmas = [tok.lemma_ for tok in valid_tokens] 
    return lemmas
    
nlp = spacy.load("en_core_web_sm")

# Load the Tokens
franken_tokens = process_book("data/84_frankenstein_clean.txt", nlp)
dracula_tokens = process_book("data/345_dracula_clean.txt", nlp)

# Train our own model
spooky_model = Word2Vec([franken_tokens, dracula_tokens], sg=0 , vector_size=50, window=5, min_count=1, workers=4)

# Test the vectors
print(len(spooky_model.wv['Frankenstein']))
print(spooky_model.wv['Frankenstein'][:30])
print(spooky_model.wv.most_similar("Frankenstein"))

Callout

Dataset size in training

To obtain your own high-quality embeddings, the size/length of the training dataset plays a crucial role. Generally tens of thousands of documents are considered a reasonable amount of data for decent results.

Is there a strict minimum? Not really. It’s important to keep in mind that vocabulary size, document length, and desired vector size all interact with each other. Higher-dimensional vectors (e.g., 200–300 dimensions) provide more features to capture a word’s meaning, resulting in higher-quality embeddings that can represent words across a finer-grained and more diverse set of contexts.

While Word2vec models typically perform better with large datasets containing millions of words, using a single page is sufficient for demonstration and learning purposes. This smaller dataset allows us to train the model quickly and understand how word2vec works without the need for extensive computational resources.

Key Points

We can run a preprocessing pipeline to obtain clear words that can be used as features
We learned how are words converted into vectors of numbers (which makes them interpretable for machines)
We can easily compute how words are similar to each other with the cosine similarity
Using gensim we can train our own word2vec models

Content from Transformers: BERT and Beyond

Last updated on 2025-11-24 | Edit this page

Estimated time: 120 minutes

Overview

Questions

What are some drawbacks of static word embeddings?
What are Transformers?
What is BERT and how does it work?
How can I use BERT to solve supervised NLP tasks?
How should I evaluate my classifiers?

Objectives

Understand how a Transformer works and recognize their main components.
Understand how to use pre-trained transformers (Use Case: BERT)
Use BERT to predict words in context.
Use BERT to classify texts.
Learn to evaluate your NLP Classifier.
Understand assumptions and basic evaluation for NLP outputs.

Word embeddings such as Word2Vec can be used to represent words as unique vectors instead of Python strings. These vector representations give us numerical “proxy” representations for words. And these allow us to provide mathematical definitions for attributing semantics or meaning to words. They also enable metrics and measures for studying linguistic relationships with words. For example, one can devise metrics for similarity and semantic closeness of words, by defining a measure of distance between their corresponding vectors. This has proven to be quite useful for downstream lexical-related tasks.

However, a big drawback of Word2Vec is that each word is represented in isolation, which means that once we finished training a model, each word has exactly one vector associated with it, regardless of the different contexts in which it appears in the corpus text. This is what is called static word embedding, and unfortunately that is a serious limitation in expressing finer-grained complexities in language. Words derive their meaning dynamically, based on the specific context in which they are used. Think of syntactic information, which is relevant to understand the difference between “the dog bit the man” and “the man bit the dog”. Another case is polysemy, where the same word can have very different meanings depending on the context, for example, “bit” in “the dog bit the man” and “in this bit of the book”. Therefore, we would like to have richer vectors of words that are themselves sensitive to their context in order to obtain finer-grained representations of word meaning.

Challenge

Polysemy in Language

Think of words (at least 2) that can have more than one meaning depending on the context. Come up with one simple sentence per meaning and explain what they mean in each context. Discuss: How do you know which of the possible meanings does the word have when you see it?

Show me the solution

Two possible examples can be the words ‘fine’ and ‘run’

Sentences for ‘fine’: - She has a fine watch (fine == high-quality) - He had to pay a fine (fine == penalty) - I am feeling fine (fine == not bad)

Sentences for ‘run’: - I had to run to catch the bus (run == moving fast) - Stop talking, before you run out of ideas (run (out) == exhaust)

Note how in the “run out” example we even have to understand that the meaning of run is not literal but goes accompanied with a preposition that changes its meaning.

Callout

Bridging Word Embeddings and Contextualized Models

The limitations of Word2Vec became apparent as researchers tackled more complex natural language understanding tasks. BERT, the model we will describe below, was not the first attempt to improve upon the drawbacks of Word2Vec. Several intermediate models emerged to address the shortcomings. Some prominent models were:

FastText (Joulin, et. al., 2016), developed by Facebook, extended Word2Vec by representing words as bags of character n-grams rather than atomic units. This subword tokenization approach enabled the model to generate embeddings for previously unseen words by combining learned subword representations—for instance, understanding “unhappiness” through its components “un-,” “happiness,” and “-ness.” FastText proved particularly valuable for morphologically rich languages and handling misspellings or rare word forms.

ELMo (Peters, et. al., 2018) marked a paradigm shift by integrating context into the word representations. Unlike Word2Vec’s static embeddings, ELMo generated different representations for the same word based on its surrounding context using bidirectional LSTM networks. The model was pretrained on large text corpora using a language modeling objective—predicting words from both left and right contexts simultaneously—and crucially introduced effective transfer learning to NLP. BERT would basically replicate this concept but using a more powerful neural network architecture: the Transformer, which allowed for the scaling of training material.

ULMFiT (Howard & Ruder, 2018). Universal Language Model Fine-tuning, also tackled the problem via transfer learning; that is, re-using the same model for learning several different tasks and hence enriching word representations after each task was learned. This idea also enriched BERT post-training methodologies.

These intermediate models established several crucial concepts: that subword tokenization could handle vocabulary limitations, that context-dependent representations were superior to static embeddings, that deep bidirectional architectures captured richer linguistic information, and most importantly, that large-scale pretraining followed by task-specific fine-tuning could dramatically improve performance across diverse NLP applications.

In late 2018, the BERT language model was introduced. Using a novel architecture called Transformer (2017), BERT was specifically designed to scale the amount of training data and integrate context into word representations. To understand BERT, we will first look at what a transformer is and we will then directly use some code to make use of BERT.

Transformers

The Transformer is a deep neural network architecture proposed by Google researchers in a paper called Attention is all you Need. They tackled specifically the NLP task of Machine Translation (MT), which is stated as: how to generate a sentence (sequence of words) in target language B given a sentence in source language A? We all know that translation cannot be done word by word in isolation, therefore integrating the context from both the source language and the target language is necessary. In order to translate, first one neural network needs to encode the whole meaning of the sentence in language A into a single vector representation, then a second neural network needs to decode that representation into tokens that are both coherent with the meaning of language A and understandable in language B. Therefore we say that translation is modeling language B conditioned on what language A originally said.

As seen in the picture, the original Transformer is an Encoder-Decoder network that tackles translation. We first need a token embedder which converts the string of words into a sequence of vectors that the Transformer network can process. The first component, the Encoder, is optimized for creating rich representations of the source sequence (in this case an English sentence) while the second one, the Decoder is a generative network that is conditioned on the encoded representation. The third component we see is the infamous attention mechanism, a third neural network what computes the correlation between source and target tokens (which word in Dutch should I pay attention to, to decide a better next English word?) to generate the most likely token in the target sequence (in this case Dutch words).

Challenge

Emulate the Attention Mechanism

Pair with a person who speaks a language different from English (we will call it language B). Think of 1 or 2 simple sentences in English and come up with their translations in the second language. In a piece of paper write down both sentences (one on top of the other with some distance in between) and try to: 1. Draw a mapping of words or phrases from language B to English. Is it always possible to do this one-to-one for words? 2. Think of how this might relate to attention in transformers?

Show me the solution

Here is an example of a sentence in English and its translation into Spanish. We can look at the final mapping and observe that: 1. Even though they are closely related languages, the translation is not linear 2. There is also not a direct word-to-word mapping between the sentences 3. Some words present in the source are not present in the target (and vice versa) 4. Some words are not translations of each other but they are still very relevant to understand the context

Next, we will see how BERT exploits the idea of a Transformer Encoder to perform the NLP Task we are interested in: generating powerful word representations.

BERT

BERT is an acronym that stands for Bidirectional Encoder Representations from Transformers. The name describes it all: the idea is to use the power of the Encoder component of the Transformer architecture to create powerful token representations that preserve the contextual meaning of the whole input segment, instead of each word in isolation. The BERT vector representations of each token take into account both the left context (what comes before the word) and the right context (what comes after the word). Another advantage of the transformer Encoder is that it is parallelizable, which made it possible for the first time to train these networks on millions of datapoints, dramatically improving model generalization.

Callout

Pretraining BERT

To obtain the BERT vector representations the Encoder is pre-trained with two different tasks: - Masked Language Model: for each sentence, mask one token at a time and predict which token is missing based on the context from both sides. A training input example would be “Maria [MASK] Groningen” and the model should predict the word “loves”. - Next Sentence Prediction: the Encoder gets a linear binary classifier on top, which is trained to decide for each pair of sequences A and B, if sequence A precedes sequence B in a text. For the sentence pair: “Maria loves Groningen.” and “This is a city in the Netherlands.” the output of the classifier is “True” and for the pair “Maria loves Groningen.” and “It was a tasty cake.” the output should be “false” as there is no obvious continuation between the two sentences.

Already the second pre-training task gives us an idea of the power of BERT: after it has been pretrained on hundreds of thousands of texts, one can plug-in a classifier on top and re-use the linguistic knowledge previously acquired to fine-tune it for a specific task, without needing to learn the weights of the whole network from scratch all over again. In the next sections we will describe the components of BERT and show how to use it. This model and hundreds of related transformer-based pre-trained encoders can also be found on Hugging Face.

BERT Architecture

The BERT Architecture can be seen as a basic NLP pipeline on its own:

Tokenizer: splits text into tokens that the model recognizes
Embedder: converts each token into a fixed-sized vector that represents it. These vectors are the actual input for the Encoder.
Encoder several neural layers that model the token-level interactions of the input sequence to enhance meaning representation. The output of the encoder is a set of Hidden layers, the vector representation of the ingested sequence.
Output Layer: the final encoder layer (which we depict as a sequence H’s in the figure) contains arguably the best token-level representations that encode syntactic and semantic properties of each token, but this time each vector is already contextualized with the specific sequence.
OPTIONAL Classifier Layer: an additional classifier can be connected on top of the BERT token vectors which are used as features for performing a downstream task. This can be used to classify at the text level, for example sentiment analysis of a sentence, or at the token-level, for example Named Entity Recognition.

BERT uses (self-) attention, which is very useful to capture longer-range word dependencies such as co-reference, where, for example, a pronoun can be linked to the noun it refers to previously in the same sentence. See the following example:

BERT for Word-Based Analysis

Let’s see how these components can be manipulated with code. For this we will be using the HuggingFace’s transformers Python library. The first two main components we need to initialize are the model and tokenizer. The HuggingFace hub contains thousands of models based on a Transformer architecture for dozens of tasks, data domains and also hundreds of languages. Here we will explore the vanilla English BERT which was how everything started. We can initialize this model with the next lines:

PYTHON

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = BertModel.from_pretrained("bert-base-cased")

BERT Tokenizer

We start with a string of text as written in any blog, book, newspaper etcetera. The tokenizer object is responsible of splitting the string into recognizable tokens for the model and embedding the tokens into their vector representations

PYTHON

text = "Maria loves Groningen"
encoded_input = tokenizer(text, return_tensors='pt')
print(encoded_input)

The print shows the encoded_input object returned by the tokenizer, with its attributes and values. The input_ids are the most important output for now, as these are the token IDs recognized by BERT

{
    'input_ids': tensor([[  101,  3406,  7871,   144,  3484, 15016,   102]]),
    'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]),
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])
}

NOTE: the printing function shows transformers objects as dictionaries; however, to access the attributes, you must use the Python object syntax, such as in the following example:

PYTHON

print(encoded_input.input_ids.shape)

Output:

torch.Size([1, 7])

The output is a 2-dimensional tensor where the first dimension contains 1 element (this dimension represents the batch size), and the second dimension contains 7 elements which are equivalent to the 7 tokens that BERT generated with our string input.

In order to see what these Token IDs represent, we can translate them into human readable strings. This includes converting the tensors into numpy arrays and converting each ID into its string representation:

PYTHON

token_ids = list(encoded_input.input_ids[0].detach().numpy())
string_tokens = tokenizer.convert_ids_to_tokens(token_ids)
print("IDs:", token_ids)
print("TOKENS:", string_tokens)

IDs: [101, 3406, 7871, 144, 3484, 15016, 102]

TOKENS: ['[CLS]', 'Maria', 'loves', 'G', '##ron', '##ingen', '[SEP]']

First we feed the tokenized input into the model:

PYTHON

output = model(**encoded_input)

Now we can explore the representations in the model.

Callout

In the case of wanting to obtain a single vector for Groningen, you can average the three vectors that belong to the token pieces that ultimately form that word. For example:

PYTHON

import numpy as np
tok_G = output.last_hidden_state[0][3].detach().numpy()
tok_ron = output.last_hidden_state[0][4].detach().numpy()
tok_ingen = output.last_hidden_state[0][5].detach().numpy()

tok_groningen = np.mean([tok_G, tok_ron, tok_ingen], axis=0)
print(tok_groningen.shape)
print(tok_groningen[:10])

We obtain the output:

(768,)
[ 0.00990007 -0.44266668  0.05274092  0.08865168  0.71115637 -0.4092603
  0.18841815  0.19883917  0.24680579 -0.07899686]

We use the functions detach().numpy() to bring the values from the Pytorch execution environment (for example a GPU) into the main Python thread and treat it as a Numpy vector for convenience. Then, since we are dealing with three Numpy vectors we can average the three of them and end up with a single Groningen vector of 768-dimensions representing the average of 'G', '##ron', '##ingen'.

Analyzing polysemy with BERT

We can encode two sentences containing the word note to see how BERT actually handles polysemy (note means something very different in each sentence) thanks to the representation of each word now being contextualized instead of isolated as was the case with Word2vec.

PYTHON

# We tokenize the sentence and identify the two different 'note' tokens inside the sentence
text_note = "Please note that this bank note is fake!"
tokenized_text = tokenizer(text_note, return_tensors='pt')
token_ids = list(tokenized_text.input_ids[0].detach().numpy())
string_tokens = tokenizer.convert_ids_to_tokens(token_ids)
print(string_tokens)

We are printing the BERT tokens extracted from the sentence, and displaying them:

['[CLS]', 'Please', 'note', 'that', 'this', 'bank', 'note', 'is', 'fake', '!', '[SEP]']

We then manually count which token indices in the list belong to note, so we can extract their vectors later

PYTHON

note_index_1 = 2 # first occurrence of `note`
note_index_2 = 6 # second occurrence of `note`
print(string_tokens[note_index_1], string_tokens[note_index_2])

We now pass the sentence through the BERT encoder and extract the encoded vectors belonging to each note token:

PYTHON

# Encode the sentence and extract the vector belonging to each 'note' token
with torch.no_grad():
    bert_output = model(**tokenized_text)
note_vector_1 = bert_output.last_hidden_state[0][note_index_1].detach().numpy()
note_vector_2 = bert_output.last_hidden_state[0][note_index_2].detach().numpy()

print(note_vector_1[:5])
print(note_vector_2[:5])

[1.0170387  0.93691176 0.30571502 0.33091038 0.73093796]
[0.17840035 0.65847856 0.22412607 0.21162085 0.5393072 ]

By printing the first 5 dimensions of the vectors we can see that, even though both vectors technically belong to the same string input note, they have different numeric representations. This is the case because the BERT encoder takes into account the position of the token as well as all the previous and following tokens when computing the representation for each token.

To be sure, we can compute the cosine similarity of the word note in the first sentence and the word note in the second sentence confirming that they are indeed two different representations, even when both cases have the same token-id and they are the 12th token of the sentence:

PYTHON

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

vector1 = np.array(note_vector_1).reshape(1, -1)  # Reshape to 2D array
vector2 = np.array(note_vector_2).reshape(1, -1)  # Reshape to 2D array

similarity = cosine_similarity(vector1, vector2)
print(f"Cosine Similarity 'note' vs 'note': {similarity[0][0]}")

With this small experiment, we have confirmed that the Encoder produces context-dependent word representations, as opposed to Word2Vec, where note would always have the same vector no matter where it appeared.

Callout

When running examples in a BERT pre-trained model, it is advisable to wrap your code inside a torch.no_grad(): context. This is linked to the fact that BERT is a Neural Network that has been trained (and can be further fine-tuned) with the Backpropagation algorithm. Essentially, this wrapper tells the model that we are not in training mode, and we are not interested in updating the weights (as it would happen when training any neural network), because the weights are already optimal enough. By using this wrapper, we make the model more efficient as it does not need to calculate the gradients for an eventual backpropagation step, since we are only interested in what comes out of the Encoder. So the previous code can be made more efficient like this:

PYTHON

import torch 

with torch.no_grad():
    output = model(**encoded_input)
    print(output)
    print(output.last_hidden_state.shape)

BERT as a Language Model

As mentioned before, the main pre-training task of BERT is Language Modelling (LM): calculating the probability of a word based on the known neighboring words (yes, Word2Vec was also a kind of LM!). Obtaining training data for this task is very cheap, as all we need is millions of sentences from existing texts, without any labels. In this setting, BERT encodes a sequence of words, and predicts from a set of English tokens, what is the most likely token that could be inserted in the [MASK] position

We can therefore start using BERT as a predictor for word completion. From now own, we will learn how to use the pipeline object, this is very useful when we only want to use a pre-trained model for predictions (no need to fine-tune or do word-specific analysis). The pipeline will internally initialize both model and tokenizer for us and also merge word pieces back into complete words.

Callout

The pipeline object is very convenient to test all the out-of-the-box models that you can find in the HuggingFace platform, and if you are lucky and you like what you see, this is the fastest way to get model predictions.

But consider that for some tasks you will want more control over the tokenization and how the model merges the predictions. If that is the case, initializing the tokenizer, encoding the texts into word-vectors and running the model inference yourself might be the way to go (like it is shown in the Polysemy in BERT section).

Even if you like what you see, it is important that you evaluate the performance of the models in your data and use case regardless of what the model performance claims to be. We will show an example with a couple of evaluation metrics in the BERT for Text Classification section.

In this case again we use bert-base-cased, which refers to the vanilla BERT English model. Once we declared a pipeline, we can feed it with sentences that contain one masked token at a time (beware that BERT can only predict one word at a time, since that was its training scheme). For example:

PYTHON

from transformers import pipeline

def pretty_print_outputs(sentences, model_outputs):
    for i, model_out in enumerate(model_outputs):
        print("\n=====\t",sentences[i])
        for label_scores in model_out:
            print(label_scores)


nlp = pipeline(task="fill-mask", model="bert-base-cased", tokenizer="bert-base-cased")
sentences = ["Paris is the [MASK] of France", "I want to eat a cold [MASK] this afternoon", "Maria [MASK] Groningen"]
model_outputs = nlp(sentences, top_k=5)
pretty_print_outputs(sentences, model_outputs)

=====	 Paris is the [MASK] of France
{'score': 0.9807965755462646, 'token': 2364, 'token_str': 'capital', 'sequence': 'Paris is the capital of France'}
{'score': 0.004513159394264221, 'token': 6299, 'token_str': 'Capital', 'sequence': 'Paris is the Capital of France'}
{'score': 0.004281804896891117, 'token': 2057, 'token_str': 'center', 'sequence': 'Paris is the center of France'}
{'score': 0.002848200500011444, 'token': 2642, 'token_str': 'centre', 'sequence': 'Paris is the centre of France'}
{'score': 0.0022805952467024326, 'token': 1331, 'token_str': 'city', 'sequence': 'Paris is the city of France'}

=====	 I want to eat a cold [MASK] this afternoon
{'score': 0.19168031215667725, 'token': 13473, 'token_str': 'pizza', 'sequence': 'I want to eat a cold pizza this afternoon'}
{'score': 0.14800849556922913, 'token': 25138, 'token_str': 'turkey', 'sequence': 'I want to eat a cold turkey this afternoon'}
{'score': 0.14620967209339142, 'token': 14327, 'token_str': 'sandwich', 'sequence': 'I want to eat a cold sandwich this afternoon'}
{'score': 0.09997560828924179, 'token': 5953, 'token_str': 'lunch', 'sequence': 'I want to eat a cold lunch this afternoon'}
{'score': 0.06001955270767212, 'token': 4014, 'token_str': 'dinner', 'sequence': 'I want to eat a cold dinner this afternoon'}

=====	 Maria [MASK] Groningen
{'score': 0.24399833381175995, 'token': 117, 'token_str': ',', 'sequence': 'Maria, Groningen'}
{'score': 0.12300779670476913, 'token': 1104, 'token_str': 'of', 'sequence': 'Maria of Groningen'}
{'score': 0.11991506069898605, 'token': 1107, 'token_str': 'in', 'sequence': 'Maria in Groningen'}
{'score': 0.07722211629152298, 'token': 1306, 'token_str': '##m', 'sequence': 'Mariam Groningen'}
{'score': 0.0632941722869873, 'token': 118, 'token_str': '-', 'sequence': 'Maria - Groningen'}

When we call the nlp pipeline, requesting to return the top_k most likely suggestions to complete the provided sentences (in this case k=5). The pipeline returns a list of outputs as Python dictionaries. Depending on the task, the fields of the dictionary will differ. In this case, the fill-mask task returns a score (between 0 and 1, the higher the score the more likely the token is), a tokenId, and its corresponding string, as well as the full “unmasked” sequence.

In the list of outputs we can observe: the first example shows correctly that the missing token in the first sentence is capital, the second example is a bit more ambiguous, but the model at least uses the context to correctly predict a series of items that can be eaten (unfortunately, none of its suggestions sound very tasty); finally, the third example gives almost no useful context so the model plays it safe and only suggests prepositions or punctuation. This already shows some of the weaknesses of the approach.

We will next see the case of combining BERT with a classifier on top.

Discussion

Challenge

Play with the fill-mask pipeline and try to find examples where the model gives bad predictions and examples where the predictions are very good. You can change the top_k parameter and even play with the multilingual BERT model to compare. To do this last think you should change the model and tokenizer parameters to bert-base-multilingual-cased :::: solution This is a free exercise, so anything works. But even by running the same cases with the multilingual models we see some interesting aspects. For example, the predictions are of less quality in English. This is due to the “spread” of information across other languages, including a worse tokenization, since this model tries do predict for around 200 languages.

Another interesting example is searching for bias in the completions. Compare the outputs you get for the sentences “This man works as a [MASK].” and “This woman works as a [MASK].”.

:::

BERT for Text Classification

The task of text classification is assigning a label to a whole sequence of tokens, for example a sentence. With the parameter task="text_classification" the pipeline() function will load the base model and automatically add a linear layer with a softmax on top. This layer can be fine-tuned with our own labeled data or we can also directly load the fully pre-trained text classification models that are already available in HuggingFace.

Let’s see the example of a ready pre-trained emotion classifier based on RoBERTa model. This model was fine-tuned in the Go emotions dataset, taken from English Reddit and labeled for 28 different emotions at the sentence level. The fine-tuned model is called roberta-base-go_emotions. This model takes a sentence as input and outputs a probability distribution over the 28 possible emotions that might be conveyed in the text. For example:

PYTHON


classifier = pipeline(task="text-classification", model="SamLowe/roberta-base-go_emotions", top_k=3)

sentences = ["I am not having a great day", "This is a lovely and innocent sentence", "Maria loves Groningen"]
model_outputs = classifier(sentences)

pretty_print_outputs(sentences, model_outputs)

=====	 I am not having a great day
{'label': 'disappointment', 'score': 0.46669483184814453}
{'label': 'sadness', 'score': 0.39849498867988586}
{'label': 'annoyance', 'score': 0.06806594133377075}

=====	 This is a lovely and innocent sentence
{'label': 'admiration', 'score': 0.6457845568656921}
{'label': 'approval', 'score': 0.5112180113792419}
{'label': 'love', 'score': 0.09214121848344803}

=====	 Maria loves Groningen
{'label': 'love', 'score': 0.8922032117843628}
{'label': 'neutral', 'score': 0.10132959485054016}
{'label': 'approval', 'score': 0.02525361441075802}

This code outputs again a list of dictionaries with the top-k (k=3) emotions that each of the two sentences convey. In this case, the first sentence evokes (in order of likelihood) disappointment, sadness and annoyance; whereas the second sentence evokes love, neutral and approval. Note however that the likelihood of each prediction decreases dramatically below the top choice, so perhaps this specific classifier is only useful for the top emotion.

Callout

Fine-tuning BERT is very cheap, because we only need to train the classifier layer, a very small neural network, that can learn to choose between the classes (labels) for your custom classification problem, without needing a big amount of annotated data. This classifier is a one-layer neural layer that assigns a score that can be translated to the probability over a set of labels, given the input features provided by BERT, which already encodes the meaning of the entire sequence in its hidden states. Unfortunately fine-tuning is out of the scope of this course but you can learn more about fine-tuning BERT-like models in this HuggingFace tutorial

Model Evaluation

Model evaluation is a critical step in any machine learning project, and it is also the case for NLP. While it may be tempting to rely on the accuracy scores observed during training, or the model descriptions provided on the web, this approach can be misleading and often results in models that fail in real-world applications, and that includes your data.

Callout

The fundamental principle of model evaluation is to test your model on data it has never seen before. If you are training your own models, this is typically achieved by splitting your dataset into training, validation, and test sets. The training set is used to teach the model, the validation set helps tune hyperparameters or other aspects of model development, and the test set provides a less biased assessment of the final model’s performance. Without this separation, you risk overfitting, where your model memorizes the training data rather than learning generalizable patterns.

If you are using out-of-the-box models, evaluation is also mandatory, as you must be sure that the outputs you obtain behave in the manner that you would expect. In this case you only need to design the test set (with human annotators) and measure the performance. It doesn’t matter how broadly a model was tested, you must always confirm that it is suitable for your intended usage. It is also not enough to use semi-automated evaluation, designing human-labeled data will also provide insightful information for shaping your own task and judging the predictions you get.

Evaluation Metrics

Several metrics are commonly used to evaluate NLP models, each offering different insights into performance. Here we will describe the 4 most used in suervised learning classifiers

Let’s give the toy example of classifying photographs of cats and dogs. In this task, there are 2 classes: cat and dog, and we have a model that return the label cat or dog given an input photograph. If we want to evaluate how good is our model at recognizing dogs (our positive class in this example), there are 4 possibilities when comparing with the ground truth (your labeled data):

True Positives (TP): the number of dog photographs the model correctly labeled as dog.
True Negatives (TN): the number of non-dog photographs correctly labeled as something else.
False Positives (FP): the number of non-dog photographs the model mistakenly labeled as dog.
False Negatives (FN): the number of dog photographs mistakenly labeled as something else.

Based on this simple counts, we can derive four metrics that inform us at scale about the performance of our classifiers:

Accuracy: measures the global proportion of correct predictions, regardless of the class they hold. This is al true cases (TP + TN) divided by all tested instances (TP+TN+FP+FN).
Precision: This answers the following question: “Of all the predictions the model made for a particular class, how many were actually correct?”. This is TP divided by (TP+FP).
Recall: This answers the following question: “Of all the actual instances of a class, how many did the model successfully identify?”. This is TP divided by (TP+FN).
F1-score: provides a harmonic mean of precision and recall, offering a single metric that balances both concerns.

Deciding which metric is the most relevant to your case depends on your specific task, but having a view at all of the metrics is always insightful.

An example for a classifier of Cats and Dogs. Source: Wikipedia

It’s important to remember that a high accuracy score doesn’t always indicate a good model. For example, if you’re classifying rare events that occur only 5% of the time, a naive model that always predicts “no event” would achieve 95% accuracy while being completely useless. This is why examining multiple metrics and understanding your data’s characteristics is essential for proper model evaluation.

In Python, the scikit-learn package already provides us with these (and many other) evaluation metrics. All we need to do is prepare an ordered list with the true_labels and a list with the corresponding predicted_labels for each example in our data.

To illustrate the usage of evaluation, we will use a simpler sentiment model that predicts 5 classes: Very positive, positive, neutral, negative and very negative. Here is an example of the model predictions for four toy examples:

PYTHON

from transformers import pipeline

# Load the classification pipeline with the specified model
pipe = pipeline("text-classification", model="tabularisai/multilingual-sentiment-analysis")

# Classify a new sentence
sentences = [
    "I love this product! It's amazing and works perfectly.",
    "The movie was a bit boring, I could predict the ending since minute 1.",
    "Mary Shelley wrote this book around 1816.",
    "Everything suuuucks!"
]
gold_labels = [
    "Very Positive",
    "Negative",
    "Neutral",
    "Very Negative"
]

# The pipeline can also run on a batch of examples
result = pipe(sentences)

# Print the result
predicted_labels = []
for res in result:
    print(res)
    predicted_labels.append(res['label'])

{'label': 'Very Positive', 'score': 0.5586304068565369}
{'label': 'Negative', 'score': 0.9448591470718384}
{'label': 'Neutral', 'score': 0.9229359030723572}
{'label': 'Very Negative', 'score': 0.36225152015686035}

We can see that the model predicts correctly the 4 examples we gave. This is unsurprising as they are incredibly obvious examples. We can also print the results and inspect them because they are only 4 instances, but it is clearly not a scalable approach.

Callout

Note that many models will provide a confidence score, with their predictions. It is very tempting to interpret these scores as a proxy to “how certain is the model of prediction X”. However, you should be very careful, this score is only a relative confidence measure with respect to the training data, and it does not always translate well to unseen data. Most of the times it is better to just ignore it, especially if it is a model that you didn’t train yourself.

We can obtain an automated evaluation report, including the basic evaluation metrics, from scikit-learn by calling:

PYTHON

from sklearn.metrics import classification_report

print(classification_report(y_true=gold_labels, y_pred=predicted_labels))

The output is a table showing a breakdown of metrics by class. The column support shows how many examples per class were present in the gold data.

               precision    recall  f1-score   support

     Negative       1.00      1.00      1.00         1
      Neutral       1.00      1.00      1.00         1
Very Negative       1.00      1.00      1.00         1
Very Positive       1.00      1.00      1.00         1

     accuracy                           1.00         4
    macro avg       1.00      1.00      1.00         4
 weighted avg       1.00      1.00      1.00         4

These 4 metrics range from 0 to 1 (note that sometimes people multiply the scores by 100 to gain more granularity with decimal places). In this case, because we had a perfect score everything amounts to 1. In the most catastrophic scenario all scores would be zero.

Challenge

Evaluate Sentiment Classifier

Now it is time to scale things a little but more… Use the same pipeline from the given toy example to run predictions over 100 examples of short book reviews. Then print the classification report for the given test set. These examples are given in the data/sentiment_film_data.tsv file.

You can use the following helper functions, the first one helps you read the file and the second one normalizes the 5-class predictions into the 3-class annotations given in the test set:

PYTHON

def load_data(filename):
    with open(filename, 'r') as f:
        lines = f.readlines()[1:] # skip header
    sentences, labels = zip(*(line.strip().split('\t') for line in lines))
    return list(sentences), list(labels)

def get_normalized_labels(predictions):
    # predicitons is a list with dicts such as {'label': 'positive', 'score': 0.95}
    # We also need to normalize the labels to match the true labels (which are only 'positive' and 'negative')
    normalized = []
    for pred in predictions:
        label = pred['label'].lower()
        if 'positive' in label:
            normalized.append('positive')
        elif 'negative' in label:
            normalized.append('negative')
        else:
            normalized.append('neutral')
    return normalized

Show me the solution

PYTHON

from sklearn.metrics import classification_report, precision_recall_fscore_support
import matplotlib.pyplot as plt

sentences, labels = load_data('data/sentiment_film_data.tsv')
# The labels from our dataset
y_true = labels
# Run the model to get predictions per sentence
y_pred = pipe(sentences)
# Normalize the labels to match the gold standard
y_pred = get_normalized_labels(y_pred)

# Detailed report with all metrics
print(classification_report(y_true, y_pred))

Confusion Matrix

The confusion matrix is another direct and informative tool for understanding your model’s performance, by offering an intuitive visualization of your model’s behavior in detail. It is a table that compares your model’s predictions against the true labels. The rows typically represent the actual classes, while the columns show the predicted classes. Each cell contains the count of instances that fall into that particular combination of true and predicted labels. Perfect predictions would result in all counts appearing along the diagonal of the matrix, with zeros everywhere else.

PYTHON

from sklearn.metrics import ConfusionMatrixDisplay

def show_confusion_matrix(y_true, y_pred, labels=None):
    ConfusionMatrixDisplay.from_predictions(y_true, y_pred, display_labels=labels, cmap='Blues')
    plt.show()

show_confusion_matrix(y_true, y_pred)

This code shows the following matrix:

Some of the insights from a confusion matrix include:

Observe which classes your model handles well and which ones it struggles with.
Observe confusion patterns between classes. By examining the off-diagonal cells, you can identify systematic errors your model makes. For example, perhaps your sentiment classifier consistently confuses neutral statements with positive ones, but rarely mistakes neutral for negative.
Detect Bias, for example by exposing a tendency to over-predict certain classes while ignoring others.
Detect class imbalance. Even if your overall accuracy seems high, the confusion matrix might reveal that your model achieves this by simply predicting the majority class most of the time.

BERT for Token Classification

Just as we plugged in a trainable text classifier layer, we can add a token-level classifier that assigns a class to each of the tokens encoded by a transformer (as opposed to one label for the whole sequence). A specific example of this task is Named Entity Recognition, but you can basically define any task that requires to highlight sub-strings of text and classify them using this technique.

Named Entity Recognition

Named Entity Recognition (NER) is the task of recognizing mentions of real-world entities inside a text. The concept of Entity includes proper names that unequivocally identify a unique individual (PER), place (LOC), organization (ORG), or other object/name (MISC). Depending on the domain, the concept can expanded to recognize other unique (and more conceptual) entities such as DATE, MONEY, WORK_OF_ART, DISEASE, PROTEIN_TYPE, etcetera…

In terms of NLP, this boils down to classifying each token into a series of labels (PER, LOC, ORG, O[no-entity] ). Since a single entity can be expressed with multiple words (e.g. New York) the usual notation used for labeling the text is IOB (Inner Out Beginnig of entity) notations which identifies the limits of each entity tokens. For example:

This is a typical sequence classification problem where an imput sequence must be fully mapped into an output sequence of labels with global constraints (for example, there can’t be an inner I-LOC label before a beginning B-LOC label). Since the labels of the tokens are context dependent, a language model with attention mechanism such as BERT is very beneficial for a task like NER.

Because this is one of the core tasks in NLP, there are dozens of pre-trained NER classifiers available in HuggingFace that you can use right away. We use once again the pipeline() wrapper to directly run the model for predictions , in this case with task="ner". For example:

PYTHON

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# We can also pass to the pipeline the initialized model & tokenizer. 
# This way we have access to both separately if we need them later
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

ner_classifier = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="first")
example = "My name is Wolfgang Schmid and I live in Berlin"

ner_results = ner_classifier(example)
for nr in ner_results:
    print(nr)

The code now prints the following:

{'entity_group': 'PER', 'score': 0.9995944, 'word': 'Wolfgang Schmid', 'start': 11, 'end': 26}
{'entity_group': 'LOC', 'score': 0.99956733, 'word': 'Berlin', 'start': 41, 'end': 47}

As you can see, the outputs contain already the entities aggregated at the Span Leven (instead of the Token Level). Word pieces are merged back into human words and also multiword entities are assigned a single entity unified label. Depending on your use case you can request the pipeline to give different aggregation_strateg[ies]. More info about the pipeline can be found here.

The same evaluation metrics you learned for the text classifier can also be applied to this classification task. A common Python library to deal with token classification is seqeval or evaluate. Remember to always test on a significant human-labeled dataset to assess if the predictions you are getting make sense. If they don’t, more advances use of the models, including fine-tuning should be used.

What did we learn in this lesson?

Key Points

Static word representations, such as word2vec, still lack of enough context to do more advanced tasks, we made this weakness evident by studying polysemy.
The transformer architecture consists of three main components: an Encoder to create powerful text representations (embeddings), an Attention Mechanism to learn more from the full sequence context, and a Decoder, a generative model to predicts the next token based on the context it has so far.
BERT is a deep encoder that creates rich contextualized representations of words and sentences. These representations are very powerful features that can be re-used by other machine Learning and deep learning models.
Several of the core NLP tasks can be solved using Transformer-based models. In this episode we covered language modeling (fill-in the mask), text classification (sentiment analysis) and token classification (named entity recognition).
Evaluating the model performance using your own data for your own use case is crucial to understand possible drawbacks when using this model for unknown predictions

Content from Episode 3: Using large language models

Last updated on 2025-09-24 | Edit this page

Estimated time: 60 minutes

Background

Chat assistants like ChatGPT and Claude, which are based on Large Language Models (LLMs) are widely used today for tasks such as content generation, question answering, research and software development. The rapid rise of such models has had quite a disruptive and strong impact. But what are these models exactly? How do they work ‘under the hood’? And how can one use them programmatically, in a responsible and effective way?

This episode is a gentle introduction to LLMs which aims to equip you with knowledge of the underpinnings of LLMs based on transformers architecture, as well as practical skills to programmatically work with LLMs in your own projects.

Company A Company B Company C

Company D Company E Company F Company C

1. What are Large Language Models (LLMs)?

Large language models (LLMs) are transformer-based language models that are specialised to interpret and generate text, and to converse in a conversational-like manner with humans. The text that they generate are mostly natural language but can, in theory, constitute any character or symbol sequence such as software code. They represent a significant advancement in AI and NLP. and are trained on vast amounts of textual data mostly obtained from the internet.

1.1 Examples of LLMs

Many different LLMs have been, and continue to be, developed. There are both proprietary and open-source varieties. Open-source varieties often make the data that their LLMs are trained on free, open and accessible online. Some even make the code they use to train these models open-source as well. Below is a summary of some current LLMs together with their creators, chat assistant interfaces, and proprietary status:

1.2 Applications of LLMs

LLMs can be used for many different helpful tasks. Some common tasks include:

Question Answering
Text Generation
Text Summarisation
Sentiment Analysis
Machine Translation
Code Generation

Exercise 1: Your first programmatic LLM interaction (30 minutes)

Before exploring how we can invoke LLMs programmatically to solve the kinds of tasks abve, let us setup and load our first LLM.

Step 1. Setup code

Install required packages transformers and torch and import required libraries.

PYTHON

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

Step 2: Load and setup an LLM

Let’s load a lightweight LLM.

PYTHON

# We'll use SmolLM-135M - an open, small, fast model
# model_name = "HuggingFaceTB/SmolLM2-135M" # base model
model_name = "HuggingFaceTB/SmolLM2-135M-Instruct" # fine-tuned assistant model
# model_name = "HuggingFaceTB/SmolLM3-3B-Base" # base model
# model_name = "HuggingFaceTB/SmolLM3-3B" # fine-tuned assistant model

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Check if model is loaded correctly
print(f"Model loaded! It has {model.num_parameters():,} parameters")

Step 3: Basic Text Generation

Let’s perform inference with the LLM to generate some text.

PYTHON

# Set pad_token_id to eos_token_id to avoid warnings
# if tokenizer.pad_token_id is None:
#     tokenizer.pad_token = tokenizer.eos_token
#     model.config.pad_token_id = tokenizer.eos_token_id
    
# Build pipeline
llm = pipeline("text-generation", model=model, tokenizer=tokenizer)
prompt = "{}"
response = llm(prompt, max_new_tokens=100, do_sample=True, top_k=50, temperature=0.7)[0]["generated_text"]
print(f"Prompt: {prompt}")
print(f"Response: {response}")

max_new_tokens: sets maximum number of tokens (roughly words/word pieces) that the model will generate in total. It’s a hard limit - generation stops when this limit is reached, even mid-sentence. Useful for controlling cost / time. The more tokens you need to generate for an answer the more time it takes. LLMs called through paid APIs often charge per a set number of tokens (e.g. $0.008 per 1000 tokens).

temperature: positive float value that controls the randomness/creativity of the model’s token selection during generation. The model predicts probabilities for each possible next token, temperature modifies these probabilities before making the final choice.

0.0: Completely deterministic - always picks the most likely token 1.0+: More random, and “creative”, but potentially less coherent

do_sample: when do_sample=True, the model generates text by sampling from the probability distribution of possible next tokens. If do_sample=False, the model uses greedy decoding (always picking the most likely next token), which makes the output more deterministic but often repetitive.

top_k: This is a sampling strategy called Top-K sampling. Instead of considering all possible next tokens, the model looks at the k most likely tokens (based on their probabilities) and samples only from that reduced set. If top_k=50, the model restricts its choices to the top 50 most probable words at each step.

Step 5. Sentiment analysis

Let us try a sentiment analysis task to see how well different models (with different number of parameters perform). Consider the following set of lines from product reviews:

Product reviews:

I love this movie! It was absolutely fantastic and made my day. [positive]
This product is terrible. I hate everything about it. [negative]
Nothing says quality like a phone that dies after 20 minutes. [negative]
The movie was dark and depressing — exactly what I was hoping for. [positive]
The food was delicious, but the service was painfully slow. [mixed]

Set the prompt for this as (substitute the above sentences for text each time):

Classify the sentiment of the following text as either POSITIVE or NEGATIVE. Text: "{text}"

Examine the results afterwards to see which models correctly classified them and which didn’t.

PYTHON

sentiment_llm = pipeline("text-generation", model=model, tokenizer=tokenizer)
sentiment_texts = [
    "I love this movie! It was absolutely fantastic and made my day.",
    "This product is terrible. I hate everything about it.",
    "Nothing says quality like a phone that dies after 20 minutes.",
    "The movie was dark and depressing — exactly what I was hoping for.",
    "The food was delicious, but the service was painfully slow."
]
text = sentiment_texts[0]
prompt = "Classify the sentiment of the following text as either POSITIVE or NEGATIVE. Text: "{text}""
response = sentiment_llm(prompt, max_new_tokens=100, do_sample=True, top_k=50, temperature=0.7)[0]["generated_text"]
print(f"Prompt: {prompt}")
print(f"Response: {response}")

Discussion: Post-exercise questions

What did you notice about the models’ responses?
- Were they always accurate? Always coherent?
- How did different prompts affect the quality?
Temperature Effects:
- What happened when temperature was low (e.g. 0.0 or 0.1) vs. high (e.g. 1.2)?
- Under which circumstances would you want more random / creative responses vs. consistent responses?
Model Size:
- What were the differences across different models?
- What trade-offs do you think exist between model size and performance?
Max Length Effects:
- Did you notice a difference in speed of responses when adjusting the max_length parameter?

Exercise 2: Other NLP tasks

Write and execute the following prompts in Python code within Jupyter notebook.

Question answering:

answering general knowledge questions

Human: What is the longest river in the world?
LLM: The Nile River in Africa is traditionally considered the longest river in the world, stretching about 6,650 km (4,130 miles) through 11 countries before emptying into the Mediterranean Sea.

expert advice in a particular domain or scientific field

Human: What are good strategies for film-making on a budget in Night conditions?
LLM: Night filmmaking can be both creatively rewarding and technically challenging, especially on a budget. Here are some strategies that indie filmmakers often use to maximize image quality and mood without expensive gear:...

Text generation:

Writing essays, business plans and other documents

Human: Write a template cover letter for a mid-level software engineering position at a large AI startup.
LLM: Dear [Hiring Manager’s Name], I am excited to apply for the Software Engineer position at [Company Name]...

Creative writing (slogans, poems, fiction, jokes)

Human: Write a two sentence comedy sketch.
LLM: [Scene: A man sits in a job interview.] Interviewer: "Your resume says you’re fluent in Python?" Candidate: pulls out a snake from his briefcase "She does most of the talking."

Text summarisation

Human: Shorten: "Amsterdam, the capital of the Netherlands, is a city celebrated for its rich history, cultural diversity, and iconic canals. Known as the “Venice of the North,” it is home to a vast network of waterways lined with elegant 17th-century houses, giving the city its distinct charm. Amsterdam seamlessly blends old-world character with modern vibrancy, attracting millions of visitors each year."
LLM: Amsterdam, the capital of the Netherlands, is known for its history, canals, and 17th-century houses, combining old charm with modern vibrancy that attracts millions each year.

Sentiment or text classification

Sentiment analysis
Human: Is this a positive, neutral or negative statement about the movie: "Incredible and not in a good way."
LLM: That’s a negative statement about the movie. The phrase “Incredible” usually suggests something impressive, but the qualifier “not in a good way” makes it clear the speaker means the film was bad or shockingly poor.

Text classification
Human: Categorise this statement into either sports, business, or science: "Lionel Messi scored twice to lead Argentina to victory in the World Cup qualifier."
LLM: This statement falls under Sports. It talks about a soccer player and a match result, which clearly relates to athletic competition.

Language translation
Human: Translate "I have recently naturalised as a Dutch citizen" into Dutch. LLM: Ik ben onlangs genaturaliseerd als Nederlands staatsburger.

Generating software code

Human: Write a short Python function to demonstrate bubble sort
LLM: ...

PYTHON

def bubble_sort(arr):
    n = len(arr)
    for i in range(n):
        # Last i elements are already in place
        for j in range(0, n - i - 1):
            if arr[j] > arr[j + 1]:
                # Swap if the element is greater than the next
                arr[j], arr[j + 1] = arr[j + 1], arr[j]
    return arr

# Example usage
numbers = [64, 34, 25, 12, 22, 11, 90]
sorted_numbers = bubble_sort(numbers)
print(sorted_numbers)

Key Takeaways

LLMs are generative models - they predict the next most likely tokens
Prompts matter - the way you ask affects what you get
Parameters control behavior - temperature, max_length, etc. tune the output
Models have limitations - they can be wrong, inconsistent, or biased
Size vs Speed trade-off - smaller models are faster but less capable

1.3 LLM selection criteria

Choosing the right LLM for your specific use case requires consideration of multiple factors. This section will guide you through some decision points that will help you select an appropriate model for your needs.

1.3.1 Openness and Licensing Considerations

The spectrum of model availability ranges from fully open to completely proprietary:

Open-weights release the trained model parameters while keeping training code or data proprietary. This allows you to run and fine-tune the model locally but if you don’t have the code used to train the model or information about the architecture used, it limits your ability to fully understand or replicate the training process.

Open training data they release the text data used for pretraining.

Open architecture they publish a paper about the neural network architecture and specific configuration they used for training. Or they release the actual source code they used for pretraining.

Ideally, if you want to use a model for empirical academic research you might decide for models that are completely open in all three of the above facets. Although, open training data is quite rare for available state-of-the-art models.

Commercial/proprietary models like GPT-4, Claude, or Gemini are accessed only through APIs. While often offering superior performance, they provide no access to internal architecture and may have usage restrictions or costs that scale with volume.

Consider your requirements for: - Code modification and customization - Data privacy and control - Commercial usage rights - Research reproducibility - Long-term availability guarantees

If you wish to build an application that makes use of LLM text generation, and you need accurate results, commercial APIs may be more suitable.

1.3.2 Hardware and Compute Requirements

Your available computational resources significantly constrain your model options:

Modern GPU access (RTX 4090, A100, H100, etc.) enables you to run larger models locally. Consider: - VRAM requirements: 7B parameter models typically need 14+ GB, 13B models need 26+ GB, 70B models require 140+ GB or multi-GPU setups - Inference speed requirements for your application - Whether you need real-time responses or can accept slower processing

CPU-only environments limit you to smaller models (such as SmolLM2 and SmolLM3) or quantized versions.

Cloud/API access removes hardware constraints but introduces ongoing costs and potential latency issues.

1.3.3 Performance Evaluation

Different models excel at different tasks. Some evaluation criteria include:

General capability benchmarks like those found on the Open LLM Leaderboard provide standardized comparisons across models for reasoning, knowledge, and language understanding tasks.

Multilingual performance varies significantly between models. The MMLU-Pro benchmark offers insights into cross-lingual capabilities if you need support for non-English languages.

Task-specific performance should be evaluated based on your particular needs: - Code generation - Mathematical reasoning - Reading comprehension and summarization - Creative writing and dialogue quality - Scientific and technical domain knowledge

Always validate benchmark performance with your own test cases, as real-world performance may differ from standardized evaluations.

1.3.4 Purpose or Use Case

Scientific and research applications often prioritize reproducibility, transparency, and the ability to modify model behavior. Open-source models with detailed documentation are typically preferred (e.g. SmolLM, LLama, Olmo)

Applications (mobile or web apps) may require: - Reliable API uptime and support - Clear licensing for commercial use - Scalability to handle many concurrent users - Content filtering and safety features

Personal or educational use might emphasize: - Cost-effectiveness - Ease of setup and use

1.3.5 Integration and Deployment Considerations

Software integration requirements affect model choice: - API-based models offer simpler integration but require internet connectivity - Local models provide more control but require more complex deployment - Consider latency requirements, offline capabilities, and data privacy needs

Hosting and serving capabilities determine whether you can run models locally: - Do you have the infrastructure to serve models at scale? - Are you self-hosting the model?

1.3.6 Domain-Specific Models

Many models have been fine-tuned for specific domains or tasks. For example:

Medical and healthcare applications (e.g., BioGPT)
Legal document processing (e.g., SaulLM)

Remember that the LLM landscape evolves rapidly. New models are released frequently, and performance benchmarks should be regularly reassessed. Consider building your system with model-agnostic interfaces to facilitate future transitions between different LLMs as your needs evolve or better options become available.

1.4 Transformers and LLMs

LLMs are also trained using the transformer neural network architecture, making use of the self-attention mechanism discussion in Lesson 02. This means that an LLM is also a transformer-based language model. However, they are distinct from general transformer-based language models in three main characteristics:

Scale: there are two dimensions in which current LLMs exceed general transformer language models in terms of scale. The most important one is the number of training parameters (weights) that are used for training models. In current models there are hundreds of billions of parameters up to trillions. The second factor is the amount of training data (raw text sequences) used for training. Current LLMs use snapshots of the internet (upwards of hundreds of terabytes in size) as a base for training and possibly augment this with additional manually curated data. The sheer scale characteristic of LLMs mean that such models require extremely resource-intensive computation to train. State-of-the-art LLMs require multiple dedicated Graphical Processing Units (GPUs) with tens or hundreds of gigabytes of memory to load and train in reasonable time. GPUs offer high parallelisability in their architecture for data processing which makes them more efficient for training these models.
Post-training: After training a base language model on textual data, there is an additional step of fine-tuning for enabling conversation in a prompt style of interaction with users, which current LLMs are known for. After the pre-training and neural network training stages we end up with what is called a base model. The base model is a language model which is essentially a token sequence generator. This model by itself is not suitable for the interaction style we see with current LLMs, which can do things like answer questions, interpret instructions from the user, and incorporate feedback to improve responses in conversations.
Generalization: LLMs can be applied across a wide range of NLP tasks such as summarization, translation, question answering, etc., without necessarily the need for fine-tuning or training separate models for different NLP tasks.

What about the relation between BERT, which we learned about in Lesson 02, and LLMs? Apart from the differences described above, BERT only makes use of the encoder layer of the transformers architecture because the goal is on creating token representations preserving contextual meaning. There is no generative component to do something with those representations.

2. How are LLMs trained?

Training LLMs involves a series of steps. There are two main phases: pretraining and post training. Pretraining generally involves the following substeps:

2.1 Obtaining and pre-processing textual data for training

Downloading and pre-processing text: State-of-the-art LLMs include entire snapshots of the internet as the core textual data for training. This data can be sourced from efforts such as CommonCrawl. Proprietary LLMs may augment or supplement this training data with additional licensed or proprietary textual data (e.g., books) from other sources or companies. The raw web pages are not usable by themselves, we need to extract the raw text from those HTML pages. This requires a preprocessing or data cleaning step.

Tokenization: As we saw in Lesson 01, the raw text itself cannot be used in the training step, we need a way to tokenize and encode the text for processing by the neural network. As an example of what these encodings look like for OpenAI models like GPT, you can visit TikTokenizer.

2.2 Neural network training

With LLMs the training goal is to predict the next token in a one-dimensional sequence of tokens. This is different from BERT where the goal is to predict masked tokens in the input sequence. BERT is therefore not natively developed for generating text, whereas LLMs are. In the internals of the transformer architecture, this is illustrated by the fact that BERT only makes use of the Encoder component to create its contextualised word embeddings. It does not use the Decoder component to generate new tokens for the input sequence.

After training we obtain a base LLM which is predicts or generates token sequences that resemble its training data. However, a post training step is required in order to fine-tune the model to accept instructions, answer questions in a conversational style and to have behavior that is more suitable for interaction with humans.

2.3 Post training

What does post training for LLMs look like? Why is this step necessary? What would happen if you skip this step and just use the base model trained in Step 2.2 for inference? The answer is that the base model is just a token sequence predictor. It just predicts the most likely next token for an input sequence of tokens. It does not understand how to deal with conversations or to interpret instructions (the intentions and tone behind written communication).

Therefore, you may encounter unexpected conversations like this if interacting with a base LLM:

Not Following Instructions

Human: Summarize this paragraph in one sentence: The Nile is the longest river in Africa and flows through 11 countries before emptying into the Mediterranean Sea.
Raw LLM: The Nile is the longest river in Africa and flows through 11 countries before emptying into the Mediterranean Sea.

In this example interaction, the model was trained to predict text, not to follow instructions. So it might not give expected or correct responses although, statistically, these response tokens are indeed the next most likely tokens.

Regurgitation

Human: Donald John Trump (born June 14, 1946) is an American politician,
Raw LLM: media personality, and businessman who is the 47th president of the United States. A member of the Republican Party, he served as the 45th president from 2017 to 2021...

In this interaction, the model just “regurgitates” verbatim a response from Wikipedia (part of it’s training data), instead of creatively formulating a response in its own words. These are just a few examples of the kind of undesirable behavior one might encounter when doing inference with the base model without post training.

Therefore, to avoid issues like this, and to be useful as conversational assistants, LLMs generally have to be fine-tuned on additional unseen conversation datasets (containing on the order of hundreds of thousands of conversations). These datasets are manually created and validated by humans. For proprietary models, these humans are hired professionally to contribute to this dataset from platforms such as Upwork and ScaleAI. There are also open-source efforts to create such datasets such as OpenAssistant.

Once such a dataset is obtained, the conversations need to be represented as a one-dimensional token sequence. Conversations therefore have to be encoded into such representations before being used as input for training. The encoding method for conversations currently differ among LLMs. The figure below shows how GPT-4 tokenizes and encodes input conversations as one dimensional token sequences (TikTokenizer is a page for visualizing and experimenting with tokenization approaches used by OpenAI LLMs).

After encoding or tokenization, the conversational data is supplied as input for fine-tuning the base model (replacing the original training data). I.e., the base model’s weights or parameters are updated through this process.

Below are some examples of conversational data that one might use to fine-tune an LLM. In general, you want multi-turn conversations where each participant in the conversation gets the opportunity to respond multiple times. Because this is the likely mode of interaction that will take place between humans and LLM-based chat assistants:

**Conversation 1 (Travel Planning):**
Human: Can you suggest a weekend getaway near Amsterdam?
LLM: Sure! Giethoorn – Known as the “Venice of the North,” this village has canals, thatched-roof houses, and a peaceful atmosphere. Utrecht, a charming city with medieval streets, beautiful canals, and cozy cafés.
Human: Which one would be better for a quiet, relaxing trip?
LLM: Giethoorn would be the quieter option. Unlike Utrecht, which is a bustling city, Giethoorn is a small village with a calm, serene atmosphere, ideal for a peaceful getaway.

**Conversation 2 (Math Help):**
Human: What's the derivative of 3𝑥^2 + 5𝑥 − 7?
LLM: The derivative is 6𝑥 + 5.
Human: Great, and what about the second derivative?
LLM: The second derivative is 6.

**Conversation 3 (Creative Writing):**
Human: Give me a one-line horror story.
LLM: The reflection in the mirror smiled back before I did.
Human: Can you make it even creepier?
LLM: The reflection winked, though I was alone in the room.

References

Comprehensive overview of LLMs: https://dl.acm.org/doi/abs/10.1145/3744746

Andrej Karpathy deep dive into LLMs: https://www.youtube.com/watch?v=7xTGNNLPyMI

Multilingual benchmark for LLMs: https://mmluprox.github.io/