Content from Introduction
Last updated on 2025-09-18 | Edit this page
Estimated time: 120 minutes
Overview
Questions
- What is Natural Language Processing?
- What are some common applications of NLP?
- What makes text different from other data?
- Why not just learn Large Language Models?
- What linguistic properties should we consider when dealing with texts?
- How does NLP relates to Deep Learning methodologies?
Objectives
- Define Natural Language Processing
- Show the most relevant NLP tasks and applications in practice
- Learn how to handle Linguistic Data and how is Linguistics relevant to NLP
- Learn a general workflow for solving NLP tasks
What is NLP?
Natural language processing (NLP) is an area of research and application that focuses on making human languages accessible to computers, so that they can perform useful tasks. It is therefore not a single method, but a collection of techniques that help us deal with linguistic inputs. The range of techniques covers from simple word counts, to Machine Learning (ML) methods, and all the way into using complex Deep Learning (DL) architectures.
The term “natural language” is used as opposed to “artificial language”, such as programming languages, which are by design constructed to be easily formalized into machine-readable instructions. On the contrary, natural languages are complex, ambiguous, and heavily context-dependent, making them challenging for computers to process. To complicate it more, there is not only a single human language, nowadays more than 7000 languages are spoken around the world, each with its own grammar, vocabulary, and cultural context.
In this course we will mainly focus on written English (and a few other languages in some specific examples as well), however this is only a convenience so we can concentrate on the technical aspects of processing textual data. While ideally most of the concepts from NLP apply to most languages, one should always be ware that certain languages require different approaches to solve seemingly similar problems.
We can already find differences on the most basic step to processing text. Take the problem of segmenting text into meaningful units, most of the times these units are words, in NLP this is the task of tokenization. A naive approach is to split text by spaces, as it seems obvious that we always separate words with spaces. Let’s see how can we segment the same sentence in English and Chinese:
PYTHON
english_sentence = "Tokenization isn't always trivial."
chinese_sentence = "标记化并不总是那么简单" # Chinese Translation
english_words = english_sentence.split(" ")
print(english_words)
print(len(english_words))
chinese_words = chinese_sentence.split(" ")
print(chinese_words)
print(len(chinese_words))
OUTPUT
['Tokenization', "isn't", 'always', 'trivial.']
4
['标记化并不总是那么简单']
1
Let’s look first at the English sentence. Words are mostly well separated, however we do not get fully “clean” words (we have punctuation and also special cases such as “isn’t”), but at least we get a rough count of the words present in the sentence. The same example however did not work in Chinese, because Chinese does not use spaces to separate words. We need to use a Chinese pre-trained tokenizer, which uses a dictionary-based approach to properly split the words:
PYTHON
import jieba # A popular Chinese text segmentation library
chinese_sentence = "标记化并不总是那么简单"
chinese_words = jieba.lcut(chinese_sentence)
print(chinese_words)
print(len(chinese_words)) # Output: 7
OUTPUT
['标记', '化', '并', '不', '总是', '那么', '简单']
7
We can trust that the output valid, because we are using a verified library, even though we don’t speak Chinese. Another interesting aspect is that the Chinese sentence has more words than the English one, even though they convey the same meaning. This shows the complexity of dealing with more than one language at a time, like in Machine Translation.
Pre-trained Models and Fine-tunning
These two terms will appear very frequently when talking about NLP. The term pre-trained is taken from Machine Learning and refers to a model that has been already optimized using relevant data to perform a task. It is possible to directly load and use the model out-of-the-box to apply it to our own dataset. Ideally, released pre-trained models have already been tested for generalization and quality of outputs, but it is always important to double check the evaluation process they were subjected to before using them.
Sometimes a pre-trained model is of good quality, but it does not fit the nuances of our specific dataset. For example, the model was trained on newspaper articles but you are interested in poetry. In this case, it is common to perform fine-tunning, this means that instead of training your own model from scratch, you start with the knowledge obtained in the pre-trained model and adjust it (fine-tune it) with your specific data. If this is done well it leads to increased performance in the specific task you are trying to solve. The advantage of fine-tunning is that you do not need a large amount of data to improve the results, hence the popularity of the technique.
In more general terms, NLP deals with the challenges of correctly processing and generating text, this can be as simple as counting word frequencies to detect different writing styles, using statistical methods to classify texts into different categories, or using deep neural networks to generate human-like text by exploiting word co-occurrences in large amounts of texts.
Language as Data
From a more technical perspective, NLP focuses on applying Machine Learning techniques to linguistic data. This makes all the difference, since ML methods expect a structured dataset, with a well defined set of features that engineers can work with. Your first task as an NLP practitioner is to understand what aspects of textual data are relevant for your application and apply techniques to systematically extract meaningful features (if using ML) or appropriate neural architectures (if using DL) from unstructured data that can help solve our problem at hand.
NLP in the real world
Name three to five tools/products that you use on a daily basis and that you think leverage NLP techniques. To solve this exercise you can get some help from the web.
These are some of the most popular NLP-based products that we use on a daily basis:
- Agentic Chatbots (ChatGPT, Perplexity)
- Voice-based assistants (e.g., Alexa, Siri, Cortana)
- Machine translation (e.g., Google translate, Amazon translate)
- Search engines (e.g., Google, Bing, DuckDuckGo)
- Keyboard autocompletion on smartphones
- Spam filtering
- Spell and grammar checking apps
- Customer care chatbots
- Text summarization tools (e.g., news aggregators)
- Sentiment analysis tools (e.g., social media monitoring)
NLP tasks
The exercise above tells us that a great deal of NLP techniques is embedded in our daily life. Indeed NLP is an important component in a wide range of software applications that we use in our daily lives.
There are several ways to describe the tasks that NLP solves. From the Machine Learning perspective, we have:
Supervised tasks: learning to classify texts given a labeled set of examples
Unsupervised tasks: exploiting existing patterns from large amounts of text.
From the Deep Learning perspective we can consider different neural network architectures to tackle an NLP task, such as:
Multi-layer Perceptron
Recurrent Neural Network
Convolutional Neural Network
LSTM’s
Transformer
Below we show one possible taxonomy of NLP tasks, where we instead focus on the problem formulation aspect. The tasks are grouped together with some of their most prominent applications. This is definitely a non-exhaustive list, as in reality there are hundreds of them, but it is a good start:
Language Modeling: Given a sequence of words, the model predicts the next word. For example, in the sentence “The capital of France is _____”, the model should predict “Paris” based on the context. This task was initially useful for building solutions that require speech and optical character recognition (even handwriting), language translation and spelling correction. Nowadays this has scaled up to the LLMs that we know.
-
Text Classification: Assign one or more labels to a “Documents”. A document in our context can mean a sentence, a paragraph, a book chapter, etc…
- Language Identification: determining the language of a given text.
- Spam Filtering: classifiying emails into spam or not spam based on their content.
- Authorship Attribution: determining the author of a text based on its style and content (based on the assumption that each author has a unique writing style).
- Sentiment Analysis: classifying text into positive, negative or neutral sentiment. For example, in the sentence “I love this product!”, the model would classify it as positive sentiment.
-
Token Classification: The task of assigning label to words individually. Because words do not occur in isolation, their meaning depend on the sequence of words to the left or the right of them, this is also called Word-In-Context Classification or Sequence Labeling and usually involves syntactic and semantic analysis.
- Part-Of-Speech Tagging: is the task of assigning a part-of-speech label (e.g., noun, verb, adjective) to each word in a sentence.
- Chunking: splitting a running text into “chunks” of words that together represent a meaningful unit: phrases, sentences, paragraphs, etc.
- Word Sense Disambiguation: based on the context what does a word mean (think of “book” in “I read a book.” vs “I want to book a flight.”)
- Named Entity Recognition: recognize world entities in text, e.g. Persons, Locations, Book Titles, or many others. For example “Mary Shelley” is a person, “Frankenstein or the Modern Prometeus” is a book, etc.
- Semantic Role Labeling: the task if finding out “Who did what to whom?” in a sentence: information from events such as agents, participants, circumstances, etc.
- Relation Extraction: the task of identifying named relationships between entities in a text, e.g. “Apple is based in California” has the relation (Apple, based_in, California).
- Co-reference Resolution: the task of determining which words refer to the same entity in a text, e.g. “Mary is a doctor. She works at the hospital.” Here “She” refers to “Mary”.
- Entity Linking: the task of disambiguation of named entities in a text, linking them to their corresponding entries in a knowledge base, e.g. Mary Shelley’s biogrpaphy in Wikipedia.
-
Text Similarity: The task of determining how similar two pieces of text are.
- Plagiarism detection: determining whether a piece of TextB is close enough to another known piece of TextA, which increments the likelihood that it was copied from it.
- Document clustering: grouping similar texts together based on their content.
- Topic modelling: A specific instance of clustering, here we automatically identify abstract “topics” that occur in a set of documents, where each topic is represented as a cluster of words that frequently appear together.
- Information Retrieval: This is the task of finding relevant information or documents from a large collection of unstructured data based on user’s query, e.g., “What’s the best restaurant near me?”.
-
Text Generation: The task of generating text based on a given input. This can
- Machine Translation: translating text from one language to another, e.g., “Hello” in English to “Que tal” in Spanish.
- Summarization: generating a concise summary of a longer text. It can be abstractive (generating new sentences that capture the main ideas of the original text) but also extractive (selecting important sentences from the original text).
- Paraphrasing: generating a new sentence that conveys the same meaning as the original sentence, e.g., “The cat is on the mat.” to “The mat has a cat on it.”.
- Question Answering: Given a question and a context, the model generates an answer. For example, given the question “What is the capital of France?” and the Wikipedia article about France as the context, the model should answer “Paris”. This task can be approached as a text classification problem (where the answer is one of the predefined options) or as a generative task (where the model generates the answer from scratch).
- Conversational Agent (ChatBot): Building a system that interacts with a user via natural language, e.g., “What’s the weather today, Siri?”. These agents are widely used to improve user experience in customer service, personal assistance and many other domains.
What is a word?
When dealing with language we deal with sequences of words and with how they relate to each other to generate meaning. Our first step to provide structure to text is therefore to split it into words.
Token vs Word
For simplicity, in the rest of the course we will use the terms “word” and “token” interchangeably, but as we just saw they do not always have the same granularity. Originally the concept of token comprised dictionary words, numeric symbols and punctuation. Nowadays, tokenization has also evolved and became an optimization task on its own (How can we segment text in a way that neural networks learn optimally from text?). Tokenizers always allow to “reconstruct back” tokens to human-readable words even if internally they split the text differently, hence we can afford to use them as synonyms. If you are curious, you can visualize how different state-of-the-art tokenizers work here
Finally, we will start to working with text data! Let’s open a file, read it into a string and split it by spaces. We will print the original text and the list of “words” to see how they look:
PYTHON
with open("text1_clean.txt") as f:
text = f.read()
print(text[:100])
print("Length:", len(text))
proto_tokens = text.split()
print(proto_tokens[:40])
print(len(proto_tokens))
OUTPUT
Letter 1 St. Petersburgh, Dec. 11th, 17-- TO Mrs. Saville, England You will rejoice to hear that no disaster has accompanied the commencement of an en
Length: 417931
Proto-Tokens:
['Letter', '1', 'St.', 'Petersburgh,', 'Dec.', '11th,', '17--', 'TO', 'Mrs.', 'Saville,', 'England', 'You', 'will', 'rejoice', 'to', 'hear', 'that', 'no', 'disaster', 'has', 'accompanied', 'the', 'commencement', 'of', 'an', 'enterprise', 'which', 'you', 'have', 'regarded', 'with', 'such', 'evil', 'forebodings.', 'I', 'arrived', 'here', 'yesterday,', 'and', 'my']
74942
Splitting by white space is possible but needs several extra steps to do get the clean words and separate the punctuation appropriately. Instead, we will introduce the [spaCy]((https://github.com/explosion/spaCy) library to segment the text into human-readable tokens. First we will download the pre-trained model, in this case we only need the small English version:
This is a model that spaCy already trained for us on a subset of web English data. Hence, the model already “knows” how to tokenize into English words. When the model processes a string, it does not only do the splitting for us but already provides more advanced linguistic properties of the tokens (such as part-of-speech tags, or named entities). Let’s now import the model and use it to parse our document:
PYTHON
import spacy
nlp = spacy.load("en_core_web_sm") # we load the small English model for efficiency
doc = nlp(text) # Doc is a python object with several methods to retrieve linguistic properties
# SpaCy-Tokens
tokens = [token.text for token in doc] # Note that spacy tokens are also python objects
print(tokens[:40])
print(len(tokens))
OUTPUT
['Letter', '1', 'St.', 'Petersburgh', ',', 'Dec.', '11th', ',', '17', '-', '-', 'TO', 'Mrs.', 'Saville', ',', 'England', 'You', 'will', 'rejoice', 'to', 'hear', 'that', 'no', 'disaster', 'has', 'accompanied', 'the', 'commencement', 'of', 'an', 'enterprise', 'which', 'you', 'have', 'regarded', 'with', 'such', 'evil', 'forebodings', '.']
85713
The differences look subtle at the beginning, but if we carefully inspect the way spaCy splits the text, we can see the advantage of using a proper tokenizer. There are also a several of properties that spaCy provides us with, for example we can get only symbols or only alphanumerical tokens, and more advanced linguistic properties, for example we can remove punctuation and only keep alphanumerical tokens:
PYTHON
only_words = [token for token in doc if token.is_alpha] # Only alphanumerical tokens
print(only_words[:50])
print(len(only_words))
OUTPUT
[Letter, Petersburgh, TO, Saville, England, You, will, rejoice, to, hear]
1199
or keep only the verbs from our text:
PYTHON
only_verbs = [token for token in doc if token.pos_ == "VERB"] # Only verbs
print(only_verbs[:10])
print(len(only_verbs))
OUTPUT
[rejoice, hear, accompanied, regarded, arrived, assure, increasing, walk, feel, braces]
150
SpaCy also predicts the sentences under the hood for us. We can access them like this:
PYTHON
sentences = [sent.text for sent in doc.sents] # Sentences are also python objects
print(sentences[:5])
print(len(sentences))
OUTPUT
48
Letter 1 St. Petersburgh, Dec. 11th, 17-- TO Mrs. Saville, England You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings.
I arrived here yesterday, and my first task is to assure my dear sister of my welfare and increasing confidence in the success of my undertaking.
I am already far north of London, and as I walk in the streets of Petersburgh, I feel a cold northern breeze play upon my cheeks, which braces my nerves and fills me with delight.
Do you understand this feeling?
This breeze, which has travelled from the regions towards which I am advancing, gives me a foretaste of those icy climes.
We can also see what named entities the model predicted:
OUTPUT
1713
DATE Dec. 11th
CARDINAL 17
PERSON Saville
GPE England
DATE yesterday
This are just basic tests to show you how you can right away structure text using existing NLP libraries. Of course we used a simplified model so the more complex the task the more errors will appear. The biggest advantage of using these existing libraries is that they help you transform unstructured plain text files into structured data that you can manipulate later for your own goals.
NLP Libraries
Related to the need of shaping our problems into a known task, there are several existing NLP libraries which provide a wide range of models that we can use out-of-the-box. We already saw simple examples using SpaCy for English and jieba for Chinese. Again, as a non-exhaustive list, we mention here some of the most used NLP libraries in python:
Linguistic Resources
There are also several curated resources that can help solve your NLP-related tasks, specifically when you need highly specialized definitions. An exhaustive list would be impossible as there are thousands of them, and also them being language and domain dependent. Below we mention some of the most prominent, just to give you an idea of the kind of resources you can find, so you don’t need to reinvent the wheel every time you start a project:
- HuggingFace Datasets: A large collection of datasets for NLP tasks, including text classification, question answering, and language modeling.
- WordNet: A large lexical database of English, where words are grouped into sets of synonyms (synsets) and linked by semantic relations.
- Europarl: A parallel corpus of the proceedings of the European Parliament, available in 21 languages, which can be used for machine translation and cross-lingual NLP tasks.
- Universal Dependencies: A collection of syntactically annotated treebanks across 100+ languages, providing a consistent annotation scheme for syntactic and morphological properties of words, which can be used for cross-lingual NLP tasks.
- PropBank: A corpus of texts annotated with information about basic semantic propositions, which can be used for English semantic tasks.
- FrameNet: A lexical resource that provides information about the semantic frames that underlie the meanings of words (mainly verbs and nouns), including their roles and relations.
- BabelNet: A multilingual lexical resource that combines WordNet and Wikipedia, providing a large number of concepts and their relations in multiple languages.
- Wikidata: A free and open knowledge base initially derived from Wikipedia, that contains structured data about entities, their properties and relations, which can be used to enrich NLP applications.
- Dolma: An open dataset of 3 trillion tokens from a diverse mix of clean web content, academic publications, code, books, and encyclopedic materials, used to train English large language models.
Why should we learn NLP Fundamentals?
In the past decade, NLP has evolved significantly, especially in the field of deep learning, to the point that it has become embedded in our daily lives, one just needs to look at the term Large Language Models (LLMs), the latest generation of NLP models, which is now ubiquitous in news media and tech products we use on a daily basis.
The term LLM now is often (and wrongly) used as a synonym of Artificial Intelligence. We could therefore think that today we just need to learn how to manipulate LLMs in order to fulfill our research goals involving textual data. The truth is that Language Modeling has always been part of the core tasks of NLP, therefore, by learning NLP you will understand better where are the main ideas behind LLMs coming from.

LLM is a blanket term for an assembly of large neural networks that are trained on vast amounts of text data with the objective of optimizing for language modeling. Once they are trained, they are used to generate human-like text or fine-tunned to perform much more advanced tasks. Indeed, the surprising and fascinating properties that emerge from training models at this scale allows us to solve different complex tasks such as answer elaborate questions, translate languages, solve complex problems, generate narratives that emulate reasoning, and many more, all of this with a single tool.
It is important, however, to pay attention to what is happening behind the scenes in order to be able trace sources of errors and biases that get hidden in the complexity of these models. The purpose of this course is precisely to take a step back, and understand that: - There is a wide variety of tools available, beyond LLMs, that do not require so much computing power. - Sometimes a much simpler and easier method is already available that can solve our problem at hand. - If we learn how previous approaches to solve linguistic problems were designed, we can better understand the limitations of LLMs and how to use them effectively. - LLMs excel at confidently delivering information, without any regards for correctness. This calls for a careful design of evaluation metrics that give us a better understanding of the quality of the generated content.
Let’s go back to our problem of segmenting text and see what ChatGPT has to say about tokenizing Chinese text:

We got what sounds like a straightforward confident answer. However, it is not clear how the model arrived at this solution. Second, we do not know whether the solution is correct or not. In this case ChatGPT made some assumptions for us, such as choosing a specific kind of tokenizer to give the answer, and since we do not speak the language, we do not know if this is indeed the best approach to tokenize Chinese text. If we understand the concept of Token (which we will today!), then we can be more informed about the quality of the answer, whether it is useful to us, and therefore make a better use of the model.
And by the way, ChatGPT was almost correct, in the specific case of the gpt-4 tokenizer, the model will return 12 tokens (not 11!) for the given Chinese sentence.

We can also argue if the statement “Chinese is generally tokenized character by character” is an overstatement or not. In any case, the real question here is: Are we ok with almost correct answers? Please note that this is not a call to avoid using LLM’s but a call for a careful consideration of usage and more importantly, an attempt to explain the mechanisms behind via NLP concepts.
Relevant Linguistic Aspects
Natural language exhibits a set fo properties that make it more challenging to process than other types of data such as tables, spreadsheets or time series. Language is hard to process because it is compositional, ambiguous, discrete and sparse.
Compositionality
The basic elements of written languages are characters, a sequence of characters form words, and words in turn denote objects, concepts, events, actions and ideas (Goldberg, 2016). Subsequently words form phrases and sentences which are used in communication and depend on the context in which they are used. We as humans derive the meaning of utterances from interpreting contextual information that is present at different levels at the same time:
The first two levels refer to spoken language only, and the other four levels are present in both speech and text. Because in principle machines do not have access to the same levels of information that we do (they can only have independent audio, textual or visual inputs), we need to come up with clever methods to overcome this significant limitation. Knowing the levels of language is important so we consider what kind of problems we are facing when attempting to solve our NLP task at hand.
Ambiguity
The disambiguation of meaning is usually a by-product of the context in which utterances are expressed and also the historic accumulation of interactions which are transmitted across generations (think for instance to idioms – these are usually meaningless phrases that acquire meaning only if situated within their historical and societal context). These characteristics make NLP a particularly challenging field to work in.
We cannot expect a machine to process human language and simply understand it as it is. We need a systematic, scientific approach to deal with it. It’s within this premise that the field of NLP is born, primarily interested in converting the building blocks of human/natural language into something that a machine can understand.
The image below shows how the levels of language relate to a few NLP applications:

Levels of ambiguity
Discuss what do the following sentences mean. What level of ambiguity do they represent?:
- “The door is unlockable from the inside.” vs “Unfortunately, the cabinet is unlockable, so we can’t secure it”
- “I saw the cat with the stripes” vs “I saw the cat with the telescope”
- “Colorless green ideas sleep furiously”
- “I never said she stole my money.” vs “I never said she stole my money.”
This is why the previous statements were difficult:
- “Un-lockable vs Unlock-able” is a Morphological ambiguity: Same word form, two possible meanings
- “I saw the cat with the telescope” is a Syntactic ambiguity: Same sentence structure, different properties
- “Colorless green ideas sleep furiously” Semantic ambiguity: Grammatical but meaningless (ideas do not have color as a property. Even if this was true, they would be either colorless or green)
- “I NEVER said she stole MY money.” is a Pragmatic ambiguity: Meaning relies on word emphasis
Whenever you are solving a specific task, you should ask yourself what kind of ambiguity can affect your results? At what level are your assumptions operating when defining your research questions? Having the answers to this can save you a lot of time when debugging your models. Sometimes the most innocent assumptions (for example using the wrong tokenizer) can create enormous performance drops even when the higher level assumptions were correct.
Discreteness
There is no inherent relationship between the form of a word and its meaning. For the same reason, by textual means alone, there is no way of knowing if two words are similar or how do they relate to each other. Take the word “pizza” and “hamburger”, how can we automatically know that they share more properties than “car” and “cat”? One way is by looking at the context in which these words are used, and how they are related to each other. This idea is the principle behind distributional semantics, and aims to look at the statistical properties of language, such as word co-occurrences, to understand how words relate to each other.
Let’s do a simple exercise:
PYTHON
from collections import Counter
# A mini-corpus where our target words appear
text = """
I am hungry . Should I eat delicious pizza ?
Or maybe I should eat a juicy hamburger instead .
Many people like to eat pizza because is tasty , they think pizza is delicious as hell !
My friend prefers to eat a hamburger and I agree with him .
We will drive our car to the restaurant to get the succulent hamburger .
Right now , our cat sleeps on the mat so we won't take him .
I did not wash my car , but at least the car has gasoline .
Perhaps when we come back we will take out the cat for a walk .
The cat will be happy to see us when we come back .
"""
words = [token.lower() for token in text.split()]
target_words = ["pizza", "hamburger", "car", "cat"] # words we want to analyze
stop_words = ["i", "am", "my", "to", "the", "a", "and", "is", "as", "at", "we", "will", "not", "our", "but", "least", "has", ".", ","] # words to ignore
co_occurrence = {word: [] for word in target_words}
window_size = 3 # How many words to look at on each side
# Find the context for each target word
for i, word in enumerate(words):
if word in target_words:
start = max(0, i - window_size)
end = min(len(words), i + 1 + window_size)
context = words[start:i] + words[i+1:end] # Exclude the target word itself
context = [w for w in context if w not in stop_words] # Filter out stop words from context
co_occurrence[word].extend(context)
# Print the most common context words for each target word
print("Contextual Fingerprints:\n")
for word, context_list in co_occurrence.items():
# We use Counter to get a frequency count of context words
fingerprint = Counter(context_list).most_common(5)
print(f"'{word}': {fingerprint}")
Contextual Fingerprints:
'pizza': [('eat', 2), ('delicious', 2), ('?', 1), ('or', 1), ('maybe', 1)]
'hamburger': [('eat', 2), ('juicy', 1), ('instead', 1), ('many', 1), ('agree', 1)]
'car': [('drive', 1), ('restaurant', 1), ('wash', 1), ('gasoline', 1)]
'cat': [('walk', 2), ('now', 1), ('sleeps', 1), ('on', 1), ('take', 1)]
Stop Words
Depending on the use case you will always perform some preprocessing steps in order to have the data as normalized and clean as possible before making any computations. These steps are not exhaustive and are task-dependent. In this case, we introduced the concept of “stop_words” (extremely common words that do not provide relevant information for our use case and, given their high frequency, they tend to obscure the results we are interested in). Of course, our list could have been much bigger, but it served the purpose for this toy example.
Spacy has a pre-defined list of stopwords per language. To explicitly load the English stop words we can do
PYTHON
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS) # a set of common stopwords
print(len(STOP_WORDS)) # There are 326 words considered in this list
Alternatively you can filter out stop words when iterating your tokens (remember the spacy token properties!) like this:
Sparsity
Another key property of linguistic data is its sparsity. This means that if we are hunting for a specific phenomenon, we will realize it barely occurs inside a enormous amount of text. Our previous example of pizzas and hamburgers worked for us because our experiment was run in an extremely hand-crafted text for the purpose. However, it would be hard to scale this up in a real setting, where we would possibly need to dive into public reviews that specifically mentions those two foods to construct a corpus.
Sparsity is tightly link to what is frequently called domain-specific data. The discourse context in which language is used varies importantly across disciplines (domains). Take for example law texts and medical texts, the meaning of concepts described in each domain will significantly differ. Another example would be to compare texts written in 16th century English vs 21st century English, where significant shifts have occurred at all linguistic levels. For this reason there are specialized models and corpora that model language use in specific domains. The concept of fine-tunning a general purpose model with domain-specific data is also popular.
While it’s true that sparsity has been reduced after the era of LLMs and big training corpora, especially when dealing with general purpose tasks, if you have domain specific objectives you should take special care on the assumptions and results you get out of pre-trained models, including LLMs.
NLP in the real world
Use what you have learned so far to search inside the Frankenstein book how many times does the word “love” appear, and how many times does “hate” appear. Compute what percentage of the content words in the text do these two terms together represent? To do this experiment you should:
- Read the file and save it into a text variable
- Use spacy to load the text into a Doc object.
- Iterate the document and keep all tokens that are alphanumeric (use the token.is_alpha property), and are not stopwords (use the property token.is_stop).
- Lowercase all the tokens to merge the instances of “Love” and “love” into a single one.
- Iterate the tokens and count how many of them are exactly “love”
- Iterate the tokens and count how many of them are exactly “hate”
- compute the percentage of hate + love compared to all content words. For example with: (len(hate_words) + len(love_words)) / len(content_words) * 100
Following our preprocessing procedure, there are 30,500 content words. The word love appears 59 times and the word hate appears only 9 times. These are 0.22% of the total words in the text. Even though intuitively these words should be quite common, in reality they occur only a handful of times. So if we are interested in studying the occurrences of love/hate in the novel, we can only rely on those occurrences. Code:
PYTHON
with open("84_frankenstein_clean.txt") as f:
text = f.read()
doc = nlp(text) # Process the text with SpaCy
words = [token.lower_ for token in doc if token.is_alpha and not token.is_stop]
print("Total Words:", len(words))
love_words = [word for word in words if "love" == word]
hate_words = [word for word in words if "hate" == word]
print("Love and Hate percentage:", (len(love_words) + len(hate_words)) / len(words) * 100, "% of content words")
NLP = Machine Learning + Linguistics
So far we saw how important it is to consider the linguistic properties of our data. We now recall that NLP is also built on top of ideas from Machine Learning, Deep Learning. A general workflow for solving an NLP task therefore looks quite close to the general Machine Learning Workflow:
- Formulate the problem
- Gather relevant data
- Data pre-processing
- Structure your Corpus (Inputs/Outputs)
- Split your Data (Train/Validation/Test)
- Choose Approach/Model/Architecture
- If necessary Train or Fine-tune a new Model
- Evaluate Results on Test Set
- Refine your approach
- Share model/results
With this in mind we can start now exploring more NLP techniques!
- NLP is embedded in numerous daily-use products
- Key tasks include language modeling, text classification, information extraction, information retrieval, conversational agents, and topic modeling, each supporting various real-world applications.
- NLP is a subfield of Artificial Intelligence (AI) that deals with approaches to process, understand and generate natural language
- Deep learning has significantly advanced NLP, but the challenge remains in processing the discrete and ambiguous nature of language
- The ultimate goal of NLP is to enable machines to understand and process language as humans do, but challenges in measuring and interpreting linguistic information still exist.
Content from Episode 1: From text to vectors
Last updated on 2025-09-18 | Edit this page
Estimated time: 120 minutes
Overview
Questions
- How do I load text and do basic linguistic analysis?
- Why do we need to prepare a text for training?
- How do I use words as features in a machine learning model?
- What is a word2vec model?
- What properties do word embeddings have?
- What insights can I get from word embeddings?
- How do we train a word2vec model?
Objectives
After following this lesson, learners will be able to:
- Implement a full preprocessing pipeline on a text
- Use Word2Vec to train a model
- Inspect word embeddings
Introduction
In the previous episode we emphasized how text is different from structured datasets. Given the linguistic properties embedded in unstructured text, we also learned how to use existing libraries such as SpaCy for segmenting text and accessing basic linguistic properties.
We learned the different levels of language and that it is ambiugous, compositional and discrete. Because of this, it is hard to know how words relate with each other, therefore obtaining meaning from text alone is possible only through proxies that can be quantified. We made our first attempt to approach word meaning by using co-occurrences of words in a fixed text window around specific target words of interest.
In this episode, we will expand on this idea and continue working with words as individual features of text. We will introduce the concept of Term-Document matrix, one of the most basic techniques to use words as features that represent the texts where they appear, which can be fed directly it into Machine Learning classifiers.
We will then visit the distributional hypothesis, which the linguist J.R. Firth, in the 1950s, summarized with the phrase: “You shall know a word by the company it keeps”. Based on this hypothesis, Mikolov et. al. decided to train neural networks on large amounts of text in order to predict a word based on it’s surrounding context or viceversa, in the famous Word2Vec model. We will learn how to use these models, and understand how do they map discrete words into numerical vectors that capture the semantic similarity of words in a continuous space. By representing words with vectors, we can mathematically manipulate them through vector arithmetic and exploit the similarity patterns that emerge from a collection of texts. Finally, we will show how to train your own Word2Vec models.
Preprocessing Text
NLP models work by learning the statistical regularities within the constituent parts of the language (i.e, letters, digits, words and sentences) in a text. However, text contains also other type of information that humans find useful to convey meaning. To signal pauses, give emphasis and convey tone, for instance, we use punctuation. Articles, conjunctions and prepositions also alter the meaning of a sentence. The machine does not know the difference among all of these linguistic units, as it treats them all as equal.
We have already done some basic data pre-processing in the introduction. Here we will formalize this initial step and present some of the most common pre-processing steps when dealing with structured data. This is analogue to the data cleaning and sanitation step in any Machine Learning task. In the case of linguistic data, we are interested in getting rid of unwanted components (such as rare punctuation or formatting characters) that can confuse a tokenizer and, depending on the task at hand, we might also be interested in normalizing our tokens to avoid possible noise in our final results. As we already know, an NLP module such as SpaCy comes in handy to deal with the preprocessing of text, here is the list of the recommended (always optional!) steps:
- Tokenization: splitting strings into meaningful/useful units. This step also includes a method for “mapping back” the segments to their character position in the original string.
- Lowercasing: removing uppercases to e.g. avoid treating “Dog” and “dog” as two different words)
- Punctuation and Special Character Removal: if we are interested in content only, we can filter out anything that is not alphanumerical. We can also explicitly exlude symbols that are just noise in our dataset. Note that getting rid of punctuation can significantly change meaning! A special mention is that new lines are a character in text, sometimes we can use them in our benefit (for example to separate paragraphs) but many times they are just noise.
- Stop Word Removal: as we’ve seen, the most frequent words in texts are those which contribute little semantic value on their own: articles (‘the’, ‘a’ , ‘an’), conjunctions (‘and’, ‘or’, ‘but’), prepositions (‘on’, ‘by’), auxiliary verbs (‘is’, ‘am’), pronouns (‘he’, ‘which’), or any highly frequent word that might not be of interest in several content only related tasks. A special case is the word ‘not’ which carries the significant semantic value of negation.
- Lemmatization: although it has become less frequent, normalizing words into their dictionary form can help to focus on relevant aspects of text. Think how “eating”, “ate”, “eaten” are all a variation of the verb “eat”.
PYTHON
"Here my python code using spacy... Here we load the 'Dirty Book of Frankenstein' and the task is to arrive to the clean tokens to train Word2Vec"
Preprocessing approaches affect significantly the quality of the training when working with word embeddings. For example, [Rahimi & Homayounpour (2022)] (https://link.springer.com/article/10.1007/s10579-022-09620-5) demonstrated that for text classification and sentiment analysis, the removal of punctuation and stopwords leads to higher performance.
You do not always need to do all the preprocessing steps, and which ones you should do depends on what you want to do. For example, if you want to segment text into sentences then characters such as ‘.’, ‘,’ or ‘?’ are the most important; if you want to extract Named Entities from text, you explicitly do not want to lowercase the text, as capitals are a component in the identification process, and if you are interested in gender bias you definitely want to keep the pronouns, etc…
Preprocessing can be very diffent for different languages. This is both in terms of which steps to apply, but also which methods to use for a specific step.
We will prepare the data for the two experiments in this episode 1. Build a Term-Document Matrix 2. Train a Word2Vec model
For both taska we need to prepare our texts by applying the same preprocessing steps. We are focusing on content words for now, so even though our preprocessin will unfortunately loose a lot of the original information, in exchange we will be able manipulate words as individual numeric representations. Therefore the preprocessing includes: cleaning the text, tokenizing, lowercasing words, removing punctuation, lemmatizing words and removing stop words. Let’s apply this step by step.
1. Cleaning the text
We start by importing the spaCy
library that will help
us go through the preprocessing steps. SpaCy is a popular open-source
library for NLP in Python and it works with pre-trained languages models
that we can load and use to process and analyse the text efficiently. We
can then load the SpaCy model into the pipeline function.
Next, we’ll eliminate the triple dashes that separate different news articles, as well as the vertical bars used to divide some columns.
2. Tokenizing
Tokenization is essential in NLP, as it helps to create structure
from raw text. It involves the segmentation of the text into smaller
units referred as tokens
. Tokens can be sentences
(e.g. 'the happy cat'
), words
('the', 'happy', 'cat'
), subwords
('un', 'happiness'
) or characters
('c','a', 't'
). The choice of tokens depends by the
requirement of the model used for training, and the text. This step is
carried out by a pre-trained model (called tokeniser) that has been
fine-tuned for the target language. In our case, this is
en_core_web_sm
loaded before.
A good word tokeniser for example, does not simply break up a text based on spaces and punctuation, but it should be able to distinguish:
- abbreviations that include points (e.g.: e.g.)
- times (11:15) and dates written in various formats (01/01/2024 or 01-01-2024)
- word contractions such as don’t, these should be split into do and n’t
- URLs
Many older tokenisers are rule-based, meaning that they iterate over a number of predefined rules to split the text into tokens, which is useful for splitting text into word tokens for example. Modern large language models use subword tokenisation, which are more flexible.
PYTHON
spacy_corpus = nlp(corpus_clean)
# Get the tokens from the pipeline
tokens = [token.text for token in spacy_corpus]
tokens[:10]
['mens', 'op', 'maan', '\n ', '„', 'de', 'eagle', 'is', 'geland', '”']
As one can see the tokeniser has split each word in a token, however
it has considered also blank spaces \n
and also
punctuation.
3. Lowercasing
Our next step is to lowercase the text. Our goal here is to generate a list of unique words from the text, so in order to not have words twice in the list - once normal and once capitalised when it is at the start of a sentence for example - we can lowercase the full text.
corpus_lower = corpus_clean.lower()
print(corpus_lower)
mens op maan „ de eagle is geland ” reisduur : 102 uur , uitstappen binnen 20 iuli , 21.17 uur 45 […]
4. Remove punctuation
The next step we will apply is to remove punctuation. We are interested in training our model to learn the meaning of the words. This task is highly influenced by the state of our text and punctuation would decrease the quality of the learning as it would add spurious information. We’ll see how the learning process works later in the episode.
The punctuation symbols are defined in:
We can loop over these symbols to remove them from the text:
PYTHON
# remove punctuation from set
tokens_no_punct = [token for token in tokens if token not in string.punctuation]
# remove also blank spaces
tokens_no_punct = [token for token in tokens_no_punct if token.strip() != '']
['mens', 'op', 'maan', 'de', 'eagle', 'is', 'geland', 'reisduur', '102', 'uur']
Visualise the tokens
This was the end of our preprocessing step. Let’s look at what tokens we have extracted and how frequently they occur in the text.
PYTHON
import matplotlib.pyplot as plt
from collections import Counter
# count the frequency of occurrence of each token
token_counts = Counter(tokens_no_punct)
# get the top n most common tokens (otherwise the plot would be too crowded) and their relative frequencies
most_common = token_counts.most_common(100)
tokens = [item[0] for item in most_common]
frequencies = [item[1] for item in most_common]
plt.figure(figsize=(12, 6))
plt.bar(tokens, frequencies)
plt.xlabel('Tokens')
plt.ylabel('Frequency')
plt.title('Token Frequencies')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
As one can see, words in the text have a very specific skewed distribution, such that there are few very high-frequency words that account for most of the tokens in text (e.g., articles, conjunctions) and many low frequency words.
5. Stop word removal
For some NLP tasks only the important words in the text are needed. A
text however often contains many stop words
: common words
such as de
, het
, een
that add
little meaningful content compared to nouns and words. In those cases,
it is best to remove stop words from your corpus to reduce the number of
words to process.
Term-Document Matrix
A Term-Document Matrix (TDM) is a matrix where:
- Each row is a unique word (term) in the corpus
- Each column is a document in the corpus
- Each cell \((i,j)\) has a value of 1 if the \(term_i\) appears in \(column_j\) or 0 otherwise
This is also sometimes known as a bag-of-words as it ignores grammar and word sequences in exchange of emphazising content, where each document is characterized by the words that appear in it. Similar documents will contain similar bags of words and documents that talk about different topics will be associated with numerical columns that are different from each other. Let’s look at a quick example:
- Doc 1: “Natural language processing is exciting”
- Doc 2: “Processing natural language helps computers understand”
- Doc 3: “Language processing with computers is NLP”
- Doc 4: “Today it rained a lot”
Term | Doc1 | Doc2 | Doc3 | Doc3 |
---|---|---|---|---|
natural | 1 | 1 | 0 | 0 |
language | 1 | 1 | 1 | 0 |
processing | 1 | 1 | 1 | 0 |
is | 1 | 0 | 1 | 0 |
exciting | 1 | 0 | 0 | 0 |
helps | 0 | 1 | 0 | 0 |
computers | 0 | 1 | 1 | 0 |
understand | 0 | 1 | 0 | 0 |
with | 0 | 0 | 1 | 0 |
NLP | 0 | 0 | 1 | 0 |
today | 0 | 0 | 0 | 1 |
it | 0 | 0 | 0 | 1 |
rained | 0 | 0 | 0 | 1 |
a | 0 | 0 | 0 | 1 |
lot | 0 | 0 | 0 | 1 |
We can represent each document by taking its column and treating it as a vector of 0’s and 1’s. The vector is of fixed size (in this case the vocabulary size is 15), therefore suitable for traditional ML classifiers. With TDM there is a problem of scalability, as the matrix size grows with the amount of documents times the vocabulary found in the documents we are processing. This means that if we have 100 documents in which 5,000 unique words appear, we would have to store a matrix of 500,000 numbers! We also have the problem of sparsity present: document “vectors” will have mostly 0’s. TDM is also a good solution to characterize documents based on their vocabulary, however the converse is even more desirable: to characterize words based on the context where they appear, so we can study words independently of their documents of origin, and more importantly, how do they relate to each other. To solve these and other limitations we enter the world of word embeddings!
What are word embeddings?
A Word Embedding is a word representation type that maps words in a numerical manner (i.e., into vectors) in a multidimensional space, capturing their meaning based on characteristics or context. Since similar words occur in similar contexts, or have same characteristics, the system automatically learns to assign similar vectors to similar words.
Let’s illustrate this concept using animals. This example will show us an intuitive way of representing things into vectors.
Suppose we want to represent a cat
using measurable
characteristics:
- Furriness: Let’s assign a score of 70 to a cat
- Number of legs: A cat has 4 legs
So the vector representation of a cat becomes:
[70 (furriness), 4 (legs)]

This vector doesn’t fully describe a cat but provides a basis for comparison with other animals.
Let’s add vectors for a dog and a caterpillar:
- Dog: [56, 4]
- Caterpillar: [70, 100]

To determine which animal is more similar to a cat, we use
cosine similarity
, which measures the cosine of the angle
between two vectors.
cosine
similarity ranges between [-1
and 1
]. It
is the cosine of the angle between two vectors, divided by the product
of their length. It is a useful metric to measure how similar two
vectors are likely to be.

PYTHON
from sklearn.metrics.pairwise import cosine_similarity
similarity_cat_dog = cosine_similarity(cat, dog)[0][0]
similarity_cat_caterpillar = cosine_similarity(cat, caterpillar)[0][0]
print(f"Cosine similarity between cat and dog: {similarity_cat_dog}")
print(f"Cosine similarity between cat and caterpillar: {similarity_cat_caterpillar}")
PYTHON
Cosine similarity between cat and dog: 0.9998987965747193
Cosine similarity between cat and caterpillar: 0.6192653797321375
The higher similarity score between the cat and the dog indicates they are more similar based on these characteristics. Adding more characteristics can enrich our vectors, detecting more semantic nuances.

Challenge
- Add one of two other dimensions. What characteristics could they map?
- Add another animal and map their dimensions
- Compute again the cosine similarity among those animals and find the couple that is the least similar and the most similar
- Add one of two other dimensions
We could add the dimension of “velocity” or “speed” that goes from 0 to 100 meters/second.
- Caterpillar: 0.001 m/s
- Cat: 1.5 m/s
- Dog: 2.5 m/s
(just as an example, actual speeds may vary)
PYTHON
cat = np.asarray([[70, 4, 1.5]])
dog = np.asarray([[56, 4, 2.5]])
caterpillar = np.asarray([[70, 100, .001]])
Another dimension could be weight in Kg:
- Caterpillar: .05 Kg
- Cat: 4 Kg
- Dog: 15 Kg
(just as an example, actual weight may vary)
PYTHON
cat = np.asarray([[70, 4, 1.5, 4]])
dog = np.asarray([[56, 4, 2.5, 15]])
caterpillar = np.asarray([[70, 100, .001, .05]])
Then the cosine similarity would be:
Output:
- Add another animal and map their dimensions
Another animal that we could add is the Tarantula!
PYTHON
cat = np.asarray([[70, 4, 1.5, 4]])
dog = np.asarray([[56, 4, 2.5, 15]])
caterpillar = np.asarray([[70, 100, .001, .05]])
tarantula = np.asarray([[80, 6, .1, .3]])
- Compute again the cosine similarity among those animals - find out the most and least similar couple
Given the values above, the least similar couple is the dog and the
caterpillar, whose cosine similarity is
array([[0.60855407]])
.
The most similar couple is the cat and the tarantula:
array([[0.99822302]])
By representing words as vectors with multiple dimensions, we capture more nuances of their meanings or characteristics.
- We can represent text as vectors of numbers (which makes it interpretable for machines)
- The most efficient and useful way is to use word embeddings
- We can easily compute how words are similar to each other with the cosine similarity
When semantic change occurs, words in their context also change. We can trace how a word evolves semantically over time through comparison of that word with other similar words. The idea is that the most similar words are not always fixed in each different year, if a word acquires a new meaning.
Train the Word2Vec model
Load the embeddings and inspect them
We proceed to load our models. We will load all pre-trained model
files from the original Word2Vec paper, which was trained on a big
corpus from Google News. The library gensim
contains a
method called KeyedVectors
which allows us to load
them.
Prepare the data to be ingested by the model (preprocessing)
Also, the decision to remove or retain these parts of text is quite crucial for training our model, as it affects the quality of generated word vectors.
Dataset size in training
To obtain high-quality embeddings, the size/length of your training dataset plays a crucial role. Generally tens of thousands of documents are considered a reasonable amount of data for decent results.
Is there however a strict minimum? Not really. Things to keep in mind
is that vocabulary size
, document length
and
desired vector size
interacts with each other. The higher
the dimensional vectors (e.g. 200-300 dimensions) the more data is
required, and of high quality, i.e. that allows the learning of words in
a variety of contexts.
While word2vec models typically perform better with large datasets containing millions of words, using a single page is sufficient for demonstration and learning purposes. This smaller dataset allows us to train the model quickly and understand how word2vec works without the need for extensive computational resources.
For the purpose of this episode and to make training easy on our laptop, we’ll train our word2vec model using just one book. Subsequently, we’ll load pre-trained models for tackling our task.
Now we will train a two-layer neural network to transform our tokens
into word embeddings. We will be using the library gensim
and the model we will be using is called Word2Vec
,
developed by Tomas Mikolov et al. in 2013.
Import the necessary libraries:
PYTHON
import gensim
from gensim.models import Word2Vec
# import logging to monitor training
import logging
# set up logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
There are two main architectures for training Word2Vec:
- Continuous Bag-of-Words (CBOW): Predicts a target word based on its surrounding context words.
- Continuous Skip-Gram: Predicts surrounding context words given a target word.

CBOW is faster to train, while Skip-Gram is more effective for infrequent words. Increasing context size improves embeddings but increases training time.
We will be using CBOW. We are interested in having vectors with 300 dimensions and a context size of 5 surrounding words. We include all words present in the corpora, regardless of their frequency of occurrence and use 4 CPU cores for training. All these specifics are translated in only one line of code.
Let’s train our model then:
PYTHON
model = Word2Vec([tokens_no_stopwords], vector_size=300, window=5, min_count=1, workers=4, sg=0)
We can inspect already what’s the output of this training, by checking the top 5 most similar words to “maan” (moon):
[('plek', 0.48467501997947693), ('ouders', 0.46935707330703735), ('supe|', 0.3929591178894043), ('rotterdam', 0.37788015604019165), ('verkeerden', 0.33672046661376953)]
We have trained our model on one page only of the newspaper and the
training was very quick. However, to approach our problem it’s best to
train our model on the entire dataset. We dont’ have the resources for
doing that on our local laptop, but luckily for us, Wevers, M (2019) did that
already for us and released it publicly. Let’s download this dataset on
our laptop and let’s save them in a folder called w2v
.
Content from Episode 2: BERT and Transformers
Last updated on 2025-09-18 | Edit this page
Estimated time: 120 minutes
Overview
Questions
- What are some drawbacks of static word embeddings?
- What are Transformers?
- What is BERT and how does it work?
- How can I use BERT to solve NLP tasks?
- How should I evaluate my classifiers?
- Which other Transformer variants are available?
Objectives
- Understand how a Transformer works and recognize their different use cases.
- Understand how to use pre-trained tranfromers (Use Case: BERT)
- Use BERT to classify texts.
- Use BERT as a Named Entity Recognizer.
- Understand assumptions and basic evaluation for NLP outputs.
Static word embeddings such as Word2Vec can be used to represent each word as a unique vectors. Vector representations also allow us to apply numerical operations that can be mapped to some syntactic and semantic properties of words, such as the cases of analogies or finding synonyms. Once we transform words into vectors, these can also be used as features for classifiers that can be trained predict any supervised NLP task.
However, a big drawback of Word2Vec is that each word is represented in isolation, and unfortunately that is not how language works. Words get their meanings based on the specific context in which they are used (take for example polysemy, the cases where the same word can have very different meanings depending on the context); therefore, we would like to have richer vector representations of words that also integrate context into account in order to obtain more powerful representations.
Polysemy in Language
Think of (at least 2) different words that can have more than one meaning depending on the context. Come up with one simple sentence per meaning and explain what they mean in each context. Discuss: How do you know what of the possible meanings does the word have when you use it?
OPTIONAL: Why do you think Word2Vec can’t caputure different meanings of words?
Two possible examples can be the words ‘fine’ and ‘run’
Sentences for ‘fine’: - She has a fine watch (fine == high-quality) - He had to pay a fine (fine == penalty) - I am feeling fine (fine == not bad)
Sentences for ‘run’: - I had to run to catch the bus (run == moving fast) - Stop talking, before you run out of ideas (run (out) == exhaust)
Note how in the “run out” example we even have to understand that the meaning of run is not literal but goes accompained with a preposition that changes its meaning.
In 2019, the BERT language model was introduced. Using a novel architecture called Transformer (2017), BERT can integrate context into word representations. To understand BERT, we will first look at what a transformer is and we will then directly use some code to make use of BERT.
Transformers
The Transformer is a neural network architecture proposed by Google researchers in 2017 in a paper called Attention is all you Need. They tackled specifically the NLP task of Machine Translation (MT), which is stated as: how to generate a sentence (sequence of words) in target language B given a sentence in source language A? We all know that translation cannot be done word by word in isolations, therefore integrating the context from both the source language and the target language is necessary. In order to translate, first one neural network needs to encode the whole meaning of the senetence in language A into a single vector representation, then a second neural network needs to decode that representation into tokens that are both coherent with the meaning of language A and understandable in language B. Therefore we say that translation is modeling language B conditioned on what language A originally said.

As seen in the picture, the original Transformer is an Encoder-Decoder network that tackles translation. We first need a token embedder which converts the string of words into a sequence of vectors that the Transformer network can process. The first component, the Encoder, is optimized for creating rich representations of the source sequence (in this case an English sentence) while the second one, the Decoder is a generative network that is conditioned on the encoded representation. The third component we see is the infamous attention mechanism, a third neural network what computes the correlation between source and target tokens (Which word in Dutch should I pay attention to decide a better next English word?) to generate the most likely token in the target sequence (in this case Dutch words).
Emulate the Attention Mechanism
Pair with a person who speaks a language different from English (we will cal it language B). This time you should think of 2 simple sentences in English and come up with their translations in the second language. In a piece of paper write down both sentences (one on top of the other) and try to: 1. Draw a one to one mapping of words in English to language B. Is it always possible to do this? 2. Think of each word in language B and draw as many lines as necessary to the relevant English words that can “help you” predict the word in language B. If you managed, congratulations, this is how attention works!
Here an image of a bilingual “manual attention” example
Next, we will see how BERT exploits the idea of a Transformer Encoder to perform the NLP Task we are interested in: generating powerful word representations.
BERT
BERT is an acronym that stands for Bidirectional Encoder Representations from Transformers. The name describes it all: the idea is to use the power of the Encoder component of the Transformer architecture to create powerful token representations that preserve the contextual meaning of the whole input segment, instead of each word in isolation. The BERT vector representations of each token take into account both the left context (what comes before the word) and the right context (what comes after the word). Another advantage of the transformer Encoder is that it is parallelizable, which made it posible for the first time to train these networks on millions of datapoints, dramatically improving model generalization.
Pretraining BERT
To obtain the BERT vector representations the Encoder is pre-trained with two different tasks: - Masked Language Model: for each sentence, mask one token at a time and predict which token is missing based on the context from both sides. A training input example would be “Maria [MASK] Groningen” and the model should predict the word “loves”. - Next Sentence Prediction: the Encoder gets a linear binary classifier on top, which is trained to decide for each pair of sequences A and B, if sequence A precedes sequence B in a text. For the sentence pair: “Maria loves Groningen.” and “This is a city in the Netherlands.” the output of the classifier is “True” and for the pair “Maria loves Groningen.” and “It was a tasty cake.” the output should be “false” as there is no obvious continuation between the two sentences.
Already the second pre-training task gives us an idea of the power of BERT: after it has been pretrained on hundreds of thousands of texts, one can plug-in a classifier on top and re-use the linguistic knowledge previously acquired to fine-tune it for a specific task, without needing to learn the weights of the whole network from scratch all over again. In the next sections we will describe the components of BERT and show how to use it. This model and hundreds of related transformer-based pre-trained encoders can also be found on Hugging Face.
BERT Architecture
The BERT Architecture can be seen as a basic NLP pipeline on its own:
- Tokenizer: splits text into tokens that the model recognizes
- Embedder: converts each token into a fixed-sized vector that represents it. These vectors are the actual input for the Encoder.
- Encoder several neural layers that model the token-level interactions of the input sequence to enhance meaning representation. The output of the encoder is a set of Hidden layers, the vector representation of the ingested sequence.
- Output Layer: the final encoder layer (which we depict as a sequence H’s in the figure) contains arguably the best token-level representations that encode syntactic and semantic properties of each token, but this time each vector is already contextualized with the specific sequence.
- OPTIONAL Classifier Layer: an additional classifier can be connected on top of the BERT token vectors which are used as features for performing a downstream task. This can be used to classify at the text level, for example sentiment analysis of a sentence, or at the token-level, for example Named Entity Recognition.

BERT uses (self-) attention, which is very useful to capture longer-range word dependencies such as correference, where, for example, a pronoun can be linked to the noun it refers to previously in the same sentence. See the following example:

BERT for Word-Based Analysis
Let’s see how these components can be manipulated with code. For this we will be using the HugingFace’s transformers python library. The first two main components we need to initialize are the model and tokenizer. The HuggingFace hub contains thousands of models based on a Transformer architecture for dozens of tasks, data domains and also hundreds of languages. Here we will explore the vanilla English BERT which was how everything started. We can initialize this model with the next lines:
PYTHON
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = BertModel.from_pretrained("bert-base-cased")
BERT Tokenizer
We start with a string of text as written in any blog, book,
newspaper etcetera. The tokenizer
object is responsible of
splitting the string into recognizable tokens for the model and
embedding the tokens into their vector representations
PYTHON
text = "Maria loves Groningen"
encoded_input = tokenizer(text, return_tensors='pt')
print(encoded_input)
The print shows the encoded_input
object returned by the
tokenizer, with its attributes and values. The input_ids
are the most important output for now, as these are the token IDs
recognized by BERT
{
'input_ids': tensor([[ 101, 3406, 7871, 144, 3484, 15016, 102]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])
}
NOTE: the printing function shows transformers objects as dictionaries; however, to access the attributes, you must use the python object syntax, such as in the following example:
Output:
torch.Size([1, 7])
The output is a 2-dimensional tensor where the first dimention contains 1 element (this dimension represents the batch size), and the second dimension contains 7 elements which are equivalent to the 7 tokens that BERT generated with our string input.
In order to see what these Token IDs represent, we can translate them into human readable strings. This includes converting the tensors into numpy arrays and converting each ID into its string representation:
PYTHON
token_ids = list(encoded_input.input_ids[0].detach().numpy())
string_tokens = tokenizer.convert_ids_to_tokens(token_ids)
print("IDs:", token_ids)
print("TOKENS:", string_tokens)
IDs: [101, 3406, 7871, 144, 3484, 15016, 102]
TOKENS: ['[CLS]', 'Maria', 'loves', 'G', '##ron', '##ingen', '[SEP]']
In the case of wanting to obtain a single vector for enchanting, you can average the three vectors that belong to the token pieces that ultimately form that word. For example:
PYTHON
import numpy as np
tok_en = output.last_hidden_state[0][15].detach().numpy()
tok_chan = output.last_hidden_state[0][16].detach().numpy()
tok_ting = output.last_hidden_state[0][17].detach().numpy()
tok_enchanting = np.mean([tok_en, tok_chan, tok_ting], axis=0)
tok_enchanting.shape
We use the functions detach().numpy()
to bring the
values from the Pytorch execution environment (for example a GPU) into
the main python thread and treat it as a numpy vector for convenvience.
Then, since we are dealing with three numpy vectors we can average the
three of them and end op with a single enchanting
vector of
768-dimensions representing the average of
'en', '##chan', '##ting'
.
Polysemy in BERT
We can encode two sentences containing the word note to see how BERT actually handles polysemy (note means something very different in each sentence) thanks to the representation of each word now being contextualized instead of isolated as was the case with word2vec.
PYTHON
# Search for the index of 'note' and obtain its vector from the sequence
note_index_1 = string_tokens.index("note")
note_vector_1 = output.last_hidden_state[0][note_index_1].detach().numpy()
note_token_id_1 = token_ids[note_index_1]
print(note_index_1, note_token_id_1, string_tokens)
print(note_vector_1[:5])
We are basically printing the tokenized sentence from the previous
example and showing the index of the token note
in the list
of tokens. We are also printing the tokenID assigned to this token and
the list of tokens. Finally, the last print shows the first five
dimensions of the vector representing the token note
.
12 3805 ['[CLS]', 'Maria', "'", 's', 'passion', 'for', 'music', 'is', 'clearly', 'heard', 'in', 'every', 'note', 'and', 'every', 'en', '##chan', '##ting', 'melody', '.', '[SEP]']
[0.15780845 0.38866335 0.41498923 0.03389652 0.40278202]
Let’s encode now another sentence, also containing the word
note
, and confirm that the same token string, with the same
assigned tokenID holds a vector with different weights:
PYTHON
# Encode and then take the 'note' token from the second sentence
note_text_2 = "I could not buy milk in the supermarket because the bank note I wanted to use was fake."
encoded_note_2 = tokenizer(note_text_2, return_tensors="pt")
token_ids = list(encoded_note_2.input_ids[0].detach().numpy())
string_tokens_2 = tokenizer.convert_ids_to_tokens(token_ids)
note_index_2 = string_tokens_2.index("note")
note_vector_2 = model(**encoded_note_2).last_hidden_state[0][note_index_2].detach().numpy()
note_token_id_2 = token_ids[note_index_2]
print(note_index_2, note_token_id_2, string_tokens_2)
print(note_vector_2[:5])
12 3805 ['[CLS]', 'I', 'could', 'not', 'buy', 'milk', 'in', 'the', 'supermarket', 'because', 'the', 'bank', 'note', 'I', 'wanted', 'to', 'use', 'was', 'fake', '.', '[SEP]']
[ 0.5003222 0.653664 0.22919582 -0.32637975 0.52929205]
To be sure, we can compute the cosine similarity of the word note in the first sentence and the word note in the second sentence confirming that they are indeed two different representations, even when in both cases they have the same token-id and they are the 12th token of the sentence:
PYTHON
from sklearn.metrics.pairwise import cosine_similarity
vector1 = np.array(note_vector_1).reshape(1, -1)
vector2 = np.array(note_vector_2).reshape(1, -1)
similarity = cosine_similarity(vector1, vector2)
print(f"Cosine Similarity 'note' vs 'note': {similarity[0][0]}")
With this small experiment, we have confirmed that the Encoder produces context-dependent word representations, as opposed to Word2Vec, where note would always have the same vector no matter where it appeared.
When running examples in a BERT pre-trained model, it is advisable to
wrap your code inside a torch.no_grad():
context. This is
linked to the fact that BERT is a Neural Network that has been trained
(and can be further finetuned) with the Backpropagation algorithm.
Essentially, this wrapper tells the model that we are not in training
mode, and we are not interested in updating the weights (as it
would happen when training any neural network), because the weights are
already optimal enough. By using this wrapper, we make the model more
efficient as it does not need to calculate the gradients for an eventual
backpropagation step, since we are only interested in what comes
out of the Encoder. So the previous code can be made more efficient
like this:
BERT as a Language Model
As mentioned before, the main pre-training task of BERT is Language
Modelling (LM): calculating the probability of a word based on the known
neighboring words (yes, Word2Vec was also a kind of LM!). Obtaining
training data for this task is very cheap, as all we need is millions of
sentences from existing texts, without any labels. In this setting, BERT
encodes a sequence of words, and predicts from a set of English tokens,
what is the most likely token that could be inserted in the
[MASK]
position

We can therefore start using BERT as a predictor for word completion.
From now own, we will learn how to use the pipeline
object,
this is very useful when we only want to use a pre-trained model for
predictions (no need to fine-tune or do word-specific analysis). The
pipeline
will internally initialize both model and
tokenizer for us and also merge back word pieces into complete
words.
In this case again we use bert-base-cased
, which refers
to the vanilla BERT English model. Once we declared a pipeline, we can
feed it with sentences that contain one masked token at a time (beware
that BERT can only predict one word at a time, since that was its
training scheme). For example:
PYTHON
from transformers import pipeline
def pretty_print_outputs(sentences, model_outputs):
for i, model_out in enumerate(model_outputs):
print("\n=====\t",sentences[i])
for label_scores in model_out:
print(label_scores)
nlp = pipeline(task="fill-mask", model="bert-base-cased", tokenizer="bert-base-cased")
sentences = ["Paris is the [MASK] of France", "I want to eat a cold [MASK] this afternoon", "Maria [MASK] Groningen"]
model_outputs = nlp(sentences, top_k=5)
pretty_print_outputs(sentences, model_outputs)
===== Paris is the [MASK] of France
{'score': 0.9807965755462646, 'token': 2364, 'token_str': 'capital', 'sequence': 'Paris is the capital of France'}
{'score': 0.004513159394264221, 'token': 6299, 'token_str': 'Capital', 'sequence': 'Paris is the Capital of France'}
{'score': 0.004281804896891117, 'token': 2057, 'token_str': 'center', 'sequence': 'Paris is the center of France'}
{'score': 0.002848200500011444, 'token': 2642, 'token_str': 'centre', 'sequence': 'Paris is the centre of France'}
{'score': 0.0022805952467024326, 'token': 1331, 'token_str': 'city', 'sequence': 'Paris is the city of France'}
===== I want to eat a cold [MASK] this afternoon
{'score': 0.19168031215667725, 'token': 13473, 'token_str': 'pizza', 'sequence': 'I want to eat a cold pizza this afternoon'}
{'score': 0.14800849556922913, 'token': 25138, 'token_str': 'turkey', 'sequence': 'I want to eat a cold turkey this afternoon'}
{'score': 0.14620967209339142, 'token': 14327, 'token_str': 'sandwich', 'sequence': 'I want to eat a cold sandwich this afternoon'}
{'score': 0.09997560828924179, 'token': 5953, 'token_str': 'lunch', 'sequence': 'I want to eat a cold lunch this afternoon'}
{'score': 0.06001955270767212, 'token': 4014, 'token_str': 'dinner', 'sequence': 'I want to eat a cold dinner this afternoon'}
===== Maria [MASK] Groningen
{'score': 0.24399833381175995, 'token': 117, 'token_str': ',', 'sequence': 'Maria, Groningen'}
{'score': 0.12300779670476913, 'token': 1104, 'token_str': 'of', 'sequence': 'Maria of Groningen'}
{'score': 0.11991506069898605, 'token': 1107, 'token_str': 'in', 'sequence': 'Maria in Groningen'}
{'score': 0.07722211629152298, 'token': 1306, 'token_str': '##m', 'sequence': 'Mariam Groningen'}
{'score': 0.0632941722869873, 'token': 118, 'token_str': '-', 'sequence': 'Maria - Groningen'}
When we call the nlp
pipeline, requesting to return the
top_k
most likely suggestions to complete the provided
sentences (in this case k=5
). The pipeline returns a list
of outputs as python dictionaries. Depending on the task, the fields of
the dictionary will differ. In this case, the fill-mask
task returns a score (between 0 and 1, the higher the score the more
likely the token is), a tokenId, and its corresponding string, as well
as the full “unmasked” sequence.
In the list of outputs we can observe: the first example shows correctly that the missing token in the first sentence is capital, the second example is a bit more ambiguous, but the model at least uses the context to correctly predict a series of items that can be eaten (unfortunately, none of its suggestions sound very tasty); finally, the third example gives almost no useful context so the model plays it safe and only suggests prepositions or punctuation. This already shows some of the weaknesses of the approach.
We will next see the case of combining BERT with a classifier on top.
BERT for Text Classification
The task of text classification is assigning a label to a whole
sequence of tokens, for example a sentence. With the parameter
task="text_classification"
the pipeline()
function will load the base model and automatically add a linear layer
with a softmax on top. This layer can be fine-tuned with our own labeled
data or we can also directly load the fully pre-trained text
classification models that are already available in HuggingFace.

Let’s see the example of a ready pre-trained emotion classifier based
on RoBERTa
model. This model was fine-tuned in the Go
emotions dataset,
taken from English Reddit and labeled for 28 different emotions at the
sentence level. The fine-tuned model is called roberta-base-go_emotions.
This model takes a sentence as input and ouputs a probability
distribution over the 28 possible emotions that might be conveyed in the
text. For example:
PYTHON
classifier = pipeline(task="text-classification", model="SamLowe/roberta-base-go_emotions", top_k=3)
sentences = ["I am not having a great day", "This is a lovely and innocent sentence", "Maria loves Groningen"]
model_outputs = classifier(sentences)
pretty_print_outputs(sentences, model_outputs)
===== I am not having a great day
{'label': 'disappointment', 'score': 0.46669483184814453}
{'label': 'sadness', 'score': 0.39849498867988586}
{'label': 'annoyance', 'score': 0.06806594133377075}
===== This is a lovely and innocent sentence
{'label': 'admiration', 'score': 0.6457845568656921}
{'label': 'approval', 'score': 0.5112180113792419}
{'label': 'love', 'score': 0.09214121848344803}
===== Maria loves Groningen
{'label': 'love', 'score': 0.8922032117843628}
{'label': 'neutral', 'score': 0.10132959485054016}
{'label': 'approval', 'score': 0.02525361441075802}
This code outputs again a list of dictionaries with the
top-k
(k=3
) emotions that each of the two
sentences convey. In this case, the first sentence evokes (in order of
likelihood) dissapointment, sadness and
annoyance; whereas the second sentence evokes love,
neutral and approval. Note however that the likelihood
of each prediction decreases dramatically below the top choice, so
perhaps this specific classifier is only useful for the top emotion.
Finetunning BERT is very cheap, because we only need to train the classifier layer, a very small neural network, that can learn to choose between the classes (labels) for your custom classification problem, without needing a big amount of annotated data. This classifier is just a one-layer neural layer with a softmax that assigns a score that can be translated to the probability over a set of labels, given the input features provided by BERT, which encodes the meaning of the entire sequence in its hidden states.

BERT for Token Classification
Just as we plugged in a trainable text classifier layer, we can add a token-level classifier that assigns a class to each of the tokens encoded by a transformer (as opposed to one label for the whole sequence). A specific example of this task is Named Entity Recognition, but you can basically define any task that requires to highlight sub-strings of text and classify them using this technique.
Named Entity Recognition
Named Entity Recognition (NER) is the task of recognizing mentions of real-world entities inside a text. The concept of Entity includes proper names that unequivocally identify a unique individual (PER), place (LOC), organization (ORG), or other object/name (MISC). Depending on the domain, the concept can expanded to recognize other unique (and more conceptual) entities such as DATE, MONEY, WORK_OF_ART, DISEASE, PROTEIN_TYPE, etcetera…
In terms of NLP, this boils down to classifying each token into a
series of labels (PER
, LOC
, ORG
,
O
[no-entity] ). Since a single entity can be expressed with
multiple words (e.g. New York) the usual notation used for labeling the
text is IOB (Inner Out
Beginnig of entity) notations which identifies the
limits of each entity tokens. For example:

This is a typical sequence classification problem where an imput sequence must be fully mapped into an output sequence of labels with global constraints (for example, there can’t be an inner I-LOC label before a beginning B-LOC label). Since the labels of the tokens are context dependent, a language model with attention mechanism such as BERT is very beneficial for a task like NER.
Because this is one of the core tasks in NLP, there are dozens of
pre-trained NER classifiers in HuggingFace that you can use right away.
We use once again the pipeline()
to run the model for
predictions in your custom data, in this case with
task="ner"
. For example:
PYTHON
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
ner_classifier = pipeline("token-classification", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang Schmid and I live in Berlin"
ner_results = ner_classifier(example)
for nr in ner_results:
print(nr)
The code prints the following:
{'entity': 'B-PER', 'score': 0.9996068, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}
{'entity': 'I-PER', 'score': 0.999582, 'index': 5, 'word': 'Sc', 'start': 20, 'end': 22}
{'entity': 'I-PER', 'score': 0.9990482, 'index': 6, 'word': '##hm', 'start': 22, 'end': 24}
{'entity': 'I-PER', 'score': 0.9951691, 'index': 7, 'word': '##id', 'start': 24, 'end': 26}
{'entity': 'B-LOC', 'score': 0.99956733, 'index': 12, 'word': 'Berlin', 'start': 41, 'end': 47}
In this case the output of the pipeline is a list of dictionaries,
each one representing only entity IOB
labels at the BERT
token level. IMPORTANT: this list is per wordPiece and NOT per human
word even if the provided text is pre-tokenized. You can assume all
of the tokens that don’t appear in the output were labeled as no-entity,
that is "O"
. To recover the full-word entities you can
initialize the pipeline with
aggregation_strategy="first"
:
PYTHON
ner_classifier = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="first")
example = "My name is Wolfgang Schmid and I live in Berlin"
ner_results = ner_classifier(example)
for nr in ner_results:
print(nr)
The code now prints the following:
{'entity_group': 'PER', 'score': 0.9995944, 'word': 'Wolfgang Schmid', 'start': 11, 'end': 26}
{'entity_group': 'LOC', 'score': 0.99956733, 'word': 'Berlin', 'start': 41, 'end': 47}
As you can see, entities aggregated at the Span Leven (instead of the
Token Level). Word pieces are merged back into human words and
also multiword entities are assigned a single entity label unifying the
IOB
labels into one. Depending on your use case you can
request the pipeline to give different
aggregation_strateg[ies]
. More info about the pipeline can
be found here.
The next step is crucial: evaluate how does the pre-trained model actually performs in your dataset. This is important since the fine-tuned model could be overfitted to other custom benchmarks that do not share the characteristics of your dataset.
To observe this, we can first see the performance on the test portion of the dataset in which this classifier was trained, and then evaluate the same pre-trained classifier on a NER dataset form a different domain.
Model Evaluation
To perform evaluation in your data you can use again the
seqeval
package:
PYTHON
from seqeval.metrics import classification_report
print(classification_report(gold_labels, model_predictions))
Since we took a classifier that was not trained for the book domain, the performance is quite poor. But this example shows us that classifiers performing very well on their own domain most of the times transfer poorly to other apparently similar datasets.
The solution in this case is to use another of the great characteristics of BERT: fine-tuning for domain adaptation. It is possible to train your own classifier with relatively small data (given that a lot of linguistic knowledge was already provided during the language modeling pre-training). In the following section we will see how to train your own NER model and use it for predictions.
Content from Episode 3: Using large language models
Last updated on 2025-09-18 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- What is a Large Language Model (LLM)?
- How do LLMs differ from traditional NLP models?
- What is the Transformer architecture, and why is it important for LLMs?
- How does prompt engineering influence LLM outputs?
- What are some real-world applications of LLMs?
- How can LLMs generate, classify, and summarise text?
Objectives
After following this lesson, learners will be able to:
- Understand what Large Language Models (LLMs) are and their role in NLP
- Explain the Transformer architecture and why it is foundational for LLMs
- Use prompt engineering to generate high-quality text responses
- Apply LLMs to real-world tasks like news article analysis
- Explore pre-trained LLMs and experiment with custom tasks
- Set up a chatbot and a simple RAG
- Understand the impact of LLMs in modern AI and language processing
Large Language Models (LLMs) are a hot and a big topic these days, and are continuously in the news. Everybody heard of ChatGPT, many tried it out for a variety of purposes, or even incorporated these tools in their daily work. But what are these models exactly, how do they work ‘under the hood’, and how can you make use of them in the best possible way?
In this episode, we will:
- Explore LLMs, which represent a significant advancement in natural language processing (NLP). We will begin by defining what LLMs are and touching on their foundational architecture, particularly the Transformer model, which allows the LLM to understand and generate human-like text.
- Through practical examples, you will discover how to work with these models for specific tasks with content generation.
- We’ll also discuss real-world applications of LLMs, such as analyzing news articles and generating insights from large amounts of text data.
This episode aims to equip you with both theoretical knowledge and practical skills, preparing you to harness the power of LLMs in your own projects and applications.
What are Large Language Models?
Large language models (LLMs) represent a significant advancement in artificial intelligence, designed to process and interpret large-scale natural language data to generate responses to user queries. By being trained on extensive datasets through advanced machine learning algorithms, these models learn the intricate patterns, structures and nuances of human language. This enables them to produce coherent and natural-sounding language outputs across various inputs. As a result, large language models are becoming increasingly essential in a range of tasks such as text generation, text summarisation, rewriting, question answering, and language translation.
The emergence of ChatGPT, powered by OpenAI’s advanced LLMs, has brought these capabilities into the mainstream. With ChatGPT, users interact through natural language, enabling seamless conversations and performing complex tasks across various sectors, like customer service, education, and content creation. Models like GPT-4, BERT, and LLaMA are also used across various applications; from chatbots, virtual assistants, text analysis, translation, summarisation, and more.
Notably, the success of ChatGPT and other LLM-driven applications highlights their versatility and potential to transform how humans interact with digital systems. These models continue to push the boundaries of what’s possible in human-computer interaction, offering a glimpse into a future where machines and humans communicate more naturally and effectively.
Timeline
The journey of large language models over the past decade highlights the rapid evolution and growing impact of AI in language processing:
2012: Significant advancements in deep learning architectures and access to larger datasets lead to the creation of the GPT (Generative Pre-trained Transformer) framework, marking a pivotal point in AI development.
2018: Google unveils BERT (Bidirectional Encoder Representations from Transformers), introducing a groundbreaking architecture that sets the stage for the next generation of large language models.
2020: OpenAI releases GPT-3, which, with its remarkable 175 billion parameters, establishes a new standard for performance in language-related tasks.
2022: The launch of ChatGPT marks a turning point by making GPT-3 and similar models easily accessible to the public via a web interface, significantly raising public awareness and interest in LLMs and generative AI.
2023: Open-source LLMs begin to deliver impressive outcomes with the introduction of models like Dolly 2.0, LLaMA, Alpaca, and Vicuna. Meanwhile, the release of GPT-4 pushes the envelope by setting new records for both parameter scale and overall performance.
Starting up a simple chat
Let’s explore how these LLMs work by doing a simple chat demonstration.
Imagine asking a computer a question and getting a human-like response. An LLM, like GPT-4, can generate responses that feel natural, informative, and tailored to your question by analyzing the context and meaning of your input. In the demonstration below, you’ll see how an LLM responds to a prompt and how it builds upon previous information to maintain the conversation.
Example Chat
User: “What’s the tallest mountain in the world?”
LLM: “The tallest mountain in the world is Mount Everest, which stands at 8,848 meters (29,029 feet).”
User: “How long does it take to climb it?”
LLM: “Climbing Mount Everest typically takes around two months, due to the acclimatization process needed to adjust to high altitudes. Expeditions often start from Base Camp in Nepal, where climbers spend several weeks preparing.”
User: “Could someone with no climbing experience try it?”
LLM: “Climbing Mount Everest is extremely challenging and requires significant training and experience due to the harsh conditions, high altitudes, and technical aspects of the climb. Most climbers spend years training and gaining experience on other mountains before attempting Everest.”
This example shows how LLMs can maintain context across multiple questions. Notice how the model connects information across each question, keeping the conversation relevant and informed.
This quick chat example introduces you to the conversational abilities of LLMs, demonstrating their ability to respond contextually, provide coherent answers, and follow up on earlier parts of the conversation.
How are LLMs different from other NLP techniques?
LLMs stand apart from traditional NLP techniques due to their scale, versatility, and reliance on deep learning models, especially Transformers. Here’s a breakdown of how they differ:
Scale of training data and parameters
- LLMs: Trained on massive datasets (trillions of words) and use billions of parameters, allowing them to learn language patterns, facts, and relationships at an unprecedented depth.
- Traditional NLP: Techniques like rule-based systems or smaller machine learning models rely on much smaller datasets, often requiring domain-specific training for each task (e.g., sentiment analysis or named entity recognition).
Model architecture
- LLMs: Use the Transformer architecture, particularly self-attention, to analyze relationships between words regardless of position. This allows them to capture long-range dependencies and context better than traditional models.
- Traditional NLP: Often use simpler models like bag-of-words, TF-IDF (term frequency-inverse document frequency), RNNs (recurrent neural networks), and LSTMs (long-short-term memory models), which treat words independently or consider only local context, missing the complex, global relationships.
Generalization vs. task-specific models
- LLMs: Can be applied across a wide range of tasks—summarization, translation, question answering, etc.—without the need for separate models for each. Fine-tuning makes them even more adaptable to specific needs.
- Traditional NLP: Typically requires developing or training a separate model for each task. For example, separate models for sentiment analysis, translation, and entity recognition.
Learning from unlabeled data
- LLMs: Leverage unsupervised or self-supervised learning during pretraining, enabling them to learn language patterns from raw text without human-labeled data.
- Traditional NLP: are often supervised models, relying on labeled data for training (e.g., labeled sentiment or part-of-speech tags), which can be costly and time-consuming to create at scale.
Context and language nuance
- LLMs: Excel at understanding context, tone, and nuance, due to their ability to weigh word relationships dynamically. This enables better handling of idioms, sarcasm, and ambiguous phrases.
- Traditional NLP: Struggles with complex language nuances, often producing more rigid or literal interpretations. Contextual understanding is limited, especially for longer sentences or paragraphs.
Adaptability and fine-tuning
- LLMs: Easily adaptable to new tasks or domains with fine-tuning, making them versatile across different applications.
- Traditional NLP: Less flexible, often requiring retraining from scratch or heavy feature engineering to adapt to new domains or tasks.
In short, LLMs represent a leap forward by combining scale, flexibility, and deep learning power, allowing for more accurate, nuanced, and versatile language processing compared to traditional NLP techniques.
What LLMs are good at
- Language generation: Creating coherent and contextually appropriate text, making them ideal for creative writing, chatbots, and automated responses.
- Summarization and translation: Quickly summarizing articles, books, and translating text between languages with reasonable accuracy.
- Information retrieval and answering questions: LLMs can recall and apply general knowledge from their training data to answer questions, though they don’t actually “know” facts.
- Sentiment and text classification: LLMs can classify text for tasks like sentiment analysis, spam detection, and topic categorization.
What LLMs struggle with
- Fact-based accuracy: Since LLMs don’t “know” facts, they may generate incorrect or outdated information and are prone to hallucinations (making up facts).
- Understanding context over long passages: LLMs can struggle with context over very long texts and may lose track of earlier details, affecting coherence.
- Mathematical reasoning and logic: Though improving, LLMs often find complex problem-solving and detailed logical reasoning challenging without direct guidance.
- Ethical and sensitive issues: LLMs may produce biased or offensive text based on biases present in the training data, making content moderation necessary in sensitive applications.
How do LLMs work?
So, how is it that you can chat with a model and receive responses that seem almost human? The answer lies in the architecture and training of Large Language Models (LLMs), which are powered by advanced neural networks that understand, generate, and even translate human language with surprising accuracy.
At the core of LLMs lies a framework known as the transformer; a concept already encountered in the previous episode. Transformers allow these models to process vast amounts of text and learn the structure and nuances of language. This setup enables LLMs not only to answer questions but also to predict, complete, and even generate coherent text based on the patterns they’ve learned.
LLMs are trained on large text datasets and later fine-tuned on specific tasks, which helps them adapt to a wide range of applications, from conversation to text classification. The result? A model that can chat, summarize, translate, and much more—all by leveraging these core mechanisms. LLM’s rely on the following key concepts:
- Transformers and self-attention: The transformer architecture, especially the self-attention mechanism, is at the heart of LLMs. Self-attention enables these models to understand the importance of each word in relation to others in a sequence, regardless of their position.
- Pretraining and fine-tuning: LLMs are first pre-trained on large text datasets using tasks like predicting the next word in a sentence, learning language patterns. They are then fine-tuned on specific tasks (e.g., translation, summarization) to enhance performance for targeted applications.
- Generative vs. discriminative models: LLMs can be applied to both generative tasks (e.g., text generation) and discriminative tasks (e.g., classification).
In practice, this attention mechanism helps LLMs produce coherent responses by establishing relationships between words as each new token is generated. Here’s how it works:
Understanding word relationships. Self-attention enables the model to weigh the importance of each word in a sentence, no matter where it appears, to make sense of the sentence as a whole.
Predicting next words based on context. With these relationships mapped out, the model can predict the next word in a sequence. For example, in “The fox,” self-attention allows the model to anticipate that “jumps” or “runs” might come next rather than something unrelated like “table.”
Structuring responses. As each word is generated, the model assesses how each new token impacts the entire sentence, ensuring that responses are relevant, logically sound, and grammatically correct. This ability to “structure” language is why LLMs can produce responses that are contextually meaningful and well-organized.
A zoo of Large Language Models
The era of Large Language Models gained momentum in 2018 with the release of Google’s BERT. Since then, many companies have rapidly developed newer and more powerful models. Among these are GPT (OpenAI), Llama (Meta), Mistral (Mistral AI), Gemini (Google DeepMind), Claude (Anthropic), and Grok (xAI). Some are open-source or more transparent than others, revealing their architectures, parameter counts, or training data and the collection thereof.
A list of LLMs: - GPT - OpenAI - Llama - Meta - Mistral / Mixtral - Mistral AI (founded by former engineers from Google DeepMind and Meta) - Gemini - Google DeepMind - Claude - Anthropic - (founded by former OpenAI employees) - Grok - xAI (Elon Musk)
Training a large language model is extremely resource intensive. For example, llama’s model Llama 3.1 405B is a model that has 405 billion parameters. It was trained on 15 trillion tokens, uses 31 million GPU hours (H100 gpus), and emitted almost 9000 tons of CO_2 (for the training process only).
Inference also consumes considerable resources and has a significant environmental impact. Large models require large memory for storing and loading the model weights (storing weights alone can require hundreds of gigabytes), and need high-performance GPUs to achieve reasonable runtimes. As a result, many models operate on cloud-based servers, increasing power consumption, especially when scaled accomodate large numbers of users.
Which one to chose when?
With so many available models the question arises “which model you should use when”? One thing to consider here is whether you want to use an open source model or not. But another important aspect is that it depends on the task at hand. There are various leaderboards (for example: HuggingFace that track which tasks specific models are good at, based on widely used benchmarks. Also, which language are you using? Most models are fully trained on English, not many models are trained on Dutch text. So if you are using Dutch texts, you may want to look for a model that is trained on or finetuned for Dutch. Additionally, some LLMs are multimodal models, meaning they can process various forms of input; text, images, timeseries, audio, videos and so on.
Building a chatbot
It is time to start using an LLM! We are not going to train our own LLM, but use Meta’s open source Llama model to set up a chatbot.
Starting Ollama
Ollama is a platform that allows users to run various LLM locally on your own computer. This is different from for example using chatgpt, where you log in and use the online api. ChatGPT collects the input you are providing and uses this to their own benefit. Running an LLM locally using Ollama thus preserves your privacy. It also allows you to customize a model, by setting certain parameters, or even by finetuning a model.
To start Ollama:
ollama serve
Next, download the large language model to be used. In this case use the smallest open source llama model, which is llama3.1:8b. Here 3.1 is the version of the model and 8b stands for the number of parameters that the model has.
!ollama pull llama3.1:8b
In general, a bigger version of the same model (such as Llama3.1:70b) is better in accuracy, but since it is larger it takes more resources to run and can hence be too much for a laptop.
Import the packages that will be used:
Create a model instance
Here, model
defines the LLM to be used, which is set to
the model just downloaded, and temperature
sets the
randomness of the mode, using the value zero ensures that repeating a
question will give the same model output (answer).
llm = ChatOllama(model="llama3.1:8b", temperature=0)
Now that the model is set up, it can be invoked - ask it a question.
PYTHON
question = "When was the moon landing?"
chatresult = llm.invoke([HumanMessage(content=question)])
print(chatresult.content)
Challenge
Play around with the chat bot by changing the questions. - How is the quality of the answers? - Is it able to answer general questions, and very specific questions? - Which limitations can you identify? - How could you get better answers?
Challenge (continued)
This Llama chat bot, just like ChatGPT, is quite generic. It is good at answering general questions; things that a lot of people know. Going deeper and asking very specific questions often leads to vague or inaccurate results.
Use context
To improve on what to expect the LLM to return, it is also possible to provide it with some context. For example, add:
PYTHON
context = "You are a highschool history teacher trying to explain societal impact of historic events."
messages = [
SystemMessage(content=context),
HumanMessage(content=question),
]
The benefit here is that your answer will be phrased in a way that fits your context, without having to specify this for every question.
Use the chat history
With this chatbot the LLM can be invoked to generate output based on the provided input and context. However, what is not possible in this state, is to ask followup questions. This can be useful to refine the output that it generates. The next step is therefore to implement message persistence in the workflow.
PYTHON
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import START, MessagesState, StateGraph
from IPython.display import Image, display
The package LangGraph is a library that is designed to build LLM agents using workflows represented as graphs. The workflows you create consist of connected components, which allows you to build multi-step processes. The workflow graphs can be easily visualised which makes them quite insightful. LangGraph also has a build-in persistence layer, exactly what we want right now!
First, define an empty workflow graph with the StateGraph class with the MessageState schema (a simple schema with messages as only key)
Then define a function to invoke the llm with a message
PYTHON
def call_llm(state: MessagesState):
response = llm.invoke(state["messages"])
return {"messages": response}
Then add the call_llm function as a node to the graph and connect it with an edge to the start point of the graph. This start node sends the user input to the graph, which in this case only contains the LLM element.
Initialise a memory that will preserve the messages state in a dictionary while going though the graph multiple times asking followup questions.
Then compile and visualise the graph with the memory as checkpoint.
PYTHON
graph = workflow.compile(checkpointer=memory)
display(Image(graph.get_graph().draw_mermaid_png()))
Define a memory
id for the current conservation.
Then call the workflow with memory we created with the original question
PYTHON
question = 'Who landed on the Moon?'
messages = [HumanMessage(question)]
output = graph.invoke({"messages": messages}, config)
output["messages"][-1].pretty_print()
The question and answer are now saved in the graph state with this config, and followup questions and answers with the same config will be added to it.
Everything that is saved can be found in the config state
The workflow can now be used to ask followup questions without having to repeat the original question, and based on the previous generated answer.
Retrieval Augmented Generation - Build a RAG
A chatbot tends to give quite generic answers. A more specific chatbot can be made by building a Retrieval Augmented Generation agent. This is an information that you yourself provide with a knowledge base: a large number of documents. When prompted with a questions, the agent first retrieves relevant sections of the data that is in the knowledge base, and then generates and answer based on that data. In this way you can build an agent with very specific knowledge.
The simplest form of a rag consists of two parts, a retriever and a generator. The retriever part will collect data from the provided data, so first a knowledge base has to be created for the retriever.
To generate text in the RAG the trained Llama model will be used, which works well for English text. Because this model was not trained on Dutch text, the RAG will work better for an English knowledge base.
Three newspaper pages will be used for the example RAG, these are pages from a Curacao newspaper. This is a Dutch newspaper with an additional page in English. The text versions of the newspapers can be downloaded to only get these specific English pages. Save them in a folder called “rag_data” for further processing: - page1 - page2 - page3
The knowledge base - a vector store
Language models all work with vectors - embedded text. Instead of saving text, a the data has to be stored in embedded versions in a vector store, where the retriever can shop around for the relevant text.
There a number of packages to be used in this section to build the RAG.
PYTHON
import os
from IPython.display import Image, display
from typing_extensions import List, TypedDict
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_core.messages import HumanMessage
from langchain_core.documents import Document
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from langchain_nomic.embeddings import NomicEmbeddings
Define the large language model to be used to generate an answer based on provided context:
Define the embeddings model, this is the model to convert our knowledge base texts into vector embeddings and will be used for the retrieval part of the RAG:
For more on Nomic embeddings see: https://python.langchain.com/api_reference/nomic/embeddings/langchain_nomic.embeddings.NomicEmbeddings.html
using inference_model=“local” uses (Embed4All)[https://docs.gpt4all.io/old/gpt4all_python_embedding.html]
In the text files, the articles are split by ‘—’. This information can be used to store the individual articles into a list. Store the filename of the articles in a list as well, so that one can find easily in from which file a text snippet was taken.
PYTHON
dir = "./rag_data"
articles = []
metadata = []
# Iterate over files and add individual articles and corresponding filenames to lists
for file in os.listdir(dir):
file_path = os.path.join(dir, file)
with open(file_path, "r") as f:
content = f.read().split('---')
articles.extend(content)
metadata.extend([file_path] * len(content))
The generator will in the end provide an answer based on the text snippet that is retrieved from the knowledge base. If the fragment is very long, it may contain a lot of irrelevant information, which will blur the generated answer. Therefor it is better to split the data into smaller parts, so that the retriever can collect very specific pieces of text to generate an answer from. It is useful to keep some overlap between the splits, so that information does not get lost because of for example splits in the middle of a sentence.
PYTHON
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=50
)
documents = text_splitter.create_documents(articles, metadatas=[{'filename': file} for file in files])
print(documents)
This text splitter splits text based on the defined character chunk size, but also takes into account spaces and newlines to slit in “smart” chunks, so the chunks will not be exactly of length 500.
Finally, convert each text split into a vector, and save all vectors in a vector store. The text is converted into embeddings using the earlier defined embeddings model.
PYTHON
vectorstore = InMemoryVectorStore.from_texts(
[doc.page_content for doc in documents],
embedding=embeddings,
)
The contents of the vectorstore can be printed as
It shows that for each text fragment that was given, a vector is created and it is saved in the vectorstore together with the original text.
Setting up the retriever and generator
Define the structure of a dictionary with the keys
question
, context
, and
answer
.
Define the retriever function of the RAG. It takes in the question and does a similarity search in the created vectorstore and returns the text snippets that were found to be similar. The similarity search converts the question into an embeddings vector and uses the cosine similarity to determine the similarity between the question and snippets. It then returns the top 4 snippets with the highest cosine similarity score. The snippets are returned in the original text form, i.e. the retrieved vectors are transformed back into text.
PYTHON
def retrieve(state: State):
"Retrieve documents that are similar to the question."
retrieved_docs = vectorstore.similarity_search(state["question"], k=4)
return {"context": retrieved_docs}
Define the generator function of the RAG. In this function a prompt is defined for the RAG using the context and question. The large language model (the Llama model, defined above) is then invoked with this question and generates an answer for the provided prompt, which is returned as the answer key of the dictionary.
PYTHON
def generate(state: State):
docs_content = "\n\n".join(doc.page_content for doc in state["context"])
rag_prompt = """You are an assistant for question-answering tasks.
Here is the context to use to answer the question:
{context}
Think carefully about the above context.
Now, review the user question:
{question}
Provide an answer to this questions using only the above context.
Use 10 sentences maximum and keep the answer concise.
Answer:"""
rag_prompt_formatted = rag_prompt.format(context=docs_content, question=State["question"])
generate = llm.invoke([HumanMessage(content=rag_prompt_formatted)])
return {"answer": generate.content}
Build the workflow
The retriever and generator are combined into a workflow graph. The
workflow is defined as a StateGraph that uses the dictionary structure
(with the keys question
, context
, and
answer
) defined above. The retriever and generator are
added as nodes, and the two are connected via the edge. The retrieve is
set as the start point of the workflow, and finally the graph is
compiled into an executable.
workflow = StateGraph(State)
workflow.add_node("retrieve", retrieve)
workflow.add_node("generate", generate)
workflow.add_edge("retrieve", "generate")
workflow.set_entry_point("retrieve")
graph = workflow.compile()
PYTHON
display(Image(graph.get_graph().draw_mermaid_png()))https://scikit-learn.org/1.5/modules/grid_search.html

That’s it! The RAG can now be asked questions. Let’s see what it can tell about the moon landing:
This is quite a specific answer. It can be seen why by looking at the text snippets that were used:
print(response["context"])
While a general chatbot uses all the information in the material that it was trained on, the RAG only uses the information that was stored in the vectorstore to generate the answer.
Challenge
Try generating more answers with the RAG based on other questions, perhaps also looking at the newspaper texts that are used. What stands out?
For example: - The RAG returns in some cases that no answer can be generated on the context it was provided with - For some questions, the LLM returns that it cannot provide an answer because of safety precautions that are inherent to the LLM used, such as information about violent acts.
This is the simplest form of a RAG, with a retriever and a generator. However, one can make the RAG more complex by adding more components and options to the workflow, for example one to check the relevance of the retrieved documents, removing those that turn out to be irrelevant to be used for the answer generation, or having a component that can reformulate the question. Another example is to add a hallucination checker step after the generator that checks if the generated answer can actually be found in the provided context.
Pitfalls, limitations, privacy
While LLMs are very powerful and provide us with many great possibilities and opportunities, they also have limitations.
- Training data: LLMs are trained on large data sets. These are often
collected from the internet and books, but these come with downsides
- They have a date cutoff; LLMs are trained on a static data set, meaning they are trained on information up to a certain date. They therefor do not have the latest information. They are definitely not useful for recent news events (often they mention this), but also lack behind in for example technological advancements. Always consider how old the model is that you are using.s
- There is no fact checking involved in the training data. If the training data contains a lot of incorrect information or fake news, this will affect the answers generated.
- The training data is often biased, including social and cultural biases. This can lead to harmful responses that are stereotyping or racist. Training LLMs does involve a human review for fine-tuning the models, such that they are prevented from answering questions on illegal activities, political advice, advice on harming yourself or others, or generating violent or sexual content. When for example prompted with questions about politics it does provide generic factual information, but will also say that it will not give advice or opinionated answers.
- For GPT-4, there is no exact information provided as to which data it is trained on, meaning that the data might be breaking privacy laws or copyright infringement.
- The data an LLM is trained on is generic, resulting in that it is not good at generating answers for specialised questions. There are however already a lot of models that are finetuned for specific fields.
- Language: LLMs are primarily trained on data collected from the internet, resulting in that they are ‘best’ in the most spoken languages. ChatGPT is trained on many languages, but languages that are less widely spoken will automatically have smaller data to train on, which makes the LLM less accurate in these languages.
- Multi-step thinking: LLMs are generally not good at multi-step thinking. They are very good at providing bullet point lists of information, but reasoning like humans do, drawing a conclusion from combined logic is something they are not good at (yet).
- Hallucinations: LLMs tend to hallucinate. When it ‘does not know the answer’, it will often still try to provide an answer. You should therefore not blindly use the answers from an LLM, but still check the given information yourself.
- Privacy: when using a language model locally, such as done above with Llama, your privacy is preserved. The model is only on your laptop, and the data you provide is not uploaded to any server. But when you for example use ChatGPT via the web interface, there is no privacy. Any information you provide, questions, provided context and so on will be used by ChatGPT. It will be used (ao) for improving the model, which may be considered a good thing, but other things it is used for are not necessarily known. This means that you should be careful in what you provide to the LLM. Never provide sensitive information or private data to an LLM that you do not run fully locally.
Key points to remember
- Transformer models power LLMs: The Transformer architecture and its self-attention mechanism allow LLMs to handle long-range dependencies in text effectively.
- LLMs excel at multiple tasks: From text generation to classification and summarisation, LLMs like Llama are versatile and perform well on various NLP tasks.
- Prompt engineering is crucial: Designing effective prompts can significantly improve the quality of outputs from LLMs, especially for creative tasks like text generation and translation.
- Real-world use cases: LLMs can be applied to real-world problems like news classification, summarisation, and headline generation, improving efficiency in content management and delivery.