Key Points

Introduction


  • NLP is a subfield of Articifial Intelligence that, with help from Linguistics, deals with processing, understanding, and generating natural language data.
  • Linguistic data has unique properties that make it challenging to process computationally: it is unstructured, ambiguous, context-dependent, and varies significantly across the 7000+ human languages.
  • Tokenization is the foundational step in NLP: splitting text into meaningful units (tokens) creates the structure that all downstream algorithms require.
  • Language Modeling is a subset of NLP, not a synonym for AI: understanding NLP fundamentals (tokenization, statistical models, evaluation, …) can help to trace errors, detect biases, and use LLMs more critically and effectively.
  • Text pre-processing is a pipeline of decisions: character cleaning, tokenization, lowercasing, and lemmatizing are common steps that can be used to improve the performance of your task.
  • Libraries like spaCy are very light and make it practical to extract linguistic features (tokens, lemmas, part-of-speech tags, named entities, and sentence boundaries) from text in different languages with minimal code.

A Primer on Linguistics


  • NLP tasks can be approached as supervised (learning from labeled examples), semi-supervised (learning from text tokens), or unsupervised (exploiting patterns in raw text).
  • The main families of NLP tasks are text classification, token classification, language modeling, and text generation.
  • Language is compositional: meaning is built layer by layer from words to sentences to discourse, and sometimes more than two layers are needed in order to describe the meaning of a piece of text.
  • Language is ambiguous at multiple levels and resolving this ambiguity requires contextual information that is challenging for machines to capture.
  • Language is sparse: words of interest typically appear rarely in a corpus, dominated by high-frequency stopwords that carry little semantic content.
  • Language is discrete: word form does not reflect meaning: “car” and “cat” differ by one letter yet are unrelated, while “pizza” and “hamburger” look very different but are semantically more similar.
  • Domain-specific language shifts the distribution of meaningful words and can change the meaning of terms entirely (e.g., “trial” in law vs. “trial” in medicine).

From words to vectors


  • An NLP pipeline is a chain of steps from raw text to a structured task output.
  • Word embeddings represent words as dense numeric vectors. These are learned by training neural network(s) on a language modeling objective. Because similar words appear in similar contexts, semantically related words will have gemoetrically similar vectors.
  • Word2Vec is a popular word embedding model. It encodes semantic relationships in vector arithmetic: analogies like “king − man + woman ≈ queen” emerge naturally from the geometry of vectors.
  • Word2Vec has a key limitation: words unseen during training (out-of-vocabulary) cannot be represented, and each word receives one fixed vector regardless of context.
  • Cosine similarity measures the angle between two vectors and is the standard metric for comparing word embeddings: it captures semantic relatedness independently of vector magnitude, and returns values between -1 (not similar at all) and 1 (completely similar).
  • Topic modelling is an unsupervised method for discovering latent themes in a document collection, representing each topic as a weighted list of characteristic words and each document as a mixture of topics.
  • BERTopic builds on language model embeddings by chaining document vectorization (each document gets a single vector that represents it), dimensionality reduction (UMAP), clustering (HDBSCAN), and topic labeling (c-TF-IDF) — each component can be swapped to adapt to different languages, text lengths, or research goals.

Transformers: BERT and Beyond


  • Static word representations, such as word2vec, still lack of enough context to do more advanced tasks, we made this weakness evident by studying polysemy.

  • The transformer architecture consists of three main components: an Encoder to create powerful text representations (embeddings), an Attention Mechanism to learn more from the full sequence context, and a Decoder, a generative model to predicts the next token based on the context it has so far.

  • BERT is a deep encoder that creates rich contextualized representations of words and sentences. These representations are very powerful features that can be re-used by other machine Learning and deep learning models.

  • Several of the core NLP tasks can be solved using Transformer-based models. In this episode we covered language modeling (fill-in the mask), text classification (sentiment analysis) and token classification (named entity recognition).

  • Evaluating the model performance using your own data for your own use case is crucial to understand possible drawbacks when using this model for unknown predictions

Using large language models


  • LLMs differ from vanilla transformers in three key ways: scale (parameters and context window sizes), post-training (SFT + RHLF), and generalization capabilities (given the amount of seen training data).
  • The LLM landscape spans distinct model families: embedder models (encoder-only, optimized for similarity), base generative models, instruct-tuned assistants, reasoning models, and tool-augmented models, choosing the right type depends on the task.
  • LLM’s can be “open” in three independent dimensions: open weights, open training data, and open architecture. An LLM that only exposes a remote API is a proprietary model and we don’t have much control over it’s behavior.
  • Open LLMs can run locally with tools like Ollama and the HuggingFace transformers pipeline. This eliminates API costs, keeps sensitive data private, and makes experiments more reproducible without depending on remote services.
  • Generative models can solve classification tasks through careful prompt design: label choices, output format, and system instructions directly shape what the model returns. Outputs should still be evaluated with standard metrics (precision, recall, F1).
  • LLMs can produce confident and fluent content, but this does not imply the content will be factually correct. Guardrails partially mitigate this but do not eliminate it.
  • LLMs carry systematic biases inherited from training data and post-training: gender stereotypes, anglosphere-centric factual accuracy, and underrepresentation of languages and cultures outside the western mainstream are structural failures. This should be always considered before reaching to conclusions without a careful analysis of results.