Summary and Schedule
Welcome
This lesson teaches the fundamentals of Natural Language Processing (NLP) in Python. It will equip you with the foundational skills and knowledge needed to carry over text-based research projects. The lesson is designed with researchers in the Humanities and Social Sciences in mind, but is also applicable to other fields of research.
On the first day we will dive in to text preprocessing and word embeddings while epxloring semantic shifts in various words over multiple decades. The second day begins with an introduction to transformers, and we will work on classification and named entity recognition with the BERT model. In the afternoon, we willl cover large language language models, and you will learn how to build your own agents.
Prerequisites
Before joining this course, participants should have:
- Basic Python programming skills
Do you want to teach this lesson? Find more help in the README Feel free to reach out to us with any questions that you have. Just open a new issue. We also value any feedback on the lesson!
| Setup Instructions | Download files required for the lesson | |
| Duration: 00h 00m | 1. Introduction |
What is Natural Language Processing? What are some common applications of NLP? What makes text different from other data? Why not just learn Large Language Models? What linguistic properties should we consider when dealing with texts? ::: |
| Duration: 02h 00m | 2. From words to vectors |
What operations should I perform to get clean text? What properties do word embeddings have? What is a word2vec model? What insights can I get from word embeddings? How do I train my own word2vec model? |
| Duration: 04h 00m | 3. Transformers: BERT and Beyond |
What are some drawbacks of static word embeddings? What are Transformers? What is BERT and how does it work? How can I use BERT to solve NLP tasks? How should I evaluate my classifiers? Which other Transformer variants are available? |
| Duration: 06h 00m | 4. Episode 3: Using large language models | |
| Duration: 07h 00m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
Software Setup
Installing Python
Python is a popular language for scientific computing, and a frequent choice for machine learning as well. To install Python, follow the Beginner’s Guide or head straight to the download page.
Please set up your python environment at least a day in advance of the workshop. If you encounter problems with the installation procedure, ask your workshop organizers via e-mail for assistance so you are ready to go as soon as the workshop begins.
Installing the required packages
Pip is the package management system built into Python. Pip should be available in your system once you installed Python successfully. Please note that installing the packages can take some time, in particular on Windows.
Open a terminal (Mac/Linux) or Command Prompt (Windows) and run the following commands.
- Create a virtual
environment called
nlp_workshop:
python3 -m venv nlp_workshop
py -m venv nlp_workshop
- Activate the newly created virtual environment:
source nlp_workshop/bin/activate
nlp_workshop\Scripts\activate
Remember that you need to activate your environment every time you restart your terminal!
- Install the required packages:
python3 -m pip install jupyterlab jieba spacy gensim matplotlib transformers
py -m pip install jupyterlab jieba spacy gensim matplotlib transformers
Jupyter Lab
We will teach using Python in Jupyter Lab, a programming environment that runs in a web browser. Jupyter Lab is compatible with Firefox, Chrome, Safari and Chromium-based browsers. Note that Internet Explorer and Edge are not supported. See the Jupyter Lab documentation for an up-to-date list of supported browsers.
To start Jupyter Lab, open a terminal (Mac/Linux) or Command Prompt (Windows) and type the command:
jupyter lab
Ollama
We will use Ollama to run large language models. The installer (available for Linux/Windows/Mac OS) can be downloaded here:
Run the installer and follow the instructions on screen.
Next, download the model that we will be using from a terminal (Mac/Linux) or Command Prompt (Windows) by typing the command:
ollama pull llama3.2:1b
Data Sets
Datasets and example files are placed in the episodes/data/ directory.
You can manually download each of the 4 .txt files by clicking on them and using the down arrow buttom (“download raw file”) that is on the upper right corner of the screen, below the word “History”.
You should also manually download a notebook template vailable in the learners/notebook directory.
The 4 text files and the notebook should be placed together in the same directory.
Word2Vec
Download Word2Vec models trained on 6 national Dutch newspaper data spanning a time period from 1950 to 1989 (Wevers, M., 2019). These models are available on Zenodo.
python3 -m gensim.downloader --download word2vec-google-news-300
py -m gensim.downloader --download word2vec-google-news-300
Spacy English
Download the trained pipelines for English from Spacy. To do so, open a terminal (Mac/Linux) or Command Prompt (Windows) and type the command:
python3 -m spacy download en_core_web_sm
py -m spacy download en_core_web_sm