Introduction
NLP is a subfield of Artificial Intelligence (AI) that, using the help of Linguistics, deals with approaches to process, understand and generate natural language
Linguistic Data has special properties that we should consider when modeling our solutions
Key tasks include language modeling, text classification, token classification and text generation
Deep learning has significantly advanced NLP, but the challenge remains in processing the discrete and ambiguous nature of language
The ultimate goal of NLP is to enable machines to understand and process language as humans do
From words to vectors
- We can run a preprocessing pipeline to obtain clear words that can be used as features
- We learned how are words converted into vectors of numbers (which makes them interpretable for machines)
- We can easily compute how words are similar to each other with the cosine similarity
- Using gensim we can train our own word2vec models
Transformers: BERT and Beyond
Static word representations, such as word2vec, still lack of enough context to do more advanced tasks, we made this weakness evident by studying polysemy.
The transformer architecture consists of three main components: an Encoder to create powerful text representations (embeddings), an Attention Mechanism to learn more from the full sequence context, and a Decoder, a generative model to predicts the next token based on the context it has so far.
BERT is a deep encoder that creates rich contextualized representations of words and sentences. These representations are very powerful features that can be re-used by other machine Learning and deep learning models.
Several of the core NLP tasks can be solved using Transformer-based models. In this episode we covered language modeling (fill-in the mask), text classification (sentiment analysis) and token classification (named entity recognition).
Evaluating the model performance using your own data for your own use case is crucial to understand possible drawbacks when using this model for unknown predictions
Using large language models
- We learned how are so called LLMs different from the first generation of Transformers
- There are different kinds of LLMs and understanding their differences and limitations are a key aspect for choosing the best model for your case
- We learned how to use HuggingFace pipeline with SmolLMv2, an open source model.
- We learned how to use Ollama to run conversational models in our laptop
- Classification tasks can be done using generative models if we define the prompt in a careful way
- Hidden biases will always be present when using LLMs, we should be aware of those before we draw conclusions from the outputs.