Introduction


  • NLP is a subfield of Artificial Intelligence (AI) that, using the help of Linguistics, deals with approaches to process, understand and generate natural language

  • Linguistic Data has special properties that we should consider when modeling our solutions

  • Key tasks include language modeling, text classification, token classification and text generation

  • Deep learning has significantly advanced NLP, but the challenge remains in processing the discrete and ambiguous nature of language

  • The ultimate goal of NLP is to enable machines to understand and process language as humans do

From words to vectors


  • We can run a preprocessing pipeline to obtain clear words that can be used as features
  • We learned how are words converted into vectors of numbers (which makes them interpretable for machines)
  • We can easily compute how words are similar to each other with the cosine similarity
  • Using gensim we can train our own word2vec models

Transformers: BERT and Beyond


  • Static word representations, such as word2vec, still lack of enough context to do more advanced tasks, we made this weakness evident by studying polysemy.

  • The transformer architecture consists of three main components: an Encoder to create powerful text representations (embeddings), an Attention Mechanism to learn more from the full sequence context, and a Decoder, a generative model to predicts the next token based on the context it has so far.

  • BERT is a deep encoder that creates rich contextualized representations of words and sentences. These representations are very powerful features that can be re-used by other machine Learning and deep learning models.

  • Several of the core NLP tasks can be solved using Transformer-based models. In this episode we covered language modeling (fill-in the mask), text classification (sentiment analysis) and token classification (named entity recognition).

  • Evaluating the model performance using your own data for your own use case is crucial to understand possible drawbacks when using this model for unknown predictions

Using large language models


  • We learned how are so called LLMs different from the first generation of Transformers
  • There are different kinds of LLMs and understanding their differences and limitations are a key aspect for choosing the best model for your case
  • We learned how to use HuggingFace pipeline with SmolLMv2, an open source model.
  • We learned how to use Ollama to run conversational models in our laptop
  • Classification tasks can be done using generative models if we define the prompt in a careful way
  • Hidden biases will always be present when using LLMs, we should be aware of those before we draw conclusions from the outputs.