This lesson is still being designed and assembled (Pre-Alpha version)

Text Analysis in Python: Glossary

Key Points

Introduction to Natural Language Processing
  • NLP is comprised of models that perform different tasks.

  • Our workflow for an NLP project consists of designing, preprocessing, representation, running, creating output, and interpreting that output.

  • NLP tasks can be adapted to suit different research interests.

Corpus Development- Text Data Collection
  • You will need to evaluate the suitability of data for inclusion in your corpus and will need to take into consideration issues such as legal/ethical restrictions and data quality among others.

  • It is important to think critically about data sources and the context of how they were created or assembled.

  • Becoming familiar with your data and its characteristics can help you prepare your data for analysis.

Preparing and Preprocessing Your Data
  • Tokenization breaks strings into smaller parts for analysis.

  • Casing removes capital letters.

  • Stopwords are common words that do not contain much useful information.

  • Lemmatization reduces words to their root form.

Vector Space and Distance
  • We model documents by plotting them in high dimensional space.

  • Distance is highly dependent on document length.

  • Documents are modeled as vectors so cosine similarity can be used as a similarity metric.

Document Embeddings and TF-IDF
  • Some words convey more information about a corpus than others

  • One-hot encodings treat all words equally

  • TF-IDF encodings weigh overly common words lower

Latent Semantic Analysis
  • Topic modeling helps explore and describe the content of a corpus

  • LSA defines topics as spectra that the corpus is distributed over

  • Each dimension (topic) in LSA corresponds to a contrast between positively and negatively weighted words

Intro to Word Embeddings
  • Word emebddings can help us derive additional meaning stored in text at the level of individual words

  • Word embeddings have many use-cases in text-analysis and NLP related tasks

The Word2Vec Algorithm
  • Artificial neural networks (ANNs) are powerful models that can approximate any function given sufficient training data.

  • The best method to decide between training methods (CBOW and Skip-gram) is to try both methods and see which one works best for your specific application.

Training Word2Vec
  • As an alternative to using a pre-trained model, training a Word2Vec model on a specific dataset allows you use Word2Vec for NER-related tasks.

Finetuning LLMs
  • HuggingFace has many examples of LLMs you can fine-tune.

  • Examine preexisting examples to get an idea of what your model expects.

  • Label Studio and other tagging software allows you to easily tag your own data.

  • Looking at common metrics used and other models performance in your subject area will give you an idea of how your model did.

Ethics and Text Analysis
  • Text analysis is a tool and can’t assign meaning to results

  • As researchers we are responsible for understanding and explaining our methods and results

Glossary

FIXME