Bag-of-Words


The bag-of-words (BoW) model is a way of representing text data for natural language processing tasks. The basic idea behind the BoW model is to convert a text document into a numerical vector representation, where each element in the vector represents a specific word in the text and the value of each element represents the number of occurrences of that word in the text.

BoW can be seen as an extension to One-Hot Encoding

Process

The process of creating a BoW representation of a text document typically involves several steps:

  1. Tokenization: The text document is split into individual words or tokens.
  2. Filtering: Certain words, such as stop words (common words like "the," "and," "is," etc.), are removed from the text.
  3. Stemming/Lemmatization: The remaining words are then reduced to their root form (stemming) or their dictionary form (lemmatization).
  4. Counting: The number of occurrences of each word in the filtered and stemmed/lemmatized text is counted.
  5. Vectorization: The counted words are then used to create a numerical vector representation of the text, where each element in the vector represents a specific word in the text and the value of each element represents the number of occurrences of that word in the text.
META

Status:: #wiki/notes/mature
Plantations:: Data Preprocessing - 20230221095709
References:: Mastering Machine Learning with scikit-learn