Summarize all available vectors in the cookbook
To regroup within #5 if relevant - close if not
Issue : The cookbook is very useful to understand the different usages of vectors but a summary of vectors + usage might be useful
Proposal: md table with different vector types to put in the cookbook (draft below)
Here is a summary of the available vectors:
| Vector Type | Description | Best Used For |
|-------------|-------------|---------------|
| **Text Embedding Vectors** |
| `LitellmTextEmbVector` | Dense text embeddings using LiteLLM models | General text similarity, semantic search |
| `OpenAITextEmbVector` | Dense text embeddings using OpenAI models | General text similarity, semantic search when using OpenAI |
| **Text Processing Vectors** |
| `NGramVector` | Text similarity based on character n-grams | Fuzzy text matching, typo-tolerant search |
| `BagOfWordsVector` | Word-level text similarity using MinHash | Document similarity, keyword matching |
| `BiGramVector` | Character bigram-based text similarity | Language detection, fuzzy matching |
| **Categorical Vectors** |
| `CategoricalVector` | One-hot encoding with pitty factor | Exact category matching with tolerance |
| `VocabularyVector` | Weighted multi-category encoding | Tag systems, multi-label classification |
| `HashedVocabularyVector` | Hashed version of VocabularyVector | Large vocabulary spaces, memory-efficient tagging |
| **Numerical Vectors** |
| `IntegersSquareKernelVector` | Integer similarity with square kernel | Numerical range queries, year matching |
| **Geospatial Vectors** |
| `H3LocVector` | H3 geospatial hashing | Location-based search with Uber H3 |
| `GHLocVector` | Geohash-based location encoding | General purpose location search |
| **Hierarchical Vectors** |
| `HierarchicalVector` | Tree-structured data encoding | Category hierarchies, taxonomies |
| **Image Vectors** |
| `LitellmImageDescriptionVector` | Image embeddings via LLM descriptions | Image similarity via textual descriptions |
| `OpenAIImageDescriptionVector` | OpenAI-based image descriptions | High-quality image search via descriptions |
| `LitellmMultiModalEmbVector` | Multi-modal embeddings | Combined text and image search |
| **Special Purpose** |
| `ExistsVector` | Simple presence/absence encoding | Filtering, existence checks, flags |
| `SotaMinHashVector` | State-of-the-art MinHash implementation | Set similarity, efficient similarity search |
| `OneBitICWSMinHashVector` | One-bit ICWS MinHash | Memory-efficient set similarity |
rendered :
Vector Type | Description | Best Used For |
---|---|---|
Text Embedding Vectors | ||
LitellmTextEmbVector |
Dense text embeddings using LiteLLM models | General text similarity, semantic search |
OpenAITextEmbVector |
Dense text embeddings using OpenAI models | General text similarity, semantic search when using OpenAI |
Text Processing Vectors | ||
NGramVector |
Text similarity based on character n-grams | Fuzzy text matching, typo-tolerant search |
BagOfWordsVector |
Word-level text similarity using MinHash | Document similarity, keyword matching |
BiGramVector |
Character bigram-based text similarity | Language detection, fuzzy matching |
Categorical Vectors | ||
CategoricalVector |
One-hot encoding with pitty factor | Exact category matching with tolerance |
VocabularyVector |
Weighted multi-category encoding | Tag systems, multi-label classification |
HashedVocabularyVector |
Hashed version of VocabularyVector | Large vocabulary spaces, memory-efficient tagging |
Numerical Vectors | ||
IntegersSquareKernelVector |
Integer similarity with square kernel | Numerical range queries, year matching |
Geospatial Vectors | ||
H3LocVector |
H3 geospatial hashing | Location-based search with Uber H3 |
GHLocVector |
Geohash-based location encoding | General purpose location search |
Hierarchical Vectors | ||
HierarchicalVector |
Tree-structured data encoding | Category hierarchies, taxonomies |
Image Vectors | ||
LitellmImageDescriptionVector |
Image embeddings via LLM descriptions | Image similarity via textual descriptions |
OpenAIImageDescriptionVector |
OpenAI-based image descriptions | High-quality image search via descriptions |
LitellmMultiModalEmbVector |
Multi-modal embeddings | Combined text and image search |
Special Purpose | ||
ExistsVector |
Simple presence/absence encoding | Filtering, existence checks, flags |
SotaMinHashVector |
State-of-the-art MinHash implementation | Set similarity, efficient similarity search |
OneBitICWSMinHashVector |
One-bit ICWS MinHash | Memory-efficient set similarity |