Vector Database Tutorial using Pinecone: the new generation database for AI
Table of contents
No headings in the article.
The amount of unstructured data has significantly increased nowadays with the difficulty of analyzing, and the requirement of specific tools because most data tools cannot ingest it.
The structured data is stored with predefined relationships between tables and columns while in unstructured data like images, video, text and audio there can not be any predefined relationship because not every two images or texts would have the same relation between them.
This unstructured data can be converted to embedding vectors to make it useful for the analysis of the data. Machine learning and deep learning models can help us by transforming unstructured data into vector embeddings. The meaning of embedding is a mapping of a discrete, categorical variable to a vector of continuous numbers. Embeddings are low-dimensional, learned continuous vector representations of discrete variables that describe complex data as numerical values in different dimensions. The embeddings overcome the traditional way of the label and one-hot encoding of the categorical data.
In simple terms vector databases are tools that allow searching through unstructured data effectively using vector embeddings rather than some human-generated labels. Vector embeddings can represent unstructured data features like a component inside an image, select frames in a video, or select a window in audio. These embeddings make searching across data simple.
Some use cases where the use of vector databases can increase are:
Semantic search
Semantic Textual Similarity
Image Search
Sentence and Paraphrase mining
You can look at one of my projects on Image Search which involves generating vector embeddings: https://www.kaggle.com/code/warcoder/image-search-on-landscape-pictures/edit
Now we will look at one of the vector databases "Pinecone".
According to its official website, "Pinecone makes it easy to build high-performance vector search applications. It’s a managed, cloud-native vector database with a simple API and no infrastructure hassles."
Pinecone has three core concepts:
Vector Search: It means to find items most similar to the query using vector embeddings instead of a traditional pattern or keyword matching technique.
Vector Embedding: To use pinecone one must have data consisting of vector embeddings.
Vector Database: A vector database indexes and stores vector embeddings for the management of data and retrieving it as per usage.
Pinecone basic python code structure:
"""Installing Pinecone"""
pip install pinecone-client
"""Setting up pinecone"""
import pinecone
pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENVIRONMENT")
# To get api key visit https://app.pinecone.io/
"""Creating Index"""
pinecone.create_index("quickstart", dimension=8, metric="euclidean", pod_type="p1")
"""
An index is the highest-level organizational unit of vector data in Pinecone. More on indexes at https://docs.pinecone.io/docs/indexes
Dimension: The dimensions of the vectors
Pods are hardware units on which your pinecone services run
Metrics have three options:cosine,euclidean,dotptoduct
"""
"""Listing indexes"""
pinecone.list_indexes()
"""Accessing an index"""
index = pinecone.Index("Index-1")
"""Describing some index stats"""
index.describe_index_stats()
# if any data is stored then the output would be something like this -> {'dimension': 8, 'index_fullness': 0.0, 'namespaces': {'': {'vector_count': 5}}}
"""Inserting data using upsert"""
index.upsert([('Text',[-4.11227345e-03,5.18435985e-02,-1.14086755e-01,-1.81790709e-01,-1.98660389e-01,-2.43360624e-01,1.97100341e-01,3.32243741e-02])])
"""Deleting index"""
pinecone.delete_index("quickstart")
Some other vector databases one can look at:
Zilliz
Weaviate
Conclusion: Vector Database can change the way searching, ranking or retrieval systems work with their ability to index and store embedding vectors. So in future, we may see a rise in the use of these vector databases in many NLP and Vision-based applications.