StarRocks
StarRocks is a High-Performance Analytical Database.
StarRocksis a next-gen sub-second MPP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics and ad-hoc query.
Usually
StarRocksis categorized into OLAP, and it has showed excellent performance in ClickBench β a Benchmark For Analytical DBMS. Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb.
Here weβll show how to use the StarRocks Vector Store.
Setupβ
#!pip install pymysql
Set update_vectordb = False at the beginning. If there is no docs
updated, then we donβt need to rebuild the embeddings of docs
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import TokenTextSplitter
from langchain.vectorstores import StarRocks
from langchain.vectorstores.starrocks import StarRocksSettings
update_vectordb = False
/Users/dirlt/utils/py3env/lib/python3.9/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.7) or chardet (5.1.0)/charset_normalizer (2.0.9) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
Load docs and split them into tokensβ
Load all markdown files under the docs directory
for starrocks documents, you can clone repo from
https://github.com/StarRocks/starrocks, and there is docs directory in
it.
loader = DirectoryLoader(
"./docs", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader
)
documents = loader.load()
Split docs into tokens, and set update_vectordb = True because there
are new docs/tokens.
# load text splitter and split docs into snippets of text
text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)
split_docs = text_splitter.split_documents(documents)
# tell vectordb to update text embeddings
update_vectordb = True
split_docs[-20]
Document(page_content='Compile StarRocks with Docker\n\nThis topic describes how to compile StarRocks using Docker.\n\nOverview\n\nStarRocks provides development environment images for both Ubuntu 22.04 and CentOS 7.9. With the image, you can launch a Docker container and compile StarRocks in the container.\n\nStarRocks version and DEV ENV image\n\nDifferent branches of StarRocks correspond to different development environment images provided on StarRocks Docker Hub.\n\nFor Ubuntu 22.04:\n\n| Branch name | Image name |\n | --------------- | ----------------------------------- |\n | main | starrocks/dev-env-ubuntu:latest |\n | branch-3.0 | starrocks/dev-env-ubuntu:3.0-latest |\n | branch-2.5 | starrocks/dev-env-ubuntu:2.5-latest |\n\nFor CentOS 7.9:\n\n| Branch name | Image name |\n | --------------- | ------------------------------------ |\n | main | starrocks/dev-env-centos7:latest |\n | branch-3.0 | starrocks/dev-env-centos7:3.0-latest |\n | branch-2.5 | starrocks/dev-env-centos7:2.5-latest |\n\nPrerequisites\n\nBefore compiling StarRocks, make sure the following requirements are satisfied:\n\nHardware\n\n', metadata={'source': 'docs/developers/build-starrocks/Build_in_docker.md'})
print("# docs = %d, # splits = %d" % (len(documents), len(split_docs)))
# docs = 657, # splits = 2802
Create vectordb instanceβ
Use StarRocks as vectordbβ
def gen_starrocks(update_vectordb, embeddings, settings):
if update_vectordb:
docsearch = StarRocks.from_documents(split_docs, embeddings, config=settings)
else:
docsearch = StarRocks(embeddings, settings)
return docsearch
Convert tokens into embeddings and put them into vectordbβ
Here we use StarRocks as vectordb, you can configure StarRocks instance
via StarRocksSettings.
Configuring StarRocks instance is pretty much like configuring mysql instance. You need to specify: 1. host/port 2. username(default: βrootβ) 3. password(default: ββ) 4. database(default: βdefaultβ) 5. table(default: βlangchainβ)
embeddings = OpenAIEmbeddings()
# configure starrocks settings(host/port/user/pw/db)
settings = StarRocksSettings()
settings.port = 41003
settings.host = "127.0.0.1"
settings.username = "root"
settings.password = ""
settings.database = "zya"
docsearch = gen_starrocks(update_vectordb, embeddings, settings)
print(docsearch)
update_vectordb = False
Inserting data...: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2802/2802 [02:26<00:00, 19.11it/s]
zya.langchain @ 127.0.0.1:41003
username: root
Table Schema:
----------------------------------------------------------------------------
|name |type |key |
----------------------------------------------------------------------------
|id |varchar(65533) |true |
|document |varchar(65533) |false |
|embedding |array<float> |false |
|metadata |varchar(65533) |false |
----------------------------------------------------------------------------
Build QA and ask question to itβ
llm = OpenAI()
qa = RetrievalQA.from_chain_type(
llm=llm, chain_type="stuff", retriever=docsearch.as_retriever()
)
query = "is profile enabled by default? if not, how to enable profile?"
resp = qa.run(query)
print(resp)
No, profile is not enabled by default. To enable profile, set the variable `enable_profile` to `true` using the command `set enable_profile = true;`