Chatbots For Social Change/Practicalities of LLMs

Statement Embedding Models

For generating embeddings of statements that can assess the similarity in meaning between two statements, several state-of-the-art, open-source algorithms and tools are available:

OpenAI's Embedding Models^[1] OpenAI offers embedding models that are particularly tuned for functionalities such as text similarity and text search. These models receive text as input and return an embedding vector that can be utilized for a variety of applications, including assessing the similarity between statements.

Spark NLP^[2] This open-source library provides a suite of transformer-based models, including BERT and Universal Sentence Encoder, which are capable of creating rich semantic embeddings. The library is fully open-source under the Apache 2.0 license.

To use Spark NLP you need the following requirements:

Java 8 and 11
Apache Spark 3.5.x, 3.4.x, 3.3.x, 3.2.x, 3.1.x, 3.0.x

GPU (optional): Spark NLP 5.1.4 is built with ONNX 1.15.1 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:

NVIDIA® GPU drivers version 450.80.02 or higher
CUDA® Toolkit 11.2
cuDNN SDK 8.1.0

There is a massive text embedding benchmark (MTEB ^[3]) which should help us determine which embedding algorithm to use.

Vector Similarity Search

LLMRails

The MTEB led me to the ember-v1 model by llmrails, because of its success on the SprintDuplicateQuestions dataset. The goal is to embed statements such that statements or questions which are deemed by a community to be duplicates are closest. The dataset compiles marked duplicates from Stack Exchange, the Sprint technical forum website, and Quora.

LLMrails ^[4] is a platform that offers robust embedding models to enhance applications' understanding of text significance on a large scale. This includes features like semantic search, categorization, and reranking capabilities.

Pricing: "Elevate your data game with our cutting-edge ChatGPT-style chatbot! All you need to do is link your data sources and watch as our chatbot transforms your data into actionable insights."

LLMRails is revolutionizing search technology, offering developers unparalleled access to advanced neural technology. Providing more precise and pertinent results paves the way for transformative changes in the field of search technology, making it accessible to a wide range of developers.

From the website: "with private invitation, join the LLMRails and start your AI advanture!" How did they get this wrong?

Embed $0.00005 per 1k tokens
Rerank $0.001 per search
Search $0.0005 per search
Extract $0.3 per document

Note: This service does not give the capabilities I need. It's a bit too managed. I just need vector embeddings, and retrieval.

Other Vector Databases as a Service

Amazon OpenSearch Service is a fully managed service that simplifies deploying, scaling, and operating OpenSearch in the AWS Cloud. It supports vector search capabilities and efficient vector query filters, which can improve the responsiveness of applications such as semantic or visual search experiences.
Azure Cognitive Search: This service allows the addition of a vector field to an index and supports vector search. Azure provides tutorials and APIs to convert input into a vector and perform the search, as well as Azure OpenAI embeddings for tasks like document search.

Zilliz Cloud, powered by the world's most powerful vector database, Milvus, solves the challenge of processing tens of billions of vectors.
- Zilliz has a 30-day free trial worth $400 of credits. 4CUs
- Pricing: Zilliz Cloud Usage (Each unit is 0.1 cent of usage) $0.001 / unit
A more comprehensive list, Awesome Vector Search, on GitHub.^[5]
- For cloud services, they list Zilliz first, then Relevance AI, Pinecone, and MyScale.
Graft somehow came up
- It is extremely expensive, $500/month for 10,000 data points. Unlimited datapoints at $5k/month...
- Perhaps it's more managed than a Zilliz, or that's just what the infrastructure costs either way?
- High price could also be an indication of the value of this sort of technology (they also do the embedding & document upload for you).

Open Source Models

Milvus is a "vector database built for scalable similarity search, Open-source, highly scalable, and blazing fast." Seems perfect. They have a managed version, but I'm not sure it's necessary now. ^[6]
Elastic NLP: Text Embeddings and Vector Search: Provides guidance on deploying text embedding models and explains how vector embeddings work, converting data into numerical representations^[7].
TensorFlow Recommenders' ScaNN^[8] TensorFlow provides an efficient library for vector similarity search named ScaNN. It allows for the rapid searching of embeddings at inference time and is designed to achieve the best speed-accuracy tradeoff with state-of-the-art vector compression techniques.
Other notable vector databases and search engines include Chroma, LanceDB, Marqo, Qdrant, Vespa, Vald, and Weaviate, as well as databases like Cassandra, Coveo, and Elasticsearch OpenSearch that support vector search capabilities.

Milvus Benchmark

Milvus has conducted benchmarks, which should give us an idea of overall cost, and how much we can scale before buckling.

CPU: An Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz. This is a high-end server-grade processor suitable for demanding tasks. It belongs to Intel's Xeon scalable processors, which are commonly used in enterprise-level servers for their reliability and performance.
Memory: 16*\32 GB RDIMM, 3200 MT/s. This implies that the server has 16 memory slots, each with a 32 GB RDIMM (Registered DIMM) module, totaling 512 GB of RAM. The memory speed is 3200 MT/s (MegaTransfers per second), which indicates how fast the memory can operate.
SSD: SATA 6 Gbps. This indicates that the server uses a Solid State Drive connected through a SATA interface, with a transfer rate of 6 Gigabits per second. SSDs are much faster than traditional hard drives and are preferred for their speed and reliability.

To find an approximate AWS EC2 equivalent, we would need to match these specs as closely as possible. Given the CPU and memory specifications, you might look into the EC2 instances that offer Intel Xeon Scalable Processors (2nd Gen or 3rd Gen) and the ability to configure large amounts of memory.

A possible match could be an instance from the m5 or r5 families, which are designed for general-purpose (m5) or memory-optimized (r5) workloads. For example, the r5.12xlarge instance provides 48 vCPUs and 384 GiB of memory, which, while not an exact match to your specs (since it has less memory), is within the same performance ballpark.

However, keep in mind that AWS offers a wide range of EC2 instances, and the actual choice would depend on the specific balance of CPU, memory, and I/O performance that you need for your application. Also, pricing can vary significantly based on region, reserved vs. on-demand usage, and additional options like using Elastic Block Store (EBS) optimized instances or adding extra SSD storage.

Using the AWS pricing calculator, this amounts to $3 hourly.

Search - (cluster w/1) 7k to 10k QPS @ 128 dimensions, (standalone w/1) 4k to 7.5k QPS
Scalability
- From 8-16 CPU cores, it doubles. After that it kinda less-than-doubles
- Going from 1-8 replicas changes QPS from 7k to 31k, and over doubles available concurrent queries (to 1200)

There are 3600 seconds in an hour, so $PQ = $3 / (7k * 3600) = $0.000000119 per query...

Large Language Models

A useful article, comparing open-source LLM models, was published here, in Medium.

LLMs In Hosted Environments

Model	Cost per 1M input tokens	Cost per 1M output tokens	Additional Notes
AI21Labs Jurassic-2 Ultra	$150	$150	Highest quality
AI21Labs Jurassic-2 Mid	$10	$10	Optimal balance of quality, speed & cost
AI21Labs Jurassic-2 Light	$3	$3	Fastest and most cost-effective
AI21Labs Jurassic-2 Chat	$15	$15	Complex, multi-turn interactions Free for $1000 in usage.
Anthropic Claude Instant	$1.63	$5.51	Low latency, high throughput
Anthropic Claude 2.0, 2.1	$8	$24	Best for tasks requiring complex reasoning
Cohere Command	$1.00	$2.00	Standard offering
Cohere Command Light	$0.30	$0.60	Lighter version
Google Bard	Free (although likely limited)		Requires a Google account
GPT-4 Turbo (gpt-4-1106-preview)	$10	$30
GPT-4 Turbo (gpt-4-1106-vision-preview)	$10	$30
GPT-4	$30	$60
GPT-4-32k	$60	$120
GPT-3.5 Turbo (gpt-3.5-turbo-1106)	$1.00	$2.00
GPT-3.5 Turbo (gpt-3.5-turbo-instruct)	$1.50	$2.00

LLMs On Your Own Hardware

From the model card: "Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. Llama 2 is intended for commercial and research use in English. It comes in a range of parameter sizes—7 billion, 13 billion, and 70 billion—as well as pre-trained and fine-tuned variations."

It turns out that you have to ask Microsoft nicely to get access to the parameter sets, agreeing to terms of use.

It's clear from a little research that running and training locally (I have a 2021 Mac M1) is going to cause a lot of headache.
AWS Sagemaker seems to be a great option for getting up and running with open source models.
- Has access to dozens of models of varying sizes through their Jumpstart feature
- In practice, you say "go" and are dropped right away in a JupyterLab instance.

Hardware requirements of Llama (Nov 2023)

Model	Instance Type	Quantization	# of GPUs per replica	Cost
Llama 7B	(ml.)g5.2xlarge	-	1	$1.52 (ml.)
Llama 13B	(ml.)g5.12xlarge	-	4	$7.09 (ml.)
Llama 70B	(ml.)g5.48xlarge	bitsandbytes	8	$20.36 (ml.)
Llama 70B	(ml.)p4d.24xlarge	-	8	$37.69 (ml.)

Benchmarking AWS SageMaker and Llama

Fortunately, Phil Schmid has conducted thorough benchmarks of different deployments of Llama on SageMaker in AWS. His blog posts in 2023 in particular are an incredible reference for getting started with these LLMs.

To give the most economical example, The g5.2xlarge ($1.52 / hour) can handle 5 concurrent requests delivering 120 tokens of output per second. Incredible! That's $3.50 per 1M tokens. ChatGPT, for comparison, offers gpt-3.5-turbo (the cheapest option) at $0.0020 per 1K tokens, or $2.00 per 1M tokens. Comparable, and not surprising that OpenAI is cheaper.

Let's compare the most expensive to the most sophisticated OpenAI model, GPT-4. Llama 70B runs on a $37.69 server (ml.p4d.24xlarge) serving 20 concurrent requests at 321 tokens/second. That's $10.43 per 1M tokens. For comparison, GPT-4 costs $0.06 per 1K tokens, or $60 per 1M.

It should be noted as well that Phil Schmid was able to get decent performance (15 seconds per thousand tokens generated) for a budget deployment in AWS's new inferentia2 hardware (inf2.xlarge), which costs just $0.75 per hour. That's $550 per month, so better not leave it on, but still. Very cool!

He trains a 7B parameter Mistral model using ml.g5.4xlarge ($2.03 / hour). It was able to fine-tune based on 15,001 examples, processed in whole 3 times (epochs), in 3.9 hours, giving an overall cost of <$8.

Integrations

To achieve the widest reach, we want to integrate our chatbots with low-effort communication media, such as text messages, phone calls, WhatsApp, Facebook Messenger, WeChat, or perhaps decentralized messaging platforms like those built on nostr. Each option has somewhat different benefits, limitations, and monetary cost. This section gives an overview of the available connections, along with the pricing and basic principles to get you started.

Facebook (now under the parent company Meta) has plans to integrate its messaging services across WhatsApp, Instagram, and Facebook Messenger. Mark Zuckerberg is leading an initiative to merge the underlying technical infrastructure of these apps, while keeping them as separate apps^[9]. This would allow cross-platform messaging between the services, with all the messaging apps adopting end-to-end encryption^[10]. The integration raises concerns around antitrust issues, privacy, and further consolidation of Facebook's power over its various platforms^[11].

Third-party platforms like Tidio, Aivo's AgentBot, Respond.io, BotsCrew, Gupshup, Landbot, and Sinch Engage allow businesses to create chatbots that can integrate with WhatsApp, Facebook Messenger, Instagram and other channels.

Here is a table summarizing the messaging integrations supported by various third-party platforms, along with their approximate pricing and relevant notes:

Platform	Messaging Integrations	Approximate Pricing	Notes
Landbot	WhatsApp, Facebook Messenger	Starter: €49/month, Pro: €99/month, Business: Custom	Offers AI chatbot builder, opt-in tools, workflows, surveys, etc. Needs at least a pro account to integrate with webhooks.
BotSpace	WhatsApp	Starter: ₹3,499/month, Pro: ₹7,499/month, Premium: ₹23,499/month	Supports team inboxes, roles & permissions, custom workflows.
Callbell	WhatsApp	€50/month per 10 agents, +€20/month per WhatsApp number	Offers advanced bot builder module for €59/month.
DelightChat	WhatsApp (others not specified)	Pricing not provided	Offers plans for businesses at different stages.
Brevo	WhatsApp	Pay-as-you-go, no recurring fees	Only pay for WhatsApp messages sent.
AiSensy	WhatsApp	Basic: ₹899/month ($10.77), Pro: ₹2399/month ($28.73)	Limits on free service conversations per month.
Flowable Engage	WhatsApp, Facebook Messenger, WeChat, LINE	Pricing not provided	Supports voice/video calls, templates, rich media on some platforms. Account requirements vary.

All the listed platforms support WhatsApp integration, as it is a popular messaging channel for businesses. Some platforms like Landbot and Flowable Engage also support Facebook Messenger integration. Platforms like Flowable Engage offer integration with additional messaging apps like WeChat and LINE. Pricing models vary, with some offering subscription plans (monthly/annual) and others following a pay-per-message or per-agent model. Certain platforms bundle additional features like AI chatbots, custom workflows, surveys, etc. along with messaging integration.

The search results indicate that Meta (Facebook) is working on enabling interoperability between its own messaging apps (WhatsApp, Messenger, Instagram) as well as with approved third-party messaging services, as mandated by the EU's Digital Markets Act^[12]^[13]. However, the extent of this interoperability and its impact on existing third-party integrations is currently unclear.

↑ "Introducing text and code embeddings". OpenAI. Retrieved 2023-11-07.
↑ "GPU vs CPU benchmark". Spark NLP. Retrieved 2023-11-07.
↑ MTEB
↑ llmrails
↑ Awesome Vector Search
↑ Milvus homepage
↑ Elastic
↑ "Efficient serving". TensorFlow Recommenders. Retrieved 2023-11-07.
↑ https://www.nytimes.com/2019/01/25/technology/facebook-instagram-whatsapp-messenger.html The New York Times
↑ https://www.theverge.com/2019/1/25/18197628/facebook-messenger-whatsapp-instagram-integration-encryption The Verge
↑ https://www.wired.com/story/facebook-plans-unite-messaging-apps/ Wired
↑ https://www.theverge.com/2023/3/24/23655688/eu-digital-markets-act-messaging-interoperability-meta-whatsapp-imessage The Verge
↑ https://www.reuters.com/technology/eu-rules-force-meta-open-up-messaging-apps-2023-03-24/ Reuters

[OpenAIEmbed-1] "Introducing text and code embeddings". OpenAI. Retrieved 2023-11-07.

[SparkNLP-2] "GPU vs CPU benchmark". Spark NLP. Retrieved 2023-11-07.

[3] MTEB

[4] rails

[github-5] Awesome Vector Search

[milvus-6] Milvus homepage

[7] Elastic

[TensorFlowScaNN-8] "Efficient serving". TensorFlow Recommenders. Retrieved 2023-11-07.

[9] ttps://www.nytimes.com/2019/01/25/technology/facebook-instagram-whatsapp-messenger.html The New York Times

[10] ttps://www.theverge.com/2019/1/25/18197628/facebook-messenger-whatsapp-instagram-integration-encryption The Verge

[11] ttps://www.wired.com/story/facebook-plans-unite-messaging-apps/ Wired

[12] ttps://www.theverge.com/2023/3/24/23655688/eu-digital-markets-act-messaging-interoperability-meta-whatsapp-imessage The Verge

[13] ttps://www.reuters.com/technology/eu-rules-force-meta-open-up-messaging-apps-2023-03-24/ Reuters

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]