A.H.

How I Replaced My OpenAI Embedding Spend with Open Source Models and Modal

Update 10/29/2023: I wrote this post before I had the chance to set up and try out Hugging Face’s text-embeddings-inference, which turned out to be way faster than my approach below. I’ve left my original post below, unchanged, but figured I’d share the latest.

The long story short is that on a T4 GPU, you can get to ~100K tok/s if you run the code on the GPU itself (to avoid bandwidth latency) or 70K tok/s if you’re sending the batches over the wire. That is about 6x (or 4.5x, respectively) faster than OpenAI!

image


I’ve been messing around with LLMs and embeddings for a few personal projects, some of which now involve processing reasonably large amounts of text data (tens of gigabytes).

The Motivation

One use case that I have involves generating summaries of a public company’s earnings transcripts. Usually, companies hold earnings calls on a quarterly basis, with calls that last roughly an hour; so, for a company that has been public for 10 years, that’s roughly 40 hours of text. Using OpenAI’s gpt-3.5-turbo and text-embedding-ada-002, wordcel (a Python package I wrote to thread API calls), and llama_index for in-memory vector search, it costs (ballpark) around 40-50 cents per company and takes around 20 minutes to generate the final document with summaries for each earnings transcript.

While that doesn’t sound like a lot, the costs could grow pretty quickly if I wanted to scale up. So far, I haven’t, because I don’t want to be spending hundreds or thousands of dollars a month on inference costs.

On top of that, I can’t run inferences any faster than OpenAI’s 90,000 tokens-per-minute rate limit for gpt-3.5-turbo and 1 million tokens per minute for text-embedding-ada-002.

This had me wondering if the developments in open source language models and wide range of tooling out there was enough for me to come up with an alternative to the OpenAI. The grail (I’ll stop short of calling it “holy” per se) here would be to get something that was faster, cheaper, and as easy to use as OpenAI.

Criteria

To this end, the axes that matter to me are quality, speed, cost, and ease of setup and maintenance.

The Solution

I’ll cut to the chase and just tell you what I ended up converging on.

The setup of my problem is that I have a Postgres table with columns date and text (among others), and another table of embeddings of the text in the first table, with columns date, chunked_text, and embedding.

For my model, I ended up using bge-small-en-v1.5, the smallest model on the top 10 of the Hugging Face leaderboard. I had (and still have?) some doubts that such a small model would be as good as OpenAI’s ada, but as of this writing bge-small is in eighth place on the leaderboard and ada in 15th place. Again, I’ll leave it up to the experts to decide which is better.

For cloud infrastructure, Modal ended up being a delightful experience to use. With two to three additional decorators on my Python functions and relatively few and minor changes to the structure of my code, Modal enabled me to deploy and scale my functions across CPU/GPU containers.

I do not plan to publish the code, as it is not really meant for general use and would require some custom setup to get working, but if you would like to see it anyway, feel free to message me.

Anyway, the important parts of the code are below.

My embedding function, and the Modal wrapper function. Note that I specified the cheapest GPU.

import modal
from modal import Image
from sentence_transformers import SentenceTransformer

EMBEDDING_IMAGE = Image.debian_slim().pip_install(
    "pandas", "numpy", "python-dotenv",
    "langchain", "sentence-transformers",
    "psycopg2-binary", "sqlalchemy", "tiktoken"
)


def embed(texts: List[str], model=None) -> List[float]:
    """Embed array of texts."""
    assert model is not None, "Must specify a model."
    model = SentenceTransformer(model)
    embeddings = model.encode(texts)
    return embeddings


@stub.function(
    image=EMBEDDING_IMAGE,
    gpu="T4",
)
def embed_docs(docs: pd.DataFrame):
    """Embed a prepared dataframe of chunks, and add each embedding as another
    column."""
    print("Embedding...")
    embeddings = embed(docs["chunked_text"].tolist(), model=EMBEDDING_MODEL)
    embeddings = pd.DataFrame(embeddings)
    docs["embedding"] = embeddings.values.tolist()
    return docs

My chunking function, and the Modal wrapper function around it.

from langchain.text_splitter import SentenceTransformersTokenTextSplitter


def chunk_text(text, model=None, chunk_overlap=DEFAULT_CHUNK_OVERLAP_LEN) -> List[str]:
    """Split text into chunks."""
    assert model is not None, "Must specify a model."
    text_splitter = SentenceTransformersTokenTextSplitter(
        model_name=model, chunk_overlap=chunk_overlap
    )
    texts = text_splitter.split_text(text)
    return texts


@stub.function(image=EMBEDDING_IMAGE)
def chunk_docs(docs: List[str]):
    """Chunk a list of documents."""
    chunked_docs = [chunk_text(doc, model=EMBEDDING_MODEL) for doc in docs]
    return chunked_docs

And then the main subroutine which functioned as the entrypoint to Modal.

@stub.local_entrypoint()
def main(src_table, dst_table, column, start_date, end_date, local=False):
    """Embed a column in the table."""
    # Some assertions and validations around the arguments.
    ...
    
    # Getting the diff between the `src_table` of text and the `dst_table`
    # of embeddings.
    ...
    
    unembedded_rows = pd.concat(unembedded_rows)

    # Chunking step.
    print("Chunking...")
    chunk_batches = [
        batch[1][column].tolist()
        for batch in unembedded_rows.groupby(unembedded_rows.index // CHUNK_BATCH_SIZE)
    ]

    chunked_text = []
    for result in chunk_docs.map(chunk_batches):
        chunked_text.extend(result)
    
    flattened: pd.DataFrame = flatten_chunked_text(chunked_text)
    # Create equal length subsections of the dataframe.
    batches = [batch[1] for batch in flattened.groupby(flattened.index // GPU_LIMIT)])

    # Embedding step.
    print(f"Embedding {len(batches)} batches...")
    print(set([len(batch) for batch in batches]))
    embedded = []
    for embeddings in embed_docs.map(batches):
        # Join the embeddings df with the original dataframe by the index of the 
        # original df and the `doc_idx` column of the embeddings df, in order to
        # get the `id` column of the original df.
        embeddings = embeddings.join(unembedded_rows, on="doc_idx")
        embeddings = embeddings[["id", "date", "chunked_text", "embedding"]]
        embedded.append(embeddings)

    print(f"Embedded {len(batches)} batches.")
    for embeddings in embedded:
        _upload_df_to_table(embeddings, dst_table)

The Results

I tested out the above code on around 2 million tokens of data.

2 million tokens would cost around 20 cents ($0.0001 per 1,000 tokens) with OpenAI, and take around 2 minutes to embed with rate limits. With the above system and bge-small, it took around half that time, and cost roughly 25 cents wth Modal. (though around 5 cents was spent on CPUs for chunking). So it was roughly comparable on cost and about twice as fast.

As far as ease of maintenance and developer experience: Modal makes it really easy to track how your jobs are doing. For example, below is a screenshot of the two functions represented above, split across 19 containers (9 CPUs, followed by 10 GPUs).

image

Bloopers

Other things I tried included:

Conclusion

Embeddings aren’t that expensive. At $0.0001 per 1K tokens from OpenAI, it was arguably not worth the effort to set up a separate inference pipeline. But, I do think this represents a fairly important moment for open source AI, where an average engineer can self-serve to get to cost and speed parity with OpenAI, relatively easily.