When trying to find my way around in the buzzing lands of OpenAI and vector
databases, the dots were not always easy to connect. In this guide I’m sharing
what I’ve learned during my journey to make yours even better. You might find a
trick or a treat!

Most of OpenAI tooling and examples is based on Python, but this guide uses
JavaScript exclusively.

We’ll begin with a brief explanation of some core concepts, before diving into
more and more code. Towards the finish we’ll discuss some strategies for token
management and maintaining a conversation.


Here are the topics we will be discussing:

OpenAI endpoints

In this guide, we will work with two OpenAI REST endpoints.

Chat Completions

POST https://api.openai.com/v1/chat/completions

The Create chat completion endpoint generates a human-like text completion
for a provided prompt. We’ll use it to start and keep the conversation going
between the end-user and OpenAI’s Large Language Models (LLMs) such as GPT-3.5
and GPT-4.

Create Embeddings

POST https://api.openai.com/v1/embeddings

With the embeddings endpoint, we can create embeddings from plain text. We
will use these embeddings to store and query a vector database. Embeddings?
Vector database? No worries, we have you covered.

The openai package

We’re going to use these endpoints directly, and not OpenAI’s npm package.
This package targets Node.js, but eventually you might want to deploy your own
endpoint on an environment without Node.js, such as a serverless or edge
platform like Cloudlare Workers, Netlify Edge or Deno. Now that fetch is
ubiquitous I think the REST APIs are just as easy to use without any
dependencies. I like being “closer to the metal” and stay flexible.

Key concepts

We’ve already introduced a few concepts that may be new to you. Let’s discuss
embeddings, vector databases and prompts briefly before diving
into any code.

If you’re familiar with them, feel free to skip straight to ingestion.


Vector embeddings are numerical representations of textual data in a
high-dimensional space. They are generated using large language models (LLMs).
Embeddings allow for efficient storage and search of content that is
semantically related to a user’s query. Semantically similar text is mapped
close together in the vector space, and we can find relevant content using a
vector embedding created from user input.

For comparison, a lexical or “full text” search looks for literal matches of the
query words and phrases, without understanding the overall meaning of the query.

Vector databases

Why do we need a vector database? Can’t we just query OpenAI and get a response?

Yes, we can use the ChatGPT UI or even the OpenAI chat completions
endpoint directly. However, the response will be limited to what the OpenAI
models are trained on. The response may not be up-to-date, accurate, or specific
enough for your needs.

What if you want to have OpenAI generate responses based solely on your own
domain-specific content? For users to “chat with your content”. Sounds
interesting! But how to go about this?

Unlike ChatGPT, the OpenAI APIs are not storing any of your content and they do
not store state or a session of the conversation(s). This is where vector
databases come in. Adding a vector database in the mix has interesting

  • Store and maintain domain-specific knowledge.
  • Support semantic search across your content.
  • Control your own data and keep it up-to-date and relevant.
  • Reduce the number of calls to OpenAI.
  • Store the user’s conversational history.

Setting up a vector database might be easier than you think. I’ve been trying
out managed solutions like Pinecone and Supabase without any issues.
There are more options though, and I don’t feel like I’m in a position to
recommend one over another. I do like that I can use Pinecone without
dependencies using only fetch and their REST API.


A prompt is the textual input we send to the chat completions endpoint to have
it generate a relevant “completion”. You could say a prompt is a question, and a
completion is an answer.

Prompts are plain text and we can provide extra details and information to
improve the results. The more context we provide, the better the response will

Requests to the chat completions endpoint are essentially stateless: not your
content, no session, no state. The challenge is to optimize and include the
right information with each request. We’ll be discussing prompts throughout this
guide, and ways to optimize them.


Armed with this knowledge, let’s begin building a chat application with a vector

We’ll need to get content into this database. Content is stored as vector
embeddings, and we can create those from textual content by using the
embeddings endpoint.


Before creating the database table or index, it’s important to consider what we
will do with the results of semantic search queries.

Vector embeddings are a compressed representation of semantics for efficient
storage and querying. It’s not possible to translate them back to the original
text. This is the reason we need to store the original text along with the
embeddings in the database.

The text can be stored as metadata and can include more useful things to display
in the application, such as document or section titles and URL’s to link back to
the original source.


There are tools that can help with this. I have seen a few solutions like
Markprompt and Databerry that offer easy content ingestion, but
you’re not free to choose where the content will be stored. Do you know of any


As I prefer to start out with command-line tools and learn more about the OpenAI
APIs, embeddings and vector database, I decided to develop a tool myself.

This work ended up as 7-docs and comes with the 7d command-line tool to
ingest content from plain text, Markdown and PDF files into a vector database.
It ingests content from local files, GitHub repositories and also HTML from
public websites. Currently it supports “upserting” vectors into Pinecone indexes
and Supabase tables.

To get an idea what ingestion using 7d looks like, here are some examples that
demonstrate how to ingest Markdown files:

7d ingest --files '*.md' --db pinecone --namespace my-docs
7d ingest --source github --repo reactjs/react.dev 
  --files 'src/content/reference/react/*.md' 
  --db supabase 
  --namespace react


When the embeddings and metadata are in the database, we can query it. We’ll
look at some example code to implement this 4-step strategy:

  1. Create a vector embedding from the user’s textual input.
  2. Query the database with this vector for related chunks of content.
  3. Build the prompt from the search results and the user’s input.
  4. Ask the model to generate a chat completion based on this prompt.

The next examples show working code, but contain no error handling or
optimizations. Just plain JavaScript without dependencies.

(Don’t want to implement this yourself, or just want to see examples? Visit
7-docs for available demos and starterkits to hit the ground running.)

1. Create a vector embedding

The first function we’ll need creates a vector embedding based on the user’s

export const createEmbeddings = async ({ token, model, input }) => {
  const response = await fetch('https://api.openai.com/v1/embeddings', {
    headers: {
      'Content-Type': 'application/json',
      Authorization: `Bearer ${token}`,
    method: 'POST',
    body: JSON.stringify({ input, model }),

  const { error, data, usage } = await response.json();

  return data;

This function can be called like this:

const vector = await createEmbeddings({
  token: '[OPENAI_API_TOKEN]',
  model: 'text-embedding-ada-002',
  input: 'What is an embedding?',

2. Query the database

In the second step we are going to query the database with the vector
embedding we just created. Below is an example that queries a Pinecone index for
vectors with related content using fetch. The rows returned from this query
are mapped to the metadata that’s stored with the vector in the same row. We
need this metadata in the next step.

export const query = async ({ token, vector, namespace }) => {
  const response = await fetch('https://[my-index].pinecone.io/query', {
    headers: {
      'Content-Type': 'application/json',
      'Api-Key': token,
    method: 'POST',
    body: JSON.stringify({
      topK: 10,
      includeMetadata: true,

  const data = await response.json();
  return data.matches.map(match => match.metadata);

This query function can be invoked with the vector we received from
createEmbeddings() like so:

const metadata = await query({
  token: '[PINECONE_API_KEY]',
  vector: vector, 
  namespace: 'my-knowledge-base',

3. Build the prompt

The third step builds the prompt. There are multiple ways to go about this and
the content of the template probably requires customization on your end, but
here is an example:

const template = `Answer the question as truthfully and accurately as possible using the provided context.
If the answer is not contained within the text below, say "Sorry, I don't have that information.".

Context: {CONTEXT}

Question: {QUERY}

Answer: `;

const getPrompt = (context, query) => {
  return template.replace('{CONTEXT}', context).replace('{QUERY}', query);

And here is how we can create the prompt with context from the metadata
returned from the database query:

const context = metadata.map(metadata => metadata.content).join(' ');

const prompt = getPrompt(context, 'What is an embedding?');

Later in this guide, we will also look at example code to maintain a
instead of merely asking one-shot questions.

4. Generate chat completion

We are ready for the last step: ask the model for a chat completion with our
prompt. Here’s an example function to call this endpoint:

export const chatCompletions = async ({ token, body }) => {
  const response = await fetch('https://api.openai.com/v1/chat/completions', {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${token}`,
      'Content-Type': 'application/json',
    body: JSON.stringify(body),

  return response;

And here’s how to make the request with the prompt:

const messages = [];
  role: 'user',
  content: prompt, 

const response = await chatCompletions({
  token: '[OPENAI_API_TOKEN]',
  body: {
    model: 'gpt-3.5-turbo',

const data = await response.json();
const text = data.choices[0].message.content;

The text contains the human-readable answer from OpenAI.

Excellent, this is the essence of generating chat completions based on your own
vector database. Now, how do we combine these four steps and integrate them into
a user interface? You can create a function that abstracts this away, or use the
@7-docs/edge package to do this for you. Keep reading to see an example.

In the next part of this guide, we will explore a UI component featuring a basic
form for users to submit their queries. This component will also render the
streaming response generated by the function in the next section.

User Interface

Let’s put our 4-step strategy into action and build function and

(Don’t want to implement this yourself, or just want to see examples? Visit
7-docs for available demos and starterkits to hit the ground running.)


The /api/completion endpoint will listen to incoming requests and respond
using all of the query logic from the previous section.

We’re going to use the @7-docs/edge package, which abstracts away the 4-step
strategy and some boring boilerplate. We need to pass the OPENAI_API_KEY and a
query function from a database adapter, Pinecone in this example. We pass it
to getCompletionHandler so it can query the database when it needs to. We
would pass a different function if we wanted to used a different type of
database (like Supabase or Milvus).

Let’s bring this together in a serverless or edge function handler in just a few
lines of code:

import { getCompletionHandler, pinecone } from '@7-docs/edge';
import { createClient } from '@supabase/supabase-js';

const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const PINECONE_URL = process.env.PINECONE_URL;
const namespace = 'my-knowledge-base';

const query: QueryFn = (vector: number[]) =>
    url: PINECONE_URL,
    token: PINECONE_API_KEY,

export default getCompletionHandler({ OPENAI_API_KEY, query });

This pattern can be used anywhere from traditional servers to edge functions,
since there are no dependencies on modules only available in Node.js.


Now we still need a UI component to render an input field, send the input to the
/api/completion endpoint, and render the streaming response.

This minimal example uses a little React and JSX for an easy read, but it could
just as well be plain JavaScript or any other framework.

import { useState } from 'react';

export default function Page() {
  const [query, setQuery] = useState('');
  const [outputStream, setOutputStream] = useState('');

  function startStream(query) {
    const searchParams = new URLSearchParams();
    searchParams.set('query', encodeURIComponent(query));
    searchParams.set('embedding_model', 'text-embedding-ada-002');
    searchParams.set('completion_model', 'gpt-3.5-turbo');
    const url = '/api/completion?' + searchParams.toString();

    const source = new EventSource(url);
    source.addEventListener('message', event => {
      if (event.data.trim() === '[DONE]') {
      } else {
        const data = JSON.parse(event.data);
        const text = data.choices[0].delta.content;
        if (text) setOutputStream(v => v + text);

  const onSubmit = event => {
    if (event) event.preventDefault();

  return (
      <form onSubmit={onSubmit}>
          onChange={event => setQuery(event.target.value)}
        <input type="submit" value="Send" />


Now all the components in a “chat with your content” have come together:

  • Ingest content as vector embeddings into a database
  • Create a function to query the content using the 4-step strategy
  • Build a UI to accept user input and render a streaming response

The following sections will build on to make everything even more interesting!


To start a chat, we’ve seen how to build a basic prompt. This is good
enough for one-shot questions, but we need more to build a meaningful
conversation. The chat completions endpoint accepts an array of messages, so a
pattern to fill this array could look like this:

  1. Add a system message that explains the model (i.e. the assistant) how to
    behave and respond.
  2. Add the conversation history with user and assistant messages.
  3. Add the user prompt, containing the context and the query.

Here is an example building on the initial prompt example that extends the
messages array to build the conversation:

const context = metadata.map(metadata => metadata.content).join(' ');

const system = `Answer the question as truthfully as possible using the provided context.
If the answer is not contained within the text below, say "Sorry, I don't have that information.".`;

const history = [
  ['What is an embedding?', 'An embedding is...'],
  ['Can you give an example?', 'Here is an example...'],

const prompt = getPrompt(context, 'Can I restore the original text?');

const messages = [];

  role: 'system',
  content: system,

history.forEach(([question, answer]) => {
    role: 'user',
    content: question,

    role: 'assistant',
    content: answer,

  role: 'user',
  content: prompt,

const response = await chatCompletions({
  token: '[OPENAI_API_TOKEN]',
  model: 'gpt-3.5-turbo',

The actual history can come from the client. For instance, this could be
stored in UI component state, or browser session storage. In that case, it will
need to be sent with every request to the function. Other ways of storing
and retrieving the conversation history is outside the scope of this guide.

See the starter kits for examples to handle this in the user interface in tandem
with the @7-docs/edge package.


Tokens (not characters) are the unit used by OpenAI for limits and usage. There
are limits to the number of tokens that can be sent to and received from the API
endpoints with each request.


The maximum number of input tokens to create embeddings with the
text-embedding-ada-002 model is 8191.

The price is $ 0.0004 per 1k tokens, which comes down to a maximum of
$ 0.0032 per request when sending 8k tokens. That’s roughly 6.000 words that
can be sent at once to create vector embeddings. We can send as many requests as
we want.

During content ingestion you may need this endpoint for a short period in
bursts, depending on the amount of content. Remember that we also need it to
create an embedding from the user’s input to query the vector database.
Depending on the user’s input this request is usually smaller, but may occur
frequently for a longer period depending on application traffic.

Chat completions

For the chat completions endpoint, the max_tokens value represents the number
of tokens the model is allowed to use when generating the completion. The models
have their own limit (context length) and pricing:

Model Context Length $/1k prompt $/1k completion
gpt-3.5-turbo 4.096 $ 0.002 $ 0.002
gpt-4 8.192 $ 0.03 $ 0.06
gpt-4-32k 32.768 $ 0.06 $ 0.12

The sum of the tokens for the prompt plus the max_tokens for completion cannot
exceed the model’s context length. For gpt-3.5-turbo this means:

num_tokens(prompt) + max_tokens <= 4096

To see what this means in practice, we'll discuss tokenization first and then
look at an example calculation.


The number of tokens for a given text can be calculated using a tokenizer (such
as GPT-3-Encoder). Tokenization can be slow on larger chunks, and npm
packages for Node.js may not work in other environments such as the browser or

The alternative is to make an estimate: use 4 characters per token or 0.75 words
per token. That's 75 words per 100 tokens. This is a very rough estimate for the
English language and varies per language. You should probably also add a small
safety margin to stay within the limits and prevent erors.

OpenAI provides an online Tokenizer. For Python there's tiktoken.


Let's say you're using the gpt-3.5-turbo model. If you want to preserve 25%
for the completion, use max_tokens: 1024. The rest of the model's context can
be occupied by the prompt. That's 3072 tokens (4096-1024), which comes down
to an estimated 2304 words (3072*0.75) or 12.288 characters (3072*4).

The length of the prompt is the combined length of all content in the
messages (i.e. the combined messages of the system, user and assistant
roles in Conversation).

If the prompt has the maximum length and the model would use all completion
tokens, using 4096 tokens would cost $ 0.008 (4*$0.002).

Using the gpt-4 model, the same roundtrip would cost $ 0.15 (3*$0.03 for the
prompt + 1*$0.06 for the completion).


To optimize for your end-user, you'll need to find the right balance between
input (prompt) and output (completion).

When adding context and conversation history to the chat completion request it
may become a challenge to keep everything within the model's limit. More context
and more conversation history (input) means less room for the completion

There are a few ways I can think of to help mitigate this:

  • Limit the number of messages to keep in the conversation history.
  • Truncate or leave out previous answers from the assistant.
  • Send some sort of summary of the conversation history. That would likely
    require additional effort and requests.
  • Use a solution like GPTCache to cache query results.
  • Some form of "compression" could work in certain cases. An example using GPT-4
    can be found at gpt4_compression.md.

Another thing to consider is the amount of context to send with the prompt. This
context comes from the semantic search results when querying the vector
database. You may want to create smaller vector embeddings during ingestion to
eventually have more options and wiggle room when building the context for the
chat completion. On the other hand, including smaller but more varied pieces of
context may result in less "focused" completions.

Overall, I think what matters most is to not lose the first and last question
throughout the conversation. Keep in mind that the model does not store state or


When using OpenAI endpoints, the token usage for the request is included in
the response (with separate prompt_tokens and completion_tokens).
Unfortunately, usage is not included for streaming chat completion responses
(stream: true).


A quick overview of some common parameters you may want to tweak for better chat


The temperature parameter is a number between 0 and 2 (default: 1). A
low number like 0.2 makes the output more focused and deterministic. You want
this when the output should be generated based on the context sent within the
prompt. A higher value like 0.8 makes the output more random.

presence_penalty and frequency_penalty

A number between -2 and 2 to decrease or increase the presence and frequency
of tokens. The default value is 0 and this is fine for most situations. If you
want to reduce repetition, try numbers between 0.1 and 1. Negative numbers
increase the likelihood of repetition.


As we've seen when creating the messages array, each message is assigned a
role (system, user or assistant). You can make the conversation more
personal and send a name with each message.

Markdown & code blocks

If you ingest Markdown content, you likely also want the completion to include
Markdown and code blocks when relevant. Here's a list of things to remember
during ingestion and building the client application:


  • Don't strip out code blocks from the Markdown during ingestion.
  • Try to prevent splitting text in the middle of code blocks.


  • Include something like "Use Markdown" and "Try to include a code example in
    language-specific fenced code blocks" in the prompt, ideally in the system
  • Use a Markdown renderer (e.g. react-markdown).
  • Use a syntax highlighter (e.g. react-syntax-highlighter).

Next steps

After figuring out how to connect the dots, it's exciting to tinker and continue
the journey to improve the user experience. Here are a few pointers that may
inspire you:

  • Consider the integration of the conversation in the user interface, as well as
    the place and the role of the chat box.
  • Keep refining the prompt to better align with your content and your target
  • Improve chat completions by further tweaking the parameters, vector embedding
    sizes, and context in the prompt.
  • Empowere users with more control by providing affordances to adjust the prompt
    or by incorporating multiple prompts.
  • Combine multiple sources of content, such as searching a database with source
    code or a table with more generic content.
  • Generate multiple chat completions in a single response.
  • Use the Moderations endpoint to make sure the input text does not
    violate OpenAI's content policy.
  • Last but not least, listen to your customers. What are their needs?

We've explored many aspects of using OpenAI with JavaScript to create useful
applications. We've covered everything from ingesting content to building a user
interface with your own serverless or edge function. Hopefully, this guide is
helpful in your own journey. Good luck!

I would love to hear about your thoughts and what you are building, please
share with me on Twitter!

Special thanks goes out to Enis Bayramoğlu for a great review.

Read More