Skip to main content

Examples

Inference

OpenAI Compatible Server

We host a deployment of ScalarLM on TensorWave for testing. You can access it at the https://llama8btensorwave.cray-lm.com endpoint.

For example, to submit a request to it:

curl https://llama8btensorwave.cray-lm.com/v1/openai/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

Using the Python client

You can also use the Python client to interact with the ScalarLM server.

import scalarlm

scalarlm.api_url = "https://llama8btensorwave.cray-lm.com"

def get_dataset():
    dataset = []

    count = 4

    for i in range(count):
        dataset.append(f"What is {i} + {i}?")

    return dataset


llm = scalarlm.SupermassiveIntelligence()

dataset = get_dataset()

results = llm.generate(prompts=dataset)

print(results)

Batching Support

ScalarLM supports batching through the python client. Notice in the example above that a list of prompts is provided to the llm.generate call.

ScalarLM will automatically distribute mini-batches of requests to inference GPUs and keep them fully utilized.

The requests are queued by ScalarLM, so you can submit very large numbers of queries. The parallelism and back pressure will automatically be handled by the ScalarLM client and server queues.