Inference
OpenAI Compatible Server
We host a deployment of ScalarLM on TensorWave for testing. You can access it at the https://llama8btensorwave.cray-lm.com endpoint.
For example, to submit a request to it:
curl https://llama8btensorwave.cray-lm.com/v1/openai/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'
Using the Python client
You can also use the Python client to interact with the ScalarLM server.
import scalarlm
scalarlm.api_url = "https://llama8btensorwave.cray-lm.com"
def get_dataset():
dataset = []
count = 4
for i in range(count):
dataset.append(f"What is {i} + {i}?")
return dataset
llm = scalarlm.SupermassiveIntelligence()
dataset = get_dataset()
results = llm.generate(prompts=dataset)
print(results)
Batching Support
ScalarLM supports batching through the python client. Notice in the example above that a list of prompts is provided to the llm.generate call.
ScalarLM will automatically distribute mini-batches of requests to inference GPUs and keep them fully utilized.
The requests are queued by ScalarLM, so you can submit very large numbers of queries. The parallelism and back pressure will automatically be handled by the ScalarLM client and server queues.