Training
Training jobs
You can also use the Python client to submit training jobs to the ScalarLM server.
import scalarlm
scalarlm.api_url = "https://llama8btensorwave.cray-lm.com"
def get_dataset():
dataset = []
count = 5
for i in range(count):
dataset.append(
{"input": f"What is {i} + {i}?", "output": str(i + i)}
)
return dataset
llm = scalarlm.SupermassiveIntelligence()
dataset = get_dataset()
status = llm.train(dataset, train_args={"max_steps": 200, "learning_rate": 3e-3})
print(status)
You get a command line output like this:
(environment) gregorydiamos@Air-Gregory cray % python test/deployment/train.py
{'job_id': '1', 'status': 'QUEUED', 'message': 'Training job launched', 'dataset_id': 'dataset', 'job_directory': '/app/cray/jobs/69118a251a074f9f9d37a2ddc903243e428d30c3c31ad019cbf62ac777e42e6e', 'model_name': '69118a251a074f9f9d37a2ddc903243e428d30c3c31ad019cbf62ac777e42e6e'}
Multi-GPU Training
llm.train(dataset,
train_args={
"max_steps": 200,
"learning_rate": 3e-3,
"gpus" : 2
}
)
To use multiple GPUs, you can specify the "gpus" argument to llm.train
. Jobs will automatically be distributed among GPUs in your ScalarLM deployment.
Custom Training
When you submit a training job using ScalarLM, it will train an LLM using the source code for the data loader, training loop, and pytorch model from the ml/
directory.
You can check out the source code to see how it works at: https://github.com/tensorwavecloud/ScalarLM/tree/main/ml
ScalarLM also allows you to use your own custom model code, e.g. if you want to adjust the model or its hyperparameters. If you check out the ml/
directory from the repo, and put it in the same directory as the your training script, your local version will be uploaded to the ScalarLM server and used for your training job.
For example, consider a directory structure like this:
./train.py
./ml/... # custom ml directory containing your custom training code
In this case, your custom ./ml
directory will be used for training jobs submitted by the ./train.py
script.