Definitely seems faster and I can see it using 10% more GPU.

Unfortunately when I ask it to return 512 tokens I’m getting an abort before the end of generation:

CUDA error 1 at ggml-cuda.cu:1920: invalid argument

Testing on H100 80GB. Ubuntu 20.04, gcc 9.4.0. CUDA toolkit 12.0.1.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

Compiled with:

make clean && LLAMA_CUBLAS=1 make -j

Command line arguments:

 ./main --color  -ngl 60 --temp 0.7 --repeat_penalty 1.1 -n 512 --ignore-eos -m /workspace/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin  -p "USER: write a story about llamasnASSISTANT:"

Full output:

[pytorch2] ubuntu@h100:/workspace/git/cuda_llama git:(cuda-full-gpu-2) $ ./main --color  -ngl 60 --temp 0.7 --repeat_penalty 1.1 -n 512 --ignore-eos -m /workspace/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin  -p "USER: write a story about llamasnASSISTANT:"
main: build = 678 (0fe5ff2)
main: seed  = 1686600421
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA H100 PCIe
llama.cpp: loading model from /workspace/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2532.68 MB (+ 3124.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 60 layers to GPU
llama_model_load_internal: total VRAM used: 17736 MB
....................................................................................................
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 13 / 26 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 0


 USER: write a story about llamasnASSISTANT:Once upon a time, in the high mountains of South America, there lived a group of llamas. They were known for their long necks and soft wool coats that kept them warm during the cold winter months. The llamas roamed the rugged terrain, grazing on grasses and plants that grew in the harsh environment.
One day, a young llama named Lluvia decided to explore beyond her usual range. She wandered through fields of wildflowers and up steep ravines, until she came to a stream. As she drank from the cool water, she noticed a group of other llamas on the opposite bank. They were dressed in colorful saddles and had bags strapped to their backs.
Curious, Lluvia crossed the stream and approached the group. "Hello," she said. "What are you doing here?"
The leader of the group, a wise old llama named Tupac, replied, "We are on a journey to deliver supplies to a village at the base of the mountains. The people there need our help."
Lluvia was impressed by the llamas' bravery and kindness. She asked if she could join them on their journey, and they welcomed her with open hearts.
Together, the group set out down the mountain, carrying their heavy loads with ease. They passed through fields of crops and orchards, and Lluvia learned about the different plants and animals that lived in the region.
Finally, they arrived at the village, where they were greeted by grateful locals who thanked them for their help. The llamas unloaded their supplies and spent the night in a cozy barn, surrounded by the warmth of the community.
In the morning, Lluvia said goodbye to her new friends and returned to her own herd. But she never forgot the experience of helping others and the sense of purpose it had given her. From that day forward, she made a promise to herself to always be kind and compassionate, just like the wise old llama Tupac.
And so, Lluvia went on to live a long and fulfilling life, inspiring others with her courage and generosity, and always remembering the lessons she had learned on her journey with the kind-CUDA error 1 at ggml-cuda.cu:1920: invalid argument
[pytorch2] ubuntu@h100:/workspace/git/cuda_llama git:(cuda-full-gpu-2) $

Reducing to -n 400 resulted in successful completion.

Read More