--- license: other license_name: nvidia-open-model-license license_link: >- https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf --- ## Nemotron-4-340B-Base [![Model architecture](https://img.shields.io/badge/Model%20Arch-Transformer%20Decoder-green)](#model-architecture)[![Model size](https://img.shields.io/badge/Params-340B-green)](#model-architecture)[![Language](https://img.shields.io/badge/Language-Multilingual-green)](#datasets) ### Model Overview: Nemotron-4-340B-Base is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs; it is pre-trained for a total of 9 trillion tokens, consisting of a diverse assortment of English-based texts, 50+ natural languages and 40+ coding languages. After an initial pre-training phase of 8 trillion tokens, continuous pre-training of 1 trillion tokens was performed on top of the pre-trained model in order to improve model quality. During continuous training, we altered the data composition by using a different distribution than the one seen at the beginning of training. Under the NVIDIA Open Model License, NVIDIA confirms: Models are commercially usable. You are free to create and distribute Derivative Models. NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models ### License: [NVIDIA Open Model License](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf) ### Intended use Nemotron-4-340B-Base is a completion model intended for use in over 50+ natural and 40+ coding languages. For best performance on a given task, users are encouraged to customize the completion model using the [NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html) suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA), and SFT/Steer-LM/RLHF using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner). **Model Developer:** NVIDIA **Model Input:** Text **Input Format:** String **Input Parameters:** One-Dimensional (1D) **Model Output:** Text **Output Format:** String **Output Parameters:** 1D **Model Dates:** Nemotron-4-340B-Base was trained between December 2023 and May 2024 ### Required Hardware BF16 Inference: - 8x H200 (1x H200 Node) - 16x H100 (2x H100 Nodes) - 16x A100 (2x A100 Nodes) ### Model Architecture: Nemotron-4-340B-Base is trained with a global batch-size of 2304, a sequence length of 4096 tokens, uses Grouped-Query Attention (GQA), and RoPE positional embeddings. **Architecture Type:** Transformer Decoder (auto-regressive language model) **Network Architecture:** Nemotron-4 ### Usage 1. We will spin up an inference server and then call the inference server in a python script. Let’s first define the python script ``call_server.py``. ```python import requests import json headers = {"Content-Type": "application/json"} def text_generation(data, ip='localhost', port=None): resp = requests.put(f'http://{ip}:{port}/generate', data=json.dumps(data), headers=headers) return resp.json() def get_generation(prompt, greedy, add_BOS, token_to_gen, min_tokens, temp, top_p, top_k, repetition, batch=False): data = { "sentences": [prompt] if not batch else prompt, "tokens_to_generate": int(token_to_gen), "temperature": temp, "add_BOS": add_BOS, "top_k": top_k, "top_p": top_p, "greedy": greedy, "all_probs": False, "repetition_penalty": repetition, "min_tokens_to_generate": int(min_tokens), "end_strings": ["<|endoftext|>", "", "\x11", "User"], } sentences = text_generation(data, port=1424)['sentences'] return sentences[0] if not batch else sentences PROMPT_TEMPLATE = "{prompt}" question = "Write a poem on NVIDIA in the style of Shakespeare" prompt = PROMPT_TEMPLATE.format(prompt=question) print(prompt) response = get_generation(prompt, greedy=True, add_BOS=False, token_to_gen=1024, min_tokens=1, temp=1.0, top_p=1.0, top_k=0, repetition=1.0, batch=False) response = response[len(prompt):] print(response) ``` 2. Given this python script, we will create a bash script, which spins up the inference server within the NeMo container (```docker pull nvcr.io/nvidia/nemo:24.01.framework```) and calls the python script ``call_server.py``. The bash script ``nemo_inference.sh`` is as follows, ```bash NEMO_FILE=$1 WEB_PORT=1424 depends_on () { HOST=$1 PORT=$2 STATUS=$(curl -X PUT http://$HOST:$PORT >/dev/null 2>/dev/null; echo $?) while [ $STATUS -ne 0 ] do echo "waiting for server ($HOST:$PORT) to be up" sleep 10 STATUS=$(curl -X PUT http://$HOST:$PORT >/dev/null 2>/dev/null; echo $?) done echo "server ($HOST:$PORT) is up running" } /usr/bin/python3 /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py \ gpt_model_file=$NEMO_FILE \ pipeline_model_parallel_split_rank=0 \ server=True tensor_model_parallel_size=8 \ trainer.precision=bf16 pipeline_model_parallel_size=2 \ trainer.devices=8 \ trainer.num_nodes=2 \ web_server=False \ port=${WEB_PORT} & SERVER_PID=$! readonly local_rank="${LOCAL_RANK:=${SLURM_LOCALID:=${OMPI_COMM_WORLD_LOCAL_RANK:-}}}" if [ $SLURM_NODEID -eq 0 ] && [ $local_rank -eq 0 ]; then depends_on "0.0.0.0" ${WEB_PORT} echo "start get json" sleep 5 echo "SLURM_NODEID: $SLURM_NODEID" echo "local_rank: $local_rank" /usr/bin/python3 /scripts/call_server.py echo "clean up dameons: $$" kill -9 $SERVER_PID pkill python fi wait ``` 3. We can launch the ``nemo_inferece.sh`` with a slurm script defined like below, which starts a 2-node job for the model inference. ```bash #!/bin/bash #SBATCH -A SLURM-ACCOUNT #SBATCH -p SLURM-PARITION #SBATCH -N 2 #SBATCH -J generation #SBATCH --ntasks-per-node=8 #SBATCH --gpus-per-node=8 set -x RESULTS= OUTFILE="${RESULTS}/slurm-%j-%n.out" ERRFILE="${RESULTS}/error-%j-%n.out" MODEL=/Nemotron-4-340B-Base CONTAINER="nvcr.io/nvidia/nemo:24.01.framework" MOUNTS="--container-mounts=:/scripts,MODEL:/model" read -r -d '' cmd <