Training Material: Difference between revisions

From CMU -- Language Technologies Institute -- HPC Wiki
Jump to navigation Jump to search
Yifengw2 (talk | contribs)
mNo edit summary
 
(3 intermediate revisions by 2 users not shown)
Line 37: Line 37:
## <code>conda create --name myenv</code>
## <code>conda create --name myenv</code>
## <code>conda activate myenv</code>
## <code>conda activate myenv</code>
# Start a shell in compute node:  <code>srun -c 16 --mem=128Gb --gres=gpu:2 --pty $SHELL</code>
# Start a shell in compute node:  <code>srun -G <num_of_gpu> -t <num_of_hours> -p <partition_name> -w <node_name> --pty $SHELL</code>
#* <code>num_of_gpu</code> : The number of GPUs you would like to allocate, each team have 8 GPUs.
#* <code>num_of_hours</code> : Time limits (days) for the job, maximum value is 20 days.
#* <code>partition_name</code> , <code>node_name</code> : Each team is assigned with a specific partition and compute node, consulting your mentor if you don't have them.
#Activate the conda environment: <code>conda activate myenv</code>
# Change directory to use lti-llm-deployment  <code>cd lti-llm-deployment</code>  
# Change directory to use lti-llm-deployment  <code>cd lti-llm-deployment</code>  
# Install dependencies ([https://github.com/neulab/lti-llm-deployment/blob/main/inference_server/README.md README]):
# Install dependencies ([https://github.com/neulab/lti-llm-deployment/blob/main/inference_server/README.md README]):
## <code>pip install flask flask_api gunicorn pydantic accelerate huggingface_hub>=0.9.0 deepspeed>=0.7.3 deepspeed-mii==0.0.4</code>
## <code>pip install flask flask_api gunicorn pydantic accelerate huggingface_hub>=0.9.0 deepspeed>=0.7.3 deepspeed-mii==0.0.4</code>
## <code>pip install sentencepiece</code>
## <code>pip install sentencepiece</code>
# Set environment variable (Required for models that have greater than 7B params) <code>export TRANSFORMERS_CACHE=/data/user_data/<andrew_id></code>
# Set environment variable (Required for models that have greater than 7B params)<code>export TRANSFORMERS_CACHE=/data/user_data/<andrew_id></code>
# Run the desired scripts to launch the LLMs  <code>bash launch_llama7b_fp16_2gpu_server.sh</code> Upon successful execution, the message ''“model loaded”'' is printed on the terminal
# Run the desired scripts to launch the LLMs  <code>bash launch_llama7b_fp16_2gpu_server.sh</code> Upon successful execution, the message ''“model loaded”'' is printed on the terminal


''<u>Client</u>''
''<u>Client</u>''


# Start another session on babel to set up the client-server
# Start another session on babel to set up the client-server
# Create a python file with the following contents<blockquote>import llm_client client = llm_client.Client(address=”babel-0-23”, port = 5000) text = “CMU students are”   output = client.prompt([text])   tokens, scores = client.score([text]) print(text) print(output[0].text) for tok, s in zip(tokens, scores): print(f"[{tok}]: {s:.2f}")</blockquote>
# Create a python file with the following contents<syntaxhighlight lang="python">
import llm_client
 
client = llm_client.Client(address=”babel-0-23”, port = 5000)  
text = “CMU students are”  
output = client.prompt([text])  
tokens, scores = client.score([text])  
print(text)  
print(output[0].text)  
for tok, s in zip(tokens, scores):  
    print(f"[{tok}]: {s:.2f}")
</syntaxhighlight>
# Run <code>python filename.py</code>
# Run <code>python filename.py</code>
</div>
</div>

Latest revision as of 16:29, 21 July 2023

Connecting to Cluster

[edit | edit source]

Each team will be assigned with a temporary Andrew id. You will need the Andrew id to access Cluster.

Follow the Instructions to connect to Cluster

Instruction with SLURM

[edit | edit source]

SLURM (Simple Linux Utility for Resource Management) is a job scheduler and resource management system commonly used in high-performance computing (HPC) environments. You will need to get familiar with basic usage of SLURM in order to interact with the Cluster

Refer to Beginner's guide to the SLURM workload manager for more details.

Monitoring

[edit | edit source]

To monitor cluster activities such as jobs, compute resources, and disk usage, learn about essential techniques and tools for effective cluster monitoring here: Monitoring.

LLM Deployment Demo

[edit | edit source]

LLaMA

[edit | edit source]

Released this past February by Meta Research, LLaMA is the latest and greatest in open-source pre-trained LLMs. The LLaMA models range in size from 7 billion to 65 billion parameters. It was trained mostly on scraped internet data, with some other data sources, such as Github code, Books, and academic papers thrown in. Since LLaMA's release, it has also been finetuned for instruction-following and conversational alignment

Accessibility: LLaMA is more efficient and less resource-intensive than other models, and it is available under a non-commercial license to researchers and other organizations. LLaMA is available in various sizes(7B, 13B, 33B, and 65B parameters), making it accessible to a range of computing resources.

Open-source Community: LLaMA models are part of the open-source ecosystem, users can benefit from the extensive community support, documentation, and shared resources available through platforms like HuggingFace.

Setting up LLaMA on Babel:

[edit | edit source]

Server

  1. Clone repo: https://github.com/neulab/lti-llm-deployment
  2. Checkout to update-lamma branch
  3. (If not already installed) Install pip using https://pip.pypa.io/en/stable/installation/
  4. Create a virtual environment with Miniconda
    1. wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    2. bash Miniconda3-latest-Linux-x86_64.sh
    3. conda create --name myenv
    4. conda activate myenv
  5. Start a shell in compute node: srun -G <num_of_gpu> -t <num_of_hours> -p <partition_name> -w <node_name> --pty $SHELL
    • num_of_gpu : The number of GPUs you would like to allocate, each team have 8 GPUs.
    • num_of_hours : Time limits (days) for the job, maximum value is 20 days.
    • partition_name , node_name : Each team is assigned with a specific partition and compute node, consulting your mentor if you don't have them.
  6. Activate the conda environment: conda activate myenv
  7. Change directory to use lti-llm-deployment cd lti-llm-deployment
  8. Install dependencies (README):
    1. pip install flask flask_api gunicorn pydantic accelerate huggingface_hub>=0.9.0 deepspeed>=0.7.3 deepspeed-mii==0.0.4
    2. pip install sentencepiece
  9. Set environment variable (Required for models that have greater than 7B params): export TRANSFORMERS_CACHE=/data/user_data/<andrew_id>
  10. Run the desired scripts to launch the LLMs bash launch_llama7b_fp16_2gpu_server.sh Upon successful execution, the message “model loaded” is printed on the terminal

Client

  1. Start another session on babel to set up the client-server
  2. Create a python file with the following contents<syntaxhighlight lang="python">

import llm_client

client = llm_client.Client(address=”babel-0-23”, port = 5000) text = “CMU students are” output = client.prompt([text]) tokens, scores = client.score([text]) print(text) print(output[0].text) for tok, s in zip(tokens, scores):

   print(f"[{tok}]: {s:.2f}")

</syntaxhighlight>

  1. Run python filename.py