CXCSCMU GroupWiki

From CMU -- Language Technologies Institute -- HPC Wiki
Jump to navigation Jump to search

Organize your code

Create your own branch for development and regularly submit changes to your branch! Here are some development steps you may refer to:

  1. For new features/experiments, it is highly recommended to create a new branch
  2. After developing and thorough test, merge the new features to your branch
  3. If you think these features can benefit everyone in the group, merge them into the main branch by pull request (peers should do code review and test)

Organize your results

Loss curve

If you train the model, the loss curve is vital in debugging, providing insights, and reproducing your results.

It is recommended to use wandb to show your loss curve:

"""
Wandb setup
"""
wandb_project = YOUR_PROJECT_NAME
# IMPORTANT! Record the most important hyperparameters here
wandb_run_name = MODEL_NAME-DATASET_NAME-BATCH_SIZE-LEARNING_RATE-YOUR_RUN_NAME
# Optional but highly recommended
hparams = YOUR_HYPER_PARAMS_DICT
out_dir = YOUR_WANDB_OUTPUT_DIR
# You may be asked to fill in the API key. Find it online
wandb.init(project=wandb_project, name=wandb_run_name, config=hparams, dir=out_dir)
"""
Wandb logging
"""
wandb.log({
           "step": train_step,
           "train/loss": train_loss,
           "val/loss": val_loss,
           "step time": step_time,
           "lr": lr,
         })

When you are training, you can keep track of the curve online and export the reports to share at the end of the training. For your reference, this is an example report.

Evaluation numbers

Create the folder named by your name/project name under the Google Drive CXCSCMU_Group and use Google Sheets to illustrate the numbers.

  • Make sure the column/row name is clear to read about the details of the model you are actually evaluating
    • ❌ Pythia
    • ✅ Pythia-160M full-model fine-tuned with SST-2 for1 epoch, lr=1e-5, bs=8
  • Group the model sets that can be compared fairly in the same format as the research paper

Codebases

An LLM codebase built on Lit-GPT and Pytorch Lightning, which is especially useful for efficient pre-training LM from scratch.

What can it do:

  • Pre-train state-of-the-art decoder-only models (Llama, Llama 2, Pythia, Vicuna, GPT-2, ...)
  • Fine-tune using task-specific data
  • Evaluation on Language Model Evaluation Harness

Pros:

  • State-of-the-art distributed training strategies: DDP, FSDP, DeepSpeed
  • Modern acceleration strategies: FlashAttention, Fused Adam, mixed precision
  • Parameter-efficient fine-tune: Adapter, Adapter v2, LoRA, QLoRA, ...
  • Large-scale evaluation datasets: almost cover every common task in NLP and keep updating
  • Comparable training speed with huggingface but better flexibility
  • Relatively easy to convert the model weights from/to huggingface: name mapping
  • Detailed tutorials for each usage, and it is pretty easy to begin with

Cons:

  • Does not support models in other structures such as T5, BERT
  • Does not support as many training datasets as huggingface, you may need to define the dataset class or preprocess the dataset by yourself
  • Still in development and requires everyone's effort to maintain it

(Yu et al., 2022) Paper | Github | Docs

A Python-based library for conducting Neural Information Retrieval (Neu-IR) research experiments. This library contains both neural and traditional IR modules, making it easy to conduct baselines experiments for comparison.

What can it do:

  • Template-based Data Processing. Convenient templates for processing of raw data -- no need to reformat data to conform to software's input.
  • Efficient Data Accessing. Integrated with HF datasets which enables access to large dataset with minimal memory overhead.
  • Sharded Search. Implements two-stage sharded search which bypasses the need to load whole datasets into a single memory.
  • A sample of models that are supported: DPR, ANCE, T5, BERT, etc.

Requirements:

  • Pytorch
  • HuggingFace, Datasets
  • Transformers
  • Faiss

A working directory of OpenMatch V2.0 exists in `/data/group_data/cx_group/OpenMatch` on the Babel server.

Paper reading

Research assistants' default minimum of paper reading is five per week. For your reference, this is an example.

Paper spotlight meetings

Meeting 1

Meeting 2 - November 3, 2023

Paper #Votes
SELF-RAG: Learning to Retrieve, Generate and Critique through Self-reflection 4
Reflection-Tuning:Data Recycling Improves LLM Instruction-Tuning 2
Tree Prompting: Efficient Task Adaptation without Fine-Tuning 2
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards 2
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models 2
Are Emergent Abilities of Large Language Models a Mirage? 2
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness 2
Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective 1
Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning 1
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning 1
CodeFusion: A Pre-trained Diffusion Model for Code Generation 0
Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers 0
ChunkAttention: Efficient Attention on KV Cache with Chunking Sharing and Batching 0
Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading 0

Meeting 3 - February 2, 2024

Paper #Votes
Tuning Language Models by Proxy 5
Learning to Tokenize for Generative Retrieval 4
Can AI Assistants Know What They Don’t Know? 3
Lost in the Middle: How Language Models Use Long Contexts 2
Distilling Semantic Concept Embeddings from Contrastively Fine-tuned Language Models 2
Datamodels: Predicting Predictions from Training Data 2
User Embedding Model for Personalized Language Prompting 1
Improving Passage Retrieval with Zero-Shot Question Generation 1
Text Representation Distillation via Information Bottleneck Principle 1
Self-Rewarding Language Models 0
Retrieval is Accurate Generation 0
Learning Query-aware Embedding Index for Improving E-commerce Dense Retrieval 0
How to Prune Your Language Model: Recovering Accuracy on the ``Sparsity May Cry'' Benchmark 0
Explaining and Improving Model Behavior with k Nearest Neighbor Representations 0