CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs
CodeACT is a novel framework designed to enhance the performance and training efficiency of Code Large Language Models (LLMs). It integrates a Complexity and Diversity Aware Sampling (CDAS) method for selecting high-quality training data and a Dynamic Pack padding strategy to minimize computational resource usage. Experimental results demonstrate that CodeACT significantly improves model performance on benchmarks while drastically reducing training time and GPU memory consumption. ✨
Article Points:
1
CodeACT framework enhances Code LLM performance and training efficiency.
2
CDAS method selects high-quality, complex, and diverse training data.
3
Dynamic Pack padding strategy minimizes padding tokens, optimizing resource use.
4
CodeACT-DeepSeek-Coder-6.7B achieved 8.6% HumanEval increase with 40% data.
5
Reduces training time by 78% and peak GPU memory by 27%.
6
K-Means is optimal for diverse data selection due to efficiency and performance.
CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs
Objective

Enhance LLM performance

Improve training efficiency

Key Components

CDAS (Data Sampling)

- Adaptive data selection
- Complexity & Diversity
- Uses IFD score
- K-Means for diversity

Dynamic Pack Padding

- Sorts samples by length
- Concatenates samples
- Minimizes padding tokens
Experimental Results

Significant performance increase

Reduced training time (78%)

Decreased GPU memory (27%)

40% optimal sampling rate

Advantages

Bridges open/closed-source gap

Resource-efficient training

Improved generalization

Limitations

Scope of model sizes

Ensuring code correctness

Related Work

Base Code LLMs

Data Generation methods

Data Selection techniques