CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs
CodeACT is a novel framework designed to enhance the performance and training efficiency of Code Large Language Models (LLMs). It integrates a Complexity and Diversity Aware Sampling (CDAS) method for selecting high-quality training data and a Dynamic Pack padding strategy to minimize computational resource usage. Experimental results demonstrate that CodeACT significantly improves model performance on benchmarks while drastically reducing training time and GPU memory consumption. ✨
Article Points:
1
CodeACT framework enhances Code LLM performance and training efficiency.
2
CDAS method selects high-quality, complex, and diverse training data.
3
Dynamic Pack padding strategy minimizes padding tokens, optimizing resource use.
4
CodeACT-DeepSeek-Coder-6.7B achieved 8.6% HumanEval increase with 40% data.
5
Reduces training time by 78% and peak GPU memory by 27%.
6
K-Means is optimal for diverse data selection due to efficiency and performance.
CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs
Objective
Enhance LLM performance
Improve training efficiency
Key Components
CDAS (Data Sampling)
- Adaptive data selection
- Complexity & Diversity
- Uses IFD score
- K-Means for diversity
Dynamic Pack Padding
- Sorts samples by length
- Concatenates samples
- Minimizes padding tokens
Experimental Results
Significant performance increase
Reduced training time (78%)
Decreased GPU memory (27%)
40% optimal sampling rate
Advantages
Bridges open/closed-source gap
Resource-efficient training
Improved generalization
Limitations
Scope of model sizes
Ensuring code correctness
Related Work