LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

Previous Card

TEMPO: PROMPT-BASED GENERATIVE PRE-TRAINED TRANSFORMER FOR TIME SERIES FORECASTING

LLM-FE is a novel framework for automated feature engineering in tabular data, integrating Large Language Models' (LLMs) domain knowledge and reasoning with evolutionary search. It formulates feature engineering as a program search problem, where LLMs iteratively propose and refine feature transformations. Guided by data-driven feedback, LLM-FE consistently outperforms state-of-the-art baselines, significantly enhancing tabular prediction model performance. ✨

Article Points:

LLM-FE: LLMs + evolutionary search for automated tabular feature engineering.

Formulates FE as program search; LLMs propose, data feedback refines transformations.

Leverages LLM domain knowledge and iterative data-driven feedback for feature discovery.

Consistently outperforms state-of-the-art baselines in tabular prediction tasks.

Enhances performance across diverse models: XGBoost, MLP, TabPFN.

Domain knowledge, evolutionary search, and feedback are crucial for LLM-FE's impact.

Source:

LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

feature engineering evaluation

Problem

Traditional FE: limited search space, no domain knowledge

LLM-based FE: direct prompting, no prior insights

Tabular data: challenging, vast combinatorial space

Approach

Combines LLMs' domain knowledge & reasoning

Uses evolutionary search for feature optimization

Formulates FE as a program search problem

Iterative generation & data-driven feedback

Key Components

Feature Generation: LLM creates programs

Data-Driven Evaluation: Model performance as reward

Experience Management: Multi-population memory

Structured Input Prompt: Guides LLM

Performance

Outperforms SOTA baselines consistently

Enhances XGBoost, MLP, TabPFN models

Effective on classification & regression tasks

Robust to noise, computationally efficient

Impact

Reduces manual effort, improves predictive power

Generates interpretable, contextually relevant features

Generalizable across models & LLM backbones

Future Directions

Integrate more powerful LLMs

Extend to data cleaning, augmentation

Apply to model tuning, HPO

Source:

LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

Next Card

TEMPO: PROMPT-BASED GENERATIVE PRE-TRAINED TRANSFORMER FOR TIME SERIES FORECASTING

Traditional FE: limited search space, no domain knowledge

LLM-based FE: direct prompting, no prior insights

Tabular data: challenging, vast combinatorial space

Combines LLMs' domain knowledge & reasoning

Uses evolutionary search for feature optimization

Formulates FE as a program search problem

Iterative generation & data-driven feedback

Feature Generation: LLM creates programs

Data-Driven Evaluation: Model performance as reward

Experience Management: Multi-population memory

Structured Input Prompt: Guides LLM

Outperforms SOTA baselines consistently

Enhances XGBoost, MLP, TabPFN models

Effective on classification & regression tasks

Robust to noise, computationally efficient

Reduces manual effort, improves predictive power

Generates interpretable, contextually relevant features

Generalizable across models & LLM backbones

Integrate more powerful LLMs

Extend to data cleaning, augmentation

Apply to model tuning, HPO

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

Related Cards

ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants

Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

PERSONA VECTORS: MONITORING AND CONTROLLING CHARACTER TRAITS IN LANGUAGE MODELS