GC.OS Logo
GC.OS Brandmark

Feature-engine

Feature-engine is an open-source Python library for feature engineering that works natively with pandas DataFrames and integrates seamlessly with scikit-learn pipelines.

Back to Projects

Feature-engine is an open-source Python library designed to simplify and standardize feature engineering for machine learning workflows. It bridges the gap between pandas-based data manipulation and scikit-learn’s estimator ecosystem, enabling robust, reproducible, and pipeline-compatible data transformations.

Unlike many scikit-learn preprocessing tools, Feature-engine works natively with pandas DataFrames. Column names and ordering are preserved throughout transformations, making data analysis, debugging, and deployment significantly more transparent.

Motivation

Feature engineering is one of the most time-consuming and critical stages in machine learning projects. Raw business data is rarely suitable for direct model training. Feature-engine was created to make feature engineering explicit, reusable, and compatible with modern machine learning pipelines.

By encapsulating pandas logic inside transformers that implement fit() and transform(), Feature-engine learns and stores transformation parameters from training data and applies them consistently to new data.

Core Characteristics

  • Native DataFrame input and output

  • No column renaming or reordering during transformations

  • Built-in selection of feature subsets without auxiliary transformers

  • Full compatibility with scikit-learn pipelines, grid search, and cross-validation

  • Automatic detection of numerical, categorical, and datetime variables

  • Validation and safeguards against invalid transformations

Feature Engineering Capabilities

Missing Data Imputation

Feature-engine provides multiple strategies to handle missing values in numerical and categorical variables, including statistical, arbitrary, random, and indicator-based approaches.

Categorical Encoding

The library supports a wide range of categorical encoding techniques, such as one-hot encoding, ordinal encoding, target-based encodings, and decision-tree-based encoders, enabling effective use of categorical data in machine learning models.

Variable Discretization

Numerical features can be converted into discrete intervals using equal-width, equal-frequency, user-defined, geometric, or decision-tree-based binning strategies.

Outlier Handling

Outliers can be capped or removed using statistically defined thresholds or user-specified limits, helping improve model stability and performance.

Numerical Transformations

Feature-engine includes variance-stabilizing and distribution-shaping transformations such as logarithmic, power, Box–Cox, and Yeo–Johnson transformations.

Feature Creation

New features can be generated through mathematical combinations, relative scaling, cyclical encoding (sine and cosine), or decision-tree-based feature generation.

Datetime Features

Datetime variables can be decomposed into meaningful components or used to compute time-based differences, enabling effective use of temporal information in models.

Feature Selection

The library extends classical feature selection techniques with methods for removing constant, duplicate, or correlated features, as well as advanced model-based and statistical selection strategies.

Time Series and Forecasting

Feature-engine supports transforming time series data into supervised learning tables using lag features, rolling windows, and expanding window aggregations.

Preprocessing and Scaling

Transformers ensure consistent data types and feature alignment between training and inference data, while allowing selective application of scikit-learn scalers where needed.

Open Source and Community

Feature-engine is released under the BSD 3-Clause license and developed openly on GitHub. The project welcomes contributions across code, documentation, testing, and knowledge sharing, and is supported by a growing global community of data scientists and engineers.

Typical Use Cases

  • Feature engineering for machine learning models

  • Reproducible data preprocessing pipelines

  • Data science research and experimentation

  • Production-grade ML systems

  • Educational and training environments

Team

Soledad Galli

Soledad Galli

Train In Data