Feature-engine
Feature-engine is an open-source Python library for feature engineering that works natively with pandas DataFrames and integrates seamlessly with scikit-learn pipelines.
Feature-engine is an open-source Python library designed to simplify and standardize feature engineering for machine learning workflows. It bridges the gap between pandas-based data manipulation and scikit-learn’s estimator ecosystem, enabling robust, reproducible, and pipeline-compatible data transformations.
Unlike many scikit-learn preprocessing tools, Feature-engine works natively with pandas DataFrames. Column names and ordering are preserved throughout transformations, making data analysis, debugging, and deployment significantly more transparent.
Motivation
Feature engineering is one of the most time-consuming and critical stages in machine learning projects. Raw business data is rarely suitable for direct model training. Feature-engine was created to make feature engineering explicit, reusable, and compatible with modern machine learning pipelines.
By encapsulating pandas logic inside transformers that implement fit() and transform(), Feature-engine learns and stores transformation parameters from training data and applies them consistently to new data.
Core Characteristics
Native DataFrame input and output
No column renaming or reordering during transformations
Built-in selection of feature subsets without auxiliary transformers
Full compatibility with scikit-learn pipelines, grid search, and cross-validation
Automatic detection of numerical, categorical, and datetime variables
Validation and safeguards against invalid transformations
Feature Engineering Capabilities
Missing Data Imputation
Feature-engine provides multiple strategies to handle missing values in numerical and categorical variables, including statistical, arbitrary, random, and indicator-based approaches.
Categorical Encoding
The library supports a wide range of categorical encoding techniques, such as one-hot encoding, ordinal encoding, target-based encodings, and decision-tree-based encoders, enabling effective use of categorical data in machine learning models.
Variable Discretization
Numerical features can be converted into discrete intervals using equal-width, equal-frequency, user-defined, geometric, or decision-tree-based binning strategies.
Outlier Handling
Outliers can be capped or removed using statistically defined thresholds or user-specified limits, helping improve model stability and performance.
Numerical Transformations
Feature-engine includes variance-stabilizing and distribution-shaping transformations such as logarithmic, power, Box–Cox, and Yeo–Johnson transformations.
Feature Creation
New features can be generated through mathematical combinations, relative scaling, cyclical encoding (sine and cosine), or decision-tree-based feature generation.
Datetime Features
Datetime variables can be decomposed into meaningful components or used to compute time-based differences, enabling effective use of temporal information in models.
Feature Selection
The library extends classical feature selection techniques with methods for removing constant, duplicate, or correlated features, as well as advanced model-based and statistical selection strategies.
Time Series and Forecasting
Feature-engine supports transforming time series data into supervised learning tables using lag features, rolling windows, and expanding window aggregations.
Preprocessing and Scaling
Transformers ensure consistent data types and feature alignment between training and inference data, while allowing selective application of scikit-learn scalers where needed.
Open Source and Community
Feature-engine is released under the BSD 3-Clause license and developed openly on GitHub. The project welcomes contributions across code, documentation, testing, and knowledge sharing, and is supported by a growing global community of data scientists and engineers.
Typical Use Cases
Feature engineering for machine learning models
Reproducible data preprocessing pipelines
Data science research and experimentation
Production-grade ML systems
Educational and training environments