Feature Engineering for Machine Learning

Published: November 2025 | 25 min read

The Art and Science of Feature Engineering

Feature engineering is the process of transforming raw data into meaningful features that better represent the underlying problem to predictive models, resulting in improved model accuracy on unseen data.

Key Concepts

Feature Types
Numerical (continuous/discrete)
Categorical (nominal/ordinal)
Text/Unstructured
Time-series/Temporal
Geospatial
Feature Transformation
Normalization/Scaling
Encoding categorical variables
Handling missing values
Binning/Discretization
Feature Creation
Domain-specific features
Interaction terms
Polynomial features
Time-based aggregations

Practical Implementation

import pandas as pd
import numpy as np
from sklearn.preprocessing import (
    StandardScaler, OneHotEncoder, 
    KBinsDiscretizer, FunctionTransformer
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from feature_engine import (
    datetime as dt_engine,
    imputation as imp,
    encoding as enc
)

# Sample data
data = {
    'transaction_date': pd.date_range('2023-01-01', periods=100, freq='D'),
    'amount': np.random.normal(100, 20, 100).round(2),
    'category': np.random.choice(['A', 'B', 'C', 'D'], 100),
    'customer_age': np.random.randint(18, 80, 100),
    'is_fraud': np.random.choice([0, 1], 100, p=[0.95, 0.05])
}
df = pd.DataFrame(data)

# Create time-based features
df['day_of_week'] = df['transaction_date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['month'] = df['transaction_date'].dt.month

# Create interaction features
df['amount_per_age'] = df['amount'] / df['customer_age']

# Define preprocessing steps
numeric_features = ['amount', 'customer_age', 'amount_per_age']
categorical_features = ['category', 'day_of_week', 'month']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Create a feature engineering pipeline
feature_engineering_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    # Add more feature engineering steps as needed
])

# Apply transformations
X = df.drop(['is_fraud', 'transaction_date'], axis=1)
y = df['is_fraud']

X_transformed = feature_engineering_pipeline.fit_transform(X)

# Get feature names after transformation
numeric_features_transformed = feature_engineering_pipeline.named_steps['preprocessor']\
    .named_transformers_['num'].get_feature_names_out(numeric_features)

categorical_features_transformed = feature_engineering_pipeline.named_steps['preprocessor']\
    .named_transformers_['cat'].get_feature_names_out(categorical_features)

all_features = np.concatenate([
    numeric_features_transformed,
    categorical_features_transformed
])

print(f"Total features after transformation: {len(all_features)}")

Advanced Feature Engineering Techniques

1. Target Encoding

from category_encoders import TargetEncoder

# Initialize target encoder
target_enc = TargetEncoder(cols=['category'])

# Fit and transform
df['category_encoded'] = target_enc.fit_transform(
    df['category'], 
    df['is_fraud']
)

2. Feature Selection

from sklearn.feature_selection import SelectKBest, f_classif

# Select top 10 features
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_transformed, y)

# Get selected feature indices
selected_indices = selector.get_support(indices=True)
selected_features = [all_features[i] for i in selected_indices]
print(f"Selected features: {selected_features}")

3. Time-Series Features

from tsfresh import extract_features
from tsfresh.utilities.dataframe_functions import roll_time_series

# Create time-series features
df_ts = roll_time_series(
    df, 
    column_id="customer_id",
    column_sort="transaction_date",
    max_timeshift=30,
    min_timeshift=5
)

# Extract time-series features
features_ts = extract_features(
    df_ts.drop("is_fraud", axis=1),
    column_id="id", 
    column_sort="transaction_date"
)

Feature Engineering Best Practices

Start Simple
Begin with basic features
Add complexity gradually
Validate each addition
Domain Knowledge
Incorporate business insights
Understand the data generation process
Consult with domain experts
Automation
Use feature stores
Implement feature versioning
Automate feature validation
Monitoring
Track feature distributions
Monitor feature importance
Set up data quality checks

Feature Stores

Modern feature stores help manage the feature engineering lifecycle:

Feast - Open source feature store
Tecton - Enterprise feature platform
Hopsworks - Open-source feature store
AWS Feature Store - Managed service

Common Pitfalls

Data Leakage
Using future information
Improper cross-validation
Target leakage
Over-Engineering
Creating too many features
Complex transformations without justification
Ignoring model interpretability
Scalability Issues
High-dimensional feature spaces
Inefficient transformations
Lack of incremental updates

Next Steps

Implement automated feature validation
Set up feature monitoring
Explore automated feature engineering
Consider feature store implementation