Skip to content

Feature Engineering for Machine Learning

Published: November 2025 | 25 min read

The Art and Science of Feature Engineering

Feature engineering is the process of transforming raw data into meaningful features that better represent the underlying problem to predictive models, resulting in improved model accuracy on unseen data.

Key Concepts

  1. Feature Types
  2. Numerical (continuous/discrete)
  3. Categorical (nominal/ordinal)
  4. Text/Unstructured
  5. Time-series/Temporal
  6. Geospatial

  7. Feature Transformation

  8. Normalization/Scaling
  9. Encoding categorical variables
  10. Handling missing values
  11. Binning/Discretization

  12. Feature Creation

  13. Domain-specific features
  14. Interaction terms
  15. Polynomial features
  16. Time-based aggregations

Practical Implementation

import pandas as pd
import numpy as np
from sklearn.preprocessing import (
    StandardScaler, OneHotEncoder, 
    KBinsDiscretizer, FunctionTransformer
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from feature_engine import (
    datetime as dt_engine,
    imputation as imp,
    encoding as enc
)

# Sample data
data = {
    'transaction_date': pd.date_range('2023-01-01', periods=100, freq='D'),
    'amount': np.random.normal(100, 20, 100).round(2),
    'category': np.random.choice(['A', 'B', 'C', 'D'], 100),
    'customer_age': np.random.randint(18, 80, 100),
    'is_fraud': np.random.choice([0, 1], 100, p=[0.95, 0.05])
}
df = pd.DataFrame(data)

# Create time-based features
df['day_of_week'] = df['transaction_date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['month'] = df['transaction_date'].dt.month

# Create interaction features
df['amount_per_age'] = df['amount'] / df['customer_age']

# Define preprocessing steps
numeric_features = ['amount', 'customer_age', 'amount_per_age']
categorical_features = ['category', 'day_of_week', 'month']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Create a feature engineering pipeline
feature_engineering_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    # Add more feature engineering steps as needed
])

# Apply transformations
X = df.drop(['is_fraud', 'transaction_date'], axis=1)
y = df['is_fraud']

X_transformed = feature_engineering_pipeline.fit_transform(X)

# Get feature names after transformation
numeric_features_transformed = feature_engineering_pipeline.named_steps['preprocessor']\
    .named_transformers_['num'].get_feature_names_out(numeric_features)

categorical_features_transformed = feature_engineering_pipeline.named_steps['preprocessor']\
    .named_transformers_['cat'].get_feature_names_out(categorical_features)

all_features = np.concatenate([
    numeric_features_transformed,
    categorical_features_transformed
])

print(f"Total features after transformation: {len(all_features)}")

Advanced Feature Engineering Techniques

1. Target Encoding

from category_encoders import TargetEncoder

# Initialize target encoder
target_enc = TargetEncoder(cols=['category'])

# Fit and transform
df['category_encoded'] = target_enc.fit_transform(
    df['category'], 
    df['is_fraud']
)

2. Feature Selection

from sklearn.feature_selection import SelectKBest, f_classif

# Select top 10 features
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_transformed, y)

# Get selected feature indices
selected_indices = selector.get_support(indices=True)
selected_features = [all_features[i] for i in selected_indices]
print(f"Selected features: {selected_features}")

3. Time-Series Features

from tsfresh import extract_features
from tsfresh.utilities.dataframe_functions import roll_time_series

# Create time-series features
df_ts = roll_time_series(
    df, 
    column_id="customer_id",
    column_sort="transaction_date",
    max_timeshift=30,
    min_timeshift=5
)

# Extract time-series features
features_ts = extract_features(
    df_ts.drop("is_fraud", axis=1),
    column_id="id", 
    column_sort="transaction_date"
)

Feature Engineering Best Practices

  1. Start Simple
  2. Begin with basic features
  3. Add complexity gradually
  4. Validate each addition

  5. Domain Knowledge

  6. Incorporate business insights
  7. Understand the data generation process
  8. Consult with domain experts

  9. Automation

  10. Use feature stores
  11. Implement feature versioning
  12. Automate feature validation

  13. Monitoring

  14. Track feature distributions
  15. Monitor feature importance
  16. Set up data quality checks

Feature Stores

Modern feature stores help manage the feature engineering lifecycle:

  1. Feast - Open source feature store
  2. Tecton - Enterprise feature platform
  3. Hopsworks - Open-source feature store
  4. AWS Feature Store - Managed service

Common Pitfalls

  1. Data Leakage
  2. Using future information
  3. Improper cross-validation
  4. Target leakage

  5. Over-Engineering

  6. Creating too many features
  7. Complex transformations without justification
  8. Ignoring model interpretability

  9. Scalability Issues

  10. High-dimensional feature spaces
  11. Inefficient transformations
  12. Lack of incremental updates

Next Steps

  1. Implement automated feature validation
  2. Set up feature monitoring
  3. Explore automated feature engineering
  4. Consider feature store implementation