top of page

StringIndexer and OneHotEncoder

Workflow Automation Templates

A library of ready-to-use workflow templates to accelerate your data journey

ChatGPT Image Feb 13, 2026, 04_44_29 PM.png

StringIndexer and OneHotEncoder

Convert categorical data into numeric form

Data-cleaning.jpg
Overview

This workflow demonstrates how to encode categorical variables for machine learning by using StringIndexer and OneHotEncoder in Spark. It transforms text-based or discrete categorical features into numerical representations suitable for modeling.

Details

The housing dataset is first loaded, and the StringIndexer node converts categorical columns—such as the number of bedrooms and bathrooms—into numeric indices. These indices are then passed to the OneHotEncoder node, which creates binary vector representations, ensuring models interpret categories without implying ordinal relationships.

The Print N Rows nodes display encoded outputs, allowing comparison between indexed and one-hot encoded data. This preprocessing step helps improve model accuracy and compatibility with algorithms requiring numerical input.

bottom of page