Workflow Automation Templates
A library of ready-to-use workflow templates to accelerate your data journey

Classification Algorithm
Predict diabetes with logistic regression

Overview
This workflow builds a classification model using the Logistic Regression algorithm to predict whether a patient is diabetic based on clinical measurements. It includes data cleaning, feature preparation, model training, scoring, and saving for future use.
Details
The workflow begins by importing patient data using the Read CSV node. It then performs data preprocessing through Drop Rows With Null and Row Filter nodes to ensure data quality by removing missing or irrelevant records.
Next, the cleaned dataset is transformed using the Vector Assembler node to combine multiple clinical features into a single feature vector. The dataset is split into training and testing sets with the Split node.
The Logistic Regression node trains a model on the training data to classify patients as diabetic (1) or non-diabetic (0). Predictions are made on the test data using the Predict node, followed by Drop Columns and Print N Rows to display the results. Finally, the trained model is stored using the Spark ML Model Save node for reuse in production or evaluation.
This workflow demonstrates a complete end-to-end classification pipeline for medical data analysis, offering a foundation for predictive healthcare applications.