Project

Distributed Text Summarization System

Multi-GPU training pipeline for abstractive news summarization using transformer models.

Distributed SystemsNLPModel Training

Problem statement

Training large summarization models on long-form text required distributed computation and careful data handling to avoid bottlenecks.

Architecture overview

A BART-based summarization model was trained on the CNN/Daily News dataset using multi-GPU distributed training on an HPCC cluster. Data preprocessing and PII redaction were automated using SpaCy and Bash-based workflows.

Technical decisions & tradeoffs

  • Used distributed training to reduce wall-clock time at the cost of orchestration complexity.
  • Maintained separate masked and unmasked datasets to evaluate privacy tradeoffs.
  • Focused on pipeline reliability over rapid iteration.

Lessons learned

Distributed training pipelines are systems problems first. Data movement, synchronization, and failure recovery matter as much as model architecture.