Alireza Olama | Machine Learning Systems Researcher

News & Updates

Recent activities, publications, and announcements.

Feb 2025

Paper submitted to ACM ICS 2026

PruneX: A Hierarchical Communication-Efficient System for Distributed CNN Training with Structured Pruning

Feb 2025

Paper submitted to ICLR 2026

Federated Learning With ℓ0 Constraint Via Probabilistic Gates For Sparsity

Spring 2025

Teaching: Parallel Programming & GPU Programming

New courses at Åbo Akademi University with hands-on access to LUMI, Puhti, and Mahti supercomputers

Dec 2023

Joined Åbo Akademi University

Started as Postdoctoral Researcher in Information Technology Department

About Me

Postdoctoral researcher focused on machine learning systems (MLSys) for energy-efficient, large-scale AI. My work addresses communication, memory, and energy bottlenecks in distributed training and inference of foundation models.

Biography

I design and implement scalable ML systems on HPC platforms, bridging algorithmic sparsity with distributed execution models. My research vision: performance optimization is energy optimization—reducing communication, memory traffic, and redundant computation directly translates into sustainability gains for AI infrastructure.

Education

Ph.D. in Automation & Systems Engineering
Federal University of Santa Catarina (UFSC), Brazil, 2023
M.Sc. in Electrical & Computer Engineering
Shiraz University of Technology (SUTECH), Iran, 2017

Research Interests

Energy-efficient distributed training
Sparsity-aware optimization & structured pruning
Hardware–software co-design for inference
Communication-efficient parallelism
Federated & decentralized learning
GPU programming & HPC systems

Technical Skills

Deep expertise in high-performance computing, distributed systems, and ML infrastructure for building scalable, energy-efficient AI systems.

Programming Languages

Python Proficient

C/C++ Proficient

CUDA Proficient

HIP Basic

Bash Proficient

AI & Machine Learning

PyTorch Proficient

Distributed PyTorch (DDP/FSDP) Proficient

Mixed-Precision Training (AMP) Proficient

Distributed Computing & HPC

Slurm Proficient

Multi-Node, Multi-GPU Training Proficient

NCCL, MPI, OpenMP Proficient

AI Inference & Performance

ONNX & TensorRT Basic

Pruning (Structured, N:M) Proficient

Quantization Proficient

TVM & ML Compilers Basic

GPU Kernel Optimization

Memory Hierarchy Optimization Proficient

Nsight Systems / Nsight Compute Proficient

Roofline & Bottleneck Analysis Proficient

Custom CUDA Kernels (PyTorch C++) Proficient

Software Engineering

Git / GitHub Proficient

Docker Basic

CI/CD Basic

Linux & HPC Ecosystems Proficient

Research Timeline

2015–2017

M.Sc.

Shiraz Univ. of Tech., Iran

2019–2023

Ph.D.

UFSC, Brazil

2021–2022

Visiting Researcher

Univ. of Bologna, Italy

2023–Present

Postdoc

Åbo Akademi, Finland

Research Interests

My research focuses on making AI systems faster, more efficient, and scalable—from training foundation models to deploying them in production.

Efficient Training of Foundation Models

Scalable training strategies for LLMs and vision transformers using parallelism techniques:

Data Parallelism: Distributed gradient computation across devices
Tensor Parallelism: Splitting model layers across GPUs
Pipeline Parallelism: Micro-batch pipelining for deep models
Mixed Precision: FP16/BF16 training with loss scaling
ZeRO Optimization: Memory-efficient optimizer states

Efficient Inference

Reducing computational and memory costs for model deployment:

Sparsity: Structured/unstructured pruning, N:M patterns
Quantization: INT8/INT4 weights, activation quantization
Knowledge Distillation: Teacher-student model compression
Sparse Tensor Cores: Hardware-accelerated sparse ops
Dynamic Inference: Early exit, adaptive computation

Performance Optimization of AI Systems

System-level optimizations for maximum hardware utilization:

Communication Efficiency: Gradient compression, overlap
Kernel Fusion: Reducing memory bandwidth bottlenecks
Memory Optimization: Activation checkpointing, offloading
Hardware Profiling: CUDA profiling, roofline analysis
Compiler Optimization: XLA, TorchInductor, Triton

Distributed Optimization

Algorithms for decentralized and federated machine learning:

ADMM: Alternating direction method of multipliers
Distributed SGD: Synchronous/asynchronous variants
Consensus Optimization: Decentralized averaging methods
Federated Learning: Privacy-preserving distributed training
Sparse Optimization: L0/L1 regularized distributed problems

Selected Publications

15 publications spanning machine learning systems, distributed optimization, and control theory. ORCID · Full list

ML

PruneX: A Hierarchical Communication-Efficient System for Distributed CNN Training with Structured Pruning

Olama, A., Lundell, A., Hajj, I.E., Lilius, J. & Björkqvist, J.

arXiv:2512.14628, 2025 Submitted to ACM ICS 2026

arXiv Code

ML

Federated Learning With ℓ₀ Constraint Via Probabilistic Gates For Sparsity

Huthasana, K.H.K., Olama, A. & Lundell, A.

arXiv:2512.23071, 2025 Submitted to ICLR 2026

arXiv

ML

A GPU-Accelerated Bi-linear ADMM Algorithm for Distributed Sparse Machine Learning

Olama, A., Lundell, A., Kronqvist, J., Ahmadi, E. & Camponogara, E.

arXiv:2405.16267, 2024

arXiv Code

ML

Distributed ℓ₀ Sparse Aggregative Optimization

Olama, A., Carnevale, G., Notarstefano, G. & Camponogara, E.

IEEE CASE, 2024

ML

A Tracking Augmented Lagrangian Method for ℓ₀ Sparse Consensus Optimization

Olama, A., Carnevale, G., Notarstefano, G. & Camponogara, E.

CoDIT, 2023

OPT

Sparse Convex Optimization Toolkit: A Mixed-Integer Framework

Olama, A., Camponogara, E. & Kronqvist, J.

Optimization Methods and Software, 38(6), 2023

Code

OPT

A Distributed Primal Outer Approximation Algorithm for Sparse Convex Programming (DiPOA)

Olama, A., Camponogara, E. & Mendes, P.R.C.

Journal of Global Optimization, 86(3), 2023

OPT

Relaxed Hybrid Consensus ADMM for Distributed Convex Optimisation with Coupling Constraints

Olama, A., Bastianello, N., Mendes, P.R.C. & Camponogara, E.

IET Control Theory & Applications, 2019

CTRL

Lyapunov-based Hybrid Model Predictive Control for Energy Management of Microgrids

Olama, A., Mendes, P.R.C. & Camacho, E.F.

IET Generation, Transmission & Distribution, 2018

View all 15 publications

Talks & Presentations

Conference presentations, invited talks, and workshop lectures.

Workshop

HPC Workshop on Mahti Supercomputer

CSC Finland · Spring 2025

Hands-on parallel programming and performance optimization techniques for graduate students and researchers.

Watch

Conference

PruneX: Hierarchical Communication-Efficient Distributed Training

ACM ICS 2026 (Submitted)

Structured pruning co-designed with cluster topology for scalable multi-node GPU training.

Conference

Distributed ℓ₀ Sparse Aggregative Optimization

IEEE CASE 2024

Sparse consensus optimization in multi-agent networked systems.

Conference

Tracking Augmented Lagrangian Method for ℓ₀ Sparse Consensus

CoDIT 2023, Rome, Italy

Novel tracking-based approach for cardinality-constrained distributed optimization.

Lecture Series

GPU Programming for AI Systems

Åbo Akademi University · 2025+

Graduate course covering CUDA programming, memory optimization, and kernel development.

HPC4AI

Teaching

Responsible for designing, modernizing, and delivering courses at Åbo Akademi University, with integration of national CSC HPC infrastructures (Puhti, Mahti, LUMI).

Åbo Akademi Courses

Database Systems (Bachelor) · Winter 2024
Parallel Programming (Master) · Spring 2025+
Focus on data-center computing and AI
GPU Programming (Master) · Autumn 2025+
CUDA, performance optimization

Workshops & Resources

HPC Workshop on Mahti · Spring 2025
Parallel programming and performance optimization
HPC4AI YouTube Channel
Recorded lectures and course materials
Hands-on HPC Labs
Using Puhti, Mahti, LUMI supercomputers