Humber Polytechnic × University of Geneva

Synthetic Dataset Generator for
Time-Resolved Spectroscopy

A rigorous framework implementing 18 first-order reaction mechanisms across 7 topological superclasses, bridging theoretical chemical kinetics and practical machine learning for automated mechanism identification from spectroscopic data.

Mechanisms

Curated catalogue

15,192

Spectra

Production dataset

50-100x

Speedup

Eigenvalue solver

0.99+

Pearson r

Validation score

Request Compute Grant Read the Paper

Authors

Bermúdez, S., Noriega Cedeño, J. R., Pavel, J. A., and Fernández-Terán, R.

Faculty of Applied Science & Technology, Humber Polytechnic, Toronto • Department of Physical Chemistry, University of Geneva

Abstract

We present the architecture, methodology, and practical implementation of a synthetic dataset generator for time-resolved UV-visible spectroscopy designed for training and evaluating machine learning models. The generator implements the rigorous mathematical framework established by Fernández-Terán and colleagues, incorporating the bilinear model of spectrokinetics, matrix exponential kinetics, analytical instrument response function convolution, and realistic multi-component noise injection.

The system encompasses a curated catalogue of 18 first-order reaction mechanisms organized into seven topological superclasses, enabling generation of physically grounded synthetic spectra that capture the full diversity of kinetic behaviours observed in experimental systems. Key methodological advances include eigenvalue decomposition for computational efficiency (50–100× speedup), mechanism-aware quality scoring, and comprehensive numerical stability safeguards.

Time-Resolved SpectroscopyChemical KineticsSynthetic Data GenerationMachine LearningMatrix ExponentialIRF ConvolutionMechanism ClassificationEigenvalue Decomposition

Theoretical Foundations

Mathematical Framework

Built upon rigorous spectrokinetic theory, the generator implements analytical solutions for computational efficiency and numerical precision.

Eq. (1)

Bilinear Model of Spectrokinetics

Time-resolved spectra decompose into the product of concentration and spectral matrices, assuming spectral time-invariance and well-defined chemical intermediates.

C ∈ ℝ^{n_t × N} contains time-dependent concentrations of N species; S ∈ ℝ^{N × n_λ} contains species-associated spectra.

D(t, λ) = C(t) \cdot S(λ) + E

Eq. (4)

Matrix Exponential Kinetics

Coupled first-order reactions are governed by a system of linear ODEs with the rate matrix K, solved analytically via the matrix exponential.

Mass conservation enforced: each column of K sums to zero. Off-diagonal K_{ij} represent interconversion rates.

dC(t)/dt = K \cdot C(t) C(t) = exp(K \cdot t) \cdot C(0)

Eq. (28)

Eigenvalue Decomposition Solver

The matrix exponential is computed efficiently via eigendecomposition, providing 50–100× speedup over direct methods for batch generation.

Where V contains eigenvectors, α = V⁻¹C(0) are modal amplitudes. One-time O(N³) cost amortized over all time evaluations.

C(t) = V · diag(e^{λ₁t}, e^{λ₂t}, ..., e^{λ_Nt}) · α

Eq. (5)

Analytical IRF Convolution

Gaussian instrument response function convolution evaluated analytically, avoiding numerical artefacts on logarithmic time grids.

σ is the Gaussian IRF width (FWHM = 2σ√(2ln2)). 10–20× faster than numerical convolution.

Ψ(t; λ_i, σ, t₀) = ½ exp[λ_i(t-t₀+λ_iσ²/2)] \cdot [1 + erf((t-t₀+λ_iσ²)/σ\sqrt2)]

Mechanism-Aware Quality Scoring

Q = w_SNR \cdot q_SNR + w_kin \cdot q_kin + w_spec \cdot q*_spec + w_rank \cdot q_rank

A composite quality score that accounts for mechanistic topology. Traditional metrics based solely on condition number inappropriately penalize mechanisms with inherent collinearity. For parallel and branched pathways, the spectral overlap penalty is relaxed with a floor of 0.3 rather than 0.0, reflecting the physical reality that product concentrations are inherently collinear.

Mechanism Catalogue

18 Reaction Mechanisms, 7 Superclasses

A curated catalogue organized by kinetic topology and distinguishability, from simple direct decay to complex hub networks with branching and exchange.

Note: M3.1 (Parallel Decay) and M4.3 (Branched Three-Product) were excluded because they produce rank-1 concentration matrices, making them spectroscopically indistinguishable from simple single-exponential decay.

SC-0

N=22 mech.

DIRECT_DECAY

Simple direct decay or equilibrium

SC-1

N=34 mech.

SEQUENTIAL_3SP

Linear three-species chains with optional reversibility

SC-2

N=3–42 mech.

BRANCHED_3SP

Competitive or parallel pathways converging to product

SC-3

N=41 mech.

LINEAR_4SP_IRR

Fully irreversible four-species linear chain

SC-4

N=43 mech.

LINEAR_4SP_PARTIAL

Linear chains with partial reversibility

SC-5

N=43 mech.

LINEAR_4SP_FULL

Fully reversible linear four-species systems

SC-6

N=43 mech.

HUB_TOPOLOGY

Hub-like networks with branching or exchange

System Design

Architecture & Pipeline

A modular Python framework with six primary components, designed for extensibility and batch generation efficiency.

STEP 1

Mechanism Selection

Sample from 18 mechanisms across 7 superclasses with configurable distribution weights

STEP 2

Rate Constant Sampling

Log-uniform distribution over [10⁻⁴, 10] s⁻¹ with kinetic separation enforcement (β = 2.0)

STEP 3

Spectral Generation

1–4 Gaussian peaks per species, 350–750 nm range, 250 channels, enforced spectral distinguishability

STEP 4

Kinetic Solving

Eigenvalue decomposition with vectorized exponential evaluation, O(N³ + N·n_t) complexity

STEP 5

IRF Convolution

Analytical Gaussian convolution (FWHM 0.05–1.0 s), 10–20× faster than numerical methods

STEP 6

Quality Assessment & Export

Mechanism-aware scoring, spectral overlap filtering (> 0.5 rejected), NPZ/CSV/JSON export

Core Modules

Mechanism Catalogue

18 mechanisms, 7 superclasses, rate matrix builders

Spectral Engine

Parametric Gaussian peak generation with diversity constraints

Kinetic Solver

Eigenvalue decomposition with condition number monitoring

Noise Module

5-component model: shot, read, drift, artefacts, cosmic rays

Quality Engine

Mechanism-aware composite scoring with topology awareness

Dataset Exporter

NPZ, CSV, JSON with complete provenance metadata

Performance Characteristics

30–60s

Batch of 1,000 samples

250 × 250

Time × wavelength grid

10⁻³–10³ s

Temporal range (6 decades)

350–750 nm

Spectral window

Validation

Chameleon Reaction Benchmark

Validated against the oxidation of reducing sugars by permanganate, a well-characterized experimental system with known rate constants and species-associated spectra.

Quantitative Agreement with Experimental Data

KMnO\u2084 + Sugar substrates in basic solution

Substrate	Mechanism	R\u00B2	Pearson r
Fructose	M4.1	0.9857	0.9929
Glucose	M2.1	0.9657	0.9827
Sucrose	M2.1	0.9945	0.9974

11 pre-trigger time points excluded from analysis.

Correlation with Experimental Data

Fructose(M4.1)

r = 0.9929

Glucose(M2.1)

r = 0.9827

Sucrose(M2.1)

r = 0.9974

All correlations exceed r = 0.98, confirming deterministic, physically accurate generation across three independent experimental systems.

Additional Accuracy Metrics

< 5 nm

Spectral peak MAE

Across all species

< 8%

Kinetic parameter error

From noisy synthetic data

< 0.02 AU

SAS reconstruction RMSE

Via pseudoinverse

< εₘₐ₉ₕ

Eigenvalue vs. expm

Machine precision agreement

Machine Learning Results

Initial Approaches (Modest Performance)

Classical ML (Random Forest, XGBoost, Logistic Regression) and sequential deep learning (GRUs, LSTMs) yielded top-1 validation accuracies in the range of 0.30–0.60, initially suggesting fundamental limits from spectrokinetic degeneracy.

Breakthrough: 2D CNN Architecture

Treating D(t, \u03BB) as a 2D image with a convolutional neural network exploited the inherent spatial structure of spectrokinetic data. Combined with scaled dataset (120K initial, 15K filtered), the 2D CNN achieved consistently accurate top-1 predictions across all mechanism superclasses.

Production Dataset

Dataset Statistics & Structure

119,988 initial spectra generated with balanced sampling, aggressively filtered to 15,192 high-quality samples (12.7% pass rate).

119,988

Initial spectra

6,666 per mechanism

15,192

Filtered dataset

12.7% pass rate

300 × 250

Sample dimensions

Time × wavelength

0.358

Mean overlap

Well below 0.5 threshold

Superclass Distribution

SEQUENTIAL_3SP

5,42635.7%

DIRECT_DECAY

4,00526.4%

HUB_TOPOLOGY

1,90412.5%

BRANCHED_3SP

1,67311%

LINEAR_4SP_PARTIAL

1,0947.2%

LINEAR_4SP_FULL

8865.8%

LINEAR_4SP_IRR

2041.3%

Aggressive filtering disproportionately removed samples from LINEAR_4SP_IRR (M4.1). Fully irreversible linear chains are particularly susceptible to kinetic degeneracy when rate constants approach similar values.

Export Format

training_data.npzNumPy compressed archive

Data matrices D, concentration profiles C, species-associated spectra S

n_samples × n_t × n_λ | n_samples × n_t × N | n_samples × N × n_λ

metadata.csvTabular summary

Per-sample parameters, mechanism IDs, quality metrics (SNR, condition number, kinetic separation, spectral overlap)

n_samples rows × 12+ columns

labels.jsonComplete provenance

All rate constants, spectral component parameters, K-matrix eigenvalues, DAS fingerprints

Full TRSpectrumLabel dictionaries

axes.npzShared axes

Wavelength (350–750 nm) and time (0.001–1000 s) vectors

n_λ = 250 | n_t = 250

Compute Grant Request

Help Us Scale Continuum

We have validated the synthetic data pipeline and demonstrated that 2D CNNs can classify reaction mechanisms from spectrokinetic data. To reach production-grade accuracy and extend to real experimental data, we need compute resources.

Why This Research Matters

Automating Expert Analysis

Traditional spectrokinetic analysis requires deep expertise in both spectroscopy and chemical kinetics. Our ML approach automates mechanism identification, democratizing access to these analytical capabilities.

Bridging Theory & Practice

Synthetic data with complete ground truth labels enables rigorous benchmarking impossible with experimental data alone. Transfer learning will extend these models to real laboratory measurements.

Open Science Impact

The generator, datasets, and trained models will be released open-source, providing the spectroscopy community with validated tools for automated kinetic analysis across chemistry, biochemistry, and materials science.

Dataset Generation at Scale

•Scale from 15K to 500K+ filtered samples
•Full 18-mechanism balanced coverage
•Multi-noise-level augmentation
•Estimated: ~200 GPU-hours on A100

CNN Model Training & Optimization

•2D CNN architecture for D(t,λ) image classification
•Hyperparameter search across 7 superclasses
•Cross-validation with stratified splits
•Estimated: ~500 GPU-hours on A100/H100

Physics-Informed Neural Networks

•PINNs leveraging analytical structure of kinetic equations
•Embedding eigenvalue constraints as physics losses
•Transfer learning from synthetic to experimental data
•Estimated: ~800 GPU-hours on H100

Mixed-Order Kinetics Extension

•Extend beyond first-order to second/mixed-order kinetics
•Hybrid analytical-numerical ODE solvers
•Concentration-dependent kinetic models
•Estimated: ~400 GPU-hours on A100

Estimated Resource Requirements

Resource	Quantity	Estimated Cost
GPU Compute (A100/H100)	2,000	$6,000–$10,000
Storage (Datasets + Checkpoints)	5 TB	$500
CPU Compute (Data Generation)	500 hours	$250
Experiment Tracking (W&B Pro)	12 months	$600
Total Estimated		$7,350–$11,350

Support This Research

A compute grant would enable us to scale dataset generation, train production-grade models, and validate transfer learning from synthetic to experimental spectrokinetic data.

Contact Principal Investigator Read Full Paper

Prof. Ricardo Fernández-Terán • University of Geneva • Ricardo.FernandezTeran@unige.ch

Team & Affiliations

Research Team

Simón Bermúdez

Project Manager & Lead Developer

Humber Polytechnic

System architecture, ML pipeline, and data engineering

José Rafael Noriega Cedeño

Chemistry Subject Matter Expert

Humber Polytechnic

Mechanism catalogue design, kinetic validation, and chemical accuracy

Julie Anne Pavel

Data Analyst

Humber Polytechnic

Statistical analysis, quality metrics, and dataset characterization

Prof. Ricardo Fernández-Terán

Principal Investigator & Supervisor

University of Geneva

Theoretical framework, spectrokinetic methods, and experimental validation

Institutions

Humber Polytechnic

Faculty of Applied Science & Technology

Toronto, ON M9W 5L7, Canada

University of Geneva

Department of Physical Chemistry

CH-1205 Geneva, Switzerland

Foundational Reference

Fernández-Terán, R. et al. (2022). "A sweet introduction to the mathematical analysis of time-resolved spectra and complex kinetic mechanisms." J. Chem. Educ., 99(6), 2327–2337.

We thank Dr. Ricardo Fernández-Terán for insightful discussions on spectrokinetic analysis and for providing the theoretical framework. We acknowledge the support of Humber Polytechnic's Faculty of Applied Science and Technology. This work was inspired by the treatment of the chameleon reaction published in the Journal of Chemical Education.

Synthetic Dataset Generator forTime-Resolved Spectroscopy

Abstract

Theoretical Foundations

Mathematical Framework

Bilinear Model of Spectrokinetics

Matrix Exponential Kinetics

Eigenvalue Decomposition Solver

Analytical IRF Convolution

Mechanism-Aware Quality Scoring

Mechanism Catalogue

18 Reaction Mechanisms, 7 Superclasses

DIRECT_DECAY

SEQUENTIAL_3SP

BRANCHED_3SP

LINEAR_4SP_IRR

LINEAR_4SP_PARTIAL

LINEAR_4SP_FULL

HUB_TOPOLOGY

System Design

Architecture & Pipeline

Mechanism Selection

Rate Constant Sampling

Spectral Generation

Kinetic Solving

IRF Convolution

Quality Assessment & Export

Core Modules

Performance Characteristics

Validation

Chameleon Reaction Benchmark

Quantitative Agreement with Experimental Data

Correlation with Experimental Data

Additional Accuracy Metrics

Machine Learning Results

Initial Approaches (Modest Performance)

Breakthrough: 2D CNN Architecture

Production Dataset

Dataset Statistics & Structure

Superclass Distribution

Export Format

Compute Grant Request

Help Us Scale Continuum

Why This Research Matters

Dataset Generation at Scale

CNN Model Training & Optimization

Physics-Informed Neural Networks

Mixed-Order Kinetics Extension

Estimated Resource Requirements

Support This Research

Team & Affiliations

Research Team

Simón Bermúdez

José Rafael Noriega Cedeño

Julie Anne Pavel

Prof. Ricardo Fernández-Terán

Institutions

Foundational Reference

Synthetic Dataset Generator for
Time-Resolved Spectroscopy