Humber Polytechnic × University of Geneva

Synthetic Dataset Generator for
Time-Resolved Spectroscopy

A rigorous framework implementing 18 first-order reaction mechanisms across 7 topological superclasses, bridging theoretical chemical kinetics and practical machine learning for automated mechanism identification from spectroscopic data.

18
Mechanisms
Curated catalogue
15,192
Spectra
Production dataset
50-100x
Speedup
Eigenvalue solver
0.99+
Pearson r
Validation score

Authors

Bermúdez, S., Noriega Cedeño, J. R., Pavel, J. A., and Fernández-Terán, R.

Faculty of Applied Science & Technology, Humber Polytechnic, Toronto • Department of Physical Chemistry, University of Geneva

Abstract

We present the architecture, methodology, and practical implementation of a synthetic dataset generator for time-resolved UV-visible spectroscopy designed for training and evaluating machine learning models. The generator implements the rigorous mathematical framework established by Fernández-Terán and colleagues, incorporating the bilinear model of spectrokinetics, matrix exponential kinetics, analytical instrument response function convolution, and realistic multi-component noise injection.

The system encompasses a curated catalogue of 18 first-order reaction mechanisms organized into seven topological superclasses, enabling generation of physically grounded synthetic spectra that capture the full diversity of kinetic behaviours observed in experimental systems. Key methodological advances include eigenvalue decomposition for computational efficiency (50–100× speedup), mechanism-aware quality scoring, and comprehensive numerical stability safeguards.

Time-Resolved SpectroscopyChemical KineticsSynthetic Data GenerationMachine LearningMatrix ExponentialIRF ConvolutionMechanism ClassificationEigenvalue Decomposition

Theoretical Foundations

Mathematical Framework

Built upon rigorous spectrokinetic theory, the generator implements analytical solutions for computational efficiency and numerical precision.

Eq. (1)

Bilinear Model of Spectrokinetics

Time-resolved spectra decompose into the product of concentration and spectral matrices, assuming spectral time-invariance and well-defined chemical intermediates.

C ∈ ℝ^{n_t × N} contains time-dependent concentrations of N species; S ∈ ℝ^{N × n_λ} contains species-associated spectra.

D(t, λ) = C(t) · S(λ) + E
Eq. (4)

Matrix Exponential Kinetics

Coupled first-order reactions are governed by a system of linear ODEs with the rate matrix K, solved analytically via the matrix exponential.

Mass conservation enforced: each column of K sums to zero. Off-diagonal K_{ij} represent interconversion rates.

dC(t)/dt = K · C(t)
C(t) = exp(K · t) · C(0)
Eq. (28)

Eigenvalue Decomposition Solver

The matrix exponential is computed efficiently via eigendecomposition, providing 50–100× speedup over direct methods for batch generation.

Where V contains eigenvectors, α = V⁻¹C(0) are modal amplitudes. One-time O(N³) cost amortized over all time evaluations.

C(t) = V · diag(e^{λ₁t}, e^{λ₂t}, ..., e^{λ_Nt}) · α
Eq. (5)

Analytical IRF Convolution

Gaussian instrument response function convolution evaluated analytically, avoiding numerical artefacts on logarithmic time grids.

σ is the Gaussian IRF width (FWHM = 2σ√(2ln2)). 10–20× faster than numerical convolution.

Ψ(t; λ_i, σ, t₀) = ½ exp[λ_i(t–t₀+λ_iσ²/2)] · [1 + erf((t–t₀+λ_iσ²)/σ√2)]

Mechanism-Aware Quality Scoring

Q = w_SNR · q_SNR + w_kin · q_kin + w_spec · q*_spec + w_rank · q_rank

A composite quality score that accounts for mechanistic topology. Traditional metrics based solely on condition number inappropriately penalize mechanisms with inherent collinearity. For parallel and branched pathways, the spectral overlap penalty is relaxed with a floor of 0.3 rather than 0.0, reflecting the physical reality that product concentrations are inherently collinear.

Mechanism Catalogue

18 Reaction Mechanisms, 7 Superclasses

A curated catalogue organized by kinetic topology and distinguishability, from simple direct decay to complex hub networks with branching and exchange.

Note: M3.1 (Parallel Decay) and M4.3 (Branched Three-Product) were excluded because they produce rank-1 concentration matrices, making them spectroscopically indistinguishable from simple single-exponential decay.
SC-0
N=22 mech.

DIRECT_DECAY

Simple direct decay or equilibrium

SC-1
N=34 mech.

SEQUENTIAL_3SP

Linear three-species chains with optional reversibility

SC-2
N=3–42 mech.

BRANCHED_3SP

Competitive or parallel pathways converging to product

SC-3
N=41 mech.

LINEAR_4SP_IRR

Fully irreversible four-species linear chain

SC-4
N=43 mech.

LINEAR_4SP_PARTIAL

Linear chains with partial reversibility

SC-5
N=43 mech.

LINEAR_4SP_FULL

Fully reversible linear four-species systems

SC-6
N=43 mech.

HUB_TOPOLOGY

Hub-like networks with branching or exchange

System Design

Architecture & Pipeline

A modular Python framework with six primary components, designed for extensibility and batch generation efficiency.

STEP 1

Mechanism Selection

Sample from 18 mechanisms across 7 superclasses with configurable distribution weights

STEP 2

Rate Constant Sampling

Log-uniform distribution over [10⁻⁴, 10] s⁻¹ with kinetic separation enforcement (β = 2.0)

STEP 3

Spectral Generation

1–4 Gaussian peaks per species, 350–750 nm range, 250 channels, enforced spectral distinguishability

STEP 4

Kinetic Solving

Eigenvalue decomposition with vectorized exponential evaluation, O(N³ + N·n_t) complexity

STEP 5

IRF Convolution

Analytical Gaussian convolution (FWHM 0.05–1.0 s), 10–20× faster than numerical methods

STEP 6

Quality Assessment & Export

Mechanism-aware scoring, spectral overlap filtering (> 0.5 rejected), NPZ/CSV/JSON export

Core Modules

Mechanism Catalogue

18 mechanisms, 7 superclasses, rate matrix builders

Spectral Engine

Parametric Gaussian peak generation with diversity constraints

Kinetic Solver

Eigenvalue decomposition with condition number monitoring

Noise Module

5-component model: shot, read, drift, artefacts, cosmic rays

Quality Engine

Mechanism-aware composite scoring with topology awareness

Dataset Exporter

NPZ, CSV, JSON with complete provenance metadata

Performance Characteristics

30–60s
Batch of 1,000 samples
250 × 250
Time × wavelength grid
10⁻³–10³ s
Temporal range (6 decades)
350–750 nm
Spectral window

Validation

Chameleon Reaction Benchmark

Validated against the oxidation of reducing sugars by permanganate, a well-characterized experimental system with known rate constants and species-associated spectra.

Quantitative Agreement with Experimental Data

KMnO\u2084 + Sugar substrates in basic solution

SubstrateMechanismR\u00B2Pearson r
FructoseM4.10.98570.9929
GlucoseM2.10.96570.9827
SucroseM2.10.99450.9974
11 pre-trigger time points excluded from analysis.

Correlation with Experimental Data

Fructose(M4.1)
r = 0.9929
Glucose(M2.1)
r = 0.9827
Sucrose(M2.1)
r = 0.9974

All correlations exceed r = 0.98, confirming deterministic, physically accurate generation across three independent experimental systems.

Additional Accuracy Metrics

< 5 nm
Spectral peak MAE
Across all species
< 8%
Kinetic parameter error
From noisy synthetic data
< 0.02 AU
SAS reconstruction RMSE
Via pseudoinverse
< εₘₐ₉ₕ
Eigenvalue vs. expm
Machine precision agreement

Machine Learning Results

Initial Approaches (Modest Performance)

Classical ML (Random Forest, XGBoost, Logistic Regression) and sequential deep learning (GRUs, LSTMs) yielded top-1 validation accuracies in the range of 0.30–0.60, initially suggesting fundamental limits from spectrokinetic degeneracy.

Breakthrough: 2D CNN Architecture

Treating D(t, \u03BB) as a 2D image with a convolutional neural network exploited the inherent spatial structure of spectrokinetic data. Combined with scaled dataset (120K initial, 15K filtered), the 2D CNN achieved consistently accurate top-1 predictions across all mechanism superclasses.

Production Dataset

Dataset Statistics & Structure

119,988 initial spectra generated with balanced sampling, aggressively filtered to 15,192 high-quality samples (12.7% pass rate).

119,988
Initial spectra
6,666 per mechanism
15,192
Filtered dataset
12.7% pass rate
300 × 250
Sample dimensions
Time × wavelength
0.358
Mean overlap
Well below 0.5 threshold

Superclass Distribution

SEQUENTIAL_3SP
5,42635.7%
DIRECT_DECAY
4,00526.4%
HUB_TOPOLOGY
1,90412.5%
BRANCHED_3SP
1,67311%
LINEAR_4SP_PARTIAL
1,0947.2%
LINEAR_4SP_FULL
8865.8%
LINEAR_4SP_IRR
2041.3%

Aggressive filtering disproportionately removed samples from LINEAR_4SP_IRR (M4.1). Fully irreversible linear chains are particularly susceptible to kinetic degeneracy when rate constants approach similar values.

Export Format

training_data.npzNumPy compressed archive

Data matrices D, concentration profiles C, species-associated spectra S

n_samples × n_t × n_λ | n_samples × n_t × N | n_samples × N × n_λ

metadata.csvTabular summary

Per-sample parameters, mechanism IDs, quality metrics (SNR, condition number, kinetic separation, spectral overlap)

n_samples rows × 12+ columns

labels.jsonComplete provenance

All rate constants, spectral component parameters, K-matrix eigenvalues, DAS fingerprints

Full TRSpectrumLabel dictionaries

axes.npzShared axes

Wavelength (350–750 nm) and time (0.001–1000 s) vectors

n_λ = 250 | n_t = 250

Compute Grant Request

Help Us Scale Continuum

We have validated the synthetic data pipeline and demonstrated that 2D CNNs can classify reaction mechanisms from spectrokinetic data. To reach production-grade accuracy and extend to real experimental data, we need compute resources.

Why This Research Matters

Automating Expert Analysis

Traditional spectrokinetic analysis requires deep expertise in both spectroscopy and chemical kinetics. Our ML approach automates mechanism identification, democratizing access to these analytical capabilities.

Bridging Theory & Practice

Synthetic data with complete ground truth labels enables rigorous benchmarking impossible with experimental data alone. Transfer learning will extend these models to real laboratory measurements.

Open Science Impact

The generator, datasets, and trained models will be released open-source, providing the spectroscopy community with validated tools for automated kinetic analysis across chemistry, biochemistry, and materials science.

Dataset Generation at Scale

  • Scale from 15K to 500K+ filtered samples
  • Full 18-mechanism balanced coverage
  • Multi-noise-level augmentation
  • Estimated: ~200 GPU-hours on A100

CNN Model Training & Optimization

  • 2D CNN architecture for D(t,λ) image classification
  • Hyperparameter search across 7 superclasses
  • Cross-validation with stratified splits
  • Estimated: ~500 GPU-hours on A100/H100

Physics-Informed Neural Networks

  • PINNs leveraging analytical structure of kinetic equations
  • Embedding eigenvalue constraints as physics losses
  • Transfer learning from synthetic to experimental data
  • Estimated: ~800 GPU-hours on H100

Mixed-Order Kinetics Extension

  • Extend beyond first-order to second/mixed-order kinetics
  • Hybrid analytical-numerical ODE solvers
  • Concentration-dependent kinetic models
  • Estimated: ~400 GPU-hours on A100

Estimated Resource Requirements

ResourceQuantityEstimated Cost
GPU Compute (A100/H100)2,000$6,000–$10,000
Storage (Datasets + Checkpoints)5 TB$500
CPU Compute (Data Generation)500 hours$250
Experiment Tracking (W&B Pro)12 months$600
Total Estimated$7,350–$11,350

Support This Research

A compute grant would enable us to scale dataset generation, train production-grade models, and validate transfer learning from synthetic to experimental spectrokinetic data.

Prof. Ricardo Fernández-Terán • University of Geneva • Ricardo.FernandezTeran@unige.ch

Team & Affiliations

Research Team

SB

Simón Bermúdez

Project Manager & Lead Developer

Humber Polytechnic

System architecture, ML pipeline, and data engineering

JR

José Rafael Noriega Cedeño

Chemistry Subject Matter Expert

Humber Polytechnic

Mechanism catalogue design, kinetic validation, and chemical accuracy

JA

Julie Anne Pavel

Data Analyst

Humber Polytechnic

Statistical analysis, quality metrics, and dataset characterization

PR

Prof. Ricardo Fernández-Terán

Principal Investigator & Supervisor

University of Geneva

Theoretical framework, spectrokinetic methods, and experimental validation

Institutions

Humber Polytechnic
Faculty of Applied Science & Technology
Toronto, ON M9W 5L7, Canada
University of Geneva
Department of Physical Chemistry
CH-1205 Geneva, Switzerland

Foundational Reference

Fernández-Terán, R. et al. (2022). "A sweet introduction to the mathematical analysis of time-resolved spectra and complex kinetic mechanisms." J. Chem. Educ., 99(6), 2327–2337.

We thank Dr. Ricardo Fernández-Terán for insightful discussions on spectrokinetic analysis and for providing the theoretical framework. We acknowledge the support of Humber Polytechnic's Faculty of Applied Science and Technology. This work was inspired by the treatment of the chameleon reaction published in the Journal of Chemical Education.