MS2LDA (Substructure Discovery) Service — Uncover Hidden Molecular Patterns from MS/MS Data

Go beyond spectral library matching. Automatically discover recurring fragmentation patterns that reveal the molecular substructures within your unknowns.

Every untargeted metabolomics experiment generates thousands of MS/MS spectra, yet standard library matching typically annotates fewer than 10% of features. The rest remain structurally uncharacterised — not because the data lacks information, but because the fragmentation patterns that encode substructure information have not been systematically extracted.

Our MS2LDA substructure discovery service applies Latent Dirichlet Allocation (LDA) topic modeling — a machine learning technique originally developed for text mining — to your MS/MS data. It automatically identifies Mass2Motifs: sets of co-occurring fragment peaks and neutral losses that represent chemically meaningful molecular substructures such as glycosyl groups, aromatic rings, carbonyl moieties, and other recurring scaffolds. Integrated with feature-based molecular networking and GNPS, this approach transforms raw fragmentation data into a chemically interpretable map of the substructures present in your sample.

Key Advantages:

Discovers substructures without requiring a reference spectral library
Reveals the substructure composition of every feature in your dataset
Integrates seamlessly with FBMN and GNPS molecular networking workflows
Supports both discovery mode (de novo motif mining) and annotation mode (MotifDB matching)
MS2LDA 2.0 delivers up to 14× faster processing with automated Mass2Motif Annotation Guidance (MAG)

Start Your MS2LDA Analysis View sample requirements

MS2LDA substructure discovery concept diagram

What MS2LDA Reveals How It Works Key Advantages When to Use Workflow Tech Comparison Sample Deliverables Case Study FAQ

What MS2LDA Can Reveal That Libraries Miss

When you run a standard spectral library search on an untargeted LC-MS/MS dataset, the result is almost always the same: a handful of confident matches, a larger set of low-confidence hits, and a majority of features that remain completely unannotated. This is not a limitation of your instrumentation or your sample preparation — it is a fundamental constraint of library-dependent approaches. Spectral libraries can only annotate compounds that have been characterised and deposited before your experiment.

MS2LDA (Mass Spectral Latent Dirichlet Allocation) takes a fundamentally different approach. Instead of asking "does this spectrum match a known compound?", it asks "what recurring fragmentation patterns exist across my entire dataset?" Using LDA topic modeling, MS2LDA automatically discovers Mass2Motifs — groups of fragment ions and neutral losses that consistently co-occur across many spectra. Each Mass2Motif typically corresponds to a specific molecular substructure: a hexose sugar, a flavonoid core, a terpene backbone, a phosphate group, or any other recurring chemical feature that produces a characteristic fragmentation signature.

This means MS2LDA can reveal structural information about compounds that have never been characterised before. Even when a feature matches no known library entry, its Mass2Motif profile can tell you that it contains, for example, a glycosyl moiety and an aromatic ring — partial structural information that is invaluable for prioritising downstream isolation, bioassay testing, or targeted MSn experiments.

Our service integrates MS2LDA with feature-based molecular networking (FBMN), combining the quantitative and organisational power of molecular networks with the structural depth of Mass2Motif analysis. The result is a multi-dimensional view of your metabolome: which features are related (molecular network), how they differ quantitatively across conditions (FBMN feature table), and what substructures they share (MS2LDA Mass2Motifs).

How MS2LDA Works: From Fragmentation Spectra to Mass2Motifs

MS2LDA is built on a text-mining analogy that makes the concept intuitive. Imagine each MS/MS spectrum as a document, each fragment ion as a word, and each Mass2Motif as a topic. Just as topic modeling can discover that certain words tend to appear together in chemistry-related documents, MS2LDA discovers that certain fragment ions and neutral losses tend to co-occur in spectra from compounds that share a common substructure.

The LDA algorithm works iteratively. It starts by randomly assigning each fragment peak in each spectrum to a "topic" (Mass2Motif), then refines these assignments through repeated probabilistic sampling. Over hundreds of iterations, the algorithm converges on a stable set of Mass2Motifs — each defined by a probability distribution over fragment peaks — and a decomposition of each spectrum into a mixture of these Mass2Motifs.

The process is entirely unsupervised. MS2LDA does not require any prior knowledge of the compounds in your sample, any training data, or any spectral library. It discovers the substructure patterns that are latent in your fragmentation data, driven purely by the statistical structure of the fragment co-occurrence patterns.

MS2LDA operates in two complementary modes:

Discovery mode: The algorithm learns a set of "free" Mass2Motifs de novo from your data. These represent the most statistically salient fragmentation patterns in your dataset, which may correspond to known or novel substructures.
Annotation mode: Known Mass2Motifs from MotifDB — a curated database of experimentally validated substructure patterns — are imported and matched against your data. This allows you to immediately recognise substructures that have been previously characterised.

Both modes can be combined in a single analysis, and the results are fully compatible with GNPS molecular networking outputs. The MS2LDA-enriched edge files can be re-imported into Cytoscape to create networks where edges represent not just spectral similarity but shared substructure content.

What Sets MS2LDA Apart from Conventional Annotation

Library-independent substructure discovery

MS2LDA discovers Mass2Motifs directly from your data without requiring any reference library. This means it can reveal substructures in truly novel compounds — those that have never been characterised and deposited in any database.

Substructure-level annotation of every feature

Every feature in your dataset is decomposed into its constituent Mass2Motifs with probability scores. Instead of a binary "matched/not matched" result, you get a detailed structural profile for each feature.

Seamless GNPS and FBMN integration

MS2LDA directly consumes FBMN output files (MGF spectra + edge files) and produces enriched networks that overlay Mass2Motif information onto molecular network topology.

Discovery and annotation in one workflow

Combining de novo motif discovery with MotifDB matching gives you both the broad coverage of unsupervised learning and the precision of curated substructure knowledge.

Cross-sample substructure comparison

With optional MS1 abundance data, MS2LDA enables differential analysis of Mass2Motif enrichment between experimental groups — revealing which substructures change under different conditions.

MS2LDA 2.0: 14× faster with automated annotation

The latest MS2LDA 2.0 release delivers up to 14× speed improvement through algorithmic optimisation and introduces the Mass2Motif Annotation Guidance (MAG) tool, which achieves median substructure overlap scores of 0.75–0.93 on chemically diverse benchmark datasets (Torres Ortega et al., 2025).

When MS2LDA Is the Right Choice for Your Project

MS2LDA is not a replacement for spectral library matching or molecular networking — it is a complementary layer that adds structural depth to your metabolomics analysis. The following scenarios are particularly well suited to an MS2LDA approach:

Low library annotation rates: If fewer than 10% of your features match a spectral library entry, MS2LDA can extract substructure information from the remaining 90%+ that would otherwise remain structurally silent.
Identifying shared substructures across compound families: When you need to know which compounds in your dataset share a specific chemical feature — for example, which features contain a glycosyl moiety or a particular heterocyclic core — MS2LDA provides direct answers.
Natural product dereplication with unknown analogs: In natural product discovery, MS2LDA can reveal that an unknown feature shares Mass2Motifs with a known compound class, suggesting it may be a structural analog even when the exact structure cannot be determined from MS/MS alone.
Environmental contaminant identification: For contaminants and transformation products that are rarely represented in spectral libraries, MS2LDA can identify substructure signatures that point to compound class membership.
Drug metabolite substructure characterisation: MS2LDA can reveal which substructures of a parent drug are retained or modified in its metabolites, supporting structure elucidation efforts.

MS2LDA works best as part of an integrated workflow that includes LC-HRMS/MS dereplication and molecular networking. Our team will help you determine whether MS2LDA is appropriate for your specific research question and dataset.

Our MS2LDA Workflow

Our MS2LDA service follows a structured six-step pipeline that integrates feature detection, molecular networking, topic modeling, and expert structural interpretation. The entire workflow is designed to be compatible with data from all major LC-MS/MS instrument platforms.

Data preprocessing and feature detection. Raw LC-MS/MS data from Thermo .raw, SCIEX .wiff, Agilent .d, or Bruker .d formats is processed using MZmine 3 or MS-DIAL. Feature detection, deisotoping, and alignment produce a feature table and an MGF file of consensus MS/MS spectra.
Feature-based molecular networking on GNPS. The feature table and MGF file are submitted to GNPS for FBMN analysis. The resulting molecular network organises features by spectral similarity and quantifies their abundance across samples.
MS2LDA topic modeling with LDA. The MGF file and edge files from the FBMN analysis are fed into MS2LDA. The LDA algorithm iteratively learns Mass2Motifs — recurring patterns of co-occurring fragment ions and neutral losses — from the entire dataset.
Mass2Motif extraction and MAG automated annotation. The discovered Mass2Motifs are processed through the MS2LDA 2.0 Mass2Motif Annotation Guidance (MAG) tool, which automatically assigns structural annotations by matching motif fragmentation patterns against the MotifDB database.
MotifDB matching and expert structural interpretation. Our bioinformatics team reviews the Mass2Motif assignments, cross-references them with spectral library matches and molecular network topology, and provides expert interpretation of the substructure landscape.
Deliverables compilation. All results are compiled into a comprehensive deliverable package including the annotated Mass2Motif table, enriched molecular network files, MS2LDA Dict file, and a summary report.

MS2LDA workflow diagram six steps from raw data to report

This workflow is fully compatible with our broader natural product MS discovery service portfolio. MS2LDA analysis can be added to any FBMN project, or run independently on your pre-processed MGF and edge files.

MS2LDA vs. Other Annotation Methods: A Technical Comparison

Each MS/MS annotation approach has distinct strengths. The table below compares MS2LDA against three widely used alternatives across six dimensions relevant to substructure discovery.

Dimension	MS2LDA (Topic Modeling)	Spectral Library Matching	SIRIUS + CSI:FingerID	Deep Learning (MS2DeepScore)
Approach	Unsupervised LDA topic modeling discovers recurring fragment patterns from the dataset itself	Cosine or dot-product similarity against reference spectra in curated libraries (GNPS, NIST, MoNA)	Fragmentation tree computation + fingerprint prediction from molecular formula	Neural network trained on large spectral libraries to predict structural similarity between spectra
Reference dependency	None — discovers patterns de novo from user data	High — requires reference spectra for every compound	Moderate — requires molecular formula (from isotope pattern) and a training database for fingerprint prediction	High — requires large training dataset of annotated spectra
Substructure resolution	Direct — each Mass2Motif corresponds to a specific substructure pattern	None — provides compound-level matches only	Indirect — molecular fingerprints encode substructure presence/absence	Indirect — similarity scores do not directly reveal which substructures are shared
Unknown annotation capability	Strong — reveals substructure information for compounds with no library match	None — cannot annotate spectra not in the library	Moderate — can predict molecular formula and fingerprint for unknowns	Weak — performance degrades significantly for compound classes underrepresented in training data
Integration with networking	Native — produces enriched edge files that overlay Motif information onto molecular networks	Limited — library matches can be overlaid on networks but provide no substructure-level integration	Partial — SIRIUS results can be imported into networks via GNPS but require manual mapping	Limited — similarity scores can be used for network construction but lack substructure interpretability
Interpretability	High — each Mass2Motif is defined by a specific set of fragment peaks and neutral losses that can be chemically interpreted	High for matches — but only for compounds already in the library	Moderate — fingerprint scores indicate substructure probability but do not provide direct fragment-level evidence	Low — deep learning similarity scores are difficult to trace back to specific structural features

MS2LDA occupies a unique position in this landscape. It is the only method that provides direct, library-independent substructure discovery with native integration into molecular networking workflows. For projects where understanding the substructure composition of unknowns is the primary goal, MS2LDA offers capabilities that complement — rather than compete with — other annotation tools. For complementary approaches, see our deep learning MS annotation service.

Sample Requirements

MS2LDA analysis can be performed on pre-existing LC-MS/MS datasets or on samples that we process in-house. The table below outlines the recommended sample specifications.

Sample Type	Recommended Amount	Concentration	Format	Notes
LC-MS/MS raw data (pre-acquired)	≥3 biological replicates per group	N/A	Thermo .raw, SCIEX .wiff, Agilent .d, or Bruker .d	Provide MS/MS acquisition parameters (collision energy, isolation window, resolution). DDA data preferred.
Pre-processed MGF + edge files	N/A	N/A	.mgf (spectra) + .tsv/.csv (edges with self-loops)	Files must be from a completed FBMN or classical molecular networking analysis on GNPS.
Crude extract (for in-house LC-MS/MS)	≥100 µg dried extract or ≥100 µL liquid extract	≥1 mg/mL in suitable solvent	Glass vial, dry or in HPLC-grade methanol/acetonitrile	Avoid non-volatile buffers (PBS, Tris) and detergents. Provide extraction protocol.
Purified compound or fraction	≥10 µg	≥0.1 mg/mL	Glass vial, dry or in HPLC-grade solvent	Provide expected molecular formula or mass if known.
Processed feature table (optional)	N/A	N/A	.csv from MZmine, MS-DIAL, or XCMS	Include feature ID, retention time, m/z, and intensity across samples for differential analysis.

Note: For projects requiring MS1 abundance-based differential analysis of Mass2Motif enrichment, a feature table with per-sample intensity values is required. Please contact us to discuss your specific dataset and experimental design.

Deliverables

Mass2Motif annotation table: A detailed table listing every feature in your dataset, its associated Mass2Motifs, probability scores, and MAG structural annotations where available.
Enriched molecular network: A GNPS-compatible molecular network with Motif-enriched edge files, ready for visualisation in Cytoscape. Edges are annotated with shared Mass2Motif information, revealing substructure relationships between nodes.
MS2LDA Dict file: The complete MS2LDA experiment file, compatible with the ms2lda.org web application for interactive exploration, manual annotation, and differential analysis.
Motif PDF report: A graphical report showing each Mass2Motif's fragment peak and neutral loss composition, with structural annotations where identified.
Summary report: A written interpretation of the substructure landscape, highlighting the most salient Mass2Motifs, their distribution across sample groups, and their relationship to known compound classes.

Representative MS2LDA Results

Representative Mass2Motif-enriched molecular network showing nodes colored by Mass2Motif assignments

Example Mass2Motif-enriched molecular network

Nodes represent individual features, edges represent spectral similarity, and node colours indicate the dominant Mass2Motif assigned to each feature. This visualisation reveals not only which compounds are related by spectral similarity, but also which substructures they share.

Case Study: MS2LDA + FBMN for Compound Profiling of a Traditional Medicinal Plant

Kong, X., Tian, G., Wu, T., Hu, S., Zhao, J. (2024). Journal of Separation Science, 47(16), e202400248. https://doi.org/10.1002/jssc.202400248

Background

Lanbuzheng (Geum japonicum Thunb. var. chinense Bolle) is a plant found in Southwest China that has been used in traditional medicine for its haematopoietic and antioxidant properties. Despite its therapeutic potential, the majority of its chemical constituents remained uncharacterised, posing a challenge for quality control, pharmacological studies, and natural product development. Kong et al. (2024) applied an integrated strategy combining UHPLC-Q-Exactive Orbitrap HRMS, FBMN, and MS2LDA to systematically profile the chemical composition of Lanbuzheng.

Methods

Dried Lanbuzheng material was extracted with 70% methanol and analysed by UHPLC-Q-Exactive Orbitrap HRMS in positive ionisation mode. Raw data were processed using MZmine 3 for feature detection, deisotoping, and alignment. The resulting feature table and MGF file were submitted to GNPS for FBMN analysis. The FBMN output — including the MGF file and edge files with self-loops — was then used as input for MS2LDA, which performed LDA topic modeling to discover Mass2Motifs associated with the detected features. A custom in-house library of 206 compounds was used for targeted identification alongside the unsupervised MS2LDA analysis.

Results

The FBMN analysis organised the detected features into a molecular network that revealed the relationships between different compound classes. MS2LDA analysis identified Mass2Motifs corresponding to key substructure classes, including glycosyl moieties, phenolic hydroxyl groups, and terpene backbones. Based on the combined FBMN clustering and MS2LDA Mass2Motif assignments, the constituents of Lanbuzheng were classified into four major compound classes: tannins, triterpenes, flavonoids, and phenolics. The custom library of 206 compounds enabled targeted identification of known constituents, while MS2LDA provided substructure-level annotation for features that did not match any library entry. Importantly, the study detected 210 features that fell outside the coverage of the custom library, and MS2LDA's Mass2Motif analysis provided partial structural characterisation for many of these unknowns — revealing, for example, which features contained glycosyl substitutions or flavonoid-like fragmentation patterns.

Conclusion

This case study demonstrates the power of combining FBMN with MS2LDA for comprehensive natural product profiling. The FBMN provided quantitative organisation and spectral similarity relationships, while MS2LDA added a structural dimension by identifying Mass2Motifs that revealed shared substructures across compound classes. For researchers working with complex natural product extracts, traditional medicines, or any sample where library coverage is limited, this integrated approach offers a practical pathway to deeper structural characterisation.

MS2LDA case study workflow diagram showing UHPLC-MS analysis, FBMN, and MS2LDA Mass2Motif discovery pipeline

Integrated workflow combining UHPLC-Q-Exactive Orbitrap HRMS, FBMN, and MS2LDA for comprehensive compound profiling of Lanbuzheng.

FAQ

Frequently Asked Questions

Q: What is a Mass2Motif and how is it different from a regular MS/MS fragment?

A Mass2Motif is a set of co-occurring fragment peaks and/or neutral losses that MS2LDA identifies as a recurring pattern across many spectra. Unlike individual fragments, a Mass2Motif represents a chemically meaningful substructure — such as a glycosyl moiety, an aromatic ring, or a carbonyl group — that is shared across multiple compounds. While a single fragment ion may be non-specific, the co-occurrence pattern captured by a Mass2Motif provides robust evidence for a particular substructure.

Q: What types of LC-MS/MS data are suitable for MS2LDA analysis?

Any untargeted LC-MS/MS dataset with data-dependent acquisition (DDA) MS/MS spectra is suitable. Common applications include natural product extracts, microbial metabolomics, clinical biofluids, plant metabolomics, environmental samples, and food science studies. We recommend at least three biological replicates per experimental group for statistical robustness. Data from Thermo, SCIEX, Agilent, and Bruker instruments are all supported.

Q: How does MS2LDA complement FBMN and GNPS molecular networking?

FBMN organises features by spectral similarity and quantifies their abundance across samples. MS2LDA adds a structural dimension by identifying which substructures (Mass2Motifs) each feature contains. Together, they reveal not just which compounds are related, but why — through shared chemical substructures. The MS2LDA-enriched edge files can be imported back into GNPS or Cytoscape to create networks where edges represent both spectral similarity and shared substructure content.

Q: Can MS2LDA annotate compounds that do not match any spectral library?

Yes — this is MS2LDA's key strength. Instead of requiring a library match, MS2LDA discovers substructure patterns directly from your data. Even if a compound is novel and has never been characterised, its Mass2Motifs can reveal partial structural information — for example, that it contains a glycosyl group and a flavonoid core — which is invaluable for prioritising downstream experiments.

Q: What deliverables will I receive from your MS2LDA service?

You will receive a Mass2Motif annotation table with probability scores, an enriched molecular network (GNPS link + Cytoscape file with Motif edges), an MS2LDA Dict file compatible with ms2lda.org for interactive exploration, a Motif PDF report, and a summary report with expert structural interpretation of the substructure landscape.

Q: Can I run MS2LDA myself on GNPS? Why should I use your service?

MS2LDA is available as an open-source tool on GNPS, and we encourage researchers to explore it. However, optimal results depend on careful parameter selection (LDA free motif count, bin width, probability thresholds), proper data preprocessing, and — most importantly — expert interpretation of the discovered Mass2Motifs. Our service provides end-to-end support: from data preprocessing and FBMN analysis through MS2LDA parameter optimisation, MAG automated annotation, and expert structural interpretation by experienced metabolomics bioinformaticians.

References

van der Hooft, J.J.J., Wandy, J., Barrett, M.P., Burgess, K.E.V., Rogers, S. (2016). Topic modeling for untargeted substructure exploration in metabolomics. Proceedings of the National Academy of Sciences, 113(48), 13738–13743. https://doi.org/10.1073/pnas.1608041113
Torres Ortega, L.R., Dietrich, J.A., Wandy, J., Mol, J.G.J., van der Hooft, J.J.J. (2025). Large-scale discovery and annotation of hidden substructure patterns in mass spectrometry profiles. bioRxiv, 2025.06.19.659491. https://doi.org/10.1101/2025.06.19.659491
Kong, X., Tian, G., Wu, T., Hu, S., Zhao, J. (2024). Feature-based molecular networking with MS2LDA to profile compounds in Lanbuzheng based on ultra-high-performance liquid chromatography-quadrupole Exactive Orbitrap high-resolution mass spectrometry. Journal of Separation Science, 47(16), e202400248. https://doi.org/10.1002/jssc.202400248

Uncover Hidden Substructures in Your MS/MS Data

Submit your LC-MS/MS dataset or raw samples and our metabolomics team will apply MS2LDA topic modeling to reveal the Mass2Motifs that library matching alone cannot find.

Start your MS2LDA inquiry

For Research Use Only. Not for use in diagnostic or clinical procedures.

Online Inquiry

Please submit a detailed description of your project. We will provide you with a customized project plan to meet your research requests. You can also send emails directly to for inquiries.