With the advancement of the MS techniques, the throughput and coverage of the untargeted metabolomics have been greatly improved, making it a powerful tool for screening altered metabolites associated with phenotypes or simulations. Tens of thousands of metabolites could be detected in one run and quantitively measured as the peak areas of the MS features. Due to the nature of the quantitation, the precision of the quantitation is affected by a variety of factors, such as the batch effects, the ionization efficiency, and the dilution effects. To combat these unwanted variations, several data processing and pretreatment steps are needed, and multiple algorithms have been developed. However, until now, there was no widely accepted workflow for the untargeted metabolomics data analysis. On the one hand, limited options of data analysis algorithms were implemented in the integrated pipelines without systematic evaluation of their performance. On the other hand, the users were encouraged to try different data processing algorithms to choose the best one.
In this study, I developed an expert analysis system, MetaboPro, for untargeted metabolomics data, in which multiple approaches were implemented, and systematic evaluations were provided. I showed that there is no single solution to the missing value imputation, batch effects removal, sample normalization, transformation, and scaling, because the performances of different approaches for these analyses varied a lot, and no approaches always outstand the other on different datasets. I reviewed and compared the existing evaluation criteria for each step of analysis and implemented the commonly used data analysis approaches and evaluation criteria into MetaboPro. Besides, I also developed a stepwise imputation strategy by classifying the missing values into three classes and imputing according to their origin, which greatly improved the imputation accuracy.
This expert analysis system may serve the community in at least three ways: firstly, MetaboPro provides guidance of the necessity of each data processing and how to evaluate the processing outcome step by step, which benefits new users to better understand the data analysis. Secondly, this integrated tool greatly improves the robustness of the statistical outcome, leading to precise interpreting of the phenotypes. Finally, this study will advance the development of untargeted metabolomics data analysis and speed up the formation of a widely accepted workflow.