Resources

2019.06
Applying machine learning strategy for microsatellite status detection in plasma sample type

Author:Lili Zhao, Wenbo Han, Wei Zhou, Lu Fang, Yuanyuan Hong, Chen Tian, Weizhi Chen, Ji He


Background: Microsatellite(MS) status is an important biomarker in cancer diagnosis . Most of available detection algorithms were based on figuring out number of validated allele types from control and cases, then determine the MS status by identify the difference. That would be suitable for tissue like samples due to their clear and sufficient tumor nucleic acid content. But for liquid biopsy such as plasma, lower ctDNA concentration and higher background ratio induced from laboratory and sequencing procedure make it necessary to develop novo method for plasma MS status detection. 


Methods: We developed an machine learning based approach to detect the MS status by using NGS sequencing data from plasma sample. 98 blood cell , 12 tumor tissue samples with PCR confirmed MSS status and 87 plasma samples from 134 patients were used to build normal baseline, and check the sequencing stability and population polymorphism of 207 candidate SSR locus in our panel. 97 locus then passed and were chosen as original biomarkers for model construction. For each coming sample, the allele probability distribution were obtained for each locus. Followed by calculating the Kullback–Leibler Divergence (KLD) value between target and baseline samples for each locus, the KLD value was used as input for feature selection and machine learning model fitting. Finally a leave one out strategy was used for model performance evaluation. 


Results: We obtained 17 most relevant SSR locus with suitable for MS status analysis of plasma sample. By using a SVM model fitted with 49 real patients sample data (21 MSI and 28 MSS) , The MS status prediction of plasma sample was of high sensitivity and specificity which were 90.47% and 96.42% respectively. 


Conclusions: Our results suggest that our MSI calculation model by using KLD value and SVM method is appropriate for the plasma sample. Further large-scale prospective studies are needed to validate our model.