Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningfagfællebedømt

Standard

Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. / Yang, Meng; Huang, Lichao; Huang, Haiping; Tang, Hui; Zhang, Nan; Yang, Huanming; Wu, Jihong; Mu, Feng.

I: Nucleic Acids Research, Bind 50, Nr. 14, e81, 2022.

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningfagfællebedømt

Harvard

Yang, M, Huang, L, Huang, H, Tang, H, Zhang, N, Yang, H, Wu, J & Mu, F 2022, 'Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution', Nucleic Acids Research, bind 50, nr. 14, e81. https://doi.org/10.1093/nar/gkac326

APA

Yang, M., Huang, L., Huang, H., Tang, H., Zhang, N., Yang, H., Wu, J., & Mu, F. (2022). Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. Nucleic Acids Research, 50(14), [e81]. https://doi.org/10.1093/nar/gkac326

Vancouver

Yang M, Huang L, Huang H, Tang H, Zhang N, Yang H o.a. Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. Nucleic Acids Research. 2022;50(14). e81. https://doi.org/10.1093/nar/gkac326

Author

Yang, Meng ; Huang, Lichao ; Huang, Haiping ; Tang, Hui ; Zhang, Nan ; Yang, Huanming ; Wu, Jihong ; Mu, Feng. / Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. I: Nucleic Acids Research. 2022 ; Bind 50, Nr. 14.

Bibtex

@article{9a6f838d1a5a42f18f8f6dc9cebee122,
title = "Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution",
abstract = "Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotating biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpret non-coding regions. Here, we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only two self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of the unlabelled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against the fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based DNA language model. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.",
author = "Meng Yang and Lichao Huang and Haiping Huang and Hui Tang and Nan Zhang and Huanming Yang and Jihong Wu and Feng Mu",
note = "{\textcopyright} The Author(s) 2022. Published by Oxford University Press on behalf of Nucleic Acids Research.",
year = "2022",
doi = "10.1093/nar/gkac326",
language = "English",
volume = "50",
journal = "Nucleic Acids Research",
issn = "0305-1048",
publisher = "Oxford University Press",
number = "14",

}

RIS

TY - JOUR

T1 - Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution

AU - Yang, Meng

AU - Huang, Lichao

AU - Huang, Haiping

AU - Tang, Hui

AU - Zhang, Nan

AU - Yang, Huanming

AU - Wu, Jihong

AU - Mu, Feng

N1 - © The Author(s) 2022. Published by Oxford University Press on behalf of Nucleic Acids Research.

PY - 2022

Y1 - 2022

N2 - Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotating biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpret non-coding regions. Here, we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only two self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of the unlabelled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against the fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based DNA language model. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.

AB - Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotating biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpret non-coding regions. Here, we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only two self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of the unlabelled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against the fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based DNA language model. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.

U2 - 10.1093/nar/gkac326

DO - 10.1093/nar/gkac326

M3 - Journal article

C2 - 35536244

VL - 50

JO - Nucleic Acids Research

JF - Nucleic Acids Research

SN - 0305-1048

IS - 14

M1 - e81

ER -

ID: 307365945