BioBERT

BioBERT provides pre-trained language representations specialized for biomedical text mining to improve extraction of biomedical named entities, relations, and answers from the biomedical literature.


Key Features:

  • BERT-based architecture: Built upon BERT (Bidirectional Encoder Representations from Transformers) to leverage bidirectional transformer representations.
  • Domain-specific pre-training: Pre-trained on large-scale biomedical corpora to address word distribution shifts between general and biomedical text.
  • Pre-trained language representation model: Supplies representations optimized for biomedical vocabulary and semantics.
  • Fine-tuning for downstream tasks: Supports fine-tuning for task-specific models such as named entity recognition, relation extraction, and question answering.
  • Empirical performance gains: Reports a 0.62% F1 increase in named entity recognition, a 2.80% F1 increase in relation extraction, and a 12.24% improvement in mean reciprocal rank (MRR) for question answering versus BERT/prior state-of-the-art.
  • Consistent architecture across tasks: Uses the same underlying model architecture without task-specific structural modifications.

Scientific Applications:

  • Biomedical named entity recognition: Identifies biomedical entities in text with improved F1 performance (+0.62% reported).
  • Relation extraction: Extracts relations between biomedical entities with improved F1 performance (+2.80% reported).
  • Question answering: Retrieves and ranks answers to biomedical questions with enhanced mean reciprocal rank (+12.24% MRR reported).

Methodology:

Pre-training on large-scale biomedical corpora using the BERT architecture followed by task-specific fine-tuning.

Topics

Details

Programming Languages:
Python
Added:
11/14/2019
Last Updated:
11/24/2024

Operations

Publications

Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2019;36(4):1234-1240. doi:10.1093/bioinformatics/btz682. PMID:31501885. PMCID:PMC7703786.

PMID: 31501885
PMCID: PMC7703786
Funding: - National Research Foundation of Korea(NRF) funded by the Korea government: NRF-2014M3C9A3063541, NRF-2017M3C4A7065887, NRF-2017R1A2A1A17069645

Links