PDFDataExtractor
PDFDataExtractor extracts and interprets metadata from Portable Document Format (PDF) scientific literature to support chemical literature mining and metadata-driven analyses.
Key Features:
- Enhanced Data Extraction: Extracts latent metadata from PDF files and integrates chemical named-entity recognition via ChemDataExtractor.
- Template-Based Architecture: Uses a template-based architecture to reconstruct the logical structure of scientific articles and extract semantic information across diverse layouts.
- Comprehensive Metadata Extraction: Outputs detailed metadata in JSON and plain text formats, including paper title, authors, affiliations, email addresses, abstracts, keywords, journal names, publication years, DOIs, references, and issue numbers.
- Improved Precision: Demonstrates precision in extracting key metadata areas using a self-created evaluation article set.
Scientific Applications:
- Chemical literature mining: Enables extraction and correlation of chemical entities and property data from PDFs by leveraging ChemDataExtractor.
- Data-driven materials discovery: Provides structured metadata to support data science workflows and materials-discovery analyses driven by literature-derived data.
Methodology:
Uses a template-based architecture to reconstruct document logical structure, applies ChemDataExtractor for chemical named-entity recognition, outputs metadata in JSON and plain text, and assessed precision with a self-created evaluation article set.
Topics
Details
- License:
- MIT
- Cost:
- Free of charge
- Tool Type:
- library
- Operating Systems:
- Mac, Linux, Windows
- Programming Languages:
- Python
- Added:
- 7/6/2022
- Last Updated:
- 11/24/2024
Operations
Publications
Zhu M, Cole JM. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format. Journal of Chemical Information and Modeling. 2022;62(7):1633-1643. doi:10.1021/acs.jcim.1c01198. PMID:35349259. PMCID:PMC9049592.