PDFDataExtractor

PDFDataExtractor extracts and interprets metadata from Portable Document Format (PDF) scientific literature to support chemical literature mining and metadata-driven analyses.

Key Features:

Enhanced Data Extraction: Extracts latent metadata from PDF files and integrates chemical named-entity recognition via ChemDataExtractor.
Template-Based Architecture: Uses a template-based architecture to reconstruct the logical structure of scientific articles and extract semantic information across diverse layouts.
Comprehensive Metadata Extraction: Outputs detailed metadata in JSON and plain text formats, including paper title, authors, affiliations, email addresses, abstracts, keywords, journal names, publication years, DOIs, references, and issue numbers.
Improved Precision: Demonstrates precision in extracting key metadata areas using a self-created evaluation article set.

Scientific Applications:

Chemical literature mining: Enables extraction and correlation of chemical entities and property data from PDFs by leveraging ChemDataExtractor.
Data-driven materials discovery: Provides structured metadata to support data science workflows and materials-discovery analyses driven by literature-derived data.

Methodology:

Uses a template-based architecture to reconstruct document logical structure, applies ChemDataExtractor for chemical named-entity recognition, outputs metadata in JSON and plain text, and assessed precision with a self-created evaluation article set.

Visit Official Homepage →

Topics

Natural language processing Chemistry Physics Data mining

Details

License:: MIT
Cost:: Free of charge
Tool Type:: library
Operating Systems:: Mac, Linux, Windows
Programming Languages:: Python
Added:: 7/6/2022
Last Updated:: 11/24/2024

Operations

Publications

Zhu M, Cole JM. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format. Journal of Chemical Information and Modeling. 2022;62(7):1633-1643. doi:10.1021/acs.jcim.1c01198. PMID:35349259. PMCID:PMC9049592.

DOI: 10.1021/acs.jcim.1c01198

PMID: 35349259

PMCID: PMC9049592

Funding: - Royal Academy of Engineering: RCSRF1819\7\10

Documentation

User manual

https://pdfdataextractor.readthedocs.io/en/latest/index.html

← Back to search