PDFDataExtractor

PDFDataExtractor extracts and interprets metadata from Portable Document Format (PDF) scientific literature to support chemical literature mining and metadata-driven analyses.


Key Features:

  • Enhanced Data Extraction: Extracts latent metadata from PDF files and integrates chemical named-entity recognition via ChemDataExtractor.
  • Template-Based Architecture: Uses a template-based architecture to reconstruct the logical structure of scientific articles and extract semantic information across diverse layouts.
  • Comprehensive Metadata Extraction: Outputs detailed metadata in JSON and plain text formats, including paper title, authors, affiliations, email addresses, abstracts, keywords, journal names, publication years, DOIs, references, and issue numbers.
  • Improved Precision: Demonstrates precision in extracting key metadata areas using a self-created evaluation article set.

Scientific Applications:

  • Chemical literature mining: Enables extraction and correlation of chemical entities and property data from PDFs by leveraging ChemDataExtractor.
  • Data-driven materials discovery: Provides structured metadata to support data science workflows and materials-discovery analyses driven by literature-derived data.

Methodology:

Uses a template-based architecture to reconstruct document logical structure, applies ChemDataExtractor for chemical named-entity recognition, outputs metadata in JSON and plain text, and assessed precision with a self-created evaluation article set.

Topics

Details

License:
MIT
Cost:
Free of charge
Tool Type:
library
Operating Systems:
Mac, Linux, Windows
Programming Languages:
Python
Added:
7/6/2022
Last Updated:
11/24/2024

Operations

Publications

Zhu M, Cole JM. PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format. Journal of Chemical Information and Modeling. 2022;62(7):1633-1643. doi:10.1021/acs.jcim.1c01198. PMID:35349259. PMCID:PMC9049592.

PMID: 35349259
PMCID: PMC9049592
Funding: - Royal Academy of Engineering: RCSRF1819\7\10

Documentation