scikit-activeml

scikit-activeml implements active learning algorithms to select informative samples for labeling and improve model performance in data-scarce bioinformatics tasks such as genomics and proteomics.


Key Features:

  • Foundation: Built on the SciPy and scikit-learn frameworks.
  • Integration with Scikit-Learn: Integrates with scikit-learn classifiers, including Logistic Regression, to apply active learning to existing models.
  • Handling Unlabeled Data: Represents unlabeled instances by a designated MISSING_LABEL value in the label vector y_true.
  • Query Strategies: Implements a range of query strategies to identify the most informative data points for labeling.
  • Customizable Active Learning Cycles: Supports iterative active learning cycles configurable with different classifiers.

Scientific Applications:

  • Genomics: Reduce labeling effort for genomic datasets where experimental validation is costly by prioritizing informative samples.
  • Proteomics: Prioritize labels in proteomic datasets to improve model accuracy with fewer labeled instances.
  • Large-scale biological data analysis: Improve model performance under limited labeling resources in other large-scale bioinformatics analyses.

Methodology:

Uses uncertainty sampling to iteratively select and query labels for samples about which the model is least certain; employs active learning cycles, the MISSING_LABEL convention in y_true to mark unlabeled instances, and scikit-learn classifiers (e.g., Logistic Regression).

Topics

Details

License:
BSD-3-Clause
Tool Type:
library
Programming Languages:
Python
Added:
11/29/2021
Last Updated:
11/29/2021

Operations

Publications

Kottke D, Herde M, Pham Minh T, Benz A, Mergard P, Roghman A, Sandrock C, Sick B. scikit-activeml: A Library and Toolbox for Active Learning Algorithms. Unknown Journal. 2021. doi:10.20944/preprints202103.0194.v1.

Documentation

Links