ML / AI dataset candidates

MolTrace · Knowledge · Dataset Candidates

Dataset candidate dashboard

Governance-focused listing: identifiers and review metadata only. Do not treat aggregates as validation of underlying chemistry or confidential content.

Review before ML use

Dataset candidates reference reviewed records; approval workflows and leakage checks must complete before training or benchmarking.

Training

1. Training dataset candidates

Curated knowledge claims nominated for ML model training — identifiers, record type, review metadata, and curation status.

status filter

Loading…

Nominate training candidate

Nominate a knowledge claim as a training candidate by specifying the record type, record ID, dataset type, source, citation IDs, and quality flags.

source_id (optional)

record_type

record_id

dataset_type

citation_ids_json (comma-separated integers)

quality_flags_json (comma-separated)

Benchmark

2. Benchmark dataset candidates

Knowledge claims nominated for ML benchmark evaluation — includes leakage risk label and split recommendation. Citation IDs are not modeled on benchmark candidates and display as blank.

status filter

Loading…

Nominate benchmark candidate

Nominate a knowledge claim as a benchmark evaluation candidate by specifying the record type, record ID, benchmark type, and leakage risk classification.

source_id (optional)

record_type

record_id

benchmark_type

quality_flags_json (comma-separated)

Versions

3. Dataset versions

Versioned snapshots of training and benchmark splits — each version locks candidate IDs into train, validation, test, and holdout partitions for reproducible model training.

status filter

POST dataset version

Populate split_json with keys train, validation, test, holdout (comma-separated candidate IDs per field). source_record_ids_json is the deduplicated union of split IDs.

name

dataset_type

version

split_json · train (comma-separated candidate IDs)

split_json · validation (comma-separated candidate IDs)

split_json · test (comma-separated candidate IDs)

split_json · holdout (comma-separated candidate IDs)

Loading…

Leakage

4. Leakage risk warnings

Aggregated from benchmark leakage_risk_label and dataset version leakage_warnings_json (summaries only).

leakage_risk_label · low

benchmark rows

leakage_risk_label · medium

benchmark rows

leakage_risk_label · high

benchmark rows

leakage_risk_label · unknown

benchmark rows

No leakage_warnings_json entries on loaded dataset versions.

Quality

5. Quality flags

Counts from training and benchmark quality_flags_json (flag strings only).

No quality flags on loaded candidates.

Command Palette

Dataset candidate dashboard

1. Training dataset candidates

2. Benchmark dataset candidates

3. Dataset versions

4. Leakage risk warnings

5. Quality flags