About this prototype
This is a research prototype showing what a
faceted catalogue can look like for an evidence-extracted corpus.
Layers
- Aspects are broad themes that cluster related
topics together. Currently 12 aspects, derived from
hierarchical clustering of topic embeddings.
- Topics are clusters of related phrases that
appear across documents. Currently 44 topics. Some sit
inside a parent topic when most of their documents are also
covered by the parent.
- Sub-groups live inside each topic. Click the
small ▸ next to a topic to see how the system splits its
phrases into finer groups (e.g. fresh meat splits
into meat / fishery products / feedingstuffs / sauces /…).
Each sub-group carries a confidence label
(
high / medium / low / no_signal); the system
may also decide a topic is indivisible and show no caret.
- Documents are the original source records.
Each linked document shows its key phrases highlighted once
phrase positions finish loading.
How it was built
- For each document, an LLM extracts evidence
phrases and tags each as a subject (e.g. "fresh
meat") or an attribute (e.g. "valid until 30 June").
- Phrases are deduplicated and embedded into a 384-dim
sentence-transformer space.
- K is chosen from the data using three independent
mathematical criteria (silhouette, eigengap, BIC) and
taking the median.
- KMeans clusters the unique phrases into rough groups, then
LLM rebasketing and merge stages build the final topics.
- Topics are then grouped into dashboard aspects from bottom-up
node labels and document evidence.
- Inside each aspect, parent → child edges are added when a
child's documents are mostly covered by a candidate
parent's documents (and the parent isn't too broad).
What is honest about the labels
Topic and aspect names are generated from the discovered evidence
groups. Use the related-phrases list to see what each topic really
contains.
What this is NOT
- It is not a confirmed taxonomy. It is candidate evidence
for human review.
- It does not predict external reference labels unless a
downstream evaluation stage is explicitly added.
- Some aspect groups and tree edges are still rough. Treat the
tree as a starting point, not a finished hierarchy.