Skip to content

Semantic Decomposition of Medical Data

Most health apps store medical information as files: a PDF here, a photo there, a discharge summary in a folder. Search means scrolling. Trends mean manual comparison. Health Vault took a different path — semantic decomposition — breaking every document into atomic medical facts that exist independently of their source.

The Document-Centric Trap

Traditional personal health records treat documents as the primary entity. You upload a lab report from March 2023 and another from September 2024. To compare glucose levels, you open both files and read the numbers yourself.

This model breaks down quickly:

  • Different labs use different names for the same test ("Glucose" vs "GLU" vs "Blood sugar").
  • Reference ranges vary between laboratories.
  • Clinical findings buried in free text are invisible to search and analytics.
  • Longitudinal analysis requires manual data entry — which nobody does consistently.

Factor-Centric Architecture

Health Vault inverts the hierarchy. The primary entities are medical factors — biomarkers, diagnoses, prescriptions, clinical observations — each with:

  • a standardized code (LOINC for lab tests, SNOMED CT for clinical concepts);
  • a value and unit;
  • a timestamp;
  • a link to the source document (for provenance, not for storage).

When you upload a PDF, the pipeline runs:

  1. OCR — extract text from PDF, image, or scan.
  2. NLP parsing — identify test names, values, units, dates, and narrative findings.
  3. Normalization — map local names to LOINC/SNOMED codes.
  4. Validation — check plausibility (value ranges, unit consistency).
  5. Storage — write factors to the user's longitudinal profile.

The document remains accessible, but analytics operate on factors, not files.

Why Standard Codes Matter

Without standardization, "glucose" from Lab A and "GLU" from Lab B are two unrelated strings. With LOINC code 2345-7, they are the same factor — and Health Vault plots them on one chart regardless of source format.

This approach mirrors enterprise healthcare data warehouses (Data Vault, OMOP CDM) but scales to consumer use. A patient does not need to understand LOINC — they see "Glucose" with a unified history.

Extracting Clinical Findings

Biomarkers are the easy part — structured tables with numbers. Harder is clinical narrative: "focal changes in the thyroid gland," "signs of fatty hepatosis," "recommend follow-up in 6 months."

Health Vault uses NLP to extract these as SNOMED-coded findings. They appear in the health profile alongside lab values, contributing to the health index and AI reports. A user with rising ALT and a finding of "hepatic steatosis" from an ultrasound report gets a coherent picture — not two disconnected documents.

Benefits for Users and Clinicians

For users:

  • automatic trend charts without manual entry;
  • health index and biological age from accumulated data;
  • AI reports that reference both labs and clinical findings.

For clinicians:

  • a structured summary before appointments;
  • secure sharing via time-limited links;
  • up to 25% time saved on documentation review.

Challenges and Trade-offs

Semantic decomposition is not perfect:

  • OCR quality depends on photo clarity and document layout.
  • Unusual lab formats may require model updates.
  • Free-text findings have lower extraction confidence than structured tables.
  • Standard code mapping fails for truly novel or local test names.

We address this with confidence scores, human-review flags for low-confidence extractions, and continuous model improvement from anonymized patterns.

Conclusion

Moving from documents to factors transforms medical data from an archive into an analytical asset. Health Vault's factor-centric architecture is the foundation for biomarker tracking, biological age, digital health twins, and AI-powered health reports — all from the same uploaded PDFs and photos.


Originally published on Habr

Vert Neo Limited — developer Health Vault