Semantic Decomposition of Medical Data

Most health apps store medical information as files: a PDF here, a photo there, a discharge summary in a folder. Search means scrolling. Trends mean manual comparison. Health Vault took a different path — semantic decomposition — breaking every document into atomic medical facts that exist independently of their source.

The Document-Centric Trap

Traditional personal health records treat documents as the primary entity. You upload a lab report from March 2023 and another from September 2024. To compare glucose levels, you open both files and read the numbers yourself.

This model breaks down quickly:

Different labs use different names for the same test ("Glucose" vs "GLU" vs "Blood sugar").
Reference ranges vary between laboratories.
Clinical findings buried in free text are invisible to search and analytics.
Longitudinal analysis requires manual data entry — which nobody does consistently.

Factor-Centric Architecture

Health Vault inverts the hierarchy. The primary entities are medical factors — biomarkers, diagnoses, prescriptions, clinical observations — each with:

a standardized code (LOINC for lab tests, SNOMED CT for clinical concepts);
a value and unit;
a timestamp;
a link to the source document (for provenance, not for storage).

When you upload a PDF, the pipeline runs:

OCR — extract text from PDF, image, or scan.
NLP parsing — identify test names, values, units, dates, and narrative findings.
Normalization — map local names to LOINC/SNOMED codes.
Validation — check plausibility (value ranges, unit consistency).
Storage — write factors to the user's longitudinal profile.

The document remains accessible, but analytics operate on factors, not files.

Why Standard Codes Matter

Without standardization, "glucose" from Lab A and "GLU" from Lab B are two unrelated strings. With LOINC code 2345-7, they are the same factor — and Health Vault plots them on one chart regardless of source format.

This approach mirrors enterprise healthcare data warehouses (Data Vault, OMOP CDM) but scales to consumer use. A patient does not need to understand LOINC — they see "Glucose" with a unified history.

Extracting Clinical Findings

Biomarkers are the easy part — structured tables with numbers. Harder is clinical narrative: "focal changes in the thyroid gland," "signs of fatty hepatosis," "recommend follow-up in 6 months."

Health Vault uses NLP to extract these as SNOMED-coded findings. They appear in the health profile alongside lab values, contributing to the health index and AI reports. A user with rising ALT and a finding of "hepatic steatosis" from an ultrasound report gets a coherent picture — not two disconnected documents.

Benefits for Users and Clinicians

For users:

automatic trend charts without manual entry;
health index and biological age from accumulated data;
AI reports that reference both labs and clinical findings.

For clinicians:

a structured summary before appointments;
secure sharing via time-limited links;
up to 25% time saved on documentation review.

Challenges and Trade-offs

Semantic decomposition is not perfect:

OCR quality depends on photo clarity and document layout.
Unusual lab formats may require model updates.
Free-text findings have lower extraction confidence than structured tables.
Standard code mapping fails for truly novel or local test names.

We address this with confidence scores, human-review flags for low-confidence extractions, and continuous model improvement from anonymized patterns.

Conclusion

Moving from documents to factors transforms medical data from an archive into an analytical asset. Health Vault's factor-centric architecture is the foundation for biomarker tracking, biological age, digital health twins, and AI-powered health reports — all from the same uploaded PDFs and photos.

Originally published on Habr

Semantic Decomposition of Medical Data ​

The Document-Centric Trap ​

Factor-Centric Architecture ​

Why Standard Codes Matter ​

Extracting Clinical Findings ​

Benefits for Users and Clinicians ​

Challenges and Trade-offs ​

Conclusion ​