Semantic Decomposition of Medical Data
Most health apps store medical information as files: a PDF here, a photo there, a discharge summary in a folder. Search means scrolling. Trends mean manual comparison. Health Vault took a different path — semantic decomposition — breaking every document into atomic medical facts that exist independently of their source.
The Document-Centric Trap
Traditional personal health records treat documents as the primary entity. You upload a lab report from March 2023 and another from September 2024. To compare glucose levels, you open both files and read the numbers yourself.
This model breaks down quickly:
- Different labs use different names for the same test ("Glucose" vs "GLU" vs "Blood sugar").
- Reference ranges vary between laboratories.
- Clinical findings buried in free text are invisible to search and analytics.
- Longitudinal analysis requires manual data entry — which nobody does consistently.
Factor-Centric Architecture
Health Vault inverts the hierarchy. The primary entities are medical factors — biomarkers, diagnoses, prescriptions, clinical observations — each with:
- a standardized code (LOINC for lab tests, SNOMED CT for clinical concepts);
- a value and unit;
- a timestamp;
- a link to the source document (for provenance, not for storage).
When you upload a PDF, the pipeline runs:
- OCR — extract text from PDF, image, or scan.
- NLP parsing — identify test names, values, units, dates, and narrative findings.
- Normalization — map local names to LOINC/SNOMED codes.
- Validation — check plausibility (value ranges, unit consistency).
- Storage — write factors to the user's longitudinal profile.
The document remains accessible, but analytics operate on factors, not files.
Why Standard Codes Matter
Without standardization, "glucose" from Lab A and "GLU" from Lab B are two unrelated strings. With LOINC code 2345-7, they are the same factor — and Health Vault plots them on one chart regardless of source format.
This approach mirrors enterprise healthcare data warehouses (Data Vault, OMOP CDM) but scales to consumer use. A patient does not need to understand LOINC — they see "Glucose" with a unified history.
Extracting Clinical Findings
Biomarkers are the easy part — structured tables with numbers. Harder is clinical narrative: "focal changes in the thyroid gland," "signs of fatty hepatosis," "recommend follow-up in 6 months."
Health Vault uses NLP to extract these as SNOMED-coded findings. They appear in the health profile alongside lab values, contributing to the health index and AI reports. A user with rising ALT and a finding of "hepatic steatosis" from an ultrasound report gets a coherent picture — not two disconnected documents.
Benefits for Users and Clinicians
For users:
- automatic trend charts without manual entry;
- health index and biological age from accumulated data;
- AI reports that reference both labs and clinical findings.
For clinicians:
- a structured summary before appointments;
- secure sharing via time-limited links;
- up to 25% time saved on documentation review.
Challenges and Trade-offs
Semantic decomposition is not perfect:
- OCR quality depends on photo clarity and document layout.
- Unusual lab formats may require model updates.
- Free-text findings have lower extraction confidence than structured tables.
- Standard code mapping fails for truly novel or local test names.
We address this with confidence scores, human-review flags for low-confidence extractions, and continuous model improvement from anonymized patterns.
Conclusion
Moving from documents to factors transforms medical data from an archive into an analytical asset. Health Vault's factor-centric architecture is the foundation for biomarker tracking, biological age, digital health twins, and AI-powered health reports — all from the same uploaded PDFs and photos.
Originally published on Habr