AI-Powered Medical Document Recognition
Medical documents arrive in every imaginable format: structured PDFs from modern labs, scanned paper reports, smartphone photos taken under fluorescent lighting, handwritten prescriptions in cursive. Manually entering data from these sources is tedious and error-prone. Health Vault solves this with an AI pipeline that achieves over 90% extraction accuracy on typical lab reports.
The Recognition Pipeline
When you upload a document, it passes through several stages:
1. Document Classification
The system first determines document type: laboratory report, imaging conclusion, discharge summary, prescription, or insurance form. Classification routes the document to the appropriate extraction model — lab parsers differ from narrative clinical parsers.
2. Optical Character Recognition (OCR)
For images and scanned PDFs, OCR converts pixels to text. Health Vault handles:
- multi-column lab layouts;
- mixed Russian and English text;
- tables with reference ranges;
- low-quality photos (with graceful degradation and confidence scoring).
3. Natural Language Processing (NLP)
Raw OCR text is noisy — misread characters, broken table alignment, missing headers. NLP models:
- identify test names and map them to standard terminologies;
- extract numeric values with units;
- parse dates in various formats;
- detect reference ranges and flag out-of-range values;
- extract narrative findings from free-text conclusions.
4. Validation and Normalization
Extracted data passes validation:
- Plausibility checks — glucose of 500 mmol/L triggers review (likely OCR error).
- Unit normalization — mg/dL and mmol/L converted consistently.
- Code assignment — LOINC for lab tests, SNOMED CT for clinical concepts.
- Deduplication — same test on the same date from re-uploaded documents.
Validated factors are stored in the user's profile; the source document is preserved for reference.
Supported Input Formats
| Format | Support | Notes |
|---|---|---|
| PDF (digital) | Full | Best accuracy — native text extraction |
| PDF (scan) | Full | OCR applied to embedded images |
| JPEG/PNG photo | Full | Works best with flat, well-lit images |
| Handwritten prescriptions | Partial | Depends on legibility; confidence score shown |
| DICOM reports | Planned | Imaging metadata and conclusions |
Accuracy and Limitations
What works well:
- standard lab panels from major Russian and international laboratories;
- structured tables with clear test name / value / unit columns;
- digital PDFs with embedded text.
What is harder:
- heavily stylized lab branding with non-standard layouts;
- handwritten notes with poor photo quality;
- documents in languages not yet supported;
- combined reports with multiple unrelated sections.
Every extracted value includes a confidence score. Low-confidence items are flagged for user review before entering analytics.
From Recognition to Analytics
Recognition is not the end goal — it is the entry point. Once biomarkers are extracted:
- trend charts appear automatically;
- the health index updates;
- biological age recalculates with new data;
- AI reports incorporate the latest values.
This closed loop — upload, extract, analyze — is what makes Health Vault more than a document scanner. The Argus OCR module, originally developed for Health Vault, was also released as a standalone Telegram bot and reached #1 on ProductRadar.
Privacy During Processing
Documents are processed within Health Vault's secure infrastructure:
- TLS 1.3 encryption in transit;
- AES-256 encryption at rest;
- no sharing with third parties for model training;
- compliance with Federal Law 152-FZ on personal data.
Tips for Best Results
- Photograph documents flat, in good lighting, without shadows.
- Include the full page — headers often contain dates and patient identifiers.
- Prefer PDF downloads from lab portals over re-photographed printouts.
- Review flagged low-confidence values after upload.
Conclusion
AI document recognition removes the biggest barrier to personal health tracking: manual data entry. Health Vault turns photos and PDFs into structured, coded, longitudinal health data — automatically, accurately, and securely.
Originally published on Habr