AI-Powered Medical Document Recognition

Medical documents arrive in every imaginable format: structured PDFs from modern labs, scanned paper reports, smartphone photos taken under fluorescent lighting, handwritten prescriptions in cursive. Manually entering data from these sources is tedious and error-prone. Health Vault solves this with an AI pipeline that achieves over 90% extraction accuracy on typical lab reports.

The Recognition Pipeline

When you upload a document, it passes through several stages:

1. Document Classification

The system first determines document type: laboratory report, imaging conclusion, discharge summary, prescription, or insurance form. Classification routes the document to the appropriate extraction model — lab parsers differ from narrative clinical parsers.

2. Optical Character Recognition (OCR)

For images and scanned PDFs, OCR converts pixels to text. Health Vault handles:

multi-column lab layouts;
mixed Russian and English text;
tables with reference ranges;
low-quality photos (with graceful degradation and confidence scoring).

3. Natural Language Processing (NLP)

Raw OCR text is noisy — misread characters, broken table alignment, missing headers. NLP models:

identify test names and map them to standard terminologies;
extract numeric values with units;
parse dates in various formats;
detect reference ranges and flag out-of-range values;
extract narrative findings from free-text conclusions.

4. Validation and Normalization

Extracted data passes validation:

Plausibility checks — glucose of 500 mmol/L triggers review (likely OCR error).
Unit normalization — mg/dL and mmol/L converted consistently.
Code assignment — LOINC for lab tests, SNOMED CT for clinical concepts.
Deduplication — same test on the same date from re-uploaded documents.

Validated factors are stored in the user's profile; the source document is preserved for reference.

Supported Input Formats

Format	Support	Notes
PDF (digital)	Full	Best accuracy — native text extraction
PDF (scan)	Full	OCR applied to embedded images
JPEG/PNG photo	Full	Works best with flat, well-lit images
Handwritten prescriptions	Partial	Depends on legibility; confidence score shown
DICOM reports	Planned	Imaging metadata and conclusions

Accuracy and Limitations

What works well:

standard lab panels from major Russian and international laboratories;
structured tables with clear test name / value / unit columns;
digital PDFs with embedded text.

What is harder:

heavily stylized lab branding with non-standard layouts;
handwritten notes with poor photo quality;
documents in languages not yet supported;
combined reports with multiple unrelated sections.

Every extracted value includes a confidence score. Low-confidence items are flagged for user review before entering analytics.

From Recognition to Analytics

Recognition is not the end goal — it is the entry point. Once biomarkers are extracted:

trend charts appear automatically;
the health index updates;
biological age recalculates with new data;
AI reports incorporate the latest values.

This closed loop — upload, extract, analyze — is what makes Health Vault more than a document scanner. The Argus OCR module, originally developed for Health Vault, was also released as a standalone Telegram bot and reached #1 on ProductRadar.

Privacy During Processing

Documents are processed within Health Vault's secure infrastructure:

TLS 1.3 encryption in transit;
AES-256 encryption at rest;
no sharing with third parties for model training;
compliance with Federal Law 152-FZ on personal data.

Tips for Best Results

Photograph documents flat, in good lighting, without shadows.
Include the full page — headers often contain dates and patient identifiers.
Prefer PDF downloads from lab portals over re-photographed printouts.
Review flagged low-confidence values after upload.

Conclusion

AI document recognition removes the biggest barrier to personal health tracking: manual data entry. Health Vault turns photos and PDFs into structured, coded, longitudinal health data — automatically, accurately, and securely.

Originally published on Habr

AI-Powered Medical Document Recognition ​

The Recognition Pipeline ​

1. Document Classification ​

2. Optical Character Recognition (OCR) ​

3. Natural Language Processing (NLP) ​

4. Validation and Normalization ​

Supported Input Formats ​

Accuracy and Limitations ​

From Recognition to Analytics ​

Privacy During Processing ​

Tips for Best Results ​

Conclusion ​