Skip to content

AI-Powered Medical Document Recognition

Medical documents arrive in every imaginable format: structured PDFs from modern labs, scanned paper reports, smartphone photos taken under fluorescent lighting, handwritten prescriptions in cursive. Manually entering data from these sources is tedious and error-prone. Health Vault solves this with an AI pipeline that achieves over 90% extraction accuracy on typical lab reports.

The Recognition Pipeline

When you upload a document, it passes through several stages:

1. Document Classification

The system first determines document type: laboratory report, imaging conclusion, discharge summary, prescription, or insurance form. Classification routes the document to the appropriate extraction model — lab parsers differ from narrative clinical parsers.

2. Optical Character Recognition (OCR)

For images and scanned PDFs, OCR converts pixels to text. Health Vault handles:

  • multi-column lab layouts;
  • mixed Russian and English text;
  • tables with reference ranges;
  • low-quality photos (with graceful degradation and confidence scoring).

3. Natural Language Processing (NLP)

Raw OCR text is noisy — misread characters, broken table alignment, missing headers. NLP models:

  • identify test names and map them to standard terminologies;
  • extract numeric values with units;
  • parse dates in various formats;
  • detect reference ranges and flag out-of-range values;
  • extract narrative findings from free-text conclusions.

4. Validation and Normalization

Extracted data passes validation:

  • Plausibility checks — glucose of 500 mmol/L triggers review (likely OCR error).
  • Unit normalization — mg/dL and mmol/L converted consistently.
  • Code assignment — LOINC for lab tests, SNOMED CT for clinical concepts.
  • Deduplication — same test on the same date from re-uploaded documents.

Validated factors are stored in the user's profile; the source document is preserved for reference.

Supported Input Formats

FormatSupportNotes
PDF (digital)FullBest accuracy — native text extraction
PDF (scan)FullOCR applied to embedded images
JPEG/PNG photoFullWorks best with flat, well-lit images
Handwritten prescriptionsPartialDepends on legibility; confidence score shown
DICOM reportsPlannedImaging metadata and conclusions

Accuracy and Limitations

What works well:

  • standard lab panels from major Russian and international laboratories;
  • structured tables with clear test name / value / unit columns;
  • digital PDFs with embedded text.

What is harder:

  • heavily stylized lab branding with non-standard layouts;
  • handwritten notes with poor photo quality;
  • documents in languages not yet supported;
  • combined reports with multiple unrelated sections.

Every extracted value includes a confidence score. Low-confidence items are flagged for user review before entering analytics.

From Recognition to Analytics

Recognition is not the end goal — it is the entry point. Once biomarkers are extracted:

  • trend charts appear automatically;
  • the health index updates;
  • biological age recalculates with new data;
  • AI reports incorporate the latest values.

This closed loop — upload, extract, analyze — is what makes Health Vault more than a document scanner. The Argus OCR module, originally developed for Health Vault, was also released as a standalone Telegram bot and reached #1 on ProductRadar.

Privacy During Processing

Documents are processed within Health Vault's secure infrastructure:

  • TLS 1.3 encryption in transit;
  • AES-256 encryption at rest;
  • no sharing with third parties for model training;
  • compliance with Federal Law 152-FZ on personal data.

Tips for Best Results

  1. Photograph documents flat, in good lighting, without shadows.
  2. Include the full page — headers often contain dates and patient identifiers.
  3. Prefer PDF downloads from lab portals over re-photographed printouts.
  4. Review flagged low-confidence values after upload.

Conclusion

AI document recognition removes the biggest barrier to personal health tracking: manual data entry. Health Vault turns photos and PDFs into structured, coded, longitudinal health data — automatically, accurately, and securely.


Originally published on Habr

Vert Neo Limited — developer Health Vault