Introducing Ragextract
Ragextract by SubworkflowAI and available today, reduces AI document OCR costs for large documents
Published Jan 15, 2024
Posted in #ragextract

## It's a Search Problem
If you work with large documents, typically you will only be interested in a small portion of it; a customer's details, specifics of a policy or numbers in a table buried midway through a report.
Modern day document AI services help you convert these documents to markdown using LLMs as a way to allow you to build your own search over the contents. This has worked well for a time but is not without certain drawbacks:
* **It's slow** since the LLM must parse every page and transcribe the contents to text which could take up to a minute dependent on the information density on the page.
* **It's expensive** as LLMs use exclusively output tokens for this task which are priced multiple times higher than input tokens.
* **It's wasteful** due to the fact that not all pages are going to be relevant and thus won't be used. That dense appendix filler page which took a minute to output, chunk and store? Discarded almost immediately.
We decided to focus on the search optimisation problem and came to test a simple assumption; **what if we could enable document search first prior to the costly LLM conversion step?**
## Ragextract is Born
After months of R&D, our solution finally arrived early 2025 by way of multimodal embeddings. This technology enabled the ability to vectorize entire pages visually, achieving 3x faster processing than conversion and 100x cheaper in token cost. In production, this approach saw a project perform 90% less OCR parsing of pages without a drop in search accuracy, saving thousands in LLM costs over the quarter.
With the learnings, we created **[Ragextract](https://ragextract.com)** - a standalone service to offer this state-of-the-art and refined approach to others. Of course, embeddings alone aren't enough! Ragextract is a complete end-to-end service which handles storage, search and retrieval and dramatically cuts down the amount of supporting infrastructure or tools needed for most document AI workloads.
||Ragextract|Existing Services|
|-|-|-|
|Process document|✅ Designed for large documents|❌ Assumes small workloads|
|Conversion to markdown|✅ No need!|❌ Yes and forced on every page|
|Generate embeddings|✅ Multimodal means text, images, charts and diagrams |❌ Not supported|
|Upload to vector store|✅ Automatic and isolated vector store provided|❌ Not supported|
|Search endpoint|✅ Production API ready to return matching pages instantly|❌ Not supported|
|Retrieve binary files|✅ Retrieval API and storage provided for your data|❌ Not supported|
|Extract data|✅ Bring-your-own-provider means full control over costs|❌ Markup on LLM costs|
## Using Ragextract
Ragextract is available as an production-ready API service that requires no install. Simply sign up for an API key to start using Ragextract today.
### 1. Extracting Vehicle Details in Insurance Claims
<div className="text-sm">Insurance teams need to receive and process a varierty of lengthy forms from customers and third parties. Whilst data extract may seem simple, no two forms are the same and the first challenge is always locating the right data within the document. Running Vision-enabled AI OCR on every page of a 100+ page document (for over 30,000 documents a year!) can get quite expensive! Ragextract is able to perform the same search: text, images, tables and fields quickly and cheaply no matter how large the document is.</div>
**Use-case**: Automate third party claim submissions, data entry and classification from inbox to CMS.
```bash
curl -X POST https://api.subworkflow.ai/v1/vectorize \
--header 'x-api-key: $RAGEXTRACT_API_KEY' \
--form "file=insurance_claim_application.pdf"
# After processing is complete...
curl -X POST https://api.subworkflow.ai/v1/search \
--header 'x-api-key: $RAGEXTRACT_API_KEY' \
--data '{
"query": "Find motor vehicle details and registration of all parties involved in claim",
"datasetIds": ["ds_B5bsOBDzsXsqfmLo"]
}'
```
### 2. Identifying Signals in Tender Documents
<div className="text-sm">More than being able to find relevant tenders, Bid managers need a way to quickly filter out irrelevant ones. When tender documents are received in competitng standards, even skimming through promising projects only to find hard requirements may waste half a day. Ragextract can perform search queries across all documents simulataneously - whether they are questionnaires, appendixes or diagrams - in seconds. This allows managers to save precious time but expand their capacity to evaluate more tenders per day.</div>
**Use-case**: Summarize and get answers from any collection of unstructured tender documents in as little as 30 seconds. Increased productivity means increasing pipeline and catching more opportunities.
```bash
curl -X POST https://api.subworkflow.ai/v1/vectorize \
--header 'x-api-key: $RAGEXTRACT_API_KEY' --form "file=invitation_to_tender.docx"
... --form "file=questionnaire_a.docx"
... --form "file=questionnaire_b.docx"
... --form "file=appendix_4.docx"
... --form "file=appendix_3.docx"
... --form "file=sitemap.pdf"
curl -X POST https://api.subworkflow.ai/v1/search \
--header 'x-api-key: $RAGEXTRACT_API_KEY' \
--data '{
"query": "What previous work experience is required for this tender?",
}'
```
### 3. Concurrency for Financial Statements
<div className="text-sm">Bank statements present a slightly different challenge as it's not so much about search but bulk file handling. Leverage Ragextract's reliable document processing pipeline which indexes hundreds of pages per second and our storage and retrieval API which returns them in either pdf or jpg formats.</div>
**Use-case**: Shortcut building your own document pipline by using Ragextgract and gain day-one support for concurrency to handle more documents and properly maintained APIs to quickly integrate into your backend.
```bash
curl -X POST https://api.subworkflow.ai/v1/extract \
--header 'x-api-key: $RAGEXTRACT_API_KEY'
--form "file=apr2025.pdf"
# returns pages 2 to 10 as jpgs
curl https://api.subworkfow.ai/v1/datasets/ds_ar7e4PtGX7fGGnSt/items?rows=jpg&cols=2:10
--header 'x-api-key: $RAGEXTRACT_API_KEY'
```
## 4. Client SDKs and Integrations
You can also use Ragextract from your favourite platform using the following packages.
* **Typescript/Javascript Library** - [https://github.com/Subworkflow-AI/ragextract-js](https://github.com/Subworkflow-AI/ragextract-js)
* **n8n Community Node** - [https://github.com/Subworkflow-AI/n8n-nodes-ragextract](https://github.com/Subworkflow-AI/n8n-nodes-ragextract)
## Conclusion
Migrating AI Document workflows to Ragextract, we found that our alternative approach of search-first-parse-later is a viable and preferrable strategy for selective data extraction within long documents. Here's a summary of our learnings:
1. Multimodal embeddings perform better than text-only for search as they also capture visual information such as images, charts and diagrams in documents.
2. Multimodal Embeddings are many times faster and cheaper than LLM output tokens and work that goes into text chunking strategies.
3. However for production usage, a durable document processing pipeline that can handle concurrency, object storage, database and serves retrieval APIs is still needed.
**Ragextract provides all the above.** We encourage you to read our [official Ragextract documentation](https://docs.subworkflow.ai/intro) to explore what the Ragextract approach can improve your document workflows.
**[Sign up for a free 14 days free trial to Ragextract Today!](https://ragextract.com)**
Thanks for Reading!
Questions or Feedback? Send a message on our Discord or email hello@subworkflow.ai
Insurance Policy, Finance Reports and Pitch Decks
Easy to use REST API, SDK and n8n community node
14 day free trial and 30 day money back guarantee
More Articles
This is Where it all started
Your Reading the First Published Article!
Follow us on Socials to get notified when new articles are published!
That's All For Now
Your Reading the Latest!
Follow us on Socials to get notified when new articles are published!