Introducing Ragextract

Ragextract by SubworkflowAI and available today, reduces AI document OCR costs for large documents

Published Jan 15, 2024

Jim Le

AI Eng @Subworkflow

Jim is a founder at @Subworkflow. When he's not actively developing Ragextract, Jim scours the web to find out all he can about Document AI use-cases.

## It's a Search Problem If you work with large documents, typically you will only be interested in a small portion of it; a customer's details, specifics of a policy or numbers in a table buried midway through a report. Modern day document AI services help you convert these documents to markdown using LLMs as a way to allow you to build your own search over the contents. This has worked well for a time but is not without certain drawbacks: * **It's slow** since the LLM must parse every page and transcribe the contents to text which could take up to a minute dependent on the information density on the page. * **It's expensive** as LLMs use exclusively output tokens for this task which are priced multiple times higher than input tokens. * **It's wasteful** due to the fact that not all pages are going to be relevant and thus won't be used. That dense appendix filler page which took a minute to output, chunk and store? Discarded almost immediately. We decided to focus on the search optimisation problem and came to test a simple assumption; **what if we could enable document search first prior to the costly LLM conversion step?** ## Ragextract is Born After months of R&D, our solution finally arrived early 2025 by way of multimodal embeddings. This technology enabled the ability to vectorize entire pages visually, achieving 3x faster processing than conversion and 100x cheaper in token cost. In production, this approach saw a project perform 90% less OCR parsing of pages without a drop in search accuracy, saving thousands in LLM costs over the quarter. With the learnings, we created **[Ragextract](https://ragextract.com)** - a standalone service to offer this state-of-the-art and refined approach to others. Of course, embeddings alone aren't enough! Ragextract is a complete end-to-end service which handles storage, search and retrieval and dramatically cuts down the amount of supporting infrastructure or tools needed for most document AI workloads. ||Ragextract|Existing Services| |-|-|-| |Process document|✅ Designed for large documents|❌ Assumes small workloads| |Conversion to markdown|✅ No need!|❌ Yes and forced on every page| |Generate embeddings|✅ Multimodal means text, images, charts and diagrams |❌ Not supported| |Upload to vector store|✅ Automatic and isolated vector store provided|❌ Not supported| |Search endpoint|✅ Production API ready to return matching pages instantly|❌ Not supported| |Retrieve binary files|✅ Retrieval API and storage provided for your data|❌ Not supported| |Extract data|✅ Bring-your-own-provider means full control over costs|❌ Markup on LLM costs| ## Using Ragextract Ragextract is available as an production-ready API service that requires no install. Simply sign up for an API key to start using Ragextract today. ### 1. Extracting Vehicle Details in Insurance Claims <div className="text-sm">Insurance teams need to receive and process a varierty of lengthy forms from customers and third parties. Whilst data extract may seem simple, no two forms are the same and the first challenge is always locating the right data within the document. Running Vision-enabled AI OCR on every page of a 100+ page document (for over 30,000 documents a year!) can get quite expensive! Ragextract is able to perform the same search: text, images, tables and fields quickly and cheaply no matter how large the document is.</div> **Use-case**: Automate third party claim submissions, data entry and classification from inbox to CMS. ```bash curl -X POST https://api.subworkflow.ai/v1/vectorize \ --header 'x-api-key: $RAGEXTRACT_API_KEY' \ --form "file=insurance_claim_application.pdf" # After processing is complete... curl -X POST https://api.subworkflow.ai/v1/search \ --header 'x-api-key: $RAGEXTRACT_API_KEY' \ --data '{ "query": "Find motor vehicle details and registration of all parties involved in claim", "datasetIds": ["ds_B5bsOBDzsXsqfmLo"] }' ``` ### 2. Identifying Signals in Tender Documents <div className="text-sm">More than being able to find relevant tenders, Bid managers need a way to quickly filter out irrelevant ones. When tender documents are received in competitng standards, even skimming through promising projects only to find hard requirements may waste half a day. Ragextract can perform search queries across all documents simulataneously - whether they are questionnaires, appendixes or diagrams - in seconds. This allows managers to save precious time but expand their capacity to evaluate more tenders per day.</div> **Use-case**: Summarize and get answers from any collection of unstructured tender documents in as little as 30 seconds. Increased productivity means increasing pipeline and catching more opportunities. ```bash curl -X POST https://api.subworkflow.ai/v1/vectorize \ --header 'x-api-key: $RAGEXTRACT_API_KEY' --form "file=invitation_to_tender.docx" ... --form "file=questionnaire_a.docx" ... --form "file=questionnaire_b.docx" ... --form "file=appendix_4.docx" ... --form "file=appendix_3.docx" ... --form "file=sitemap.pdf" curl -X POST https://api.subworkflow.ai/v1/search \ --header 'x-api-key: $RAGEXTRACT_API_KEY' \ --data '{ "query": "What previous work experience is required for this tender?", }' ``` ### 3. Concurrency for Financial Statements <div className="text-sm">Bank statements present a slightly different challenge as it's not so much about search but bulk file handling. Leverage Ragextract's reliable document processing pipeline which indexes hundreds of pages per second and our storage and retrieval API which returns them in either pdf or jpg formats.</div> **Use-case**: Shortcut building your own document pipline by using Ragextgract and gain day-one support for concurrency to handle more documents and properly maintained APIs to quickly integrate into your backend. ```bash curl -X POST https://api.subworkflow.ai/v1/extract \ --header 'x-api-key: $RAGEXTRACT_API_KEY' --form "file=apr2025.pdf" # returns pages 2 to 10 as jpgs curl https://api.subworkfow.ai/v1/datasets/ds_ar7e4PtGX7fGGnSt/items?rows=jpg&cols=2:10 --header 'x-api-key: $RAGEXTRACT_API_KEY' ``` ## 4. Client SDKs and Integrations You can also use Ragextract from your favourite platform using the following packages. * **Typescript/Javascript Library** - [https://github.com/Subworkflow-AI/ragextract-js](https://github.com/Subworkflow-AI/ragextract-js) * **n8n Community Node** - [https://github.com/Subworkflow-AI/n8n-nodes-ragextract](https://github.com/Subworkflow-AI/n8n-nodes-ragextract) ## Conclusion Migrating AI Document workflows to Ragextract, we found that our alternative approach of search-first-parse-later is a viable and preferrable strategy for selective data extraction within long documents. Here's a summary of our learnings: 1. Multimodal embeddings perform better than text-only for search as they also capture visual information such as images, charts and diagrams in documents. 2. Multimodal Embeddings are many times faster and cheaper than LLM output tokens and work that goes into text chunking strategies. 3. However for production usage, a durable document processing pipeline that can handle concurrency, object storage, database and serves retrieval APIs is still needed. **Ragextract provides all the above.** We encourage you to read our [official Ragextract documentation](https://docs.subworkflow.ai/intro) to explore what the Ragextract approach can improve your document workflows. **[Sign up for a free 14 days free trial to Ragextract Today!](https://ragextract.com)**

Thanks for Reading!

If you've enjoyed this article, please remember to follow us on LinkedIn and X.com.

Questions or Feedback? Send a message on our Discord or email hello@subworkflow.ai

Insurance Policy, Finance Reports and Pitch Decks

Easy to use REST API, SDK and n8n community node

14 day free trial and 30 day money back guarantee

Visit Project Page

14 day free trial

This is Where it all started

Your Reading the First Published Article!

That's All For Now

Your Reading the Latest!

Back to Subworkflow Blog