Introducing Ragextract
Ragextract AI document search for messy documents

## AI Document Search for AI Agents
Most LLM agents today can accept most documents as input and happily read, analyse and extract without much issue. Where they struggle is with large documents which tend to overload their context limits with filler and irrelevant content leading to mixed results.
At [Subworkflow](https://subworkflow.ai), we saw this happen first hand with our AI document projects in insurance, legal and proptech. We worked intensively with RAG during its formative years and as document volume demands increased with adoption, we quickly realised that our existing workflow was hitting a technical bottleneck with how many and how fast we were able to process these documents. The natural next step was to reimagine our RAG setup with a robust document processing pipeline; one which would handle the uploading, splitting, storing, indexing, searching and retrieval to handle even our most challenging scenarios.
Today, we're proud to announce [Ragextract](https://ragextract.com), a managed API service which encapsulates our learnings and experience in building robust AI document search and retrieval for messy PDFs.
## Who is it for?
Ragextract was built for document intelligence without context limits. It enables businesses who process large and messy PDFs (scanned pages, images, tables) yet only require relevant details to be extracted from each, to either:
**1. Get set up with a document processing and search pipeline quickly**
Saving months of work on architecting and developing a bespoke solution
**2. Increase document processing capacity greatly without additional resources**
Enabling teams to max productivity and budgets with existing workloads
**3. Simplify their current document processing pipelines**
Avoiding refactoring efforts and instead focusing on their core business offerings
||Ragextract|Existing Services|
|-|-|-|
|Process document|✅ Designed for large documents|❌ Assumes small workloads|
|Conversion to markdown|✅ No need!|❌ Yes and forced on every page|
|Generate embeddings|✅ Multimodal means text, images, charts and diagrams |❌ Not supported|
|Upload to vector store|✅ Automatic and isolated vector store provided|❌ Not supported|
|Search endpoint|✅ Production API ready to return matching pages instantly|❌ Not supported|
|Retrieve binary files|✅ Retrieval API and storage provided for your data|❌ Not supported|
|Extract data|✅ Bring-your-own-provider means full control over costs|❌ Markup on LLM costs|
## Using Ragextract
Ragextract is available as an production-ready API service that requires no install. Simply sign up for an API key to start using Ragextract today.
### 1. Extracting Vehicle Details in Insurance Claims
<div className="text-sm">Insurance teams need to receive and process a varierty of lengthy forms from customers and third parties. Whilst data extraction may seem simple, no two forms are the same and the first challenge is always locating the relevant data within the document. Running Vision-enabled LLM OCR on every page of a 100+ page document - for over 30,000 documents a year - can get quite expensive! Ragextract is able to perform the same search: text, images, tables and fields quickly and cheaply no matter how large the document is.</div>
**Use-case**: Automate third party claim submissions, data entry and classification from inbox to CMS.
```bash
curl -X POST https://api.subworkflow.ai/v1/vectorize \
--header 'x-api-key: $RAGEXTRACT_API_KEY' \
--form "file=insurance_claim_application.pdf"
# After processing is complete...
curl -X POST https://api.subworkflow.ai/v1/search \
--header 'x-api-key: $RAGEXTRACT_API_KEY' \
--data '{
"query": "Find motor vehicle details and registration of all parties involved in claim",
"datasetIds": ["ds_B5bsOBDzsXsqfmLo"]
}'
```
### 2. Identifying Signals in Tender Documents
<div className="text-sm">More than being able to find relevant tenders, bid managers need a way to quickly filter out irrelevant ones. When tender documents are received in differing standards and structure, even skimming through promising projects only to find hard requirements may waste half a day. Ragextract can perform search queries across all documents simulataneously - whether they are questionnaires, appendixes or diagrams - in seconds. This not only allows managers to save precious time but also expand their capacity to evaluate more tenders per day.</div>
**Use-case**: Summarize and get answers from any collection of unstructured tender documents in as little as 30 seconds. Increased productivity means increasing pipeline and catching more opportunities.
```bash
curl -X POST https://api.subworkflow.ai/v1/vectorize \
--header 'x-api-key: $RAGEXTRACT_API_KEY' --form "file=invitation_to_tender.docx"
... --form "file=questionnaire_a.docx"
... --form "file=questionnaire_b.docx"
... --form "file=appendix_4.docx"
... --form "file=appendix_3.docx"
... --form "file=sitemap.pdf"
curl -X POST https://api.subworkflow.ai/v1/search \
--header 'x-api-key: $RAGEXTRACT_API_KEY' \
--data '{
"query": "What previous work experience is required for this tender?",
}'
```
### 3. Concurrency for Financial Statements
<div className="text-sm">Bank statements present a slightly different challenge as it's not so much about search but bulk file handling. Fintech startups can leverage Ragextract's reliable document processing pipeline which indexes hundreds of pages per second, our secure storage facilities and easy-to-use retrieval APIs to build their own solutions.</div>
**Use-case**: Reduce development time by months by building on top of using Ragextract. If handling high page count or high frequency documents isn't your core business (it's grunt work!), check out Ragextract now!
```bash
curl -X POST https://api.subworkflow.ai/v1/extract \
--header 'x-api-key: $RAGEXTRACT_API_KEY'
--form "file=apr2025.pdf"
# returns pages 2 to 10 as jpgs
curl https://api.subworkfow.ai/v1/datasets/ds_ar7e4PtGX7fGGnSt/items?rows=jpg&cols=2:10
--header 'x-api-key: $RAGEXTRACT_API_KEY'
```
## 4. Client SDKs and Integrations
You can also use Ragextract from your favourite platform using the following packages.
* **Typescript/Javascript Library** - [https://github.com/Subworkflow-AI/ragextract-js](https://github.com/Subworkflow-AI/ragextract-js)
* **n8n Community Node** - [https://github.com/Subworkflow-AI/n8n-nodes-ragextract](https://github.com/Subworkflow-AI/n8n-nodes-ragextract)
## Conclusion
[Ragextract](https://ragextract.com) is undoubedly a service we wish we had ourselves and today powers existing and new clients for [Subworkflow](https://subworkflow.ai). We're glad to now make this available for fellow AI and automation builders and we're looking forward to hearing how you're using it!
Here's a summary of our learnings:
1. Multimodal embeddings perform better than text-only for search as they also capture visual information such as images, charts and diagrams in documents.
2. Multimodal Embeddings are many times faster and cheaper than LLM output tokens and work that goes into text chunking strategies.
3. However for production usage, a durable document processing pipeline that can handle concurrency, object storage, database and serves retrieval APIs is still needed.
**Ragextract provides all the above.** We encourage you to read our [official Ragextract documentation](https://docs.ragextract.com/intro) to explore what the Ragextract approach can improve your document workflows.
**[Sign up for a free 14 days free trial to Ragextract Today!](https://ragextract.com)**
Thanks for Reading!
Questions or Feedback? Send a message on our Discord or email hello@subworkflow.ai
More Articles