Extract and Search Document Content

Learn how to extract text from a document and make it searchable.

This guide will show you how to extract text from a document image and make it searchable. This is a powerful feature for any application that needs to work with the content of physical documents.

See in Postman

This guide's API calls are available as a Postman collection. You can use it to quickly test the API and see how it works.

Business Problem

Let's continue with the law firm example from the previous guide. Now that the firm has digitized its case files, the next step is to make them searchable. A lawyer should be able to search for a specific term (e.g., a case number, a name, a legal term) and find all the relevant documents.

Solution

We can achieve this by using the Scan Documents API to extract the text from the digitized documents. Here's the plan:

  1. Upload and Digitize: First, we'll upload and digitize the document as we did in the previous guide.
  2. Extract Text: Then, we'll use the extract-text operation to get the content of the document in plain text.
  3. Index for Search: Finally, you can take the extracted text and store it in a search engine like Elasticsearch or a database that supports full-text search. This will allow your users to perform powerful searches on the content of their documents.

Step 1: Upload and Digitize the Document

Please refer to the Digitize Document for Archiving guide to learn how to upload and digitize a document. For this guide, we'll assume you have a digitized document with the ID file_jmjje3ut90btw1r9.

Digitized Document

Step 2: Extract the Text

Next, you need to extract the text from the digitized document. You can do this by creating an extract-text task and setting the format parameter to plain.

Extract Text

Creates a task to extract text from a specified image file.

curl -X POST "https://api.scan-documents.com/v1/image-operations/extract-text" \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "file_jmjje3ut90btw1r9",
    "format": "plain"
  }'

The API will respond with a task object. Once the task is completed, its result will contain the extracted text.

{
    "id": "task_wsn9ej4f8bmrg7j7",
    "operation": "extract-text",
    "status": "completed",
    "parameters": {
        "input": "file_jmjje3ut90btw1r9",
        "format": "plain"
    },
    "result": {
        "format": "plain",
        "content": "I represent that my performance of all the terms of this Intellectual Property Agreement\nwill not breach any agreement to keep in confidence proprietary information acquired by me in\nconfidence or in trust prior to my Relationship with the Company. I have not entered into, and I\nagree I will not enter into, any oral or written agreement in conflict herewith. I agree to execute\nany proper oath or verify any proper document required to carry out the terms of this Intellectual\nProperty Agreement.\n7.\nEquitable Relief\nThe Company and I each agree that disputes relating to or arising out of a breach of the\ncovenants contained in this Intellectual Property Agreement may cause the Company or me, as\napplicable, to suffer irreparable harm and to have no adequate remedy at law. In the event of any\nsuch breach or default by a party, or any threat of such breach or default, the other party will be\nentitled to injunctive relief, specific performance and other equitable relief. The parties further\nagree that no bond or other security shall be required in obtaining such equitable relief and hereby\nconsents to the issuance of such injunction and to the ordering of specific performance.\n8.\nGeneral Provisions\n(a) Governing Law; Consent to Personal Jurisdiction. This Intellectual Property\nAgreement will be governed by the laws of the State of California as they apply to contracts entered\ninto and wholly to be performed within such state. I hereby expressly consent to the nonexclusive\npersonal jurisdiction and venue of the state and federal courts located in the federal Northern\nDistrict of California for any lawsuit filed there by either party arising from or relating to this\nIntellectual Property Agreement.\n(b)\nEntire Agreement. This Intellectual Property Agreement sets forth the entire\nagreement and understanding between the Company and me relating to the subject matter herein\nand merges all prior discussions between us. No modification of or amendment to this Intellectual\nProperty Agreement, or any waiver of any rights under this Intellectual Property Agreement, will\nbe effective unless in writing signed by the party to be charged. Any subsequent change or changes\nin my duties, salary or compensation will not affect the validity or scope of this Intellectual\nProperty Agreement.\n(c) Severability. If one or more of the provisions in this Intellectual Property\nAgreement are deemed void by law, then the remaining provisions will continue in full force and\neffect.\n(d) Successors and Assigns. This Intellectual Property Agreement will be binding\nupon my heirs, executors, administrators and other legal representatives and will be for the benefit\nof the Company and its successors and assigns.\n\n[Signature Page Follows]\nA-5\n76409-0001/LEGAL20300065.1"
    },
    "callback_url": null,
    "created_at": "2025-08-23T14:47:28.000Z",
    "updated_at": "2025-08-23T14:47:40.000Z"
}

Step 3: Index the Content

Now that you have the text content of the document, you can store it in your preferred search engine or database. For example, if you are using Elasticsearch, you would index the document like this:

{
  "file_id": "file_0246813579",
  "content": "I represent that my performance..."
}

By indexing the content, you can now build a powerful search experience for your users, allowing them to quickly find the information they need from a large collection of documents.