Skip to main content

Extract PDF's Text


Extract text from PDF documents using the pdf-text endpoint.


The pdf-text endpoint is for extracting text from PDF documents. In this tutorial we demonstrate just how easy it is to extract text from a PDF document using the pdf-text endpoint. First we use cURL to call the endpoint directly as a REST call. We then use the DynamicPDF Cloud API client libraries to call the endpoint programmatically.

Required Resources#

To complete this tutorial, you must add the Extract Text (pdf-text endpoint) sample to your samples folder in your cloud storage space using the Resource Manager. After adding the sample resources, you should see a samples/extract-text-pdf-text-endpoint folder containing the resources for this tutorial.

SampleSample FolderResources
Extract Textsamples/extract-text-pdf-text-endpointfw4.pdf
  • From the Resource Manager, download fw4.pdf to your local system; here we assume /temp/dynamicpdf-api-samples/extract-text.
  • After downloading, delete fw4.pdf from your cloud storage space using the Resource Manager.
ResourceCloud/Local
fw4.pdflocal
tip

See Sample Resources for instructions on adding sample resources.

Obtaining API Key#

This tutorial assumes a valid API key obtained from the DynamicPDF Cloud API's Environment Manager. Refer to the following for instructions on getting an API key.

tip

If you are not familiar with the Resource Manager or Apps and API Keys, refer to the following tutorial and relevant Users Guide pages.

Calling API Directly Using POST#

The pdf-text endpoint takes a POST request. When using cURL, you specify the endpoint, the HTTP command, the API key and the local resources required. However, the pdf-text endpoint also allows specifying the starting page and page count as query parameters.

Let's extract the text of only the first two pages of the PDF. Because we only wish to extract the text from the first two pages, in addition to sending the PDF and API key in the request, we must also send two query string parameters, startPage and pageCount.

Figure 1. To extract the first two pages of this PDF select start page and the number of pages.

ParameterParameter TypeValue
startPageQuery1
pageCountQuery2
info

Setting the startPage and pageCount both to zero (or omitting the querystring parameters) defaults to getting all pages of the PDF.

Make Request Using API#

  • Create the following cURL command where the PDF is sent to the endpoint as binary data and then execute the command.
  • Add the startPage and pageCount as querystring parameters to the request URL.
  • Specify the Content-Type as application/pdf so the request knows to get the binary data as a PDF.
curl -X POST "https://api.dynamicpdf.com/v1.0/pdf-text?startPage=1&pageCount=2"-H  "Content-Type: application/pdf"-H  "Authorization: Bearer xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"--data-binary "@c:/dynamic-pdf-api-samples/extract-text/fw4.pdf"
  • Execute the command and the following text is returned to the commandline as a JSON document.
[    {        "pageNumber": 1,        "text": "[DynamicPDF Evaluation] Form  W-4\n(Rev. December 2020)\nDepartment of the Treasury  \nInternal Revenue Service \nEmployee’s Withholding Certificate\n ▶  Complete Form W-4 so that your employer can withhold the correct federal  income tax from your pay. \n ▶  Give Form W-4 to your empl ....[Text Truncated - Please purchase a license or contact support for an evaluation license.]"    },    {        "pageNumber": 2,        "text": "[DynamicPDF Evaluation] Form W-4 (2021) Page 2\nGeneral Instructions\nFuture Developments\nFor the latest information about developments related to \nForm W-4, such as legislation enacted after it was published, \ngo to www.irs.gov/FormW4 .\nPurpose of Form\nComplete Form W-4 so that yo ....[Text Truncated - Please purchase a license or contact support for an evaluation license.]"    }]

Calling Endpoint Using Client Library#

To simplify development, you can also use one of the DynamicPDF Cloud API client libraries. Use the client library of your choice to complete this tutorial section.

Complete Source#

You can access the complete source for this project at one of the following GitHub projects.

LanguageFile NameLocation (package/namespace/etc.)GitHub Project
JavaExtractText.javacom.dynamicpdf.api.exampleshttps://github.com/dynamicpdf-api/java-client-examples
C#Program.csExtractTexthttps://github.com/dynamicpdf-api/dotnet-client-examples
NodejsExtractText.jsnodejs-client-exampleshttps://github.com/dynamicpdf-api/nodejs-client-examples
PHPExtractText.phpphp-client-exampleshttps://github.com/dynamicpdf-api/php-client-examples
tip

Click on the language tab of choice to view the tutorial steps for the particular language.


In all four languages, the steps were similar. First, we created a new PdfResource instance by loading the path to the PDF via the constructor. Next, we created a new instance of the PdfText class, which abstracts the pdf-text endpoint. Then the PdfText instance prints the extracted text as JSON after processing. Finally, we called the Process method and printed the resultant JSON to the console.