How to load PDF files
Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.
This covers how to load PDF
documents into the Document format that we use downstream.
By default, one document will be created for each page in the PDF file. You can change this behavior by setting the splitPages
option to false
.
Setup
- npm
- Yarn
- pnpm
npm install pdf-parse
yarn add pdf-parse
pnpm add pdf-parse
Usage, one document per page
import { PDFLoader } from "langchain/document_loaders/fs/pdf";
// Or, in web environments:
// import { WebPDFLoader } from "langchain/document_loaders/web/pdf";
// const blob = new Blob(); // e.g. from a file input
// const loader = new WebPDFLoader(blob);
const loader = new PDFLoader("src/document_loaders/example_data/example.pdf");
const docs = await loader.load();
Usage, one document per file
import { PDFLoader } from "langchain/document_loaders/fs/pdf";
const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", {
splitPages: false,
});
const docs = await loader.load();
Usage, custom pdfjs
build
By default we use the pdfjs
build bundled with pdf-parse
, which is compatible with most environments, including Node.js and modern browsers. If you want to use a more recent version of pdfjs-dist
or if you want to use a custom build of pdfjs-dist
, you can do so by providing a custom pdfjs
function that returns a promise that resolves to the PDFJS
object.
In the following example we use the "legacy" (see pdfjs docs) build of pdfjs-dist
, which includes several polyfills not included in the default build.
- npm
- Yarn
- pnpm
npm install pdfjs-dist
yarn add pdfjs-dist
pnpm add pdfjs-dist
import { PDFLoader } from "langchain/document_loaders/fs/pdf";
const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", {
// you may need to add `.then(m => m.default)` to the end of the import
pdfjs: () => import("pdfjs-dist/legacy/build/pdf.js"),
});
Eliminating extra spaces
PDFs come in many varieties, which makes reading them a challenge. The loader parses individual text elements and joins them together with a space by default, but if you are seeing excessive spaces, this may not be the desired behavior. In that case, you can override the separator with an empty string like this:
import { PDFLoader } from "langchain/document_loaders/fs/pdf";
const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", {
parsedItemSeparator: "",
});
const docs = await loader.load();