Document Privacy & Redaction

How to properly redact a PDF

Redaction sounds simple: cover what you don't want seen. In physical documents, drawing a black marker over text destroys the underlying ink — the text is gone. In PDFs, the equivalent — drawing an opaque black box over text using annotation tools — typically doesn't remove anything. The original text remains in the PDF's data structure, fully selectable, searchable, and extractable. This has caused real data breaches: court filings where redacted names were copy-pasted from the PDF, medical records where 'redacted' fields were scraped by text extraction tools, legal documents where opposing counsel found hidden text under annotation layers. Proper redaction removes or destroys the underlying content, not just its visibility.

Why drawing a black box over text isn't redaction

PDF documents separate visual rendering from underlying content. A PDF annotation (a black rectangle drawn on the page) is a visual layer placed on top of the document — it changes what you see when viewing the PDF, but it doesn't modify the text content that the PDF stores. The text under the annotation remains in the PDF file in full, as machine-readable data.

PDFs store text as data, not as an image

Unlike a scanned photograph, a PDF created from a word processor or text source contains the actual text as data — each character, word, and sentence stored as Unicode text in the file structure, along with positioning coordinates. Annotations (shapes, drawings, highlights) are stored separately as an annotation layer. Removing or hiding an annotation doesn't touch the underlying text data.

Text under an annotation is still selectable

In any PDF viewer, try clicking and dragging over a 'redacted' annotation in a PDF where the redaction was done with a black rectangle overlay. In most cases, the text cursor will still activate, and you can select the text underneath. Copy it and paste it into a text editor — the original text appears in full. The annotation was covering text that was never removed from the file.

Text extraction tools see through visual layers entirely

PDF text extraction tools — including the ones built into search engines, e-discovery platforms, AI document analysis tools, and accessibility software — read the underlying text content of a PDF directly. They don't render the visual page; they parse the file's text layer. An opaque black annotation is irrelevant to text extraction — it's a visual element that these tools typically ignore completely.

Highlight-and-delete in word processors has the same problem

Converting a Word document to PDF after highlighting text in black (font color = black background = black) has the same failure mode: the text is visually hidden but still present in the document. When the Word file is converted to PDF, the invisible-but-present text carries over. Viewing the PDF, the text appears redacted; using Ctrl+A to select all and copying to a text editor reveals the hidden text.

What true redaction actually does

Proper redaction — the kind used by courts, government agencies, and compliance teams — doesn't cover content. It removes or destroys the content from the file.

The underlying text is removed from the PDF's data structure

A proper redaction tool identifies the text (or image content) at the specified location in the PDF's page content stream, removes those character references from the data, and replaces that region with a solid opaque area (typically black). After redaction, there is no underlying text to extract, select, or copy at that location — the data has been deleted from the file.

The PDF is re-written, not annotated

True redaction involves re-writing the PDF file with the target content removed, not adding an annotation layer on top of the existing file. The output is a new PDF where the redacted content does not exist in the file structure. Viewing, selecting, copying, or extracting text from the redacted area returns nothing — or only whitespace — because there is no underlying data.

Metadata may also need to be cleaned

Even after redacting visible text, PDF metadata (document properties, comments, revision history, embedded XMP data) may contain sensitive information. A thorough redaction workflow also strips metadata that isn't needed for the final document. Some redaction tools handle this automatically; others require a separate step.

Scanned PDFs require image redaction

PDFs created from scanned physical documents contain images, not machine-readable text. In these documents, the 'text' is actually pixel data in an image. Redacting scanned PDFs requires redacting at the image level — the image pixels at the specified region must be permanently overwritten, not merely covered. Some redaction tools apply both text and image redaction; verify which type your tool performs.

Common redaction failure modes that expose data

Most real-world redaction failures fall into a small number of recognizable patterns. Each has been the cause of actual data exposure incidents.

Black rectangle annotation (the most common)

Using the annotation or drawing tools in Adobe Acrobat Reader, Preview (macOS), or similar viewers to draw a black box over text. These tools add visual annotations, not true redactions. The text remains in the file. This is the most frequently encountered redaction failure — it appears secure in normal viewing but the underlying text is trivially accessible.

Word document 'redaction' via font/highlight color

Setting text color to white or background color to black in Word, then converting to PDF. The hidden text is present in the Word file and survives the PDF conversion. This is the technique behind several well-publicized legal document redaction failures, including cases where court filings revealed names of informants or classified details.

Cropping or margin-based hiding

Setting a PDF page's visible area (media/crop box) to exclude a region does not remove the content from the file — it only changes what's displayed by default. PDF viewers that honor the crop box won't show the hidden region; but the underlying page content, including any text or images outside the display area, remains in the file and can be accessed by changing the page view or by extracting text.

Layered redaction without flattening

Some redaction workflows apply a redaction annotation as a separate layer and then export without flattening the layers into a single flat image or content stream. The redaction annotation is present, but the underlying layer is also still accessible. Flattening the redacted document ensures the visual and content layers merge into a single representation with no hidden underlying data.

Redacting the text but not embedded images or attachments

A PDF may contain embedded images that include the text being redacted — a scanned page, a photo of a document, a form field rendered as an image. Redacting the text layer doesn't touch these images. Similarly, PDFs can contain embedded file attachments; if those contain the same sensitive information, redacting the main document text doesn't affect the attachment.

How to verify your redaction actually worked

After applying redaction, verifying that the sensitive content is truly gone — not just visually hidden — takes about two minutes and should always be done before sharing a redacted document.

Try to select and copy the redacted text

Open the redacted PDF in a PDF viewer. Click and drag to attempt to select text in the redacted area. If text highlights and can be selected, the redaction is visual-only — the underlying text is still present. A properly redacted area won't respond to text selection at all, or will return only an empty selection.

Use a text extraction tool

Run the redacted PDF through a text extraction tool and inspect the output for the supposedly redacted content. Several free utilities can extract all text from a PDF file. If the 'redacted' text appears in the extraction output, the underlying text was not removed. This test catches both the black-annotation failure mode and the color-hiding failure mode.

Check the document's metadata

Open the PDF properties (File → Properties in most viewers) and inspect the document metadata: author, title, subject, creator, and any custom fields. Also check for comments or revision notes. These fields can contain information that shouldn't be in the distributed document. For official documents, strip metadata before distribution.

Open the file in a different PDF reader

Some PDF readers suppress certain annotation layers or display documents differently. View the redacted document in at least one alternative reader (if you created it in Acrobat, also open it in Chrome's built-in viewer or a free reader). Discrepancies in how the document appears across readers can reveal hidden layers or annotation issues.

What information typically requires redaction

Knowing which content to redact is as important as knowing how. The categories below represent the most commonly required redactions across legal, medical, financial, and compliance contexts.

Personally identifiable information (PII)

Full names, social security numbers, dates of birth, home addresses, phone numbers, email addresses, passport numbers, and driver's license numbers. For documents being shared broadly — in litigation, public filings, or publication — PII of private individuals typically requires redaction. GDPR and similar privacy regulations require that personal data be protected, including in shared documents.

Legal identifiers and case-specific information

In litigation and legal proceedings: names of minor children, names of victims or witnesses in sensitive cases, home addresses of parties, certain financial account details, and in some jurisdictions sealed material. Courts typically specify in their rules what must be redacted from public filings, with sanctions for non-compliance.

Protected health information (PHI)

Under HIPAA, PHI includes 18 categories of identifiers: names, geographic identifiers smaller than a state, dates (other than years) directly related to an individual, contact information, social security numbers, medical record numbers, health plan beneficiary numbers, account numbers, certificate/license numbers, vehicle identifiers, device identifiers, URLs, IP addresses, biometric identifiers, full face photographs, and any other unique identifier. Medical documents shared for research, litigation, or institutional review require these identifiers removed.

Financial account and payment information

Account numbers, routing numbers, credit/debit card numbers, and financial institution details. In legal discovery and regulatory filings, financial records are typically shared with account numbers redacted. PCI-DSS compliance requires that card numbers not appear in unredacted form in documents that will be stored or transmitted.

Confidential business information and trade secrets

Proprietary formulas, pricing structures, vendor contract terms, internal cost data, and strategic plans that are commercially sensitive. In litigation involving multiple parties (antitrust, patent, commercial disputes), documents are often produced with trade-secret information redacted under a court-approved protective order.

Why doesn't drawing a black box over PDF text actually redact it?+

Because a PDF is not a flat image — it has a separate layer of text data. When you draw a black rectangle on a PDF page using annotation tools, you're adding a visual overlay, not modifying the underlying text content. The PDF file still contains the original text at the same location, fully accessible to selection, copying, and text extraction. True redaction requires removing the text from the PDF's data structure, which annotation tools don't do.

How can I tell if a PDF has been improperly redacted?+

In any PDF viewer, try clicking and dragging your cursor over the 'redacted' area. If a text cursor appears and you can select and copy text, the redaction is visual-only. The underlying text is still present. You can also run the PDF through a text extraction tool — if the supposedly redacted text appears in the extraction output, it was never removed from the file. Finally, check the document properties and metadata for any fields that might contain sensitive information not visible in the normal view.

What's the difference between PDF redaction and PDF annotation?+

PDF annotation adds a visual element on top of the document without changing its underlying content — a highlight, a drawing, a note, a black box. PDF redaction modifies the document's content structure to remove the underlying text or image data at the specified location. An annotation can be added and removed without changing the document content; a proper redaction permanently removes the content from the file. Annotation tools can create the visual appearance of redaction without performing actual redaction.

Does redacting a PDF affect the document's appearance otherwise?+

Properly applied redaction replaces the redacted content with a solid black area (or another specified fill color) at the same location in the document. The page layout, other text, and unredacted images remain unchanged. The redacted area is visible as a black rectangle — it's apparent that something was redacted, though the content is not recoverable. The document's pagination, headings, and structure are preserved.

Do I need to redact metadata as well as document text?+

For sensitive documents, yes. PDF metadata — stored in the document's properties and not visible in the main document view — can contain the author's name, company, creation date, revision history, comments, and custom fields. In some cases, earlier versions of redacted content may be recoverable from revision data embedded in the file. A complete redaction workflow includes removing or sanitizing metadata, particularly for documents being filed in court, published publicly, or shared with adversarial parties. Some redaction tools handle this automatically; others require a separate metadata-stripping step.

Can I verify that a PDF I received was properly redacted?+

Yes. Open the PDF and try selecting text in the redacted regions — proper redaction returns no selectable text. Run the PDF through a text extraction tool and search the output for what should have been redacted. If you find the supposedly redacted content in either test, the redaction was not done correctly and the underlying text is accessible. This verification takes about two minutes and should be standard practice before accepting a redacted document as genuinely sanitized.