Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

To retrieve the textual content from a PDF document using PdfSharp in C#, follow these steps:

  1. Install the PdfSharp nuget package in your project.

  2. Import the PdfSharp namespace:

using PdfSharp.Pdf;
using PdfSharp.Pdf.Content;
using PdfSharp.Pdf.Content.Objects;
  1. Load the PDF document:
PdfDocument document = PdfReader.Open("path/to/document.pdf", PdfDocumentOpenMode.ReadOnly);
  1. Define a ContentReader object and set its properties:
ContentReader reader = new ContentReader(document.Pages[0]);
reader.RenderMode = PdfRenderMode.Text;
  1. Use the ContentReader to read the PDF content:
string content = reader.ReadContent();
  1. Use the content string as needed.

Note: This method retrieves the textual content as it appears in the PDF document, which may include formatting and styling information. To extract only the plain text, you can use regular expressions or string manipulation to remove the unwanted characters.