How can I use PdfSharp to retrieve the textual content from a PDF document in C#?

answered 2021-10-03 00:00:00 +0000

pufferfish
41 ●3 ●2

To retrieve the textual content from a PDF document using PdfSharp in C#, follow these steps:

Install the PdfSharp nuget package in your project.
Import the PdfSharp namespace:

using PdfSharp.Pdf;
using PdfSharp.Pdf.Content;
using PdfSharp.Pdf.Content.Objects;

Load the PDF document:

PdfDocument document = PdfReader.Open("path/to/document.pdf", PdfDocumentOpenMode.ReadOnly);

Define a ContentReader object and set its properties:

ContentReader reader = new ContentReader(document.Pages[0]);
reader.RenderMode = PdfRenderMode.Text;

Use the ContentReader to read the PDF content:

string content = reader.ReadContent();

Use the content string as needed.

Note: This method retrieves the textual content as it appears in the PDF document, which may include formatting and styling information. To extract only the plain text, you can use regular expressions or string manipulation to remove the unwanted characters.

edit flag offensive delete link

add a comment

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account. This space is reserved only for answers. If you would like to engage in a discussion, please instead post a comment under the question or an answer that you would like to discuss

Add Answer

How can I use PdfSharp to retrieve the textual content from a PDF document in C#?

1 Answer

Your Answer

Question Tools

Stats

Related questions

How can I use PdfSharp to retrieve the textual content from a PDF document in C#? edit

1 Answer

Your Answer

Question Tools

Stats

Related questions

How can I use PdfSharp to retrieve the textual content from a PDF document in C#?