Ask Your Question
2

How can I use PdfSharp to retrieve the textual content from a PDF document in C#?

asked 2021-07-11 11:00:00 +0000

nofretete gravatar image

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
0

answered 2021-10-03 00:00:00 +0000

pufferfish gravatar image

To retrieve the textual content from a PDF document using PdfSharp in C#, follow these steps:

  1. Install the PdfSharp nuget package in your project.

  2. Import the PdfSharp namespace:

using PdfSharp.Pdf;
using PdfSharp.Pdf.Content;
using PdfSharp.Pdf.Content.Objects;
  1. Load the PDF document:
PdfDocument document = PdfReader.Open("path/to/document.pdf", PdfDocumentOpenMode.ReadOnly);
  1. Define a ContentReader object and set its properties:
ContentReader reader = new ContentReader(document.Pages[0]);
reader.RenderMode = PdfRenderMode.Text;
  1. Use the ContentReader to read the PDF content:
string content = reader.ReadContent();
  1. Use the content string as needed.

Note: This method retrieves the textual content as it appears in the PDF document, which may include formatting and styling information. To extract only the plain text, you can use regular expressions or string manipulation to remove the unwanted characters.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account. This space is reserved only for answers. If you would like to engage in a discussion, please instead post a comment under the question or an answer that you would like to discuss

Add Answer


Question Tools

Stats

Asked: 2021-07-11 11:00:00 +0000

Seen: 17 times

Last updated: Oct 03 '21