Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

One method for eliminating certain HTML tags from the page source obtained with Python selenium using webdriver.pageSource() is to use a string manipulation technique. This involves converting the page source to a string, using regular expressions to remove the unwanted HTML tags, and then converting the modified string back to HTML format.

Here is an example code snippet that removes all <script> tags from the page source:

import re
from selenium import webdriver

# Launching a browser and navigating to a webpage
browser = webdriver.Chrome()
browser.get('https://www.example.com')

# Getting the page source as a string
page_source = browser.page_source

# Removing <script> tags using regular expressions
page_source = re.sub(r'<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>', '', page_source)

# Converting the modified string back to HTML format
html = f'<html>{page_source}</html>'

# Continuing with the program using the modified HTML
# ...

Note that the regular expression used in this example may not work for all scenarios, and you may need to modify it based on your specific needs. Also, keep in mind that removing certain HTML tags may affect the functionality and appearance of the webpage.