Revision history [back]

There are several methods for extracting data from a document in R using scraping techniques:

Parsing HTML/XML: R has several packages for parsing HTML/XML files, such as XML and rvest. These packages can extract data from HTML/XML files by navigating the document tree, selecting elements by element type, class or ID, and extracting specific attributes or content.

Example:

library(rvest)
url <- "https://www.example.com"
html <- read_html(url)

# Extract all links from the document
links <- html %>% 
  html_nodes("a") %>% 
  html_attr("href")

# Extract all paragraphs from the document
paras <- html %>% 
  html_nodes("p") %>% 
  html_text()

Web scraping: R also has several packages for web scraping, such as httr and RSelenium. These packages allow you to interact with web pages and extract data by simulating user interactions such as clicking links, filling out forms, and scrolling.

Example:

library(RSelenium)
driver <- rsDriver(browser="chrome")
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.example.com")
el <- remote_driver$findElement(using = 'xpath', "//a[text()='Login']")
el$clickElement()

# extract data from the login page
username <- remote_driver$findElement(using = 'id', "username")
password <- remote_driver$findElement(using = 'id', "password")
username$sendKeysToElement(list("my_username"))
password$sendKeysToElement(list("my_password"))
submit <- remote_driver$findElement(using = 'xpath', "//button[@type='submit']")
submit$clickElement()

# extract data from the logged-in page
data <- remote_driver$findElement(using = 'xpath', "//div[@class='data']")
text <- data$getElementText()

Text scraping: R also has functions for extracting data from text files, such as readLines and scan. These functions allow you to extract data from text files by reading them in and using regular expressions to extract specific patterns.

Example:

text <- readLines("my_file.txt")
data <- gsub("\\s+", ",", text)  # replace all whitespaces with commas
data <- scan(text = data, sep = ",")