How to get more data...for free

Featured image

In your work as a Data Scientist, there maybe times when you are not provided with sufficient data to build model or to analyze. In such times, you are required to get more data from external sources, or to be more specific, from the Internet! That is called ‘crawling’.

Today in this post, I will show you what is required and how to crawl data from the web (with R). This post also comes with code from my previous project. Link to the code in Github is published here.

Please note that the tutorial is specifically made for Dynamic Scrolling page only. This is an example of a dynamic scrolling page, where new element is loaded on the fly as you scroll to the end of the page.

1. Is web-crawling illegal?

Before doing any web-crawling/web-scraping, you need to carefully visit the website’s terms of services and copyright. This is because:

TL;DR: Web-crawling may not be illegal and there is a small chance you will get into prison for that. However legal it might be, web-crawling poses a high chance of being unethical.

When do you know your scraping is unethical?

It depends. I can think of many perfectly ethical cases of data gathering (and in fact have defended some forms of scraping, especially of facts, in a blog post - http://blog.everylodge.com/2012/02/how-everylodge-works-thoughts-on-copyright-terms-of-service/). But a lot of scraping is unethical. It is unethical to appropriate someone else’s creative work for profit. There are also a lot of bots out there scraping and “spinning” content, producing trash that clogs search engine results and adds no value to the internet.

2. What is required

2.1. HTML knowledge (or not)

Web-crawling is basically getting HTML element/text out of a website. To get yourself familiar with HTML, in a website, try pressing F12 and you will see a tab of Elements, that where the HTML code of the website is. Below is an example of the text in HTML, as you can see, all the contents of the website is stored there.

In order to learn HTML, you can refer to this great website: https://www.learn-html.org/. The website provides interactive tutorial, as well as step by step explanation.

For some of you who does not have enough time to learn HTML but still want to do some web-crawling, refer to this great extension, which can easily choose and extract element from any website: https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb

2.2. R Packages

As this tutorial is done in R, not Python, remember to download R and RStudio, find the tutorial here if you haven’t download these.

The packages to be used to crawl data from dynamic scrolling page is RSelenium and rvest. Other packages like tidyverse, stringr, leaflet, ggmap, etc. is for supplement use only.

# create a vector of packages that you will be using
packages <- c('tidyverse', 'rvest', 'RSelenium', 'stringr', 'leaflet', 'ggmap', 'magrittr', 'sp')

# create a vector of installed packages
installed_packages <- as.vector(installed.packages()[,c(1)])

# install missing packages
missing_packages <- setdiff(packages, installed_packages)

if (length(missing_packages) > 0) {
  install.packages(setdiff(packages, installed_packages))
}

2.3. Webdriver

When you open the official page of the Selenium, the first thing you read is “Selenium automates browser” in “What is Selenium?” section. But in order to use Selenium, you have to use Selenium WebDriver. WebDriver comes in many versions for Chrome and Firefox. You can install the WebDriver for your favorite browser by accessing:

Personally I prefer Firefox, since it’s generally faster than Chrome.

2.4. Java

You also need to install Java in order to use Selenium (and many other packages in the future). Please find the link here: https://www.java.com/en/download/

3. Start your crawling

Okay so now let’s start your project in web-crawling. Consider the project as a car, first you need to fuel it up, run it, and then shut it down in the last phase.

3.1 Fuel up

3.1.1. Get a glimpse of your website-to-be-crawled

Here I am taking the Foody website as an example. My goal is to extract all the information of all the Café/Dessert in Hanoi. You can take a look at the page here.

Problem with this page:

3.1.2. Prepare your code

In R and any programming language, you need to load packages before using them. To do this in R, use the code:

# load packages
invisible(lapply(packages, library, character.only = TRUE))

Next, you need to initiate a session in Selenium and create any variables and function that will be used.

## function
### this function is used to switch window handler
myswitch <- function (remDr, windowId) {
  qpath <- sprintf('%s/session/%s/window', remDr$serverURL,
                   remDr$sessionInfo[['id']])
  remDr$queryRD(qpath, 'POST', qdata = list(handle = windowId))
}

### this function is used to get the text of a xpath in a page
get_html_text <- function(page_source, xpath, index=1) {
  # get text by xpath
  text <- (page_source %>% 
             html_nodes(xpath = xpath) %>% 
             html_text())[index]
  
  # if no text is extracted, return empty string, otherwise return that text
  if (identical(text, character(0))) {
    return('')
  } else {
    return(text)
  }
}

### this function is used to get the information of the restaurant
get_general_info <- function(page_source) {
  info <- tibble(
    title = get_html_text(page_source, "//div[@class = 'main-info-title']//h1"),
    category = get_html_text(page_source, "//div[@class = 'category-items']//a[@title]"),
    cuisine_type = get_html_text(page_source, "//div[@class = 'cuisines-list']//a/text()"),
    customer_type = get_html_text(page_source, "//div[@class = 'audiences']/text()"),
    avg_score = get_html_text(page_source, "//div[@class = 'microsite-point-avg ']"),
    location = get_html_text(page_source, "//div[@class = 'microsite-top-points']//span", 1),
    price = get_html_text(page_source, "//div[@class = 'microsite-top-points']//span", 2),
    quality = get_html_text(page_source, "//div[@class = 'microsite-top-points']//span", 3),
    space = get_html_text(page_source, "//div[@class = 'microsite-top-points']//span", 4),
    service = get_html_text(page_source, "//div[@class = 'microsite-top-points']//span", 5),
    comment_count = get_html_text(page_source, "//div[@class = 'microsite-review-count']"),
    address = get_html_text(page_source, "//div[@class = 'res-common-add']//span//a//span[@itemprop = 'streetAddress']"),
    district = get_html_text(page_source, "//span[@itemprop = 'addressLocality']"),
    price_range = get_html_text(page_source, "//span[@itemprop = 'priceRange']/span"),
    time = get_html_text(page_source, "//div[@class = 'micro-timesopen']//span[3]"),
    area = get_html_text(page_source, "//a[@itemprop = 'item']//span[@itemprop = 'name']")
  )
  
  return(info)
}

## variable
page_source <- list() # store the source_code of a page
multiple_branches <- list() # store the indices of multiple branches restaurants
single_branches <- list() # store the indices of single branches restaurants
shop_name <- list() # store the name of the restaurants
click_field <- list() # field to click in the 'mother' page
currWindow <- list() # window handler of the 'mother' page
general_info <- list() # store the information of multiple branches restaurants
windows <- list() # get all windows handle after clicking into a multiple branches restaurant
otherWindow <- list() # get windows handle after clicking into a multiple branches restaurant of pages except 'mother' and 'intermediate' page
intermediate_handle <- list() # handle of the intermediate page
windows1 <- list() # get all windows handle after clicking into a branch
otherWindow1 <- list()  # get windows handle after clicking into a branch of pages except 'mother' and 'intermediate' page
windows2 <- list() # get all windows handle after clicking into a single branches restaurant
otherWindow2 <- list()  # get windows handle after clicking into a multiple branches restaurant of pages except 'mother' page
general_info_2 <- list() # store the information of single branches restaurants
a <- 1 # just a counter
e <- 1 # just a counter
click_field_branches <- list() # field to clicks in intermediate page

The most confusing part maybe the get_general_info() function. You maybe wondering where all those in the quote come from. They are called xpath, and you can collect it using the CSS selector I introduced above.

3.2. Get your car running

for (page_num in 1:84) {
  
  ## navigate to each pages
  remDr$navigate(paste0("https://www.foody.vn/ha-noi/cafe?ds=Restaurant&vt=row&st=1&c=2&page=", page_num, "&provinceId=218&categoryId=2&append=true"))
  
  ## get page's source_code (including all html code)
  page_source[[page_num]] <- remDr$getPageSource()
  
  ## get name of the restaurants
  read_html(page_source[[page_num]][[1]]) %>% 
    html_nodes(css = "h2") %>% 
    html_text() %>% 
    str_remove('^([[:space:]]+)') %>% # trim spaces at the beginning
    str_remove('([[:space:]]+)$') %>% # trim spaces at the end
    str_replace_all('[\n\r]','') -> shop_name[[page_num]] 
  
  ## get index of multiple branches restaurants
  which(str_detect(shop_name[[page_num]], 'Hệ thống')) -> multiple_branches[[page_num]]
  
  ## get index of single branches restaurants
  which(str_detect(shop_name[[page_num]], 'Hệ thống', negate = TRUE)) -> single_branches[[page_num]]
  
  ## pick the element to click to get to the page of that specific restaurant
  remDr$findElements(using = "xpath", "//h2/a") -> click_field[[page_num]] 
  
  ## get handle of the 'mother' page (which include many restaurants), use to get back to the mother page later
  currWindow[[page_num]] <- remDr$getCurrentWindowHandle()[[1]] 
  
  ## ---------------- for restaurant with multiple branches ------------------
  for(i in multiple_branches[[page_num]]) { 
    
    ## click link
    click_field[[page_num]][[i]]$clickElement()

    ## get handle
    windows[[i]] <- remDr$getWindowHandles() # get all windows handle
    otherWindow[[i]] <- windows[[i]][!windows[[i]] %in% currWindow[[page_num]]][[1]] # other handle apart from current 'mother' one
    
    ## switch to other window
    myswitch(remDr, otherWindow[[i]][[1]])  
    
    ## ------ click to a branch then get information about that branch -----
    
    ## get into a branch
    remDr$findElements(using = "xpath", "//h2/a") -> click_field_branches[[i]] 
    
    ## get the handle of the intermediate page (to get back later)
    intermediate_handle[[i]] <- remDr$getCurrentWindowHandle()[[1]] 
    
    ## loop to get information about each branch
    for(j in 1:length(click_field_branches[[i]])) {
      
      ## click link
      click_field_branches[[i]][[j]]$clickElement()

      ## get handle
      windows1[[j]] <- remDr$getWindowHandles() # get all windows handle
      otherWindow1[[j]] <- windows1[[j]][!windows1[[j]] %in% c(windows1[[j]][[1]], windows1[[j]][[2]])] # other handle apart from current 'mother' and 'intermediate' one
      
      ## switch to other window
      myswitch(remDr, otherWindow1[[j]][[1]])
      
      ## extract info of that branch
      page_source1 <- read_html((remDr$getPageSource())[[1]])
      
      general_info[[a]] <- get_general_info(page_source1)

      ## close the tab for that branch and switch to the intermediate tab
      remDr$closeWindow()
      remDr$switchToWindow(intermediate_handle[[i]][[1]])

      a <- a + 1
      
    }
    
    ## close intermediate tab and switch to mother page
    remDr$closeWindow()
    remDr$switchToWindow(currWindow[[page_num]])
  }
  
  ## ---------------- for restaurant with single branches ------------------
  
  for(g in single_branches[[page_num]]) {
    
    ## click link
    click_field[[page_num]][[g]]$clickElement()

    ## get handle
    windows2[[g]] <- remDr$getWindowHandles() # get all windows handle
    otherWindow2[[g]] <- windows2[[g]][!windows2[[g]] %in% currWindow[[page_num]]][[1]] # other handle apart from current 'mother' one
    
    ## switch to other window
    myswitch(remDr, otherWindow2[[g]][[1]])
    
    ## extract info of that branch
    page_source2 <- read_html((remDr$getPageSource())[[1]])

    general_info_2[[e]] <- get_general_info(page_source2)
    
    e <- e + 1
    
    ## close tab for that restaurant and switch to mother page
    remDr$closeWindow()
    remDr$switchToWindow(currWindow[[page_num]])
  }
}

Now you have prepared your helper functions, it’s time to start your crawling process. However, as stated in the above part, the flows for multiple branches and single branch restaurants are different. As a result, we need to determine which are multiple branches and which are single branches, then use a different for loop for each type of restaurants.

The general idea is that you would want to get into the page for every specific restaurants, extract the information, and get out. Then repeat that process again and again and again.

The detailed steps are as follow:

Finally, we store the information of 2 different types of restaurants into 2 different dataframe called general_info and general_info_2.

3.3. Finalize

Note that our output is in the form of list. As a result, we need to convert it into dataframe for further analysis.

do.call(rbind.data.frame, general_info) -> final_multiple_branches
do.call(rbind.data.frame, general_info_2) -> final_single_branch

Most of the time, the output we get from HTML code is not clean. Thus, another step for data cleaning is required. What to clean varies on a case by case basis. For this specific example, just take the code below for your reference:

loca_process <- function(location_x) {
  location_x %>%
    mutate_all(function(x) str_replace_all(x, '^((\\s+)(\"*)|(\"*))(.+)((\\s+)(\"*)|(\\s+))$', '\\5')) %>%
    mutate_all(function(x) str_replace_all(x, '^-\\s+', '')) %>%
    mutate_all(function(x) str_replace_all(x, '\\s+$|^\\s+', '')) %>%
    unite('address', address, district, sep = ', ') %>%
    separate(price_range, c('lower_price', 'upper_price'), sep = ' - ') %>%
    separate(time, c('time1', 'time2'), sep = ' \\| ') %>%
    separate(time1, c('open_1', 'close_1'), sep = ' - ') %>%
    separate(time2, c('open_2', 'close_2'), sep = ' _ ') %>%
    mutate_at(vars('lower_price', 'upper_price'), function(x) str_replace_all(x, '\\.|đ|click_field_branches', '')) %>%
    mutate_at(vars('lower_price', 'upper_price'), function(x) str_sub(x, end = -2)) %>%
    mutate_at(vars('avg_score', 'location', 'quality', 'space', 'service', 'comment_count', 'lower_price', 'upper_price'), as.double)
}

loca_process(final_single_branch) -> final_single_branch
loca_process(final_multiple_branches) -> final_multiple_branches

Now we have gone through the basic process of web-crawling. Hope that you can try it with your own project!