Cruising Through Kijiji Part 1: Web Scraping using R

2023-06-17

It’s no secret that used vehicle prices have increased significantly compared to prices we have seen in the past. This has been said to be caused by a variety of factors, such as the microchip shortage brought on by the COVID-19 pandemic, as well as inflation.

I was recently in the market for a used vehicle and I wanted to survey the market and gather data programmatically in addition to browsing ads during evenings to see what I could expect to pay in today’s market.

To do this, I used an excellent suite of packages in R: tidyverse, as well as rvest. Here’s how you can do the same.

Getting Started

This post outlines the process of scraping and cleaning Kijiji ads for cars and trucks. First we need to determine exactly which urls we will be scraping. Through a process of trial and error I determined that I would have the most success scraping the first 100 pages of vehicle ads every day. This is because, despite the website showing 230,699 vehicles at the time of writing, Kijiji only allows you to visit the first 100 pages. This limitation isn’t so bad because it ensures we will always be looking at the freshest postings.

First we will load up the necessary packages and generate the list of 100 urls to visit. Each page of ads follows the same structure, so we can generate 100 urls using paste0.

1library(tidyverse)
2library(rvest)
3
4urls <- paste0("https://www.Kijiji.ca/b-cars-trucks/canada/new__used/page-",
5               100:1,
6               "/c174l0a49?ad=offering")

We’re setup. Now all we have to do is 1) Download each ad. 2) Parse each ad into a useable format. 3) Do a little cleanup because webscraping is always messy. Easy!

Downloading Ads

Before we dive right into sending HTTP requests, it is a good idea to make sure we both scrape respectfully, and handle any errors that might arise from sending an HTTP request. The purrr library (helpfully loaded into our R session when we loaded tidyverse) has excellent functions for this purpose.

1# Wraps read_html with three adverbs from purrr to ensure we always 
2# get a result from our web scraping. 
3#  - First, we wrap read_html with slowly to provide a 2 second delay.
4#  - Second, we wrap that with insistently so that if the request times out, #    we try again, with an exponential back off period.
5#  - If all else fails, we simply return NA. 
6rhtml <- possibly(insistently(slowly(read_html, rate = rate_delay(2)), 
7                     rate = rate_backoff(max_times = 5),
8                     quiet = FALSE),
9                  otherwise = NA)

Now we have a function that will make HTTP requests on a 2 second delay (so as not to overwhelm the server with requests), try the connection again if an error such as a timeout occurs (up to 5 times), and if all else fails we will return NA instead of stopping our scraping.

Next we have to take our vector of 100 urls, and then get the HTML data for each ad on each page of ads. We can do this first storing the HTML data for each ad on one page into a variable called data, and then we can use a combination of CSS selectors and regular expressions to store the link of the ad, the date the ad was posted, and the HTML data of the ad into a tibble (and that is where we use the function rhtml we defined above).

 1scrape_page <- function(.url) {
 2  # Scrapes the provided url and returns a tibble containing an ad link and html data.
 3  # 
 4  # Args:
 5  #  .url: A Kijiji url that contains advertisements for vehicles.
 6  #        Typically there are ~40 advertisements per page.
 7  #
 8  # Returns: 
 9  #  tibble with the following columns:
10  #    - link
11  #    - html_data
12  data <- .url |>
13    read_html()
14  
15  ads <- data |>
16    html_elements(".title") |>
17    html_attr("href") |>
18    purrr::discard(is.na) |>
19    keep(\(x) str_detect(x, "v-cars|v-autos"))
20  
21  dates <- data |>
22    html_elements(".date-posted") |> 
23    html_text()
24  
25  links <- paste0("https://www.Kijiji.ca", ads)
26  
27tibble(link = links,
28       date = dates,
29       html_data = map(link, rhtml))
30}

The benefit of this function is that once it is called, we’re all done scraping! While we’re testing out our function for parsing each ad, we don’t have to scrape the ad again each time we want to make a change to our parsing function - we’re drawing a line between what each function should and shouldn’t do. This both lessens the load on the server and makes it faster for us to test our parsing function.

Parsing Each Ad

Now that we have each ad’s HTML data in memory, we can work on parsing each ad and selecting each piece of data we want using CSS selectors and regular expressions. This is fairly straight forward - it’s just tedious. Once each piece of the ad is parsed, we simply return a tibble containing all of the information.

Side Note: I do suspect that some (if not all) of the CSS selectors here have changed since I first wrote this function and it may fail until they’re changed again.

  1parse_ad <- function(.html_data, .link, .date) {
  2  # Parses an ads html page into a human readable format.
  3  # 
  4  # Args:
  5  #  .html_data: The html data as returned by hrtml().
  6  #  .link: The link to the Kijiji ad.
  7  #  .date: The date the ad was posted, as according to the ad.
  8  #
  9  # Returns: 
 10  #  tibble containing all the data from the ad.
 11  title <- .html_data |> 
 12    html_elements(".title-2323565163") |> 
 13    html_text() |> 
 14    paste0(collapse = "") 
 15  
 16  location <- .html_data |>
 17    html_elements(".address-3617944557") |>
 18    html_text()
 19  
 20  price <- .html_data |> 
 21    html_elements(".currentPrice-2842943473 span") |> 
 22    html_text() |>
 23    paste0(collapse = "")
 24  
 25  seller_type <- .html_data |> 
 26    html_elements(".line-2791721720:nth-child(1)") |>
 27    html_text()
 28  
 29  description <- .html_data |>
 30    html_elements(".descriptionContainer-231909819 p") |>
 31    html_text()
 32  description <- if (is_empty(description)) {
 33    .html_data |>
 34      html_elements(".descriptionContainer-231909819 div") |>
 35      html_text() |> 
 36      paste(collapse = " ") 
 37  } else {
 38      description |>
 39      paste(collapse = " ")
 40    } 
 41  
 42  # Contains all the data about the car such as make, model, condition, etc.
 43  meta_data <- .html_data |> 
 44    html_elements(".itemAttribute-3080139557") |> 
 45    html_text() |>
 46    paste(collapse = " ")
 47  
 48  condition <- meta_data |>
 49    str_extract("Condition(\\S+)") |> 
 50    str_replace("Condition", "")
 51  
 52  year <- meta_data |>
 53    str_extract("Year(\\S+)") |> 
 54    str_replace("Year", "")
 55  
 56  make <- meta_data |>
 57    str_extract("Make(\\S+)") |> 
 58    str_replace("Make", "")
 59  
 60  # Optionally match any number of digits after "Model" for vehicles such as "Tesla Model 3".
 61  model <- if(is.na(make) || make != "Tesla") {
 62    meta_data |> 
 63    str_extract("Model(\\S+)") |> 
 64    str_replace("Model", "")
 65  } else {
 66    meta_data |> 
 67    str_extract("Model(\\S+)( *.)?") |> 
 68    str_replace("Model", "")
 69  }
 70  
 71  trim <- meta_data |>
 72    str_extract("Trim(\\S+)") |> 
 73    str_replace("Trim", "")
 74  
 75  colour <- meta_data |>
 76    str_extract("Colour(\\S+)") |> 
 77    str_replace("Colour", "")
 78  
 79  body_type <- meta_data |>
 80    str_extract("Body Type(\\S+)") |> 
 81    str_replace("Body Type", "")
 82  
 83  number_of_doors <- meta_data |>
 84    str_extract("No. of Doors(\\S+)") |> 
 85    str_replace("No. of Doors", "")
 86  
 87  number_of_seats <- meta_data |>
 88    str_extract("No. of Seats(\\S+)") |> 
 89    str_replace("No. of Seats", "")
 90  
 91  drive_train <- meta_data |>
 92    str_extract("Drivetrain(\\S+)") |> 
 93    str_replace("Drivetrain", "")
 94  
 95  transmission <- meta_data |>
 96    str_extract("Transmission(\\S+)") |> 
 97    str_replace("Transmission", "")
 98  
 99  fuel_type <- meta_data |>
100    str_extract("Fuel Type(\\S+)") |> 
101    str_replace("Fuel Type", "")
102  
103  km <- meta_data |>
104    str_extract("Kilometers(\\S+)") |> 
105    str_replace("Kilometers", "")
106  
107  blue_tooth <- meta_data |>
108    str_detect("Bluetooth")
109  
110  push_start <- meta_data |> 
111    str_detect("Push button start")
112  
113  parking_assistant <- meta_data |>
114    str_detect("Parking assistant")
115  
116  tibble(link = .link,
117         date_posted = .date,
118         ad_title = title,
119         location = location,
120         description = description,
121         seller_type = seller_type,
122         price = price,
123         make = make,
124         model = model,
125         condition = condition,
126         year = year,
127         trim = trim,
128         colour = colour,
129         body_type = body_type,
130         number_of_doors = number_of_doors,
131         number_of_seats = number_of_seats,
132         drive_train = drive_train,
133         transmission = transmission,
134         fuel_type = fuel_type,
135         km = km,
136         blue_tooth = blue_tooth,
137         push_start = push_start,
138         parking_assistant = parking_assistant)
139}

Cleaning Up

We’re almost there (and you could skip this step if you wish)! As with any webscraping project, you will ineveitably end up with messy data. I’ve included a function to tidy up some of the most visible problems with the data that I would like corrected, but there is undoubtedly more to do than what I’ve done here.

 1clean_ads <- function(.ads) {
 2  # Performs steps to clean up the messy ad data.
 3  # 
 4  # Args:
 5  #  .ads: A tibble containing the ads as returned by parse_ad().
 6  #
 7  # Returns: 
 8  #  tibble containing all the cleaned data from the Kijiji ads.
 9  postal_code_table <- tribble(
10    ~postal_code, ~province,
11    "A", "Newfoundland and Labrador",
12    "B", "Nova Scotia",
13    "C", "Prince Edward Island",
14    "E", "New Brunswick",
15    "G", "Quebec", 
16    "H", "Quebec",
17    "J", "Quebec",
18    "K", "Ontario",
19    "L", "Ontario",
20    "M", "Ontario",
21    "N", "Ontario",
22    "P", "Ontario",
23    "R", "Manitoba",
24    "S", "Saskatchewan",
25    "T", "Alberta", 
26    "V", "British Columbia",
27    "X", "Northwest Territories and Nunavut",
28    "Y", "Yukon"
29  )
30  
31  # We can perform all the cleaning steps in one call to mutate().
32  .ads |> 
33    filter(!duplicated(link)) |> 
34    mutate(date_posted = case_when(
35      date_posted == "Yesterday" ~ Sys.Date() - 1,
36      str_detect(date_posted, "<") ~ Sys.Date(),
37      .default = as.Date(date_posted, "%d/%m/%Y")
38    ),
39           ad_title = trimws(ad_title),
40           description = if_else(description == "", NA_character_, description),
41           price = if_else(price == "Please Contact" | price == "Swap/Trade",
42                           NA_character_,
43                           price),
44           price = parse_number(price), 
45           make = case_when(make == "Land" ~ "Land Rover",
46                            make == "Aston" ~ "Aston Martin",
47                            .default = make),
48           model = case_when(model == "Range" ~ "Range Rover",
49                             str_detect(str_to_lower(ad_title), "grand") ~ 
50                               str_extract(str_to_title(ad_title), "Grand (\\S+)"),
51                             .default = model),
52           model = str_remove(model, "[[:punct:]].*"),
53           year = as.integer(year),
54           trim = str_replace(trim, ",", ""),
55           body_type = str_replace(body_type, ",", ""),
56           number_of_doors = as.integer(number_of_doors),
57           number_of_seats = as.integer(number_of_seats),
58           drive_train = if_else(drive_train == "4", "4x4", drive_train),
59           km = parse_number(km),
60           postal_code = str_extract(location, 
61                                     "[A-Z][0-9][A-Z]( )?[0-9][A-Z][0-9]"),
62           key = str_sub(postal_code, 1, 1)) |>
63    left_join(postal_code_table, 
64              by = join_by(key == postal_code)) |> 
65    select(link,
66           date_posted,
67           ad_title, 
68           province, 
69           postal_code, 
70           description:parking_assistant)
71}

Putting It All Together

We have all the pieces we need to successfully scrape vehicle ads on Kijiji. The final step is to glue each function together so that at the end of it all, we end up with a tibble containing the data for each ad on all 100 pages.

 1# Iterate over the list of urls, applying scrape_pages() to each one. 
 2# purrr::map_df() is the same as purrr::map(), but returns the result as a 
 3# tibble insteal of a list - convenient in this case.
 4# This is where the actual scraping is performed. 
 5pages <- map_df(urls, scrape_page) |>
 6  filter(!is.na(html_data))
 7
 8ads <- pmap_df(list(pages$html_data, pages$link, pages$date),
 9               parse_ad) |> 
10  clean_ads()

After the above code is executed, we should end up with a tibble that looks something like this:

## # A tibble: 4,363 × 24
##    link  date_posted ad_title province postal_code description seller_type price
##    <chr> <date>      <chr>    <chr>    <chr>       <chr>       <chr>       <dbl>
##  1 http… 2023-03-03  2022 Ni… Ontario  N6L 1J9     "Come visi… Dealer      30391
##  2 http… 2023-03-02  2022 Hy… British… V9L 6C7     "Our 2022 … Dealer      55596
##  3 http… 2023-03-01  2019 BM… Alberta  T4A 2H7     "Home of t… Dealer      82998
##  4 http… 2023-02-12  2016 RA… Alberta  T1A7H8      "This is a… Owner       62900
##  5 http… 2023-03-02  2020 Au… Quebec   J5R 1S8     "**Moteur … Dealer      43995
##  6 http… 2023-03-05  2019 TE… Ontario  M1R 2Y5     "TESLA INS… Dealer      42998
##  7 http… 2023-03-05  2022 Fo… Ontario  L4R 4L1     "Take adva… Dealer      59563
##  8 http… 2023-03-05  2015 Ca… Alberta  T8N 5A5     "Canadian … Dealer      49900
##  9 http… 2023-03-05  2015 RA… Saskatc… S7K 1R1     "WE FINANC… Dealer      21995
## 10 http… 2023-03-05  2023 Ch… Saskatc… S7K 0V1     "Saskatoon… Dealer      60407
## # ℹ 4,353 more rows
## # ℹ 16 more variables: make <chr>, model <chr>, condition <chr>, year <dbl>,
## #   trim <chr>, colour <chr>, body_type <chr>, number_of_doors <dbl>,
## #   number_of_seats <dbl>, drive_train <chr>, transmission <chr>,
## #   fuel_type <chr>, km <dbl>, blue_tooth <lgl>, push_start <lgl>,
## #   parking_assistant <lgl>

And we’re done! We’ve successfully used R to webscrape Kijiji and programatically download data on used vehicles in Canada. I’ve set up this script to run on my server every day and in the next post we’ll take a look at what insights we can find glean from that data.

Thanks for reading!

#used cars #webscraping #Kijiji #canada #R