Cruising Through Kijiji Part 1: Web Scraping using R
It’s no secret that used vehicle prices have increased significantly compared to prices we have seen in the past. This has been said to be caused by a variety of factors, such as the microchip shortage brought on by the COVID-19 pandemic, as well as inflation.
I was recently in the market for a used vehicle and I wanted to survey the market and gather data programmatically in addition to browsing ads during evenings to see what I could expect to pay in today’s market.
To do this, I used an excellent suite of packages in R: tidyverse
, as well as
rvest
. Here’s how you can do the same.
Getting Started
This post outlines the process of scraping and cleaning Kijiji ads for cars and trucks. First we need to determine exactly which urls we will be scraping. Through a process of trial and error I determined that I would have the most success scraping the first 100 pages of vehicle ads every day. This is because, despite the website showing 230,699 vehicles at the time of writing, Kijiji only allows you to visit the first 100 pages. This limitation isn’t so bad because it ensures we will always be looking at the freshest postings.
First we will load up the necessary packages and generate the list of 100 urls
to visit. Each page of ads follows the same structure, so we can generate 100
urls using paste0
.
1library(tidyverse)
2library(rvest)
3
4urls <- paste0("https://www.Kijiji.ca/b-cars-trucks/canada/new__used/page-",
5 100:1,
6 "/c174l0a49?ad=offering")
We’re setup. Now all we have to do is 1) Download each ad. 2) Parse each ad into a useable format. 3) Do a little cleanup because webscraping is always messy. Easy!
Downloading Ads
Before we dive right into sending HTTP requests, it is a good idea to make sure
we both scrape respectfully, and handle any errors that might arise from sending
an HTTP request. The purrr
library (helpfully loaded into our R session when
we loaded tidyverse
) has excellent functions for this purpose.
1# Wraps read_html with three adverbs from purrr to ensure we always
2# get a result from our web scraping.
3# - First, we wrap read_html with slowly to provide a 2 second delay.
4# - Second, we wrap that with insistently so that if the request times out, # we try again, with an exponential back off period.
5# - If all else fails, we simply return NA.
6rhtml <- possibly(insistently(slowly(read_html, rate = rate_delay(2)),
7 rate = rate_backoff(max_times = 5),
8 quiet = FALSE),
9 otherwise = NA)
Now we have a function that will make HTTP requests on a 2 second delay (so as not to overwhelm the server with requests), try the connection again if an error such as a timeout occurs (up to 5 times), and if all else fails we will return NA instead of stopping our scraping.
Next we have to take our vector of 100 urls, and then get the HTML data for each
ad on each page of ads. We can do this first storing the HTML data for each ad
on one page into a variable called data
, and then we can use a combination of
CSS selectors and regular expressions to store the link of the ad, the date the
ad was posted, and the HTML data of the ad into a tibble (and that is where we
use the function rhtml
we defined above).
1scrape_page <- function(.url) {
2 # Scrapes the provided url and returns a tibble containing an ad link and html data.
3 #
4 # Args:
5 # .url: A Kijiji url that contains advertisements for vehicles.
6 # Typically there are ~40 advertisements per page.
7 #
8 # Returns:
9 # tibble with the following columns:
10 # - link
11 # - html_data
12 data <- .url |>
13 read_html()
14
15 ads <- data |>
16 html_elements(".title") |>
17 html_attr("href") |>
18 purrr::discard(is.na) |>
19 keep(\(x) str_detect(x, "v-cars|v-autos"))
20
21 dates <- data |>
22 html_elements(".date-posted") |>
23 html_text()
24
25 links <- paste0("https://www.Kijiji.ca", ads)
26
27tibble(link = links,
28 date = dates,
29 html_data = map(link, rhtml))
30}
The benefit of this function is that once it is called, we’re all done scraping! While we’re testing out our function for parsing each ad, we don’t have to scrape the ad again each time we want to make a change to our parsing function - we’re drawing a line between what each function should and shouldn’t do. This both lessens the load on the server and makes it faster for us to test our parsing function.
Parsing Each Ad
Now that we have each ad’s HTML data in memory, we can work on parsing each ad and selecting each piece of data we want using CSS selectors and regular expressions. This is fairly straight forward - it’s just tedious. Once each piece of the ad is parsed, we simply return a tibble containing all of the information.
Side Note: I do suspect that some (if not all) of the CSS selectors here have changed since I first wrote this function and it may fail until they’re changed again.
1parse_ad <- function(.html_data, .link, .date) {
2 # Parses an ads html page into a human readable format.
3 #
4 # Args:
5 # .html_data: The html data as returned by hrtml().
6 # .link: The link to the Kijiji ad.
7 # .date: The date the ad was posted, as according to the ad.
8 #
9 # Returns:
10 # tibble containing all the data from the ad.
11 title <- .html_data |>
12 html_elements(".title-2323565163") |>
13 html_text() |>
14 paste0(collapse = "")
15
16 location <- .html_data |>
17 html_elements(".address-3617944557") |>
18 html_text()
19
20 price <- .html_data |>
21 html_elements(".currentPrice-2842943473 span") |>
22 html_text() |>
23 paste0(collapse = "")
24
25 seller_type <- .html_data |>
26 html_elements(".line-2791721720:nth-child(1)") |>
27 html_text()
28
29 description <- .html_data |>
30 html_elements(".descriptionContainer-231909819 p") |>
31 html_text()
32 description <- if (is_empty(description)) {
33 .html_data |>
34 html_elements(".descriptionContainer-231909819 div") |>
35 html_text() |>
36 paste(collapse = " ")
37 } else {
38 description |>
39 paste(collapse = " ")
40 }
41
42 # Contains all the data about the car such as make, model, condition, etc.
43 meta_data <- .html_data |>
44 html_elements(".itemAttribute-3080139557") |>
45 html_text() |>
46 paste(collapse = " ")
47
48 condition <- meta_data |>
49 str_extract("Condition(\\S+)") |>
50 str_replace("Condition", "")
51
52 year <- meta_data |>
53 str_extract("Year(\\S+)") |>
54 str_replace("Year", "")
55
56 make <- meta_data |>
57 str_extract("Make(\\S+)") |>
58 str_replace("Make", "")
59
60 # Optionally match any number of digits after "Model" for vehicles such as "Tesla Model 3".
61 model <- if(is.na(make) || make != "Tesla") {
62 meta_data |>
63 str_extract("Model(\\S+)") |>
64 str_replace("Model", "")
65 } else {
66 meta_data |>
67 str_extract("Model(\\S+)( *.)?") |>
68 str_replace("Model", "")
69 }
70
71 trim <- meta_data |>
72 str_extract("Trim(\\S+)") |>
73 str_replace("Trim", "")
74
75 colour <- meta_data |>
76 str_extract("Colour(\\S+)") |>
77 str_replace("Colour", "")
78
79 body_type <- meta_data |>
80 str_extract("Body Type(\\S+)") |>
81 str_replace("Body Type", "")
82
83 number_of_doors <- meta_data |>
84 str_extract("No. of Doors(\\S+)") |>
85 str_replace("No. of Doors", "")
86
87 number_of_seats <- meta_data |>
88 str_extract("No. of Seats(\\S+)") |>
89 str_replace("No. of Seats", "")
90
91 drive_train <- meta_data |>
92 str_extract("Drivetrain(\\S+)") |>
93 str_replace("Drivetrain", "")
94
95 transmission <- meta_data |>
96 str_extract("Transmission(\\S+)") |>
97 str_replace("Transmission", "")
98
99 fuel_type <- meta_data |>
100 str_extract("Fuel Type(\\S+)") |>
101 str_replace("Fuel Type", "")
102
103 km <- meta_data |>
104 str_extract("Kilometers(\\S+)") |>
105 str_replace("Kilometers", "")
106
107 blue_tooth <- meta_data |>
108 str_detect("Bluetooth")
109
110 push_start <- meta_data |>
111 str_detect("Push button start")
112
113 parking_assistant <- meta_data |>
114 str_detect("Parking assistant")
115
116 tibble(link = .link,
117 date_posted = .date,
118 ad_title = title,
119 location = location,
120 description = description,
121 seller_type = seller_type,
122 price = price,
123 make = make,
124 model = model,
125 condition = condition,
126 year = year,
127 trim = trim,
128 colour = colour,
129 body_type = body_type,
130 number_of_doors = number_of_doors,
131 number_of_seats = number_of_seats,
132 drive_train = drive_train,
133 transmission = transmission,
134 fuel_type = fuel_type,
135 km = km,
136 blue_tooth = blue_tooth,
137 push_start = push_start,
138 parking_assistant = parking_assistant)
139}
Cleaning Up
We’re almost there (and you could skip this step if you wish)! As with any webscraping project, you will ineveitably end up with messy data. I’ve included a function to tidy up some of the most visible problems with the data that I would like corrected, but there is undoubtedly more to do than what I’ve done here.
1clean_ads <- function(.ads) {
2 # Performs steps to clean up the messy ad data.
3 #
4 # Args:
5 # .ads: A tibble containing the ads as returned by parse_ad().
6 #
7 # Returns:
8 # tibble containing all the cleaned data from the Kijiji ads.
9 postal_code_table <- tribble(
10 ~postal_code, ~province,
11 "A", "Newfoundland and Labrador",
12 "B", "Nova Scotia",
13 "C", "Prince Edward Island",
14 "E", "New Brunswick",
15 "G", "Quebec",
16 "H", "Quebec",
17 "J", "Quebec",
18 "K", "Ontario",
19 "L", "Ontario",
20 "M", "Ontario",
21 "N", "Ontario",
22 "P", "Ontario",
23 "R", "Manitoba",
24 "S", "Saskatchewan",
25 "T", "Alberta",
26 "V", "British Columbia",
27 "X", "Northwest Territories and Nunavut",
28 "Y", "Yukon"
29 )
30
31 # We can perform all the cleaning steps in one call to mutate().
32 .ads |>
33 filter(!duplicated(link)) |>
34 mutate(date_posted = case_when(
35 date_posted == "Yesterday" ~ Sys.Date() - 1,
36 str_detect(date_posted, "<") ~ Sys.Date(),
37 .default = as.Date(date_posted, "%d/%m/%Y")
38 ),
39 ad_title = trimws(ad_title),
40 description = if_else(description == "", NA_character_, description),
41 price = if_else(price == "Please Contact" | price == "Swap/Trade",
42 NA_character_,
43 price),
44 price = parse_number(price),
45 make = case_when(make == "Land" ~ "Land Rover",
46 make == "Aston" ~ "Aston Martin",
47 .default = make),
48 model = case_when(model == "Range" ~ "Range Rover",
49 str_detect(str_to_lower(ad_title), "grand") ~
50 str_extract(str_to_title(ad_title), "Grand (\\S+)"),
51 .default = model),
52 model = str_remove(model, "[[:punct:]].*"),
53 year = as.integer(year),
54 trim = str_replace(trim, ",", ""),
55 body_type = str_replace(body_type, ",", ""),
56 number_of_doors = as.integer(number_of_doors),
57 number_of_seats = as.integer(number_of_seats),
58 drive_train = if_else(drive_train == "4", "4x4", drive_train),
59 km = parse_number(km),
60 postal_code = str_extract(location,
61 "[A-Z][0-9][A-Z]( )?[0-9][A-Z][0-9]"),
62 key = str_sub(postal_code, 1, 1)) |>
63 left_join(postal_code_table,
64 by = join_by(key == postal_code)) |>
65 select(link,
66 date_posted,
67 ad_title,
68 province,
69 postal_code,
70 description:parking_assistant)
71}
Putting It All Together
We have all the pieces we need to successfully scrape vehicle ads on Kijiji. The final step is to glue each function together so that at the end of it all, we end up with a tibble containing the data for each ad on all 100 pages.
1# Iterate over the list of urls, applying scrape_pages() to each one.
2# purrr::map_df() is the same as purrr::map(), but returns the result as a
3# tibble insteal of a list - convenient in this case.
4# This is where the actual scraping is performed.
5pages <- map_df(urls, scrape_page) |>
6 filter(!is.na(html_data))
7
8ads <- pmap_df(list(pages$html_data, pages$link, pages$date),
9 parse_ad) |>
10 clean_ads()
After the above code is executed, we should end up with a tibble that looks something like this:
## # A tibble: 4,363 × 24
## link date_posted ad_title province postal_code description seller_type price
## <chr> <date> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 http… 2023-03-03 2022 Ni… Ontario N6L 1J9 "Come visi… Dealer 30391
## 2 http… 2023-03-02 2022 Hy… British… V9L 6C7 "Our 2022 … Dealer 55596
## 3 http… 2023-03-01 2019 BM… Alberta T4A 2H7 "Home of t… Dealer 82998
## 4 http… 2023-02-12 2016 RA… Alberta T1A7H8 "This is a… Owner 62900
## 5 http… 2023-03-02 2020 Au… Quebec J5R 1S8 "**Moteur … Dealer 43995
## 6 http… 2023-03-05 2019 TE… Ontario M1R 2Y5 "TESLA INS… Dealer 42998
## 7 http… 2023-03-05 2022 Fo… Ontario L4R 4L1 "Take adva… Dealer 59563
## 8 http… 2023-03-05 2015 Ca… Alberta T8N 5A5 "Canadian … Dealer 49900
## 9 http… 2023-03-05 2015 RA… Saskatc… S7K 1R1 "WE FINANC… Dealer 21995
## 10 http… 2023-03-05 2023 Ch… Saskatc… S7K 0V1 "Saskatoon… Dealer 60407
## # ℹ 4,353 more rows
## # ℹ 16 more variables: make <chr>, model <chr>, condition <chr>, year <dbl>,
## # trim <chr>, colour <chr>, body_type <chr>, number_of_doors <dbl>,
## # number_of_seats <dbl>, drive_train <chr>, transmission <chr>,
## # fuel_type <chr>, km <dbl>, blue_tooth <lgl>, push_start <lgl>,
## # parking_assistant <lgl>
And we’re done! We’ve successfully used R to webscrape Kijiji and programatically download data on used vehicles in Canada. I’ve set up this script to run on my server every day and in the next post we’ll take a look at what insights we can find glean from that data.
Thanks for reading!