Web Scraping Using ImportXML in Google Spreadsheets

This tutorial shows you how to scrape web pages by using Google Spreadsheets instead of using web scraping tools. The suggested method is very useful for a smaller set of URLs.

What is Web Scraping?

Web Scraping is a process of extracting specific information from web pages.

How to Scrape a Website?

You can use data extraction in Screaming Frog (SEO Spider Tool). However, if you don’t want to pay for a Screaming Frog license (the free version is limited to 500 URLs), or if you just don’t like that tool (that’s hard to believe), then you may be looking for something else.

There are some great alternatives such as Deepcrawl and Import.io or you scrape websites using a python library. However, I’d like to show you how to use Google Spreadsheet to extract data from a list of URLs. It’s quite simple, and you need to understand just two things to do that:

  • XPath
  • ImportXML

XPath

XPath helps you to navigate through an XML document or locate specific elements and attributes. It’s a simple way to select an element in an HTML page.

Many articles about XPath have been written and I’ll refer you to W3C Schools to learn a bit if you’re not familiar with it just yet.

Let’s imagine I want to know all blog posts published on Moz’s blog this year with the author names and the number of comments.

  • Title: //h1
  • Author’s name: //*[@id=”wrap”]/main[1]/article/header/div/div/div[1]/div[2]/a
  • The number of comments: //*[@id=”comments”]/div[1]/div/div[1]/h2/span

It’s always important to understand the concept to be able to use it in different situations, but in some cases, you can cheat a little bit and use Chrome Developer Tools instead to get XPath.

XPath in Chrome Developer Tools

Copy XPath in Google Developer Tools

Copying XPath via Chrome Developer Tools may not work every time. I’ve done some tests and this simple method didn’t work for me in 25% of cases, therefore, I can recommend learning and understanding XPath.

ImportXML

The magic function that allows importing data from other documents is IMPORTXML.

Unfortunately, ImportXML doesn’t load JavaScript, so you won’t be able to use this function if the content of the document is generated by JavaScript (jQuery, etc.)

How to use ImportXML

=IMPORTXML(URL,XPATH)

  • URL – The URL of the document from which you want to extract information.
  • XPATH – The path to the element you want to extract

To get the heading of Mike King’s post, you can use the formula below.

=IMPORTXML(“https://moz.com/blog/the-technical-seo-renaissance”,”//h1”)

How to get a list of URLs? Sitemap.xml!

As you can see, extract information from one URL is easy, but it would still be better to do it manually (copy and paste). The real benefit of using this function is when you want to extract some information from a list of URLs.

If importing information from multiple pages but the same domain is the goal, the sitemap.xml file is here to help.

There are a bunch of tools and guides how to get a list of all URLs in the sitemap.xml file. I’m sure you can find some easy ones, but I’d like to use the function IMPORTXML again because it’s simple, right?

Knowing XPath, it just took 10 seconds to write a formula to import all URLs:

=IMPORTXML(“https://moz.com/blog-sitemap.xml”, “//url/loc”)

The formula above would work for many sitemap files, but this time, I got a nice error message:

Error: Imported content is empty.

I was quite sure that the XPath was correct so I did some small research on the Internet. Long story short, a namespace is declared at the beginning of the sitemap so you can’t directly request //url/loc.

<urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″>

If you’re interested in a detailed explanation, read Ben Ellwood’s post, who wrote about this first. If not, just see the XPath formula below:

//*[local-name() =’url’]/*[local-name() =’loc’]

Please note that I don’t scrape Moz’s sitemap in the spreadsheet to which I link at the end of this post. The spreadsheet should be considered as an example and not as a final solution.

Google Docs Caching

Google caches the results for about two hours, so you need to either add a fragment identifier (#) to the URL or just wait for a little time to get the updated results.

  • Original URL: https://moz.com/blog-sitemap.xml
  • Modified URL: https://moz.com/blog-sitemap.xml#test1

Conclusion

From the spreadsheet I just created and with a little help from pivot tables, I can see which blog posts had the highest engagement (the number of comments) and find out topics the SEO community is eager about or find the most popular authors, etc. It’s up to you what data and why you want to import.

All the functions in the spreadsheet must be adjusted for your need before using because each site has a different structure. I just wanted to show you an example of how to use it because I believe that every SEO magician should be familiar with this function.

IMPORTXML is a powerful function, and I’m only scratching the surface with my example. I’d love to hear about some cool ways and ideas how to use this function.

UPDATE: If you want to know more, there is a great guide to ImportXML from Distilled.

Open the Sheet

I'm an SEO Strategist who enjoys technical aspects of SEO and web analytics because you can't optimize anything if you don't measure it.
  • Akshay

    I am able to fetch data from multiple URL’s listed in a single column in google sheets using importhtml/importxml formula. However, when I drag the column with the formula it overwrites the existing data of the previous results. The error that I get is Array result was not expanded because it would overwrite data in the C4.

    Here is a live example to test:: https://docs.google.com/spr

    Sheet1 contains the formula and error I see. (cell – C3,C4,C5)

    Sheet2 is the desired output using importxml/importhtml formula.

    How can obtain the desired result for 100+ URLs using a formula? Thank you in advance.

    • http://rozhon.com/ Robin Rozhon

      Hi Akshay,

      I have an idea how to do. It’s not going to be the most elegant solution but should work. I drafted the solution in your sheet.

      1) “Solution – Step 1” takes all your data from the Sheet 1 (VLOOKUP formula in column B is important!)
      2) “Solution – Step 1” => Add a number (ID) to each row (column C). In your example, the IDs will be 1, 2, 3, 4, 5 because you have only 5 URLs.
      3) Because your IMPORTHTML formula returns a table with 13 rows (you need 13 empty rows), repeat the ID 13 times. (See column C in “Solution – Step 1”)
      4) Sort sheet by column C, A->Z (“Solution – Step 2”)
      4) Now, all the imported data should be visible
      5) Delete column C

      Let me know if it works for you. If not, add all names and URLs to Sheet 1, I’ll do it for you.

  • pipoulito

    Hi,
    Thanks a lot but it seems it does not work anymore with the new google spreadsheet , isn’t it ?
    thanks

    • http://rozhon.com/ Robin Rozhon

      Thanks for the heads up. Indeed, some of the IMPORTXML functions in the example sheet didn’t work because MOZ has made some changes in their source code.

      I updated the XPaths in the sheet but the functions now return “Error: Loading Data…”. I tried the identical functions in a new spreadsheet and it worked so the functions are correct.

      It seems like the sheet hit a usage limit. I know that Google used to have some limits but they announced that those limits were removed a couple of years ago. I’ll do some digging to find more info.