Setup an Ads.txt Web Crawler

This post is a step by step walkthrough of how to start using Neal Richter’s ads.txt web crawler Python script posted under the official IAB Tech Lab’s git repository. You might know him as the former CTO of Rubicon Project or current CTO of Rakuten, but he also served as a key contributor to the ads.txt working group.

Getting this script working ended up being a great way to learn more about Python and web scraping, but was primarily so I could compile the data necessary to analyze publisher adoption of ads.txt since the project was released in June of this year. For more on ads.txt or to read my analysis and download my data on the state of publisher adoption head on over to my State of Ads.txt post.

What the ads.txt web crawler does

The script takes two inputs – first, a txt file of domains and second, a database to write the parsed output.  Once you specify a list of domains, the script then appends ‘/ads.txt’ to each and writes them to a temporary CSV file.  The script then loops through each record in the CSV file, formatting it into a request which then leverages Python’s request library to execute the call.

Next, the script does some basic error handling.  It will timeout a request if the host doesn’t respond in a few seconds, will log an error if the page doesn’t look like an ads.txt file (such as a 404 page), if the browser starts getting redirected like crazy, or other unexpected things happen.

If the page looks like an ads.txt file, the script then parses the file for the expected values – domain, exchange, account ID, type, tag ID, and comment – and logs those to the specified database.

What the ads.txt web crawler doesn’t do

Neal is pretty clear that this script is intended more as an example than a full fledged crawler.  The script runs pretty slow for one because it can only process one domain at a time, rather than a bunch in parallel.  It also uses a laptop to do the work vs. a production server which would add more bandwidth and speed.  It leaves something to be desired on error handling, both on crawling domains and writing the output to the database.

I was closer to chucking my laptop out the window than I’d care to admit trying to get around UnicodeErrors, newline characters, null bytes, or other annoying and technically detailed nuances that made the script puke on my domain files.  And finally, the database is also just sitting on your laptop so it won’t scale forever, even if CSV files are typically small, even with tens of thousands of records.

All that said, I’m not a developer by trade and I was able to figure it out, even if it was a bit painful at times.  Hopefully this post will help others do the same.

How to get your ads.txt web crawler running

First things first – before you try to run this script need to have a few things already in place on your machine. If you don’t have these pieces yet it’ll be a bit of a chore, but it’s a great excuse to get it done if you want to do more technical projects in the future. 

Python

python programming logo

Python is a programming language just like Java or C++.  Many say that Python is one of the easiest languages to learn because the syntax is closer to plain English and therefore faster to write and simpler to read.  In my exceedingly brief and limited experience try to learn how to code I would agree.  The other nice thing about Python is it comes with a huge set of libraries and packages – essentially shortcuts – that greatly simplifies the coding process.  Don’t worry, you won’t be writing any Python to get the crawler running, but you need the code that forms all these shortcuts installed on your machine because Neal’s script relies on a bunch of them.

You can download Python here: https://www.python.org/downloads/

You can check to see if you already have Python by typing:

$python --version

Python HTTP Requests Library

python http requests library

The HTTP requests library is a specific collection of the aforementioned shortcuts written for Python that allow your script to make URL requests.  It’s the piece that lets you tell your script “go call domain.com/ads.txt”.

You can download Python Requests Library here: http://requests.readthedocs.io/en/master/, or you can install pip and then just type:

$pip install requests

SQLite3

SQLite

SQLite3 is the most famous piece of software in the world you’ve never heard of before now.  It lets you create small databases on your machine, which you can then query with basic SQL to explore the data. It doesn’t require a server, doesn’t require any real configuration, and is completely free.  The project homepage says there are billions of deployments and it looks widely used by every major tech company out there.  You need SQLite3’s database so you have a place to store the parsed ads.txt data.

This page will show you how to install SQLite3: https://www.tutorialspoint.com/sqlite/sqlite_installation.htm

SQLite Manager (Firefox Plugin)

ads.txt data stored in a SQLite3 database viewed using the SQLite Manager extension

This one is optional but I found it invaluable to explore my data.  It’s basically a user interface you can put on top of your SQLite database, quickly run queries, download results to CSV, and so forth. Mostly it made it easy to validate that records were being added as the script was running, and to easily remove some duplicate domains and records with funky errors.

You can download SQLite Manager here: https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager/  For now, this is only available for Firefox.

Get the example code from IAB Tech Lab: https://github.com/InteractiveAdvertisingBureau/adstxtcrawler

Get crawling!

  1. Get the example code from IAB Tech Lab: https://github.com/InteractiveAdvertisingBureau/adstxtcrawler.  If you are familiar with git, you can simply run
    1. git clone [email protected]:InteractiveAdvertisingBureau/adstxtcrawler.git
    2. I worked from jhpacker’s pull request due to the improved error handling in his code, but ended up making some changes myself as well that handle issues related to UnicodeErrors when writing to my database, annoying newline character errors that caused the script to puke, and better error logging that shows what URL hit the error in the log.
    3. Work from my pull request here: https://github.com/InteractiveAdvertisingBureau/adstxtcrawler/pull/5
  2. Create your database
    1. You need a place to store your parsed ads.txt data, so create a table by typing
    2. $sqlite3 adstxt.db < adstxt_crawler.sql
  3. Create a list of domains you want to crawl
    1. In a txt editor, create a list of domains you want to crawl. My recommendation is to just list them as ‘domain.com’ and do not include a ‘www.’.  It doesn’t matter either way, but just stay consistent whether you include a ‘www.’ or not because if you mix and match it will create duplicate records in your database, which are annoying.
    2. Save your file in the same directory as your adstxt-crawler code
  4. Try running it!
    1. In your terminal, navigate to your adstxt-crawler directory and execute the below command, referencing the file you saved above as your target file
    2. ./adstxt_crawler.py -t target_domains.txt -d adstxt.db
  5. It failed!
    1. I got a permissions error
      1. I had the same thing happen, and it’s a common issue to encounter. In this case you simply have to allow the web crawler file (adstxt_crawler.py) write permissions so it can put data into the adstxt.db database.  Execute the below command and you should be able to resolve your issue.
      2. $chmod -x adstxt_crawler.py
    2. I got a UnicodeEncodeError
      1. This was perhaps the most annoying thing I dealt with trying to get the crawler to run.  Essentially the script can fail if it hits a special character like an accent mark anywhere in an ads.txt file.  A good developer would have figured out how to handle and resolve this issue, but because I’m not one of those I just found a way to ignore them and skip those domains.  If you are having this issue as well, I recommend using my pull request version of adstxt_crawler.py instead.
    3. Look at the log file, which you can view by opening the same folder as where the script is.  You’ll see a file named adstxt_crawler.log.  If you open this file you’ll see a list of domains that failed and a reason why they failed.  You can at least use this information to eliminate domains that you know are working and getting skipped vs. actually causing an issue.
  6. It worked!
    1. Congrats!  Now explore your results using Terminal or the SQLite Manager plugin.
      1. In terminal, execute the below command:
        1. $echo "select * from adstxt;" | sqlite3 adstxt.db
      2. In SQLite Manager, execute the below command:
        1. SELECT * from adstxt;

Best practices and final thoughts

I was able to crawl roughly 15,000 domains this way over a few hours, though I broke my domains up into files of about 1,000 a piece, which seemed to help isolate errors.  That said, I was able to crawl the entire 3,500+ list posted on the ads.txt git repository in one run, which took about 45 minutes.

Anyone interested further in ads.txt please read my analysis of publisher adoption here.  And, for more on web scraping using Python, I’ve found Hartley Brody’s post here informative and easy to follow.

Best of luck getting your own crawler working!

 

13 comments

  1. Even using your pull script, I still hit an occasional domain that had null bytes (ex kizi.com/ads.txt which had an ads.txt file and yardbarker.com which did not). I made one small change that seems to have worked for both use cases.

    tmp_csv_file.write(r.text.replace(‘\x00’, ”).encode(‘ascii’, ‘ignore’).decode(‘ascii’))

  2. Awesome – thanks Jeff! Those unicode / null byte errors are quite frustrating – anything to get around them so I can run longer crawler files is most welcome!

  3. Hi Ben thanks for the detailed article.

    The script parses well the target_domains file, but for some reason the results are not persisted onto the sqlite db.

    Did you encounter the same issue before?

    Many thanks.

  4. Hi WR –

    I didn’t have that problem actually – did you verify your table was in fact created, and did you set the table name in your script under the -d parameter?

  5. Thanks Ben for the nice article!

    @ W.R, I had the exact same issue working with Neal’s original version (everything seemed to work fine but the adstxt table stayed empty). So I used jhpacker’s one and it worked.
    Hope this helps.
    Chris.C

  6. Thanks for your reply Ben.
    I created the database using:
    $sqlite3 adstxt.db < adstxt_crawler.sql
    I set also the correct filename after the -d parameter.
    But after crawling and checking that everything is alright with the logs, the adstxt table is still empty.
    I will keep trying, and will hopefully post the solution in case some readers have the same issue.

  7. hi,

    thank you so much for posting this article – it was immensely helpful. going through your instructions,
    literally got my crawler up and running in about 20 minutes (already had python and sql installed). The rest of the site is awesome – i frequently recommend it to others at my agency and some clients as well. Thanks so much!

  8. Thanks for posting this guide, it was really helpful. As of now, the database output gives me a very large list with duplicates. For example, in the example given you get 33 entries for businessinsider.com, one for every line in the sides ads.txt.

    How can I adjust the db so instead of getting each line of direct or resller, I just get the domain that was found to have ads.txt?

  9. I think it might be easier to just write a different SQL query against your database; you could do something like ‘select site_domain from adstxt group by site_domain’ for instance. The data is trivial in size at this point so I think it makes more sense to filter the full data vs. collecting it differently. It’s also a much simpler solution.

Comments are closed.