In my last article I introduced the idea of having robots.txt be a dynamic document, rather than a static text file. The reason for this was to allow it to respond in real-time to the particulars of the robot that was making the request.
Here are the five cases I laid out for robot behavior when it comes to processing robots.txt:
The spider doesn’t even bother to check for robots.txt at all.
The spider checks for robots.txt, and doesn’t crawl prohibited areas. (Good bot! Here, have a cookie.)
The spider reads robots.txt, scans for ‘allow’ stanzas1 that apply to other spiders, and then masquerades as those in order to access the protected areas.
The spider checks for robots.txt, but doesn’t comply with the restrictions.
The spider reads robots.txt and explicitly tries to scan prohibited areas.
(I’ve reordered this slightly to make more sense.)
The first two cases are not problems: first because robots.txt isn’t even involved, and secondly because it’s working properly and we don’t have to be concerned with a misbehaving spider.
It’s the next three that we need to address.
At the time that robots.txt is being processed, there is no way of telling which of these five cases will apply. For this reason, robots.txt merely checks the ID, as it were, of the spider making the request, for use by intelligence in subsequent requests.
We can handle case number three by emitting stanzas that only apply to the robot making the request. That way, there are no other robots mentioned whose permissions it can record and later abuse.
Now that robots.txt is actually a dynamic document, let’s put it to work and actually do something with it.
For starters, and for performance, I use a MySQL database to record bot activity and access rules. Let’s begin by having it record the particulars of the current request. Here is the first MySQL table I use for that:
mysql> explain client; +---------+---------+------+-----+---------+----------------+ | Field| Type| Null | Key | Default | Extra | +---------+---------+------+-----+---------+----------------+ | xid | int(11) | NO| PRI | NULL| auto_increment | | client | blob| YES | UNI | NULL|| | comment | text| YES | | NULL|| +---------+---------+------+-----+---------+----------------+
The client table simply records each user agent that accesses the robots.txt file, and assigns it a unique identifying number. The first thing our dynamic robots.txt script does is look up the current client in this table, and either determine its unique number or create an entry for it if it doesn’t already exist.
Now we have a record that the client has accessed robots.txt file just now or in the past. Let’s do something with the information about this current request. Here’s a MySQL table for that:
mysql> explain access; +----------------+-------------+------+-----+---------+----------------+ | Field | Type| Null | Key | Default | Extra | +----------------+-------------+------+-----+---------+----------------+ | xid| int(11) | NO| PRI | NULL| auto_increment | | client | int(11) | YES | MUL | NULL|| | ipa| varchar(16) | YES | | NULL|| | total_requests | int(11) | YES | | 0|| | last_request| datetime| YES | | NULL|| +----------------+-------------+------+-----+---------+----------------+
The key fields here are ‘client’ and ‘ipa’ (IP address). Together they let us determine how often and how recently a particular client has accessed us from a particular IP address. This may be useful in the future if, for instance, we decide to block access for all clients from that particular IP address. We used the unique number for the client, rather than its full name, to save space and improve performance.
So we now have a robots.txt file that is actually a script and which:
makes a record of every client to access it and
makes a record of where the request came from (the IP address) and how many times requests for robots.txt have been made from that address by that user-agent.
This is useful stuff, but so far it’s just record keeping, and we’re not yet taking any action based on the information gathered. In order to do so, within the role of the robots.txt file, we need to have some rules pertaining to each client.
I feel another MySQL table coming on…
1. The original RES didn’t support ‘allow’ stanzas, and not all RES-compliant bots recognize them. However, the basic issue is the same even for ‘disallow’ stanzas – a bot with evil intentions can conceivably change its access by pretending to be one of those for which you have explicit rules.
This article was first published on EnterpriseITPlanet.com.
Ethics and Artificial Intelligence: Driving Greater Equality
FEATURE | By James Maguire,
December 16, 2020
AI vs. Machine Learning vs. Deep Learning
FEATURE | By Cynthia Harvey,
December 11, 2020
Huawei’s AI Update: Things Are Moving Faster Than We Think
FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA
FEATURE | By Guest Author,
November 10, 2020
FEATURE | By Samuel Greengard,
November 05, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media
FEATURE | By Rob Enderle,
October 16, 2020
FEATURE | By Cynthia Harvey,
October 07, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science
FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future
FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2021
FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI
FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality
FEATURE | By James Maguire,
September 09, 2020
Datamation is the leading industry resource for B2B data professionals and technology buyers. Datamation's focus is on providing insight into the latest trends and innovation in AI, data security, big data, and more, along with in-depth product recommendations and comparisons. More than 1.7M users gain insight and guidance from Datamation every year.
Advertise with TechnologyAdvice on Datamation and our other data and technology-focused platforms.
Advertise with Us
Property of TechnologyAdvice.
© 2025 TechnologyAdvice. All Rights Reserved
Advertiser Disclosure: Some of the products that appear on this
site are from companies from which TechnologyAdvice receives
compensation. This compensation may impact how and where products
appear on this site including, for example, the order in which
they appear. TechnologyAdvice does not include all companies
or all types of products available in the marketplace.