I, the Web Robot, help you search the internet

Probably all of you already heard about web robots. If you imagined them as in SF movies, then forget about that. A robot is a program that automatically traverses the Web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. Normal Web browsers are not robots, because they are operated by a human, and don’t automatically retrieve referenced documents (other than inline images). Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.

Are web robots good or bad?

There are a few reasons people believe robots are bad for the Web:

  • Certain robot implementations can (and have in the past) overloaded networks and servers. This happens especially with people who are just starting to write a robot; these days there is sufficient information on robots to prevent some of these mistakes.
  • Robots are operated by humans, who make mistakes in configuration, or simply don’t consider the implications of their actions. This means people need to be careful, and robot authors need to make it difficult for people to make mistakes with bad effects
  • Web-wide indexing robots build a central database of documents, which doesn’t scale too well to millions of documents on millions of sites.

If professionally designed and operated, robots are good, because they make search possible, because they bring relevant web pages into the attention of readers seeking for that kind of information they offer.

Nevertheless, there may be situations when indexing a web page is not desirable by its author. In such cases, robots are not good, because they will index that page against our will.

Can we prevent web robots from visiting a web page?

No, we can’t keep them away from our pages, but there is a weapon that can prevent them from indexing our pages. This weapon is a file called robots.txt, which specifies an access policy for robots. This file must be accessible via HTTP on the local URL “/robots.txt“. The contents of this file are specified below.

This approach was chosen because it can be easily implemented on any existing WWW server, and a robot can find the access policy with only a single document retrieval. Yet, the protocol is purely advisory. It relies on the cooperation of the web robot, so that marking an area of your site out of bounds with robots.txt does not guarantee privacy.

Making a robots.txt file for your website is very simple, once you know the syntax to be used. There are so called robots.txt checkers, which would parse your file and let you know if it is malformed.

If you want to know what a robots meta tag is, take a quick look at this page. You’ll also find there a link to a list of web robots and crawlers.

Other articles:
Dr. Jekyll or Mr. Hyde? It depends on perspective
The invisible web revealed
Data validation in MSExcel
The hidden face of MS Excel
Get more productivity with Windows XP
How to write faster in MS Word
How to use macros in MS Excel

One Comment

  1. apoorva
    Posted December 21, 2007 at 11:47 am | Permalink

    Please tell me some idea how to make following robot.here i will use atmega16 microcontroller and for block detector i will use IR sensor.please tell me algorithm or help in making it.

    Task:
    An autonomous machine is required to detect total number of landmines present in the entire arena. Landmines are represented in the arena with black squares. The arena may consist of landmines of different size. The machine should be able to display total number of landmines, and also total number of black squares detected at the end of its run.

    Maze Specifications:
    The Maze will be a square of side 305 cm.
    The Maze will consist of 10 X 10 multiples of 25 cm X 25 cm unit square. The pathway between these squares will be 5 cm wide. The maze will NOT be walled at any of its sides.
    White line Grid will be present throughout the arena.
    The floor of the maze will be constructed from plywood and lines will be painted White (non glossy paint). Each mine will be painted black (non glossy paint). (The brand and color numbers of paints shall be declared).
    Mines will be either 1×1 or 2×2 square matrix and multiple squares which touch each other will be considered as a single entity, painted black in color. Note that you have to make separate counts of the two types of mines.
    No two mines will share a common side.
    The start of the maze shall be located at any of the four 25 cm X 25 cm square corners of the maze.
    The dimensions of the maze shall be accurate to within 2% or 5cm.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*