mapping the web infome

The Infome Software

The system allows the user to set up a web crawler- i.e. a process that uses the HTTP protocol to requests data on the internet and to recombine, transduce and map the collected data.

Crawling:

The interface has various settings for how the crawler should behave and what it should collect.

The crawler can start on a specified web page; by making a search for something on a search engine; or by selecting a random page by using a process that is getting a random word from the news, searching for that word on a search engine, and then randomly selects one of the returned links.

The crawler can 'move' over the web in two different ways that can be endlessly combined. Either in 'layers' i.e. following all links on all pages. Or it can dance around in a predefined pattern. It can be set to only visit a page once or it can visit a page every time it encounters a link to it, creating very different results in the data. The data set resulting from many revisits will have repetitions talking about the structure of the sites, revealing its topology.

The crawler can collect data found in the HTML page or in the header. The header contains meta information sent with the document that tells the client (the web browser or in this case the crawler) things about the page. For example how large the page is, when it was created or if it is no longer there, producing a 404 message.

The crawler can set and retrieve cookies and leave traces in the log file of the server it 'visits'.

Manifestations of the crawlers:

The data that is collected can be presented as HTML or as gif images and image maps.

In one type of visualization each link requested by the crawler (and in some cases the links from those pages) is represented by a line. The first page visited is a pixel in the middle. The elements around that are the links from that page. The next layer/circle represents the links that was linked to from the pages in the previous circle. Thus each circle represents the distance in clicks away from the starting page. If the crawling was done by moving in patterns a de-centering of the circles occurs. The elements/links can be colored in different ways: 1. Header information: The types of response headers (use link to HTTP specifications on home page for explanation on response headers) that are numerical for example the date/time when the crawler visited the link [image] and the size of the page [image map] These values are mapped on a gray scale. 2. IP address: The first three octets are used for rgb value so 200.2.10.34 would be 200 red 2 green and 10 blue. [image map]

In another type of visualization, collected data is represented as simple pixels placed one after the other. [image] This one is displaying the background and font colors from all the pages visited by a crawler.

The data can also be used to produce plain HTML. [HTML page] [HTML page] [HTML page]

home