The Infome Software
The system allows the user to set up a web crawler- i.e. a process that
uses the HTTP protocol to requests data on the internet and to
recombine, transduce and map the collected data.
Crawling:
The interface has various settings for how the crawler should behave
and what it should collect.
The crawler can start on a specified web page; by making a search for
something on a search engine; or by selecting a random page by using a
process that is getting a random word from the news, searching for that
word on a search engine, and then randomly selects one of the returned
links.
The crawler can 'move' over the web in two different ways that can be
endlessly combined. Either in 'layers' i.e. following all links on all
pages. Or it can dance around in a predefined pattern. It can be set to
only visit a page once or it can visit a page every time it encounters
a link to it, creating very different results in the data. The data set
resulting from many revisits will have repetitions talking about the
structure of the sites, revealing its topology.
The crawler can collect data found in the HTML page or in the header.
The header contains meta information sent with the document that tells
the client (the web browser or in this case the crawler) things about
the page. For example how large the page is, when it was created or if
it is no longer there, producing a 404 message.
The crawler can set and retrieve cookies and leave traces in the log
file of the server it 'visits'.
Manifestations of the crawlers:
The data that is collected can be presented as HTML or as gif images
and image maps.
In one type of visualization each link requested by the crawler (and
in some cases the links from those pages) is represented by a line. The
first page visited is a pixel in the middle. The elements around that
are the links from that page. The next layer/circle represents the
links that was linked to from the pages in the previous circle. Thus
each circle represents the distance in clicks away from the starting
page. If the crawling was done by moving in patterns a de-centering of
the circles occurs. The elements/links can be colored in different
ways: 1. Header information: The types of response headers (use link to
HTTP specifications on home page for explanation on response headers)
that are numerical for example the date/time when the crawler visited
the link [image] and
the size of the page [image
map] These values are mapped on a gray
scale. 2. IP address: The first three octets are used for rgb value so
200.2.10.34 would be 200 red 2 green and 10 blue. [image map]
In another type of visualization, collected data is represented as
simple pixels placed one after the other. [image]
This one is displaying the background and font colors from all the
pages visited by a crawler.
The data can also be used to produce plain HTML. [HTML page]
[HTML page] [HTML
page]
|