Micromicon Blog

The Weblog of Micromicon

Micromicon Blog header image 4

Building a bot, phase I

August 28th, 2008 by admin
Respond

Bots are a popular search engine activity. When users recommend sites they are sometimes asked to enter information such as Title, URL, Description and possibly some Keywords. Some search engines however simply require a URL and the essential information is automatically retrieved using an automated mechanism - bots.

Bots are ideal for this repetitive automated activity by taking a work queue (a given list of URLs) and producing a set of information that can be passed to the next stage in the submission process. Even though these techniques are called bots they are essentially just software programs that run either at a scheduled time or in a loop constantly checking the work queue.

In addition to the usual information a bot can periodically extract additional words from body text itself by parsing out html tags, removing stop words and building a word frequency table. A word frequency table is simply a list of words with the number of times a word appears in the given text.

Recently I have been working to find a way of automatically extracting information for a given URL by accessing the html from a submitted URL, parsing the relevant information and using this information as input to the submission process. Information is often placed as meta tags in the HEAD section of html but I have found this process so far to somewhat hit and miss as some sites include the information and others do not. For those that do not an additional ways needs to be identified to provide a suitable link title and description - it may be the case that this will always require some user intervention but further analysis should identify more.

The bot also needs to be aware of the period of time since the last change of a web page to ensure that valuable information or updates is not missed and that processing power isn’t wasted on a link that does not change very much. The way we can address this is by producing a hash for a page and comparing it with a previously stored hash.

A hash, simply put is the result of a consistent algorithm applied to information. MD5 is a well known hash mechanism that I may investigate in a later article. For now a simple example to demonstrate a hash would be to calculate a check digit by taking letter values of a phrase, adding them together then finding the modulus using a number say 27 (allowing 26 letters and the space), in the example the first two phrases give the same answer however the third phrase with one letter different gives a much different check digit.

checkdigit

I have a working program up and running - but not in bot mode yet -but can be initiated manually. I need to add the loop code to continually check the work queue for submitted URLs.

Technorati Tags: ,,,


Go to Source

Tags: No Comments.

Taking a new theme for a ride

August 28th, 2008 by admin
Respond

Over here at Eggnchips we do like to fiddle about with technology so we are trying out a new Theme, the Mimbo Theme from Darren Hoyt, let us know [...] Continue Reading…

Tags: No Comments.

Search and stop words

August 28th, 2008 by admin
Respond

Whilst planned how to build a word and phrase vocabulary for use in the EggnChips project we need to make a decision regarding the use of stop words.
Stop words [...] Continue Reading…

Tags: No Comments.

Eggnchips: Improving the search results

August 28th, 2008 by admin
Respond

When using a search engine, or even a link directory, from a user point of view it almost seems that simplistic systematic method is being employed.
This methods appears to [...] Continue Reading…

Tags: No Comments.

Search and URL Submission Mechanics

August 28th, 2008 by admin
Respond

The following diagram shows the components of the first phase of the Eggnchips search engine project. The two key functional components are URL submission and Keyword Searching.

The green areas [...] Continue Reading…

Tags: No Comments.

Eggnchips, components of search

August 28th, 2008 by admin
Respond

Currently the Eggnchips search engine is one large application. However, whilst putting together the basic search engine, it has become apparent that the engine can be divided into a [...] Continue Reading…

Tags: No Comments.

Eggnchips, Stage One of Phase One

August 28th, 2008 by admin
Respond

 
The first stage of the search engine is up and running with the ability to perform some simple keyword searches. The engine shows the basic screen (the idea here [...] Continue Reading…

Tags: No Comments.

Dynamic Information

August 28th, 2008 by admin
Respond

Information needed for inclusion in the search engine:
Master Categories: #cats#
Sub-Categories: #subcats#
Total Links: #url-count#
Total Links Pending: #url-pending#
Link Submissions Today: #url-pending-today#

Go to Source

Tags: No Comments.

Building the basics of search

August 28th, 2008 by admin
Respond

The EggnChips development is getting underway. The first part of the project is to build a simple keyword search engine.
In this project a form will be presented the [...] Continue Reading…

Tags: No Comments.

Exploring search using keywords

August 28th, 2008 by admin
Respond

To understand more about the way search works, and to gain an idea into how it could work in the future, it may be useful to talk about what [...] Continue Reading…

Tags: No Comments.