About Tom Evslin

Video Profile of Tom Evslin

Follow Tom Evslin on Twitter


subscribe:

Add to Technorati Favorites!
Powered by TypePad
Member since 01/2005

technorati


« Survey Extended | Main | Vonage Out(r)age »

Search Down Memory Lane

Was reading John Battelle’s excellent book The Search (which I’ll blog about when I finish it) and was suddenly sent hurtling down Memory Lane.  John writes about the distant antecedents of today’s search engines:

“Enter Gerard Salton, a Harvard- and Cornell- based mathematician often called the father of digital search.  Salton was fascinated by the problem of digital information retrieval, and in the late 1960s developed SMART – Salton’s Magic Automatic Retriever of Text – or what might be considered the first digital search engine.  Salton introduced many of the seminal concepts commonly used in search today, including concept identification based on statistical weighting, and relevance algorithms based on feedback from queries…”

In the early 1960s I was an American History and Literature major at Harvard looking for a better way to earn tuition than bar-tending, snow shoveling and vacuuming – all of which I was doing.  I learned how to program – that’s another story – and, during my junior and senior years and the summer in between, worked for the late Gerry Salton on the NSF-funded SMART project.  I was the project manager in my senior year and wrote lots and lots of code, mostly in FORTAN and IBM 7090 Assembler.  Even wrote some of the published reports to the NSF – last time I was listed as the author of a book before hackoff.com.

Salton planned the SMART project to compare various methods of information retrieval.  We weren’t searching the Web then, of course; we were searching abstracts of scientific papers.  We had a bunch of them in machine-readable format because paper tape had just come into use to control linotype machines and we could get the paper tape converted to magnetic tape which the 7090 could read.  Our searches weren’t a practical way to find information both because we had a very small library of machine-readable documents and because each search involved passing the documents on tape through the computer – we had no disk drives.  But our grant was for comparing algorithms looking forward to the day when it would actually be feasible to use some of them.

The “seminal concepts” that Salton invented (or at least popularized) included using precision and recall as measurements of the efficacy of an information retrieval algorithm.  Precision is the percentage you get when you divide the number of relevant documents retrieved by the total number of documents retrieved.  Recall is the percentage of all documents relevant to the query which are actually found in the particular search.  For any particular search algorithm, these two measurements usually trade-off against each other.  At the extreme, if a search always returns all the documents in the collection, it will have 100% recall but lousy precision.  If it always returns one relevant document, it will have good precision but poor recall assuming there are many relevant documents.

So, if algorithm A has BOTH better precision AND better recall than algorithm B (and doesn’t use inordinately more computing time), then algorithm A is the better algorithm.  Next contender, please.  It was our job to build the platform for testing algorithms and then program the algorithms to be tested.

We were very surprised by the results we got.  Everybody assumed that the best results would be obtained by algorithms which made an attempt at understanding English syntax. (which is very hard to do).  WRONG!  Turns out that syntax was a waste of time; all that matters is semantics – the actual words used in the query and the documents – not how they relate to each other in a sentence.  Sometimes it was (and still is) useful to search for phrases as if they were words.  But you get that just by observing word order or how close words are to each other – not trying to parse sentences.

So, after a while, we not only ignored syntax but also stripped off all the “ed”s and “ing”s and “s”s which identify parts of speech so that we could simply correlate what we called “word stems”.  Ignoring syntax means that you do exactly the wrong thing when someone queries: “I want to know about all the indigenous people in Alaska except Aleuts”.  Without syntax, you judge documents with the word “Aleut” to be a closer match to the query than documents without.  That’s why even modern search engines implement “Exclude” as one of those advanced functions (which no one ever uses).  Semantics rule!

We keypunched all our programs on 80 column cards, then bound the decks in rubber bands and submitted them for execution. They came back many hours later.  Three turn-arounds in a 24-hour nerd day was doing good.  Time on the 7090, which was carefully tended in an air-conditioned room the size of a basketball court,  went for about $1000/hour in the days when a grand was a lot of money.  So we did a lot of desk checking.

Probably any PC which is in use today has more computing power than ALL of the computers that were in use in the early 1960s put together.

Note for fellow computer nerds: The IBM 7090 had 32,768 thirty-six bit words of memory (bytes weren’t invented until the IBM 360).  Obviously, this memory was shared by both the executing program and the data it was operating on.  Sometimes, to free up memory, we wrote the operating system out to the punch tape (the one with card images which would later be punched on the IBM 1401 which was an acolyte to the 7090).  If the program crashed before it could reload the operating system, the operators had to reboot the 7090 from cards and the entire batch of jobs could be trashed.  In theory you could be punished for this but we badly needed the memory.

Note for fellow math nerds: Don’t know how Google does this but, in our best semantic algorithms, we considered each document to be a point on the surface of an n-dimensional unit sphere.  The coordinates of the point were defined by a vector consisting of the frequencies with which each word stem (or phrase) appeared in the document.  The closer two documents were (and queries are just documents), the more likely they are to be related to each other.  What we actually computed was the cosine of the angle between the two vectors.  Fortunately, the 7090 had good floating point hardware.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451cce569e200d834281ac153ef

Listed below are links to weblogs that reference Search Down Memory Lane:

» Gerard Salton and Early Search Engines from SEO Book.com
Post about early search algorithms and some of the surprises they found when they were trying to create them. [Read More]

» Gerard Salton and Early Search Engine Algorithms from SEO Book.com
Post about early search algorithms and some of the surprises they found when they were trying to create them. [Read More]

» 搜索引擎发展史(1945-2006) from 正在想
Google是现今当之无愧的搜索之王,这是事实。但在Google之前,还有很多搜索引擎。它们也曾经给人们带来极大的便利,也曾经风光过。从历史观点看,它们都是Google的前辈。那么到底搜索技术最早出现于何时?谁才是现代搜索技术之父?AltaVista是怎样没落的?Google的前身BackRub的含义是什么? ... [Read More]

Comments

CJ

Wow. What a cool post. You guys paved the way for an awful lot of things going on right now. I would have loved to be around when you were doing all that. I'm finishing a PhD in natural language generation and understanding, and so I've had to work hard in NLP and in IR too. I can really appreciate what you did, very cool.

Tom Evslin

Frode:

Thanks for the comment. I worked on SMART in the early 60s in Fortan II when not in 7090 assembler. Although Battelle is right that Salton was working on this in the late sixties, the start was earlier than that.

Frode Hernes

Hi Tom,
Nice blog about MAPI (and yes, I was there representing MaXware).

I am writing this because I started out working on a free-text indexing and search engine wich (according to comments in the code) had originated from Sperry UK in the early sixties ('63 I think). It was written in Fortran II. So late 60 was probably not the earliest digital search engine.
Frode.

ellen

I showed my husband your post, and all he seems to remember are the keypunch girls. I guess it was the highlight of his day while his nose was to the grindstone amid all that equiptment.

I'll never forget when we met on one of our dates he had to drop something off to a "service bureau." Not being from a technical background, I was amazed by the building filled with all that equiptment and that was 1975.

How far we have come!

Tom Evslin

I don't know how much this was used. I graduated and went on to other things in 1965; and, as you point out, Salton went much further later.

We didn't use this formalization (as I remember) but we used some of the concept in two ways: we looked for a dictionary of noise words to eliminate by finding those which did NOT help to discriminate between documents and we made an attempt to add some phrases to our word stem dictionary because they did, essentially, lead to greater space between documents. We didn't really have enough computer horsepower then to pursue the latter very far.

Search Engines Web

Was this Vectorization, in fact the basis for Webcrawler's technology?

________________________

"In a document retrieval, or other pattern matching environment where stored entities (documents) are compared with each other, or with incoming patterns (search requests), it appears the best indexing (property) space is one where each entity lies as far away from the others as possible; that is, retrieval performance correlates inversely with space density. This result is used to choose an optimum indexing vocabulary for a collection of documents."

Salton spent a lifetime working on the vector space model. It was introduced to the web in 1994 by Brian
Pinkerton with his innovative, full text retrieval WebCrawler.

http://fantomaster.com/fnsubs/nomodomo/fn-0014.txt

Post a comment

If you have a TypeKey or TypePad account, please Sign In.

Now on Kindle!

hackoff.com: An historic murder mystery set in the Internet bubble and rubble

CEO Tom Evslin's insider account of the Internet bubble and its aftermath. "This novel is a surveillance video of the seeds of the current economic collapse."

The Interpreter's Tale

Hacker Dom Montain is in Barcelona in Evslin's Kindle-edition long short story. Why? and why are the pickpockets stealing mobile phones?

Need A Kindle?

Kindle: Amazon's Wireless Reading Device

Not quite as good as a real book IMHO but a lot lighter than a trip worth of books. Also better than a cell phone for mobile web access - and that's free!

Recent Reads - Click title to order from Amazon


Google

  • adlinks
  • adsense