About Tom Evslin

Video Profile of Tom Evslin

Follow Tom Evslin on Twitter


subscribe:

Add to Technorati Favorites!
Powered by TypePad
Member since 01/2005

technorati


« Survey Extended | Main | Vonage Out(r)age »

Search Down Memory Lane

Was reading John Battelle’s excellent book The Search (which I’ll blog about when I finish it) and was suddenly sent hurtling down Memory Lane.  John writes about the distant antecedents of today’s search engines:

“Enter Gerard Salton, a Harvard- and Cornell- based mathematician often called the father of digital search.  Salton was fascinated by the problem of digital information retrieval, and in the late 1960s developed SMART – Salton’s Magic Automatic Retriever of Text – or what might be considered the first digital search engine.  Salton introduced many of the seminal concepts commonly used in search today, including concept identification based on statistical weighting, and relevance algorithms based on feedback from queries…”

In the early 1960s I was an American History and Literature major at Harvard looking for a better way to earn tuition than bar-tending, snow shoveling and vacuuming – all of which I was doing.  I learned how to program – that’s another story – and, during my junior and senior years and the summer in between, worked for the late Gerry Salton on the NSF-funded SMART project.  I was the project manager in my senior year and wrote lots and lots of code, mostly in FORTAN and IBM 7090 Assembler.  Even wrote some of the published reports to the NSF – last time I was listed as the author of a book before hackoff.com.

Salton planned the SMART project to compare various methods of information retrieval.  We weren’t searching the Web then, of course; we were searching abstracts of scientific papers.  We had a bunch of them in machine-readable format because paper tape had just come into use to control linotype machines and we could get the paper tape converted to magnetic tape which the 7090 could read.  Our searches weren’t a practical way to find information both because we had a very small library of machine-readable documents and because each search involved passing the documents on tape through the computer – we had no disk drives.  But our grant was for comparing algorithms looking forward to the day when it would actually be feasible to use some of them.

The “seminal concepts” that Salton invented (or at least popularized) included using precision and recall as measurements of the efficacy of an information retrieval algorithm.  Precision is the percentage you get when you divide the number of relevant documents retrieved by the total number of documents retrieved.  Recall is the percentage of all documents relevant to the query which are actually found in the particular search.  For any particular search algorithm, these two measurements usually trade-off against each other.  At the extreme, if a search always returns all the documents in the collection, it will have 100% recall but lousy precision.  If it always returns one relevant document, it will have good precision but poor recall assuming there are many relevant documents.

So, if algorithm A has BOTH better precision AND better recall than algorithm B (and doesn’t use inordinately more computing time), then algorithm A is the better algorithm.  Next contender, please.  It was our job to build the platform for testing algorithms and then program the algorithms to be tested.

We were very surprised by the results we got.  Everybody assumed that the best results would be obtained by algorithms which made an attempt at understanding English syntax. (which is very hard to do).  WRONG!  Turns out that syntax was a waste of time; all that matters is semantics – the actual words used in the query and the documents – not how they relate to each other in a sentence.  Sometimes it was (and still is) useful to search for phrases as if they were words.  But you get that just by observing word order or how close words are to each other – not trying to parse sentences.

So, after a while, we not only ignored syntax but also stripped off all the “ed”s and “ing”s and “s”s which identify parts of speech so that we could simply correlate what we called “word stems”.  Ignoring syntax means that you do exactly the wrong thing when someone queries: “I want to know about all the indigenous people in Alaska except Aleuts”.  Without syntax, you judge documents with the word “Aleut” to be a closer match to the query than documents without.  That’s why even modern search engines implement “Exclude” as one of those advanced functions (which no one ever uses).  Semantics rule!

We keypunched all our programs on 80 column cards, then bound the decks in rubber bands and submitted them for execution. They came back many hours later.  Three turn-arounds in a 24-hour nerd day was doing good.  Time on the 7090, which was carefully tended in an air-conditioned room the size of a basketball court,  went for about $1000/hour in the days when a grand was a lot of money.  So we did a lot of desk checking.

Probably any PC which is in use today has more computing power than ALL of the computers that were in use in the early 1960s put together.

Note for fellow computer nerds: The IBM 7090 had 32,768 thirty-six bit words of memory (bytes weren’t invented until the IBM 360).  Obviously, this memory was shared by both the executing program and the data it was operating on.  Sometimes, to free up memory, we wrote the operating system out to the punch tape (the one with card images which would later be punched on the IBM 1401 which was an acolyte to the 7090).  If the program crashed before it could reload the operating system, the operators had to reboot the 7090 from cards and the entire batch of jobs could be trashed.  In theory you could be punished for this but we badly needed the memory.

Note for fellow math nerds: Don’t know how Google does this but, in our best semantic algorithms, we considered each document to be a point on the surface of an n-dimensional unit sphere.  The coordinates of the point were defined by a vector consisting of the frequencies with which each word stem (or phrase) appeared in the document.  The closer two documents were (and queries are just documents), the more likely they are to be related to each other.  What we actually computed was the cosine of the angle between the two vectors.  Fortunately, the 7090 had good floating point hardware.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451cce569e200d834281ac153ef

Listed below are links to weblogs that reference Search Down Memory Lane:

» Gerard Salton and Early Search Engines from SEO Book.com
Post about early search algorithms and some of the surprises they found when they were trying to create them. [Read More]

» Gerard Salton and Early Search Engine Algorithms from SEO Book.com
Post about early search algorithms and some of the surprises they found when they were trying to create them. [Read More]

» 搜索引擎发展史(1945-2006) from 正在想
Google是现今当之无愧的搜索之王,这是事实。但在Google之前,还有很多搜索引擎。它们也曾经给人们带来极大的便利,也曾经风光过。从历史观点看,它们都是Google的前辈。那么到底搜索技术最早出现于何时?谁才是现代搜索技术之父?AltaVista是怎样没落的?Google的前身BackRub的含义是什么? ... [Read More]

Comments

blog comments powered by Disqus

Now on Kindle!

hackoff.com: An historic murder mystery set in the Internet bubble and rubble

CEO Tom Evslin's insider account of the Internet bubble and its aftermath. "This novel is a surveillance video of the seeds of the current economic collapse."

The Interpreter's Tale

Hacker Dom Montain is in Barcelona in Evslin's Kindle-edition long short story. Why? and why are the pickpockets stealing mobile phones?

Need A Kindle?

Kindle: Amazon's Wireless Reading Device

Not quite as good as a real book IMHO but a lot lighter than a trip worth of books. Also better than a cell phone for mobile web access - and that's free!

Recent Reads - Click title to order from Amazon


Google

  • adlinks
  • adsense