Search Down Memory Lane
Was reading John Battelle’s excellent book The Search (which I’ll blog about when I finish it) and was suddenly sent hurtling down Memory Lane. John writes about the distant antecedents of today’s search engines:
“Enter Gerard Salton, a Harvard- and Cornell- based mathematician often called the father of digital search. Salton was fascinated by the problem of digital information retrieval, and in the late 1960s developed SMART – Salton’s Magic Automatic Retriever of Text – or what might be considered the first digital search engine. Salton introduced many of the seminal concepts commonly used in search today, including concept identification based on statistical weighting, and relevance algorithms based on feedback from queries…”
In the early 1960s I was an American History and Literature major at Harvard looking for a better way to earn tuition than bar-tending, snow shoveling and vacuuming – all of which I was doing. I learned how to program – that’s another story – and, during my junior and senior years and the summer in between, worked for the late Gerry Salton on the NSF-funded SMART project. I was the project manager in my senior year and wrote lots and lots of code, mostly in FORTAN and IBM 7090 Assembler. Even wrote some of the published reports to the NSF – last time I was listed as the author of a book before hackoff.com.
Salton planned the SMART project to compare various methods of information retrieval. We weren’t searching the Web then, of course; we were searching abstracts of scientific papers. We had a bunch of them in machine-readable format because paper tape had just come into use to control linotype machines and we could get the paper tape converted to magnetic tape which the 7090 could read. Our searches weren’t a practical way to find information both because we had a very small library of machine-readable documents and because each search involved passing the documents on tape through the computer – we had no disk drives. But our grant was for comparing algorithms looking forward to the day when it would actually be feasible to use some of them.
The “seminal concepts” that Salton invented (or at least popularized) included using precision and recall as measurements of the efficacy of an information retrieval algorithm. Precision is the percentage you get when you divide the number of relevant documents retrieved by the total number of documents retrieved. Recall is the percentage of all documents relevant to the query which are actually found in the particular search. For any particular search algorithm, these two measurements usually trade-off against each other. At the extreme, if a search always returns all the documents in the collection, it will have 100% recall but lousy precision. If it always returns one relevant document, it will have good precision but poor recall assuming there are many relevant documents.
So, if algorithm A has BOTH better precision AND better recall than algorithm B (and doesn’t use inordinately more computing time), then algorithm A is the better algorithm. Next contender, please. It was our job to build the platform for testing algorithms and then program the algorithms to be tested.
We were very surprised by the results we got. Everybody assumed that the best results would be obtained by algorithms which made an attempt at understanding English syntax. (which is very hard to do). WRONG! Turns out that syntax was a waste of time; all that matters is semantics – the actual words used in the query and the documents – not how they relate to each other in a sentence. Sometimes it was (and still is) useful to search for phrases as if they were words. But you get that just by observing word order or how close words are to each other – not trying to parse sentences.
So, after a while, we not only ignored syntax but also stripped off all the “ed”s and “ing”s and “s”s which identify parts of speech so that we could simply correlate what we called “word stems”. Ignoring syntax means that you do exactly the wrong thing when someone queries: “I want to know about all the indigenous people in Alaska except Aleuts”. Without syntax, you judge documents with the word “Aleut” to be a closer match to the query than documents without. That’s why even modern search engines implement “Exclude” as one of those advanced functions (which no one ever uses). Semantics rule!
We keypunched all our programs on 80 column cards, then bound the decks in rubber bands and submitted them for execution. They came back many hours later. Three turn-arounds in a 24-hour nerd day was doing good. Time on the 7090, which was carefully tended in an air-conditioned room the size of a basketball court, went for about $1000/hour in the days when a grand was a lot of money. So we did a lot of desk checking.
Probably any PC which is in use today has more computing power than ALL of the computers that were in use in the early 1960s put together.
Note for fellow computer nerds: The IBM 7090 had 32,768 thirty-six bit words of memory (bytes weren’t invented until the IBM 360). Obviously, this memory was shared by both the executing program and the data it was operating on. Sometimes, to free up memory, we wrote the operating system out to the punch tape (the one with card images which would later be punched on the IBM 1401 which was an acolyte to the 7090). If the program crashed before it could reload the operating system, the operators had to reboot the 7090 from cards and the entire batch of jobs could be trashed. In theory you could be punished for this but we badly needed the memory.
Note for fellow math nerds: Don’t know how Google does this but, in our best semantic algorithms, we considered each document to be a point on the surface of an n-dimensional unit sphere. The coordinates of the point were defined by a vector consisting of the frequencies with which each word stem (or phrase) appeared in the document. The closer two documents were (and queries are just documents), the more likely they are to be related to each other. What we actually computed was the cosine of the angle between the two vectors. Fortunately, the 7090 had good floating point hardware.





Wow. What a cool post. You guys paved the way for an awful lot of things going on right now. I would have loved to be around when you were doing all that. I'm finishing a PhD in natural language generation and understanding, and so I've had to work hard in NLP and in IR too. I can really appreciate what you did, very cool.
Posted by: CJ | August 10, 2008 at 07:35 PM
Frode:
Thanks for the comment. I worked on SMART in the early 60s in Fortan II when not in 7090 assembler. Although Battelle is right that Salton was working on this in the late sixties, the start was earlier than that.
Posted by: Tom Evslin | January 31, 2006 at 03:46 PM
Hi Tom,
Nice blog about MAPI (and yes, I was there representing MaXware).
I am writing this because I started out working on a free-text indexing and search engine wich (according to comments in the code) had originated from Sperry UK in the early sixties ('63 I think). It was written in Fortran II. So late 60 was probably not the earliest digital search engine.
Frode.
Posted by: Frode Hernes | January 31, 2006 at 02:54 PM
I showed my husband your post, and all he seems to remember are the keypunch girls. I guess it was the highlight of his day while his nose was to the grindstone amid all that equiptment.
I'll never forget when we met on one of our dates he had to drop something off to a "service bureau." Not being from a technical background, I was amazed by the building filled with all that equiptment and that was 1975.
How far we have come!
Posted by: ellen | January 20, 2006 at 11:22 AM
I don't know how much this was used. I graduated and went on to other things in 1965; and, as you point out, Salton went much further later.
We didn't use this formalization (as I remember) but we used some of the concept in two ways: we looked for a dictionary of noise words to eliminate by finding those which did NOT help to discriminate between documents and we made an attempt to add some phrases to our word stem dictionary because they did, essentially, lead to greater space between documents. We didn't really have enough computer horsepower then to pursue the latter very far.
Posted by: Tom Evslin | January 20, 2006 at 12:04 AM
Was this Vectorization, in fact the basis for Webcrawler's technology?
________________________
"In a document retrieval, or other pattern matching environment where stored entities (documents) are compared with each other, or with incoming patterns (search requests), it appears the best indexing (property) space is one where each entity lies as far away from the others as possible; that is, retrieval performance correlates inversely with space density. This result is used to choose an optimum indexing vocabulary for a collection of documents."
Salton spent a lifetime working on the vector space model. It was introduced to the web in 1994 by Brian
Pinkerton with his innovative, full text retrieval WebCrawler.
http://fantomaster.com/fnsubs/nomodomo/fn-0014.txt
Posted by: Search Engines Web | January 19, 2006 at 10:52 PM