Search Down Memory Lane

Was reading John Battelle’s excellent book The Search (which I’ll blog about when I finish it) and was suddenly sent hurtling down Memory Lane. John writes about the distant antecedents of today’s search engines:

“Enter Gerard Salton, a Harvard- and Cornell- based mathematician often called the father of digital search. Salton was fascinated by the problem of digital information retrieval, and in the late 1960s developed SMART – Salton’s Magic Automatic Retriever of Text – or what might be considered the first digital search engine. Salton introduced many of the seminal concepts commonly used in search today, including concept identification based on statistical weighting, and relevance algorithms based on feedback from queries…”

In the early 1960s I was an American History and Literature major at Harvard looking for a better way to earn tuition than bar-tending, snow shoveling and vacuuming – all of which I was doing. I learned how to program – that’s another story – and, during my junior and senior years and the summer in between, worked for the late Gerry Salton on the NSF-funded SMART project. I was the project manager in my senior year and wrote lots and lots of code, mostly in FORTAN and IBM 7090 Assembler. Even wrote some of the published reports to the NSF – last time I was listed as the author of a book before hackoff.com.

Salton planned the SMART project to compare various methods of information retrieval. We weren’t searching the Web then, of course; we were searching abstracts of scientific papers. We had a bunch of them in machine-readable format because paper tape had just come into use to control linotype machines and we could get the paper tape converted to magnetic tape which the 7090 could read. Our searches weren’t a practical way to find information both because we had a very small library of machine-readable documents and because each search involved passing the documents on tape through the computer – we had no disk drives. But our grant was for comparing algorithms looking forward to the day when it would actually be feasible to use some of them.

The “seminal concepts” that Salton invented (or at least popularized) included using precision and recall as measurements of the efficacy of an information retrieval algorithm. Precision is the percentage you get when you divide the number of relevant documents retrieved by the total number of documents retrieved. Recall is the percentage of all documents relevant to the query which are actually found in the particular search. For any particular search algorithm, these two measurements usually trade-off against each other. At the extreme, if a search always returns all the documents in the collection, it will have 100% recall but lousy precision. If it always returns one relevant document, it will have good precision but poor recall assuming there are many relevant documents.

So, if algorithm A has BOTH better precision AND better recall than algorithm B (and doesn’t use inordinately more computing time), then algorithm A is the better algorithm. Next contender, please. It was our job to build the platform for testing algorithms and then program the algorithms to be tested.

We were very surprised by the results we got. Everybody assumed that the best results would be obtained by algorithms which made an attempt at understanding English syntax. (which is very hard to do). WRONG! Turns out that syntax was a waste of time; all that matters is semantics – the actual words used in the query and the documents – not how they relate to each other in a sentence. Sometimes it was (and still is) useful to search for phrases as if they were words. But you get that just by observing word order or how close words are to each other – not trying to parse sentences.

So, after a while, we not only ignored syntax but also stripped off all the “ed”s and “ing”s and “s”s which identify parts of speech so that we could simply correlate what we called “word stems”. Ignoring syntax means that you do exactly the wrong thing when someone queries: “I want to know about all the indigenous people in Alaska except Aleuts”. Without syntax, you judge documents with the word “Aleut” to be a closer match to the query than documents without. That’s why even modern search engines implement “Exclude” as one of those advanced functions (which no one ever uses). Semantics rule!

We keypunched all our programs on 80 column cards, then bound the decks in rubber bands and submitted them for execution. They came back many hours later. Three turn-arounds in a 24-hour nerd day was doing good. Time on the 7090, which was carefully tended in an air-conditioned room the size of a basketball court, went for about $1000/hour in the days when a grand was a lot of money. So we did a lot of desk checking.

Probably any PC which is in use today has more computing power than ALL of the computers that were in use in the early 1960s put together.

Note for fellow computer nerds: The IBM 7090 had 32,768 thirty-six bit words of memory (bytes weren’t invented until the IBM 360). Obviously, this memory was shared by both the executing program and the data it was operating on. Sometimes, to free up memory, we wrote the operating system out to the punch tape (the one with card images which would later be punched on the IBM 1401 which was an acolyte to the 7090). If the program crashed before it could reload the operating system, the operators had to reboot the 7090 from cards and the entire batch of jobs could be trashed. In theory you could be punished for this but we badly needed the memory.

Note for fellow math nerds: Don’t know how Google does this but, in our best semantic algorithms, we considered each document to be a point on the surface of an n-dimensional unit sphere. The coordinates of the point were defined by a vector consisting of the frequencies with which each word stem (or phrase) appeared in the document. The closer two documents were (and queries are just documents), the more likely they are to be related to each other. What we actually computed was the cosine of the angle between the two vectors. Fortunately, the 7090 had good floating point hardware.