Tags

I was reading a beautiful article by Gary Griffiths, CEO of Trapit, a content discovery platform – in a sense related and competing with Pugmarks.me, titled “The web is much bigger and smaller than you think“. This got me thinking, and vibed extremely well with the thesis we have at Pugmarks.me.

About 10 years back, the web “you dealt with”, was actually much bigger than the web you deal with now. This may sound paradoxical, since the web has grown at least a 1000 times bigger in that time. 10 years back, everything you did was with the “full web”. You would use a search engine, or navigate through directories. But a lot has changed since then. These days, you deal with a very different web, everyday. This is a much smaller web – the web of relevance, your circle of interests. Most of what you read and get is from this smaller web – either on Twitter, Facebook, or the personalized news readers you have. Your web of relevance is growing – its probably in the just short of a 1000 for most of us, if you pool all the friends you have on Facebook, colleagues on LinkedIn, personalities on Twitter or the publishers you follow. This is the web you are in most contact with – every day, not the full web thats behind Google. You do have the choice of going to the full web, at any time. But the web you are in touch with is small. So, although the full web has grown, your web has been small.

Your small web

The ways you interact with the web have changed. The full web is accessed through large directories, or a search engine. Your web is more of a stream – a stream of life events that happen to people and things you care about. Streams are aggregated and presented to you. The same systems that powered your access to the full web will not work when you interact with “your web”. Your web is not just documents. Your web is organized around people, time, context and thoughts. They keep happening. How should you be interacting with “your web”?

Hey, let me stop here. What do I even mean by “interacting with your web”?

Let me ask you a few questions, and you should point me to existing systems that give you answers to these questions.

1) Tell me all the places visited by my friends this year.

2) Look at Sony’s Android product line, and then Samsung’s Android product line. Whats common in their feature evolution, and whats different?

3) What are the top interests among people who work at Intel?

4) What industries and sectors get the most attention from the VCs of India?

5) Whats similar between what I read yesterday, and what I am reading today?

6) Now that I am looking at this resume, what have I read in the recent past that may be interesting to him?

Do you get the idea? The nature of these searches are very different. They are all working on data that is very “close” to you. The objects of these questions are of interest to you. They all have their streams – and they are all emitting out data into the wild. You want to gather interesting aspects of these streams – that is not as straight forward as just searching on the web for some document. The results often lie across documents, and across streams!

This is multi-stream retrieval. For the uninitiated in information retrieval – the science (and art) of creating search engines, let me explain this a bit more. Multi-stream retrieval is about humans querying, exploring and discovering information from streams of information. This is what we do at Pugmarks.me, and the platform we’ve built at Insieve.

When you build a search engine for the web, you build large scale systems that are optimized to go and do a huge brute search on billions of documents – for every query you are doing. In multi-stream retrieval, streams that are close to you are more important than the full web. You are willing to work on smaller data, but want to discover more from the world near you. Every query in multi-stream retrieval is sent out to several 100s of streams that matter to you. When these results come back, you are not just served documents – but various ways to exploring this data, and insights gathered from across these streams. Eg: When we query for “dance”, the multi-stream retrieval engine wont just give you documents that matched, but will also tell the people who seem to be writing a lot about dance. While they write about dance, what else do they write about?

In multi-stream retrieval, you want to reason about streams. Pick everything Robert Scoble has been talking or reading. Now compare them with everything Michael Arrington has been talking or reading. Whats common, and whats different between them?

Scoble with Arrington

You get the idea now. Its your world. You deserve to know more about the things that are close to you! Welcome to the world of multi-stream retrieval.

Bharath ()