The Deep Web

March 24, 2009 · Posted by Donna Remillard · 1 Comment · Trackback Url

 

So just when I thought I was finally getting the most out of my search with IE 8’s new search tools, I come across an article about how much I’m missing in the Deep Web and find there’s a army of research going on and specialized search providers to access the ocean of data I’m not getting.

The Deep Web (also called Deepnet, the invisible Web, dark Web or the hidden Web) refers to web content that is not part of the surface web which is indexed by standard search engines

Last summer, Google’s search engine passed a milestone. It added the one trillionth address to the list of Web pages it knows about. But as impossibly big as that number may seem, it represents only a fraction of the entire Web.

Beyond those trillion pages lies the Deep Web of hidden data: financial information, shopping catalogs, flight schedules, medical research and all kinds of other material stored in databases that remain largely invisible to search engines.

The challenges that the major search engines face in penetrating the Deep Web explains why they still can’t provide satisfying answers to questions like “What’s the best fare from New York to London next Thursday?” The answers are readily available — if only the search engines knew how to find them.

How it Works

Search engines rely on programs known as crawlers (or spiders) that gather information by following the trails of hyperlinks that tie the Web together. While that approach works well for the pages that make up the surface Web, these programs have a harder time penetrating databases that are set up to respond to typed queries.

Google, Yahoo and a number of others are investing in Deep Web search strategies to extract that data, but right now they’re only scratching the surface of Web content with these traditional spiders. 

To extract meaningful data from the Deep Web, search engines have to analyze users’ search terms and figure out how to broker those queries to particular databases. For example, if a user types in “Rembrandt,” the search engine needs to know which databases are most likely to contain information about art ( say, museum catalogs or auction houses), and what kinds of queries those databases will accept.

Google’s Deep Web search strategy involves sending out a program to analyze the contents of every database it encounters. For example, if the search engine finds a page with a form related to fine art, it starts guessing likely search terms — “Rembrandt,” “Picasso,” “Vermeer” and so on — until one of those terms returns a match. The search engine then analyzes the results and develops a predictive model of what the database contains.

A few of the Deep Web Search Engines currently available:

Turbo10

Complete Planet

Deep Peep

If you still can’t find what you’re looking for, here’s a directory of specialized libraries, but you need to provide hints in terms of the subject you’re interested in for each search term.

http://websearch.about.com/od/invisibleweb/a/invisibleweb.htm

And if you just can’t get enough on search theory, reference, research, algorithms, here are some links to some more Deep Web Research:

http://deepwebresearch.blogspot.com/

http://www.llrx.com/features/deepweb2009.htm

1 response so far

  • Thursday, 12 Nov 2009 01:51 by Brenda Cane

    This is a good overall description of deep web... how would you differentiate Turbo 10 from other "meta crawllers' like dogpile?

  • CAPTCHA Image Validation