Wat is the deep, invisible or hidden web?

Not all documentary information is directly retrievable. In 2001 two publications described the problems with search engine limitations. Bergman (2001) who coined the term 'deep web' and Sherman & Price (2001) who used the term 'invisble web'. According to estimates by Bergman the 'deep web' is about 500 times as large as the 'surface web'. However, some of the assumptions in the early studies were likely to be flawed. Whatever its size, the deep web still exists and there is a lot of quality content to be found.

 

Estimates for some databases indexed by three major search engines give an illustration for the existence of the deep web:

Site Google Yahoo Live
Worldcat 433.000 3.500.000 964
Pubmed 9.260.000 863.000 98.272

 

The main causes for the existence of the deep web

  • The information is contained in databases
  • Search Engine limitions
  • Low ranking results
  • Cognitive factors

 

Informatie is contained in databases

Spiders or crawlers of search engines can't deal with database forms. The spiders can't complete a form, and hit the search button to gain access to the information in databases. They can index the search form itself, but not the wealth behind it. Webpages resulting from the database are so called dynamic pages. Dynamic pages can be recognized from the structure of their URL, they contain: ? or clues like: cgi, cfm, php etc. The following URL is an example of a dynamic page http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=9742976 wich is indexed by search engines with some difficulty.

 

Search engine limitations

  • Sites are too large to be indexed completely
  • Files are too large (limits on index size change, but they are still there).
  • Informatie is contained in non indexable file types (ZIP, TAR etc..)
  • Informatie is contained in graphical, multimedia files or Flash.
  • The site owner robots.txt does not allow indexing
  • Informatie changes rapdily (tocks, news or blogs)
  • Information is on intranets, or requires passwords

 

Cognitive factors

(Re)searcher are human after all. They don't look beyond the first 10 or perhaps 20 results. They will change their search query rather than paging through.

Make sure that you alter the preferences of your favourite search engine. Another possibility is too use another search engine. There is probably not a single search engine in the world that is the best for all your search queries.

 

Solution

Too find the information contained in the deep web, it is most important too find those database that hold the information rather than the information directly. To locate these databases there are four possibilities.

  • Use the standard search engines too locate the databases that can possibly contain the required inforation
  • Special directories.
  • Look for databases where they can be expected.
  • Special search engines.

 

Searching databases with standard search engines

Search for your research topic with addtional terms that point to databases. Terms such as: database, data, dataset, archive, bibliography, index, directory of statistics. Bijvoorbeeld ["plane crash" | "aviation accidents" database].

 

Search for your search term and add terms in the URL which generate database queries for example: asp, bin, cgi, cfm, search, query, (webquery) or php. Eg.

[mycology inurl:cfm] or [mycology inurl:asp]

 

Whenever you have found the suitable databases it is important that you understand how to query the database to retrieve your information.

 

Special directories

 

Direct Search http://www.freepint.com/gary/direct.htm

Although Direct Search is no longer updated, it is still a valuable resource to find important databases. This site was started by Gary Price. Recent developments on Web search and Web resrouces are still reported by him and blogged on ResoureShelf and DocuTicker.

 

Yahoo! Webdirectories http://dir.yahoo.com/

Most subject categories have a special set as webdirectories. On some occasions als databases or bibliographies.

 

A collection of special search engines http://www.leidenuniv.nl/ub/biv/specials.htm

A bit outdated (last additions from 2000) but still an impressive collection of special directories, bibliographies and databases in the social sciences and humanities.

 

Look for databases in locations where you can expect them.

  • Statistics in Netherlands are collected by CBS at the homepage you find the Statline database with all the important statistics of the Netherlands.

 

Specialized search engines

 

Complete Planet http://www.completeplanet.com

Covers some 70,000 databases, and Web directories.

 

IncyWincy http://www.incywincy.com/default

 

Turbo10 http://turbo10.com/

A meta search engine which can search in 800 databases at once. Some of these databases contain information from the hidden web.

 

Gosh me http://www.goshme.com/ (Still in Beta, perhaps defunct?)

Promising new search engine.

 

ScienceResearch http://www.scienceresearch.com/search/

This portal allows access to numerous scientific journals and public science databases. Depending on the source, full text documents may be available. In the event full text is not available, the results pull up an abstract of the article and a link to the source.

 

Additional Information

Anon. (2004) Invisible Web: What it is, Why it exists, How to find it, and Its inherent ambiguity. Retrieved 2005-05-23, from http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html

 

Bergman, K. T. (2001). The deep web: surfacing hidden value. The Journal of Electronic Publishing 7(1). http://www.press.umich.edu/jep/07-01/bergman.html

 

Devine, J. and F. Egger-Sider. (2005). Beyond Google: The invisible Web. Retrieved 2005-05-23, from http://www.lagcc.cuny.edu/LIBRARY/invisibleweb/.

 

Sherman, C. and G. Price (2001). The invisible web: Discovering information sources search engines can't see. Medford NJ, USA, Information today.

 

 


home

WG 20071009

 


Page Information

  • 8 months ago [history]
  • View page source
  • You're not logged in
  • No tags yet learn more

Wiki Information

Recent PBwiki Blog Posts