The Deep, Dark Invisible Web

Deep WebFinding Wisdom

by

Searching the Deep Web

I've been poking around this topic for a few months now, bumping into it constantly as I research other interests like web services, bioinformatics and the Internet of Things. All of these topics and much of the associated content are located in the region of the web called: the Deep Web, a repository of content and services that exist below the surface (static web pages) and are unreachable by traditional search engines like Google, Yahoo and Bing.

What is the Deep Web?

The deep web represents all of the data, information, content and services that are hidden from standard search engine crawlers and are therefore not easily found or accessed. The types of content that exists in this region includes practically everything imaginable from product reviews to videos to research papers and everything in between. In general, this hidden information is more current and is more content rich than what is generally available on the surface.

There are several types of sites that can be categorized as it relates to the Deep Web, most of these sites have made concsious decisions to make certain content hidden or invisible from the search engines. The reasons for this include everything from proprietary information on corporate intranets to subscription based systems that distribute content for a fee. The types of sites include the following:

  1. Visible: These are traditional websites (Web 1.0) that allow all of their content to be easily found, accessed and indexed by the search engines, 
  2. Private: A site in this category basically has a single page with no accessible links on it until a user is signed in with the appropriate credentials, these are typically banks, brokers and other proprietary systems with confidential content,
  3. Opaque: A site that has a home page with no links or obvious content, typically this represents nothing more than a marker indicating where subsurface content is stored,
  4. Proprietary: These sites are basically pay-per-view in that they require a subscription and payments to access the content they contain,
  5. Invisible: These are sites that contain no content that can be indexed by the search engines, everything is encrypted and stored in formats that cannot be read by third party services.

Why is the content hidden?

As the image to the above appropriately indicates, the iceberg is the perfect metaphor to describe the deep web because most of its mass is below the surface. The deep web's informational mass is below the surface and not accessible by reading static web pages. Back in the days of Web 1.0 almost all web sites were built using static web pages where all of the content was embedded in the actual HTML. For search engines this made discovery and indexing pretty straight forward since the content was stored in a standard format (more or less) and could be easily retrieved and parsed.

As the paradigm shifted and content management systems became more data driven the use of static pages declined dramatically to the point where most current websites generate content (pages) upon request and very little is stored as a static page. This actually created substantial problems for both the websites and the search engines since content was going un-indexed and therefore not found in the search queries. Dozens of adjustments have been made over the years to accommodate this new way of generating web pages, many of these adjustments have helped create the industry of search engine optimization (SEO). To a large extent the problems that plagued search engines during the transition to dynamic content generation have been solved, but there are still some residual problems that exist:

  • Pages in non-HTML format like PDFs, Flash, DOCs, XLSs and others,
  • Expanded use of scripts like PHP and JavaScript where the URLs contain coding characters like ?,
  • Dynamic page generation using technologies like ASP or Cold Fusion.

The Internet servers (Apache, IIS, etc.) as well as the content management systems (Drupal, Joomla, etc.) had to make several adjustments to accommodate these new content types for the search engines (Google, Yahoo, etc.) to deal with. So, things have improved since the initial content expansion and paradigm shift but there are still massive amounts of content not being indexed by the search engines. In fact, the search engines are unable to dive any deeper without additional changes being implemented by the CMS and server companies. Some of the current major issues remaining include the following:

  • Searchable databases: These are database systems that exist specific to a site and its underlying content. For example, Amazon has a huge database of reviews that it exposes only on its web pages but are not (generally) accessible by the search engines, so finding those reviews forces you to use Amazon. This is true for almost every proprietary database on the Internet, with some rare exceptions.
  • Excluded pages: There are two ways that pages can be excluded from the search engine crawlers, the first is by the crawler itself and the second is by the site deciding not to expose certain pages or elements to the crawler. Short of going through the details, the crawler decides not to index pages that are generated through queries because they tend to be user specific and lack broad appeal. On the other hand, sites have several options to disable search engine indexing including the use of a robots.txt file or meta tags which tell the crawlers what content they may and may not access.
  • Hidden content: Much of the web's content remains hidden from general access and can only be reached through a combination of special client applications and security credentials. The search engine crawlers are generally unaware that this content even exists since it is not directly connected to the surface layer of the web, in many cases this content is stored in encrypted files and database systems that further reduce general accessibility.

Some of the other reasons for hidden content include geo-tags, time-limited access and dynamic URLs. These are relatively minor reasons compared to the ones listed above but in some cases can still represent a large volume of information. Consider a GPS enabled cell phone that can effectively geo-tag all content it produces, this is a common practice for photograph identification in sites like Flickr and Picasa.

How much content is hidden?

There are widely differing opinions on this with the low estimates indicating about 75% and the higher estimates approaching 99% of all internet based content is hidden, the reality is that it's somewhere in between. Regardless of the exact amount it is clear that most of the content on the Internet remains hidden from the search engines and therefore not easily found or accessed by search engine users. Researchers and Internet guru's have known about this for several years and have developed alternative methods for tapping into this repository.

Again, regardless of the actual volume most of the information contained in the deep web is more valuable and more current than information accessible on the surface. Back in 2000 it was estimated that the internet contained about 7,500 terabytes and around 550 billion individual documents and files. In today's terms, an estimate of the deep web content from UC Berkeley suggests about 91,000 terabytes, in contrast the surface web which contains only about 167 terabytes. To put this in some perspective the Library of Congress, in 1997, was estimated to have around 3,000 terabytes of content.

Bright Planet, estimates that the dark web contains 500 times more content then the indexed surface layer. Considering that Google indexes about 8 billion pages, that number is incredible: 4,500,000,000,000 trillion pages of content! If you go to the Bright Planet site it indicates that the deep web is actually thousands of times larger than the surface, it's growing faster than I thought.

Deep Web ContentWhat types of content are hidden?

Everything you can think of, and then some and then some more. To a large extent the content that exists just beyond the reach of the search engines includes business intranets, academic research and content rich databases. The free web lives here as well, this includes technologies like TOR, I2P and Freenet that allow content to be transferred around without being traceable and with total anonymity of its users.

In most of the reviews and research I've read on this topic the tendency is to focus on the negative, things like terrorist manifestos and child pornography and other, even more nefarious content. While this is certainly true, it should be expected because the Internet is nothing more than a reflection of society - if it happens in the world, its on the Internet. Besides, the amount of valuable content and intelligence that exists in the deep web makes it worth opening the door and looking in, the bad stuff can be avoided without much difficulty.

Back to the good stuff.

The term hidden may be a misnomer here because much of the deep web can be found and accessed for free, but you have to know how to do it. For example, the vast majority of content that Amazon accumulates like user comments, product reviews, product descriptions and community tags can be accessed a couple of different ways which I'll describe in the next section. In general, search engines cannot access this content unless the site chooses to expose it to them which in many cases they don't for business reasons or can't for legal reasons.

The idea is that around 95% of the deep web is contained in database tables and other discrete formats, the vast majority of this can be accessed for free. The accessible content that tends to be hidden from view includes binary files like software applications, images, videos and other forms of multimedia content. It also includes massive databases like the types listed below:

  • Governement databases which at the federal level include curated databases for almost every agency and department within the government. These databases also exist at the state and in many cases local levels as well. An example of this can be seen at the Medical Device database form published by the FDA.
  • Millions of topic specific databases exist like the plane crash database or the toxic chemicals database. This content is generated and maintained by a whole range of entities including industry and trade associations, all types of businesses, various organizations and even individuals.
  • Academic databases and publications are one of the main constituents of the deep web, having been the most prolific type of users on the net for the first 10 years means lots of curated content. Most of these publications have been vetted by the academic community and therefore contain much higher integrity than similar surface materials. In general, this content is also considered the most current because it is closest to the producer. The content here is also the most discrete and therefore pure, as this content makes its way to the surface it tends to be diluted for mass consumption.

The content that is completely inaccessible by search engines is made up of proprietory databases that require some sort of membership, businsess intranets, and subscription services. This content is intentionally made unavailable for reasons already mentioned like potential legal issues, competitive advantages, propreitary information as well as others, most of which are obvious. An example is Hoovers, a company that sells business information that it considers proprietary and valuable. Another example is Facebook which has to consider legal issues as well as maintaining a competitive advantage to attract advertising dollars.

How to find the hidden content?

For much of the content that exists in the deep web there are a limited number of methods for searching and finding it, in many cases there is only a single method of access exposed by the content producer, typically via some web based query form. Using Amazon as an example with their huge repository of activities, there are basically three methods to access that content:

  1. Via the Amazon website directly,
  2. Via Amazon partner sites, and
  3. Via the Amazon web services.

The first two methods are fairly obvious but the web service based approach is gaining popularity and becoming a ubiquitous model for interaction with Amazon services. I've defined web services in other posts located here, but suffice it to say that they enable syndication of services and content. Sort of like RSS and ATOM feeds but using machine to machine communications that tend to be more robust because of programmatic interaction and control.

Deep Web SearchThere is another way to get to much of the deep web content through specialized search engines that were built to harvest and index the content in these lower layers of the Internet. There are literally hundreds of these search engines available to use but many are limited in scope. The limitations are generally related to a certain type of content or a specific industry, so while they reduce the number of places to search there is still no single source to go to like Google is for the surface web. That said, there are several sites that claim to be a deep web search engine and while they are broad in scope there are still substantial limitations with their content and algorithms. Here's a list of some of the popular deep web search engines:

This is just a starter list but should provide a good jumping off point for exploring, I've also included two attachments to this post for download and one of them lists dozens of these deep web search sites. Invariably it is better to use a context specific search engine when trying to research something, for example using sites like Google Book search and Google Scholar will likely yield much better results when researching books or scientific publications then a more generic deep web search engine like Clusty.

Deep Web Companies

Various Context Specific Search Engines

A set of trends are developing that should have a positive impact on our ability to leverage the content buried in the deep web, these trends are the use of web services as a method of syndicating both content and services, and the new application development paradigm of mashups. Both of these trends conspire to generate a meaningful set of tools that can be used to search out and find content that is currently only available to the elite users of the Internet. A recent publication posted at SpringerLink entitled: Mashups over the Deep Web discusses this new application paradigm and how it can be used to successfully traverse the deep web.

Summary

The dark web represents that largest part of the Internet and it is to a large extent hidden from the traditional search engines like Google and Yahoo. The deep web represents a huge, almost undefinable repository of valuable content and services that are generally considered more current and of higher quality than anything found on the surface. Most of this content is free to be consumed once it is found but therein lies the difficulty, finding content in the deep web can be difficult if not arduous since the information is scattered over millions of sites and in generally inaccessible formats.

Deep WebThese huge volumes of hidden content have spawned a whole new series of search methodologies and companies dedicated to ferreting out all of the content buried in the deep web and making it available at the surface. Still, with so many sources of content that is by definition difficult to find and catalog, these engines generally do best when they are focused on a specific area or type of content. It is unlikely that a general search engine like Google or Yahoo can be applied to the vast landscape of deep web content, instead multiple specialized engines are emerging as the best alternative.

One final trend that is helping to collect and disseminate this content in a sensible way is using web services for syndication and mashups for consumption, these are both perfectly suited to the complexities of the deep web and the vertical nature of the searches. Mashups and web services may ultimately provide the best set of tools for aggregating search results from multiple sources into a cohesive presentation, after all that's what they do.

The traditional search engines aren't giving up and walking away. Google is actively pursuing the development of technologies like the Deep Web Crawl which is obviously aimed at this problem. Yahoo is just as active and has been for a while, they were one of the early pioneers in the efforts to catalog the deep web as this article published by Cnet in 2005 indicates! Last and probably least is Bing (only kidding MrSofty fans) for which I couldn't find anything specific to Bing but here is an article from Microsoft Research on the deep web. One final link relating to the traditional search engines and their efforts has been reviewed here.

Word Cloud for the Post:

Wordle of Deep Web

Additional Resource Links:


  1. 10 useful deep web search engines
  2. 100 Tools and Tips for Research on the Deep Web
  3. Dark Net and the Economics of Mutual Anonymity
  4. Slideshare presentation on the Deep Web
  5. Deep web research site
  6. Exploring the Deep Web
  7. Mining the Deep Web with Mashups
  8. New York times article on exploring the Deep Web
  9. Search engines of the future
  10. Dark site of the Web
  11. The Dark Web explained
  12. Beyond the beyond: The Dark Web
  13. The Dark Web: 500x Larger the the World Wide Web
  14. The Deep Web and the future of finding things
  15. Deep Web Research Tools
  16. Internous: The Database of Databases
  17. Federated Search technology
  18. Video: Deep Web Video
  19. Video: Finding Research on the Deep Web
Your rating: None Average: 5 (1 vote)
Groups:

Technology Feeds

Technology