I’m sure by now you’ve heard about the Yahoo! breach which is one of the largest breached ever. For many people breaches like this keep us up at night at first but ultimately end there. Well, the story is usually a much longer one because after these types of breaches occur the sensitive information is then bought and sold within the deep web. It is for this reason that any pentester or anyone in IT Security for that matter needs to understand what it is and how it works. One pentesting tactic that is used among professionals is to go look to see if you can discover credentials that are already out on the web due to a previous breach. There have been many so it is definitely in the realm of possibility, and finding these weaknesses is what the organization would be paying you to do.
Whenever someone mentions the dark net or the deep web, most people’s eyes gloss over with mystified curiosity for the Internet’s innermost circles. And the media doesn’t help, because modern films and television dramas do little to paint an accurate picture of the deep web. Instead, entertainment media paints a picture of clandestine meetings between anonymous silhouettes wearing trench coats and fedoras.
All of this mystique, in my opinion, paints an unfair and inaccurate portrayal of what the deep web really is. It’s not just a place for the world’s most shady and secretive users to lurk around on exclusive “invite only” websites. Even though it may sound like I’m contradicting myself, to be completely honest, there are undoubtedly more than a few of those types of sketchy websites in what most people consider to be the deep web.
While there are plenty of seedy destinations, riff-raff, and unsavory users occupying some portion of the deep web, the vast majority of it isn’t necessarily sketchy or dangerous. But before we dig into what types of content lurk in the deep web, let’s first talk about it’s size.
Search Engines and the Global Internet as a Whole
What exactly is the Internet? Is it Facebook, Google, and your favorite streaming media sites? There are a variety of definitions, but for our discussion of the deep web, I want you to understand one key point. Technically, any device – be it a router, switch, firewall, smartphone, tablet, refrigerator, or automobile – becomes part of the global network when it gains an Internet connection.
Like it or not, there are a variety of paths between any two endpoints, meaning that in some contexts, the computer you’re using right now could technically be classified as a component of the Internet. However, there are a lot of security tools to separate devices, groups of devices, and entire countries into their own isolated networks. The fact remains that you can’t personally connect to the overwhelming majority of devices connected to the public Internet (such as private corporate networks, your neighbors Wi-Fi router, etc.).
When most people talk about the Internet in common speech, they are merely talking about a couple of different protocols out of many hundreds (or perhaps thousands), namely HTTP and HTTPS. For instance, there are other protocols used every single day, such as File Transfer Protocol, Voice over IP, and Secure Shell, among many others. With that understanding, then wouldn’t the entire Internet consist of HTTP and HTTPS web pages? And doesn’t Google, not to mention other search engines, already index all of those pages so we have a means of sifting through all of the web servers to find the information we want?
Well, not exactly. If you go and run a Google query right now, it’ll spit back millions of results in a fraction of a second. And while that may look like it’s every imaginable web page on the Internet that’s related to your search terms, there’s a lot you don’t see. The fact remains that search engines can only index a very small fraction of the entire Internet that is hosted on web servers.
Why Can’t Google Index the Entire Internet?
Despite what most people think, Google does not have supreme authority on the Internet. In fact, it’s completely optional whether or not you want Google to index your site. But how does Google index sites in the first place? The answer is critical to understanding a vast portion of the deep web. You see, Google uses mechanisms called crawlers to hunt around on the Internet and find web pages.
After a crawler has crawled a page and read the content, it feeds the data through a special proprietary Google algorithm that uses a lot of composite metrics to rank the web page, and perhaps place that page on the first page of the search engine results. But how do the crawlers know where to crawl? Well, they don’t call it the World Wide Web for nothing. The vast majority of web pages and domains are intermingled and linked to each other many times over.
If a crawler is crawling a page and notices outbound links, it then follows those outbound links and starts crawling the new pages. If those new pages contain links, the crawler continues to crawl the current page, and crawls the pages of the second set of outbound links as well, and so on, and so on. It’s a very recursive process, and the crawlers work by building a sort of link road-map of the Internet.
But the crawlers don’t have an all-access pass to every page on the Internet. There are many roadblocks in their way that can prevent them from crawling a website. For instance, a web server hosted on a private network may not allow the crawlers private access. Also, consider gated content. What about content sites that require a user to enter login credentials before accessing information? With few exceptions, crawlers are blocked by sites requiring login credentials.
Furthermore, a lot of data hosted on the Internet is used to create dynamically generated web pages. The crawlers simply can’t comb through the secure back-end databases to index the content. Finally, we are starting to touch on the vast majority of content that is genuinely part of the deep web. But also consider that a website administrator can opt out of Google indexing (in fact, you can opt out of it with your Facebook page as well in the Facebook security settings).
Types of Content on the Deep Web
With so much of the Internet left untapped by the search engines, you are probably wondering what types of content actually exist in the deep web. Naturally, there are some truly disgusting websites hosted by people engaged in nefarious and despicable activities. For instance, there’s more than a fair share of stomach-wrenchingly nasty and disgusting porn sites.
Some people have used the deep web to operate sex trafficking circles, and even child pornography. In addition, there’s quite a few sites that are engaged in the trafficking of drugs and illegal contraband, such as weapons. I don’t recommend that you go hunting around on the deep web for these types of websites out of idle curiosity.
Even the act of just visiting one of these sites would look incredibly suspicious should the authorities become aware of your browsing habits. In the modern digital-driven era, you never know who else may be watching your online activities – even with precautions like Tor, VPN tunnels, and anonymity services. Don’t believe me? Well, consider that the Tor network has been infiltrated by the FBI on multiple occasions in the past.
However, I don’t want to pain an inaccurate picture of the deep web. The fact remains that the aforementioned types of unsavory web destinations only account for a small fraction of all the data that could be classified as the deep web. Other types of data that reside on the deep web include, but are not limited to, the following:
Dynamically generated content – dynamic web pages that are based on back-end databases are becoming increasingly common, and that data isn’t typically indexed by search engines.
Limited content – it’s a common practice for some websites to limit access to their websites. A few examples include web servers hosted on a corporate intranet, websites that are gated with login credentials, websites that require user verification via a CAPTCHA security plugin, and other similar types of limitations.
Oddball file formats – not all of the web is written in text. There are a lot of strange file formats that can’t always be indexed by search engines, such as non-HTML based content, voice, podcasts, video, multimedia content, and others.
Private networks – a lot of the Internet is private, such as any home or business network residing behind a firewall.
Content protected by software and digital services – some content simply isn’t accessible unless you have access to the right software. For instance, some web pages and domains aren’t accessible unless you are using I2P or Tor. These types of websites are what most people commonly think of as being the entire deep web.
Disavowed content – some content on the Internet disavows incoming links, or simply isn’t linked to. Furthermore, the website administrator may use software tools to prevent crawlers from linking to the page, as well as break incoming links with redirects.
Archived data – some tools allow people to make historical catalogues of web pages to track changes over time. However, over time, some of the older versions of those web pages may become inaccessible.
How to Access the Deep Web
I know that the deep web sounds like an interesting and curious place, but there’s not that much to it. The easiest way to access what you would think of as the deep web would be to download Tor. You see, Tor is an anonymity network that makes it (virtually) impossible to track users’ online browsing habits. There are a few exceptions, but for the most part, the Tor network is well adept at hiding your identity.
Before we dig into the dirty details of accessing hidden services, I need to first caution you with a warning: proceed at your own risk. There are a lot of dangerous and malicious websites on the deep web, and if you get a virus or end up on governmental watch list for accessing the wrong content, that burden rests squarely on your shoulders.
That said, you want to go ahead and download Tor, and then access hidden services. But how do you know about a website if no one tells you it exists, and you can’t find it in a search engine? I know it seems like a bit of a “chicken and the egg” scenario, but there are a few places you can go to find directories of hidden services.
The following are a few sources for links to the deep web:
Because a lot of these websites aren’t indexed by search engines, they are instead spread by word of mouth. There’s no telling what websites are lurking out there on the deep web, and some may only have user bases of tens or a few hundred people. Some of them are invite only, though these three aforementioned places are a good place to start.
Just glancing at the list, you can begin to get an idea of the types of websites occupying the deep web. There’s more than a few of them I have no desire to visit, such as those that deal with illegal arms trading.
Some of them seem to be pretty black market in nature, and could very well be a scam – such as the fake ID and passport websites. Some of them are simply nefarious online marketplaces, like the infamous silk road, which is a type of Alibaba or Amazon digital vending service that offers a lot of contraband. However, as far as informational content is concerned, there are a lot of hidden blogs as well.
The amount of the Internet the average user can see is analogous to an iceberg, whereby you can only see the tiny bit poking above the water’s surface. But deep down below, there’s more than meets the eye. I know that some people are curious by nature, and that you may be one of them.
However, I’d highly caution you against taking a stroll down the back alleys of the Internet. You don’t know what you’re going to find, and it might be pretty ugly. Unless you are specifically trying to access content that you already know about, I wouldn’t spend too much on the deep web.