How The Internet Really Works: A Hands-On Crash Course from Ethernet to HTTP using Wireshark
Whether you’re a hacker, IT pro, coder, or just curious, it helps to know exactly how the Internet works: you may understand the idea of connections, but do you understand all the protocols and steps that it takes to create and troubleshoot a connection?
Ever wondered what exactly happens between typing “google.com” into the address bar and seeing the webpage appear on your screen? Do you know what would happen if two computers had slightly different subnet masks, or how ARP spoofing works, or what exactly the Kaminsky DNS attack was, or what happens when you plug a switch back into itself?
This was presented at CactusCon 2014, and the slides / wireshark captures are available here: how-internet-works.zip (the slides are sparse; turn on notes to see what I said for each slide.) If you don’t have PowerPoint, you can download LibreOffice (free) or see the SlideShare.
Also note that this is a semester worth of Networking 101 presented in about an hour; this is enough to get you started Googling for topics of interest and hopefully a gut feeling for all the different things happening during a typical connection, but some bits are omitted– please do more research in order to get a complete understanding. Open Wireshark yourself and send out your own traffic; read books or tutorials, consider certification classes like Network+, Security+, or Cisco.
Finally, I’m happy to answer questions in the comments or on twitter @willbradley .
Just in case you can’t see the notes attached to the slides, here’s my full notes below:
How The Internet Really Works: A Hands-On Crash Course from Ethernet to HTTP using Wireshark
exactly how the Internet works
– protocols and steps required
– 7 layer OSI model
– TCP, UDP, ARP, and MAC definitions
– IP, NAT, BGP definition
– DNS definition
– HTTP definition
– create and troubleshoot a connection?
– type “heatsynclabs.org” into the address bar
– browser asks the OS’s network stack to open a TCP connection to “heatsynclabs.org” on port 80
– the OS uses its DNS resolver to ask your DNS server what IP address “heatsynclabs.org” is at
– your DNS server (let’s say 22.214.171.124) probably won’t know the answer directly, so it’ll probably use a recursion process to find the answer from the global DNS infrastructure:
– root servers (like l.root-servers.net) are authoritative for the root (.) and know where to find GTLD servers (for “com”)
– GTLD (top-level domain) servers (like l.gtld-servers.net) are authoritative for a TLD (like “com”) and know where to find domain servers (for “heatsynclabs.org”) also known as nameservers or the “NS records” which you might be familiar with setting for your own domains.
– nameservers (like ns1.heatsynclabs.org) are authoritative for a domain (heatsynclabs.org) and can give you information on all that domain’s records (like heatsynclabs.org or http://www.heatsynclabs.org)
– now that your DNS server has this information, it will cache it to save time later, and return the IP for “heatsynclabs.org” to your OS (126.96.36.199)
– now the OS knows what IP to connect to; but first it needs to know how.
– the OS checks its networking settings. If your computer’s IP is 10.0.1.100, and its subnet is 255.255.255.0, then that means anything starting with 10.0.1.__ will be in its subnet and anything else will be out of its subnet.
– IPs inside the subnet will be directly resolved with ARP and connected to via TCP.
– IPs outside the subnet will be forwarded through your Default Gateway (let’s say 10.0.1.1) first.
– To connect to any local IP (default gateway or otherwise) your OS initiates an ARP request.
– TCP and UDP communicate exclusively over MAC addresses. ARP translates an IP into a MAC address.
– The request looks like “What MAC address has 10.0.1.1?” and is sent to the entire local network (a “broadcast” packet)
– The first computer to respond wins; the response looks like “10.0.1.1 is at 00:26:bb:6c:12:e0”
– Finally, the OS sends a packet destined for heatsynclabs.org (188.8.131.52) to your default gateway 10.0.1.1 (00:26:bb:6c:12:e0). It uses some local source port over 1024, and a destination port of 80, using the TCP protocol.
– TCP is “stateful” or “connection-oriented” which means that instead of just sending packets to an address, it first makes sure that the packets will arrive properly (and automatically re-sends any lost packets)
– It accomplishes this with a “3-way handshake” — SYN, SYN/ACK, and ACK. SYN means “synchronize sequence numbers”, ACK is an acknowledgement that a packet was received. Finally, RST packets will reset a connection (usually because it’s already been closed on that side), and FIN indicates that there is no more data (done transmitting.)
– Finally, the data sent is the text “GET /” according to the HTTP standard
– The gateway probably uses NAT to let multiple internal computers connect to the internet with one internet-facing IP address. In order for this to work, it keeps track of your connection and sends it out on a unique source port. So while internally your computer might’ve sent the request from its IP on port 1024 to HeatSync on port 80, your NAT gateway will change that to its own public IP on port 3000 to HeatSync on port 80, and then change it back when it communicates back to you.
– Routers through the internet
– Routers use a protocol like BGP to figure out which route to take
– BGP basically consists of manually-configured peers (Router A and Router B are manually, usually directly connected and configured to talk to each other)
– Peers pass along their entire routing table to others; the table looks like “Router 42 can get to network 10.0.0.0/8 via the router path: 525, 43, 124”
– NAT on the far end (port forwarding, etc)
– HeatSync’s web server is listening on TCP port 80 — a program like Apache has requested to be sent any packets its OS receives on that port be sent to it.
– The web server receives the data (“GET /”) and responds appropriately (with the contents of the “/” page, or in this case, a “301 Redirect” to “https://www.heatsynclabs.org” which starts the whole process again, except to “www.heatsynclabs.org” on TCP port 443, with encryption.)
– HeatSync’s response goes back to your computer in much the same way as your connection was created, except in response to the existing connection so it can skip most of the lookups and just trace the path backwards.
– browser receives the data and interprets the HTML in the packets
– see the webpage appear on your screen!
– what would happen if two computers had slightly different subnet masks
– computer A is 10.0.1.120, subnet 255.255.255.128
– computer B is 10.0.1.160, subnet 255.255.255.0
– A to B would assume that B is on a different subnet, and send to the gateway — this probably wouldn’t work unless the gateway was configured to allow this.
– B to A would assume that A is on the same subnet, and would try to send directly to A. This might work, except if A rejects the packet due to being on a “different” subnet.
– how ARP spoofing works
– the first response to an ARP request wins!
– flood the network with broadcast packets announcing that each IP on the network belongs to your computer’s MAC (which, by the way, can also be changed manually to any MAC you desire)
– memorize all other ARP responses besides yours
– now all traffic will come to you!
– read/modify the traffic, and send it on to its original destination if desired.
– secure your physical and wireless networks!
– what exactly the Kaminsky and other DNS cache poisoning attacks are:
– DNS requests work similarly to ARP requests: they have sequence numbers as a basic form of protection, but they’ll still accept the first valid response.
– so, if you wanted to pretend to be Bank.com, you could try sending fake response packets for Bank.com (with your web server’s IP instead of the real one) to a recursive DNS server (like 184.108.40.206 or whatever your ISP provides.)
– but, this wouldn’t work too well because of the sequence number (0-65,535)
– so, we send it a request for a domain that we own (using a nameserver we control), and listen for the sequence number it uses.
– then, since sequence numbers used to be sequential, we could send the server a bunch of requests for Bank.com and also a bunch of answers using likely sequence numbers.
– once a valid response gets accepted by the DNS server, it will typically cache it for hours. Now everyone using that server will think that you are Bank.com! Scary.
– sequence numbers are now randomized, but there are still a number of weaknesses: for starters, there are only 65,000 sequence numbers, so it’s still relatively easy to guess randomly.
– what happens when you plug a switch back into itself
– switches work by remembering what MAC addresses it sees on each port, and then directing packets that come in to the appropriate port. (protip: the address memory is limited! hmm…)
– broadcast packets, however, are sent to all ports.
– if a switch is plugged back into itself (or other loops formed) then a packet being received on one port (from MAC address X to everyone) will come back in to another port… which might cause the table entries for MAC address X to get duplicated or disappear, but will also cause a broadcast packet to come back out that second port… and back into the first port. Basically, causing an infinite echo chamber of all broadcast packets ever, and potentially clearing the address table of the switch (if enough computers send a broadcast packet, causing their MAC addresses to be duplicated/cleared.)
– remember the address memory part? well, if you flood a switch with mac addresses to remember, it’ll do something interesting: it’ll start behaving like a hub, which sends all packets to all ports. this can cause problems (like flooding) or be advantageous to an attacker (all traffic on the switch is suddenly visible to everyone.)
– how ethernet / 802.15 works
– +/- 2.5 volts
– 8-position, 8-conductor cables
– orange and green pairs; up to +2.5 volts on one half of the pair, with the opposite voltage on the other half (a form of self-noise-cancellation).
– orange pair is typically on pins 1 & 2, and are the transmit pins for a computer. green pair is on pins 3 & 6, and are the receive pins. (Transmit/receive is opposite on switches/routers, or frequently auto-negotiated on modern equipment.)
– With 100BASE-TX hardware, the raw bits (4 bits wide clocked at 25 MHz) go through 4B5B binary encoding to generate a series of 0 and 1 symbols clocked at 125 MHz symbol rate.
– 4b5b table:
0 0000 11110
1 0001 01001
2 0010 10100
3 0011 10101
4 0100 01010
5 0101 01011
6 0110 01110
7 0111 01111
8 1000 10010
9 1001 10011
A 1010 10110
B 1011 10111
C 1100 11010
D 1101 11011
E 1110 11100
F 1111 11101
– 100BASE-TX over copper uses NRZI encoding, and then MLT-3 for final encoding, resulting in a maximum “fundamental frequency” of 31.25 MHz.