What is HTTP, actually? What happens when we access a web page

In this article we take a look at HTTP based on what people most often use it for: requesting a webpage.

HTTP stands for HyperText Transfer Protocol. This webpage is served to your browser by a webserver. Due to the Hypertext Transfer Protocol both your browser and the webserver know what they have to do when you decide to visit a website. A protocol is a set of rules that explain how to handle a situation. HTTPS is HTTP with added Security. Hypertext is text linked to by hyperlinks, the links we click to move from webpage to webpage. Webpages today are not purely text anymore; they also contain images, videos and sounds. For that reason we often talk about hypermedia instead of hypertext. The reason they call this media "hyper" is because it is interactive, as opposed to text on paper. We usually call these documents or resources instead of hypermedia. In this article I will use an example with a document.

HTTP is used for internet communications. It is an application layer protocol, meaning that it is used between applications. In the model for computer to computer communications, the Open Systems Interconnection model, application layer protocols sit at layer 7, the highest layer. Besides HTTP there are a lot more details to how this page was delivered to you!

In the protocol there is a client and a server. In this example my browser is the client and whichever computer contains my website is the server. The client sends an HTTP request to the server. The server gives a response. The client may request a specific document, using a specific version of HTTP. That is what we call the GET request, and it is the simplest example.

This is what a request from my browser for my webpage looks like:

GET https://tacosteemers.com/articles.html

The client can also add more details to their request, called header fields. Examples are the user's login information or that they prefer not to be tracked.

My browser has added many details to the request. Here are some of the request header fields:

Host: tacosteemers.com
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
If-Modified-Since: Sun, 02 Jan 2022 20:56:04 GMT

The Host request header is mandatory and clarifies which host we want to place a request with. This may seem redundant because we are already placing our request at this host when we GET https://tacosteemers.com/articles.html. However, it is not redundant. My website is served to the client from a server that serves up many websites. That server does not have the name tacosteemers.com. Instead it will be accessed by a name that may look like web887.dc3.example.com. By the time the GET request arrives at this server somewhere in a datacenter, passing through many different computers and routers, that initial request will have been translated several times to reflect hostnames encountered along the way. The server needs the Host request header to know which website we want.

The Accept request header let's the server know what kind of documents the client can accept. If-Modified-Since means that the client only wants to receive the document if it has been changed since the given time. If the client sends this it means that it already has a copy from that date and time on disk, and if the server doesn't have a newer version it will tell the client in it's response. The server's response will not include the document in that situation.

The server responds with:

A status code
A list of response headers
The response body, which contains the actual document

The statuscode for this response is 200, which simply means "OK". If the document on the server was not newer than the browser indicated with If-Modified-Since the server would have given statuscode 304 "Not Modified" and the response body would have been empty.

Some of the response headers are:

Content-Length: 30224
Content-Type: text/html
Last-Modified: Mon, 03 Jan 2022 06:33:28 GMT

The first two response headers tell the client how to interpret the contents of the response body. Last-Modified tells us that the document has indeed changed since we last accessed it. My hosting company has also added two custom response headers that tells us which webserver and loadbalancer this request and response have passed through. They probably do this to allow them to diagnose problems in their network.

You may have noticed that we used the word GET, and wondered if there are any other words. We call these request methods. There are nine request methods.

GET
HEAD, a GET request without getting the body in the response
POST, where the client sends data to the server for further processing
PUT, where the client sends data that overwrites something that already exists on the server, that could be something that has been POST-ed earlier.
DELETE
CONNECT, a more complicated request method
OPTIONS, where the client asks the server what options there are for communicating with the server or a specific resource
TRACE, this method is new to me, apparently it is used for troubleshooting and will give back information about what the request looked like to the server after travelling through all the intermediary systems
PATCH, for sending instructions on how to partially update a resource or document

I haven't used PATCH but I imagine that PATCH is handy for when the client doesn't have the document or doesn't want to send it because it is too large, but the client does know what modifications need to be made to the document.

Here is the proposal for the current HTTP version, HTTP/2, from May 2015. The first eight request methods are described in the earlier HTTP/1 proposal. The PATCH method is described in a separate specification.