Blog - 0x00


Python requests Tips and Tricks

2022/02/14

Table of contents


#

Requests is one of the most popular HTTP module for Python and for a good reason, its easy to use. Want to get information from a REST API or just a website? Just do a GET request requests.get("https://example.com") and in return you get an object that has headers, status code and text.

If you look into requests documentation, one of the first things you’ll see are examples on how to use it, which is what anyone would use to try it out. But the docs also have an “Advanced Usage” section, which for a beginner programmer might be a little bit intimidating or the code provided by the example works for you, right? Well, there is important information in the advanced usage that can improve your code robustness and speed which should be on the landing page of the docs. So in this blog I’ll share what I found with examples.

Timeout #

One of the little things that don’t appear in the landing page of the docs, when making a request but it is in the doc at the end of the list of topics and is recommended to use it and I quote:

Nearly all production code should use this parameter in nearly all requests. Failure to do so can cause your program to hang indefinitely…

Which seems to be a very important option to use in your code but unless you read the docs, then you might never know that option exists. I made a script that will make a request but it won’t respond until 120000 ms have passed (2 minutes):

import requests

def timeout_requests():
    logging.info("Starting a slow request")
    requests.get("https://httpstat.us/200?sleep=120000")
$ python pull.py
2022-02-14 10:15:59,785 [INFO    ] (main:50) Script starting
2022-02-14 10:15:59,785 [INFO    ] (timeout_requests:42) Starting a slow request
2022-02-14 10:18:00,675 [INFO    ] (main:52) Finished running script

Well, it worked as expected, we waited 2 minutes for a response and at no point the connection timed out. Like the docs say, its important to add a timeout to our requests, otherwise the script could hang forever. So lets add a 10 seconds time out, if the server doesn’t respond within 10 seconds then it will raise an exception.

import requests

def timeout_requests():
    logging.info("Starting a slow request")
    requests.get("https://httpstat.us/200?sleep=120000", timeout=10)
$ python pull.py
2022-02-14 10:22:18,009 [INFO    ] (main:50) Script starting
2022-02-14 10:22:18,009 [INFO    ] (timeout_requests:42) Starting a slow request
Traceback (most recent call last):
  ...
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='httpstat.us', port=443): Read timed out. (read timeout=10)

It’s important to use a timeout if you have a Python script that runs with a schedule like cron.

Exceptions #

When an error occurs in python it raises an exception, prints a stack trace of the error and exits. The stacktrace might provide useful information for the developer but for an end-user it’s just garbage. A better way to handle those errors is by using try-except blocks. You can put the code that might raise an exception in the try block and in the exception block code to handle the error if there is an exception, because a program can error for different reasons and there are different ways to handle them it’s possible to have multiple exception blocks provided that each of the exception block handle different exceptions, Python has built-in exceptions.

Requests provides its own exceptions, for example, if for any reason the machine running the script doesn’t have internet connection or the remote server is down and doesn’t recieve a response then it’s going to raise and exception.

Exceptions provided by requests: `

You can read with more detail each one of them in the docs exception section. I want to focus on important ones: RequestException, HTTPError and Timeout

HTTPError #

When an HTTP request is made, the response contains a status code, this code can indicate if the request was successful or not. 2XX codes mean it was successful and 4XX-5XX means there was en error. requests provides a function for requests response object called .raise_for_status(), it will raise an exception (requests.exceptions.HTTPError) if the status code of the HTTP response is 4XX/5XX. This is useful when you don’t care about handling an specific HTTP status code and only care whether the request was successful or not. Here is an example on how to use it:

def status_requests():
    try:
        req = requests.get("https://httpstat.us/404")
        req.raise_for_status()
    except requests.exceptions.HTTPError as err:
        logging.error(err)
    else:
        logging.info("Success!")
$ python pull.pu
2022-02-14 13:41:04,418 [INFO    ] (main:71) Script starting
2022-02-14 13:41:05,059 [ERROR   ] (status_requests:65) 404 Client Error: Not Found for url: https://httpstat.us/404
2022-02-14 13:41:05,059 [INFO    ] (main:73) Finished running script

Timeout #

On the previous section I explained how to make requests with a timeout but if it times out then it raises an exception. The timeout parameter is the amount of seconds before it raises and exception.

def timeout_requests():
    logging.info("Starting a slow request")
    try:
        requests.get("https://httpstat.us/200?sleep=120000", timeout=10)
    except requests.exceptions.Timeout as err:
        logging.error(err)
$ python pull.py
2022-02-14 15:48:42,047 [INFO    ] (main:73) Script starting
2022-02-14 15:48:42,047 [INFO    ] (timeout_requests:56) Starting a slow request
2022-02-14 15:48:53,567 [ERROR   ] (timeout_requests:60) HTTPSConnectionPool(host='httpstat.us', port=443): Read timed out. (read timeout=10)
2022-02-14 15:48:53,567 [INFO    ] (main:75) Finished running script

If the remote server is slow to respond and you really need to make that request by handling this type of exception you can retry by creating a loop or exiting and informing your user that the server took too long to respond.

RequestException #

This particular exception is the base for all requests exceptions, what does that mean? Well, it means that this exception will match whatever exception requests raises. Similar how all Python built-in exceptions derive from Exception but for requests. This exception should always be put last within the requests exceptions otherwise if requests raises an exception it will run the first exception block that matches and RequestException will match all of them. This exception is used when you want to handle some errors by requests and any other error will be hadled by RequestException, or if you want to handle all errors by requests.

def base_exception_requests():
    logging.info("Starting multiple requests that will raise diffrent exceptions")
    try:
        requests.get("https://httpstat.us/200?sleep=120000", timeout=0.1)
    except requests.exceptions.RequestException as err:
        logging.error(err)

    try:
        req = requests.get("https://httpstat.us/404")
        req.raise_for_status()
    except requests.exceptions.RequestException as err:
        logging.error(err)

    try:
        requests.get("https://example.invalidtld/")
    except requests.exceptions.RequestException as err:
        logging.error(err)
$ python pull.py
2022-02-14 16:19:07,173 [INFO    ] (main:89) Script starting
2022-02-14 16:19:07,173 [INFO    ] (base_exception_requests:63) Starting multiple requests that will raise diffrent exceptions
2022-02-14 16:19:07,291 [ERROR   ] (base_exception_requests:67) HTTPSConnectionPool(host='httpstat.us', port=443): Max retries exceeded with url: /200?sleep=120000 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fdd75581ed0>, 'Connection to httpstat.us timed out. (connect timeout=0.1)'))
2022-02-14 16:19:08,168 [ERROR   ] (base_exception_requests:72) 404 Client Error: Not Found for url: https://httpstat.us/404
2022-02-14 16:19:08,776 [ERROR   ] (base_exception_requests:76) HTTPSConnectionPool(host='example.invalidtld', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fdd75582d70>: Failed to establish a new connection: [Errno -2] Name or service not known'))
2022-02-14 16:19:08,777 [INFO    ] (main:91) Finished running script

Sessions #

Usually a script that pulls information from different sources will make requests to different websites and REST APIs. In this example I’ll make 10 GET requests to a website using a for loop (I’m using my python-template I shared in my previous blog A simple Python template)

import requests

def multiple_requests():
    logging.info("Starting to make requests")
    for _ in range(10):
        requests.get("https://httpstat.us/200")
$ python pull.py
2022-02-14 09:30:43,335 [INFO    ] (main:44) Script starting
2022-02-14 09:30:43,335 [INFO    ] (multiple_requests:35) Starting to make requests
2022-02-14 09:30:51,849 [INFO    ] (main:46) Finished running script

You can see making 10 requests to https://httpstat.us/200 took about 8 seconds and the logs look great, right? Well, remember that Python provides a logging module from the standard library so python modules can use it and integrate logs with your Python program/script. Lets enable DEBUG log messages and see what we get.

$ python pull.py -d
2022-02-14 09:37:24,576 [INFO    ] (main:44) Script starting
2022-02-14 09:37:24,576 [INFO    ] (multiple_requests:35) Starting to make requests
2022-02-14 09:37:24,577 [DEBUG   ] (_new_conn:971) Starting new HTTPS connection (1): httpstat.us:443
2022-02-14 09:37:25,389 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 09:37:25,393 [DEBUG   ] (_new_conn:971) Starting new HTTPS connection (1): httpstat.us:443
2022-02-14 09:37:26,209 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 09:37:26,215 [DEBUG   ] (_new_conn:971) Starting new HTTPS connection (1): httpstat.us:443
2022-02-14 09:37:27,027 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 09:37:27,033 [DEBUG   ] (_new_conn:971) Starting new HTTPS connection (1): httpstat.us:443
2022-02-14 09:37:27,847 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 09:37:27,853 [DEBUG   ] (_new_conn:971) Starting new HTTPS connection (1): httpstat.us:443
2022-02-14 09:37:28,666 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 09:37:28,672 [DEBUG   ] (_new_conn:971) Starting new HTTPS connection (1): httpstat.us:443
2022-02-14 09:37:29,485 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 09:37:29,487 [DEBUG   ] (_new_conn:971) Starting new HTTPS connection (1): httpstat.us:443
2022-02-14 09:37:30,265 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 09:37:30,273 [DEBUG   ] (_new_conn:971) Starting new HTTPS connection (1): httpstat.us:443
2022-02-14 09:37:31,124 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 09:37:31,129 [DEBUG   ] (_new_conn:971) Starting new HTTPS connection (1): httpstat.us:443
2022-02-14 09:37:31,943 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 09:37:31,948 [DEBUG   ] (_new_conn:971) Starting new HTTPS connection (1): httpstat.us:443
2022-02-14 09:37:32,764 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 09:37:32,767 [INFO    ] (main:46) Finished running script

Great! Requests uses the logging module and provides debug log information. We get some interesting information when we enable debug logs, everytime we make a GET request we are creating new HTTPS connection and then it makes the request. Now I think the question is, should we be creating a new HTTPS connection for every GET requests we make? Certainly no, we can create a new connection and re-use it to increase speed, how? requests provides Session() object which is in the Advanced Usage section.

A Session() can be created 2 ways. The first one is to create one and manually .close() the connection or the recommended way is to use Python context manager with _ as var: syntax. The benefit of using a context manager is that if at any point there is an exception the context manager will close the connection, otherwise if exceptions aren’t handled properly then the connection might be kept open. So, lets create a Session for the requests.

def multiple_requests():
    logging.info("Starting to make requests")
    with requests.Session() as req:
        for _ in range(10):
            req.get("https://httpstat.us/200")
$ python pull.py -d
2022-02-14 10:05:07,610 [INFO    ] (main:45) Script starting
2022-02-14 10:05:07,610 [INFO    ] (multiple_requests:35) Starting to make requests
2022-02-14 10:05:07,612 [DEBUG   ] (_new_conn:971) Starting new HTTPS connection (1): httpstat.us:443
2022-02-14 10:05:09,163 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 10:05:09,466 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 10:05:09,671 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 10:05:09,876 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 10:05:10,081 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 10:05:10,285 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 10:05:10,490 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 10:05:10,695 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 10:05:10,900 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 10:05:11,104 [DEBUG   ] (_make_request:452) https://httpstat.us:443 "GET /200 HTTP/1.1" 200 None
2022-02-14 10:05:11,107 [INFO    ] (main:47) Finished running script

Now the logs show only 1 connection being created and its reusing it to make GET requests. You should also see the timestamps of the logs, in our previous version of the script where we created a new connection for every GET requests we made, it took about 8 seconds, and this time by re-using the connection it took about 4 seconds, it took half the time of what it used to!

Final words #

This tips certainly will help in some cases speeding up their script, in others being able to handle error gracefully and not show garbage.

I do believe requests should timeout after 90 seconds by default, unless of course you provide a new value, I don’t think most people who are making requests expects the script to hang indefinately nor want their script to wait for hours for a response. But, now we know adding the timeout parameter is important.