21.6. `urllib.request` — Extensible library for opening URLs¶

The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

The urllib.request module defines the following functions:

urllib.request.urlopen(url, data=None[, timeout], *, cafile=None, capath=None, cadefault=False)¶

Open the URL url, which can be either a string or a Request object.

data must be a bytes object specifying additional data to be sent to the server, or None if no such data is needed. data may also be an iterable object and in that case Content-Length value must be specified in the headers. Currently HTTP requests are the only ones that use data; the HTTP request will be a POST instead of a GET when the data parameter is provided.

data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format. It should be encoded to bytes before being used as the data parameter. The charset parameter in Content-Type header may be used to specify the encoding. If charset parameter is not sent with the Content-Type header, the server following the HTTP 1.1 recommendation may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to use charset parameter with encoding used in Content-Type header with the Request.

urllib.request module uses HTTP/1.1 and includes Connection:close header in its HTTP requests.

The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS and FTP connections.

The optional cafile and capath parameters specify a set of trusted CA certificates for HTTPS requests. cafile should point to a single file containing a bundle of CA certificates, whereas capath should point to a directory of hashed certificate files. More information can be found in ssl.SSLContext.load_verify_locations().

The cadefault parameter specifies whether to fall back to loading a default certificate store defined by the underlying OpenSSL library if the cafile and capath parameters are omitted. This will only work on some non-Windows platforms.

Warning

If neither cafile nor capath is specified, and cadefault is False, an HTTPS request will not do any verification of the server’s certificate.

For http and https urls, this function returns a http.client.HTTPResponse object which has the following HTTPResponse Objects methods.

For ftp, file, and data urls and requests explicity handled by legacy URLopener and FancyURLopener classes, this function returns a urllib.response.addinfourl object which can work as context manager and has methods such as

geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed
info() — return the meta-information of the page, such as headers, in the form of an email.message_from_string() instance (see Quick Reference to HTTP Headers)
getcode() – return the HTTP status code of the response.

Raises URLError on errors.

Note that None may be returned if no handler handles the request (though the default installed global OpenerDirector uses UnknownHandler to ensure this never happens).

In addition, if proxy settings are detected (for example, when a *_proxy environment variable like http_proxy is set), ProxyHandler is default installed and makes sure the requests are handled through the proxy.

The legacy urllib.urlopen function from Python 2.6 and earlier has been discontinued; urllib.request.urlopen() corresponds to the old urllib2.urlopen. Proxy handling, which was done by passing a dictionary parameter to urllib.urlopen, can be obtained by using ProxyHandler objects.

Changed in version 3.2: cafile and capath were added.

Changed in version 3.2: HTTPS virtual hosts are now supported if possible (that is, if ssl.HAS_SNI is true).

New in version 3.2: data can be an iterable object.

Changed in version 3.3: cadefault was added.

urllib.request.install_opener(opener)¶: Install an OpenerDirector instance as the default global opener. Installing an opener is only necessary if you want urlopen to use that opener; otherwise, simply call OpenerDirector.open() instead of urlopen(). The code does not check for a real OpenerDirector, and any class with the appropriate interface will work.

urllib.request.build_opener([handler, ...])¶

Return an OpenerDirector instance, which chains the handlers in the order given. handlers can be either instances of BaseHandler, or subclasses of BaseHandler (in which case it must be possible to call the constructor without any parameters). Instances of the following classes will be in front of the handlers, unless the handlers contain them, instances of them or subclasses of them: ProxyHandler (if proxy settings are detected), UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor.

If the Python installation has SSL support (i.e., if the ssl module can be imported), HTTPSHandler will also be added.

A BaseHandler subclass may also change its handler_order attribute to modify its position in the handlers list.

urllib.request.pathname2url(path)¶: Convert the pathname path from the local syntax for a path to the form used in the path component of a URL. This does not produce a complete URL. The return value will already be quoted using the quote() function.

urllib.request.url2pathname(path)¶: Convert the path component path from a percent-encoded URL to the local syntax for a path. This does not accept a complete URL. This function uses unquote() to decode path.

urllib.request.getproxies()¶: This helper function returns a dictionary of scheme to proxy server URL mappings. It scans the environment for variables named <scheme>_proxy, in a case insensitive approach, for all operating systems first, and when it cannot find it, looks for proxy information from Mac OSX System Configuration for Mac OS X and Windows Systems Registry for Windows.

The following classes are provided:

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)¶

This class is an abstraction of a URL request.

url should be a string containing a valid URL.

data must be a bytes object specifying additional data to send to the server, or None if no such data is needed. Currently HTTP requests are the only ones that use data; the HTTP request will be a POST instead of a GET when the data parameter is provided. data should be a buffer in the standard application/x-www-form-urlencoded format.

The urllib.parse.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format. It should be encoded to bytes before being used as the data parameter. The charset parameter in Content-Type header may be used to specify the encoding. If charset parameter is not sent with the Content-Type header, the server following the HTTP 1.1 recommendation may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to use charset parameter with encoding used in Content-Type header with the Request.

headers should be a dictionary, and will be treated as if add_header() was called with each key and value as arguments. This is often used to “spoof” the User-Agent header, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while urllib‘s default user agent string is "Python-urllib/2.6" (on Python 2.6).

An example of using Content-Type header with data argument would be sending a dictionary like {"Content-Type":" application/x-www-form-urlencoded;charset=utf-8"}

The final two arguments are only of interest for correct handling of third-party HTTP cookies:

origin_req_host should be the request-host of the origin transaction, as defined by RFC 2965. It defaults to http.cookiejar.request_host(self). This is the host name or IP address of the original request that was initiated by the user. For example, if the request is for an image in an HTML document, this should be the request-host of the request for the page containing the image.

unverifiable should indicate whether the request is unverifiable, as defined by RFC 2965. It defaults to False. An unverifiable request is one whose URL the user did not have the option to approve. For example, if the request is for an image in an HTML document, and the user had no option to approve the automatic fetching of the image, this should be true.

method should be a string that indicates the HTTP request method that will be used (e.g. 'HEAD'). Its value is stored in the method attribute and is used by get_method().

Changed in version 3.3: Request.method argument is added to the Request class.

class urllib.request.OpenerDirector¶: The OpenerDirector class opens URLs via BaseHandlers chained together. It manages the chaining of handlers, and recovery from errors.

class urllib.request.BaseHandler¶: This is the base class for all registered handlers — and handles only the simple mechanics of registration.

class urllib.request.HTTPDefaultErrorHandler¶: A class which defines a default handler for HTTP error responses; all responses are turned into HTTPError exceptions.

class urllib.request.HTTPRedirectHandler¶: A class to handle redirections.

class urllib.request.HTTPCookieProcessor(cookiejar=None)¶: A class to handle HTTP Cookies.

class urllib.request.ProxyHandler(proxies=None)¶

Cause requests to go through a proxy. If proxies is given, it must be a dictionary mapping protocol names to URLs of proxies. The default is to read the list of proxies from the environment variables <protocol>_proxy. If no proxy environment variables are set, then in a Windows environment proxy settings are obtained from the registry’s Internet Settings section, and in a Mac OS X environment proxy information is retrieved from the OS X System Configuration Framework.

To disable autodetected proxy pass an empty dictionary.

class urllib.request.HTTPPasswordMgr¶: Keep a database of (realm, uri) -> (user, password) mappings.

class urllib.request.HTTPPasswordMgrWithDefaultRealm¶: Keep a database of (realm, uri) -> (user, password) mappings. A realm of None is considered a catch-all realm, which is searched if no other realm fits.

class urllib.request.AbstractBasicAuthHandler(password_mgr=None)¶: This is a mixin class that helps with HTTP authentication, both to the remote host and to a proxy. password_mgr, if given, should be something that is compatible with HTTPPasswordMgr; refer to section HTTPPasswordMgr Objects for information on the interface that must be supported.

class urllib.request.HTTPBasicAuthHandler(password_mgr=None)¶: Handle authentication with the remote host. password_mgr, if given, should be something that is compatible with HTTPPasswordMgr; refer to section HTTPPasswordMgr Objects for information on the interface that must be supported. HTTPBasicAuthHandler will raise a ValueError when presented with a wrong Authentication scheme.

class urllib.request.ProxyBasicAuthHandler(password_mgr=None)¶: Handle authentication with the proxy. password_mgr, if given, should be something that is compatible with HTTPPasswordMgr; refer to section HTTPPasswordMgr Objects for information on the interface that must be supported.

class urllib.request.AbstractDigestAuthHandler(password_mgr=None)¶: This is a mixin class that helps with HTTP authentication, both to the remote host and to a proxy. password_mgr, if given, should be something that is compatible with HTTPPasswordMgr; refer to section HTTPPasswordMgr Objects for information on the interface that must be supported.

class urllib.request.HTTPDigestAuthHandler(password_mgr=None)¶

Handle authentication with the remote host. password_mgr, if given, should be something that is compatible with HTTPPasswordMgr; refer to section HTTPPasswordMgr Objects for information on the interface that must be supported. When both Digest Authentication Handler and Basic Authentication Handler are both added, Digest Authentication is always tried first. If the Digest Authentication returns a 40x response again, it is sent to Basic Authentication handler to Handle. This Handler method will raise a ValueError when presented with an authentication scheme other than Digest or Basic.

Changed in version 3.3: Raise ValueError on unsupported Authentication Scheme.

class urllib.request.ProxyDigestAuthHandler(password_mgr=None)¶: Handle authentication with the proxy. password_mgr, if given, should be something that is compatible with HTTPPasswordMgr; refer to section HTTPPasswordMgr Objects for information on the interface that must be supported.

class urllib.request.HTTPHandler¶: A class to handle opening of HTTP URLs.

class urllib.request.HTTPSHandler(debuglevel=0, context=None, check_hostname=None)¶

A class to handle opening of HTTPS URLs. context and check_hostname have the same meaning as in http.client.HTTPSConnection.

Changed in version 3.2: context and check_hostname were added.

class urllib.request.FileHandler¶: Open local files.

class urllib.request.FTPHandler¶: Open FTP URLs.

class urllib.request.CacheFTPHandler¶: Open FTP URLs, keeping a cache of open FTP connections to minimize delays.

class urllib.request.UnknownHandler¶: A catch-all class to handle unknown URLs.

class urllib.request.HTTPErrorProcessor¶: Process HTTP error responses.

21.6.1. Request Objects¶

The following methods describe Request‘s public interface, and so all may be overridden in subclasses. It also defines several public attributes that can be used by clients to inspect the parsed request.

Request.full_url¶: The original URL passed to the constructor.

Request.type¶: The URI scheme.

Request.host¶: The URI authority, typically a host, but may also contain a port separated by a colon.

Request.origin_req_host¶: The original host for the request, without port.

Request.selector¶: The URI path. If the Request uses a proxy, then selector will be the full url that is passed to the proxy.

Request.data¶: The entity body for the request, or None if not specified.

Request.unverifiable¶: boolean, indicates whether the request is unverifiable as defined by RFC 2965.

Request.method¶

The HTTP request method to use. This value is used by get_method() to override the computed HTTP request method that would otherwise be returned. This attribute is initialized with the value of the method argument passed to the constructor.

New in version 3.3.

Request.get_method()¶

Return a string indicating the HTTP request method. If Request.method is not None, return its value, otherwise return 'GET' if Request.data is None, or 'POST' if it’s not. This is only meaningful for HTTP requests.

Changed in version 3.3: get_method now looks at the value of Request.method.

Request.add_header(key, val)¶: Add another header to the request. Headers are currently ignored by all handlers except HTTP handlers, where they are added to the list of headers sent to the server. Note that there cannot be more than one header with the same name, and later calls will overwrite previous calls in case the key collides. Currently, this is no loss of HTTP functionality, since all headers which have meaning when used more than once have a (header-specific) way of gaining the same functionality using only one header.

Request.add_unredirected_header(key, header)¶: Add a header that will not be added to a redirected request.

Request.has_header(header)¶: Return whether the instance has the named header (checks both regular and unredirected).

Request.get_full_url()¶: Return the URL given in the constructor.

Request.set_proxy(host, type)¶: Prepare the request by connecting to a proxy server. The host and type will replace those of the instance, and the instance’s selector will be the original URL given in the constructor.

Request.add_data(data)¶

Set the Request data to data. This is ignored by all handlers except HTTP handlers — and there it should be a byte string, and will change the request to be POST rather than GET. Deprecated in 3.3, use Request.data.