Python urlparse and urlsplit: Parse URLs Safely

Quick answer: urllib.parse.urlparse and urlsplit turn a URL into named components such as scheme, hostname, path, query, and fragment. urlsplit is generally the better choice for modern URL syntax, while urlparse retains the older params component for compatibility. Use urlencode to build query strings and validate untrusted urljoin results before making a request.

Python Pool infographic showing URL parsing components, urlsplit, query encoding, and URL joining — urllib.parse splits URLs into structured components; use urlsplit for modern URL syntax, urlencode for query data, and validate untrusted joins.

urlparse() splits a URL into structured parts such as scheme, network location, path, query string, and fragment. In modern Python, import it from urllib.parse.

The main references are Python’s urlparse documentation, the parse_qs documentation, and the urlunparse documentation.

Use urlparse() when you need to inspect or validate URL structure without manually splitting strings. It understands common URL syntax and returns a named result.

Parsing does not prove that a URL is safe or reachable. It only separates the text into components so your code can make better decisions.

This distinction matters for security-sensitive code. A parsed hostname can still be untrusted, and a parsed path can still contain content your application should reject.

Use parsing as the first step, then apply allowlists, scheme checks, path rules, or query-parameter rules that match your application.

Contents

Parse A URL

Import urlparse from urllib.parse and pass a URL string.

from urllib.parse import urlparse

url = "https://example.com/docs/page.html?lang=python#intro"
parts = urlparse(url)

print(parts.scheme)
print(parts.netloc)
print(parts.path)
print(parts.query)
print(parts.fragment)

The returned object is a ParseResult. You can access fields by name, which is clearer than relying on tuple positions.

The scheme is usually http or https. The network location contains the host and optional port.

If the input omits a scheme, urlparse() does not guess one for you. Your code must decide whether that is allowed.

Read Hostname And Port

ParseResult provides helper properties for hostname and port.

from urllib.parse import urlparse

url = "https://example.com:8443/api/items"
parts = urlparse(url)

print(parts.hostname)
print(parts.port)
print(parts.path)

hostname is normalized to the host name without the port. port is an integer when a port is present.

Use these properties instead of manually splitting netloc around a colon.

This is especially helpful for IPv6 hosts and URLs that include authentication text, where manual string splitting can break quickly.

Parse Query Parameters

urlparse() leaves the query string as text. Use parse_qs() or parse_qsl() to turn it into structured data.

from urllib.parse import parse_qs, urlparse

url = "https://example.com/search?q=python&page=2&q=code"
parts = urlparse(url)
params = parse_qs(parts.query)

print(params["q"])
print(params["page"])

parse_qs() returns a dictionary where each key maps to a list of values. That handles repeated query parameters correctly.

Use parse_qsl() when you need an ordered list of pairs instead.

Handle Relative URLs

A relative URL may not have a scheme or network location.

from urllib.parse import urlparse

for url in ["/docs/page.html", "docs/page.html", "https://example.com/docs"]:
    parts = urlparse(url)
    print(url, parts.scheme, parts.netloc, parts.path)

This matters when parsing links from HTML. Internal links often use relative paths.

Use urljoin() with a base URL when you need to resolve a relative URL into an absolute one.

Be careful when joining links from untrusted pages. A relative-looking link can still redirect your crawler or client to a different host after joining if it is actually absolute.

Rebuild A URL

You can replace a component and rebuild the URL with geturl().

from urllib.parse import urlparse

url = "https://example.com/docs?page=1"
parts = urlparse(url)
new_parts = parts._replace(path="/guide")

print(new_parts.geturl())

_replace() returns a new parse result with the changed field.

This is safer than string replacement because it targets the URL component directly.

urlparse Versus urlsplit

urlparse() returns six fields, including the rarely used params field. urlsplit() returns five fields and skips params.

For most modern web URLs, either can work. Use urlparse() when you want the traditional full parse result, and use urlsplit() when you only need scheme, network location, path, query, and fragment.

Keep one parser choice inside a helper so the rest of your code does not depend on different tuple shapes.

Validate The Parts You Need

URL validation depends on your application. A crawler, login redirect, and API client may all need different rules.

from urllib.parse import urlparse

def is_https_url(url):
    parts = urlparse(url)
    return parts.scheme == "https" and bool(parts.netloc)

print(is_https_url("https://example.com"))
print(is_https_url("/relative/path"))
print(is_https_url("ftp://example.com"))

This helper checks only for an absolute HTTPS URL. Add host allowlists, path checks, or query rules when your use case needs them.

The practical rule is to parse first, inspect named fields, and then validate the exact components your code depends on.

Avoid manual URL splitting for production logic. URL syntax has enough edge cases that the standard library parser is the safer starting point.

Also avoid using urlparse() alone as a security filter. Parse the URL, normalize the pieces you care about, and then compare those pieces against explicit rules.

When writing tests, include absolute URLs, relative paths, repeated query keys, fragments, ports, and unsupported schemes. That set catches most parsing assumptions early.

For application code, keep parsed URL handling in a small helper. That prevents each caller from inventing a different interpretation of the same URL shape.

That consistency prevents parsing bugs.

Read URL Components

The returned named tuple makes a URL’s components visible without manually splitting on punctuation. hostname is normalized for convenient comparison, while netloc preserves the authority text used by the parser.

from urllib.parse import urlsplit

value = urlsplit("https://example.com:8443/docs/page?q=python#intro")
print(value.scheme)
print(value.hostname)
print(value.port)
print(value.path)
print(value.query)
print(value.fragment)

Build Query Strings With urlencode

Query values may contain spaces, ampersands, or repeated keys. urlencode applies the correct escaping rules and doseq can encode multiple values for one key.

from urllib.parse import urlencode

query = urlencode({"q": "python urlparse", "page": 2})
print(query)
print(urlencode([("tag", "python"), ("tag", "seo")]))

Join A Base URL Carefully

urljoin follows URL syntax, so an input beginning with a slash replaces the base path and an input with a scheme can replace the host. Treat untrusted input as data and validate the result before fetching it.

from urllib.parse import urljoin, urlsplit

base = "https://example.com/docs/"
for candidate in ("guide.html", "/account", "https://other.example/"):
    result = urljoin(base, candidate)
    parts = urlsplit(result)
    allowed = parts.scheme == "https" and parts.hostname == "example.com"
    print(result, allowed)

Choose urlparse Or urlsplit

urlparse returns six components and includes params for compatibility with older URL conventions. urlsplit returns five components and is usually clearer for modern URLs. Use the result type consistently across the application.

from urllib.parse import urlparse, urlsplit

old_style = urlparse("https://example.com/a;b?x=1")
modern = urlsplit("https://example.com/a;b?x=1")
print(old_style.params)
print(modern.path)

Python’s official urllib.parse documentation covers urlparse, urlsplit, urljoin, quoting, and urlencode. Related references include hostname extraction, request JSON, and web crawling boundaries.

For related URL workflows, compare hostname extraction, request JSON, and web crawling boundaries when parsing or fetching links.

Frequently Asked Questions

How do I parse a URL in Python?

Import urlparse or urlsplit from urllib.parse and inspect the named result fields such as scheme, hostname, path, and query.

Should I use urlparse or urlsplit?

urlsplit generally matches modern URL syntax, while urlparse additionally exposes the older params component for compatibility.

How do I build a query string?

Use urllib.parse.urlencode with a mapping or sequence of pairs instead of concatenating and escaping values manually.

Is urljoin safe with untrusted input?

A user-controlled absolute URL can replace the base host, so validate the resulting scheme and hostname before requesting it.