Low-Level Utilities

Parsing Simple Repository Pages

pypi_simple.parse_repo_index_page(html: Union[str, bytes], from_encoding: Optional[str] = None) → pypi_simple.classes.IndexPage[source]

New in version 0.7.0.

Parse an index/root page from a simple repository into an IndexPage. Note that the last_serial attribute will be None.

Parameters
  • html (str or bytes) – the HTML to parse

  • from_encoding (Optional[str]) – an optional hint to Beautiful Soup as to the encoding of html when it is bytes (usually the charset parameter of the response’s Content-Type header)

Return type

IndexPage

Raises

UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version

pypi_simple.parse_repo_index_response(r: requests.models.Response) → pypi_simple.classes.IndexPage[source]

New in version 0.7.0.

Parse an index page from a requests.Response returned from a (non-streaming) request to a simple repository, and return an IndexPage.

Parameters

r (requests.Response) – the response object to parse

Return type

IndexPage

Raises

UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version

pypi_simple.parse_repo_project_page(project: str, html: Union[str, bytes], base_url: Optional[str] = None, from_encoding: Optional[str] = None) → pypi_simple.classes.ProjectPage[source]

New in version 0.7.0.

Parse a project page from a simple repository into a ProjectPage. Note that the last_serial attribute will be None.

Parameters
  • project (str) – The name of the project whose page is being parsed

  • html (str or bytes) – the HTML to parse

  • base_url (Optional[str]) – an optional URL to join to the front of the packages’ URLs (usually the URL of the page being parsed)

  • from_encoding (Optional[str]) – an optional hint to Beautiful Soup as to the encoding of html when it is bytes (usually the charset parameter of the response’s Content-Type header)

Return type

ProjectPage

Raises

UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version

pypi_simple.parse_repo_project_response(project: str, r: requests.models.Response) → pypi_simple.classes.ProjectPage[source]

New in version 0.7.0.

Parse a project page from a requests.Response returned from a (non-streaming) request to a simple repository, and return a ProjectPage.

Parameters
  • project (str) – The name of the project whose page is being parsed

  • r (requests.Response) – the response object to parse

Return type

ProjectPage

Raises

UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version

New in version 0.7.0.

Parse an HTML page from a simple repository and return a (metadata, links) pair.

The metadata element is a Dict[str, str]. Currently, the only key that may appear in it is "repository_version", which maps to the repository version reported by the HTML page in accordance with PEP 629. If the HTML page does not contain a repository version, this key is absent from the dict.

The links element is a list of Link objects giving the hyperlinks found in the HTML page.

Parameters
  • html (str or bytes) – the HTML to parse

  • base_url (Optional[str]) – an optional URL to join to the front of the links’ URLs (usually the URL of the page being parsed)

  • from_encoding (Optional[str]) – an optional hint to Beautiful Soup as to the encoding of html when it is bytes (usually the charset parameter of the response’s Content-Type header)

Return type

Tuple[Dict[str, str], List[Link]]

Raises

UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version

New in version 0.7.0.

A hyperlink extracted from an HTML page

property text

The text inside the link tag, with leading & trailing whitespace removed and with any tags nested inside the link tags ignored

property url

The URL that the link points to, resolved relative to the URL of the source HTML page and relative to the page’s <base> href value, if any

property attrs

A dictionary of attributes set on the link tag (including the unmodified href attribute). Keys are converted to lowercase. Most attributes have str values, but some (referred to as “CDATA list attributes” by the HTML spec; e.g., "class") have values of type List[str] instead.

Streaming Parsers

New in version 0.7.0.

Parse an HTML page given as an iterable of bytes or str and yield each hyperlink encountered in the document as a Link object.

This function consumes the elements of htmlseq one at a time and yields the links found in each segment before moving on to the next one. It is intended to be faster than both parse_links() and parse_repo_links(), especially when the complete document is very large.

Warning

This function is rather experimental. It does not have full support for web encodings, encoding detection, or handling invalid HTML. It also leaves CDATA list attributes on links as strings instead of converting them to lists.

Parameters
  • htmlseq (Iterable[AnyStr]) – an iterable of either bytes or str that, when joined together, form an HTML document to parse

  • base_url (Optional[str]) – an optional URL to join to the front of the links’ URLs (usually the URL of the page being parsed)

  • http_charset (Optional[str]) – the document’s encoding as declared by the transport layer, if any; e.g., as declared in the charset parameter of the Content-Type header of the HTTP response that returned the document

Return type

Iterator[Link]

Raises

UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version

New in version 0.7.0.

Parse an HTML page from a streaming requests.Response object and yield each hyperlink encountered in the document as a Link object.

See parse_links_stream() for more information.

Parameters
  • r (requests.Response) – the streaming response object to parse

  • chunk_size (int) – how many bytes to read from the response at a time

Return type

Iterator[Link]

Raises

UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version

Deprecated Functions

pypi_simple.parse_simple_index(html: Union[str, bytes], base_url: Optional[str] = None, from_encoding: Optional[str] = None) → Iterator[Tuple[str, str]][source]

Parse a simple repository’s index page and return a generator of (project name, project URL) pairs

Deprecated since version 0.7.0: Use parse_repo_index_page() or parse_links_stream() instead

Parameters
  • html (str or bytes) – the HTML to parse

  • base_url (Optional[str]) – an optional URL to join to the front of the URLs returned (usually the URL of the page being parsed)

  • from_encoding (Optional[str]) – an optional hint to Beautiful Soup as to the encoding of html when it is bytes (usually the charset parameter of the response’s Content-Type header)

Return type

Iterator[Tuple[str, str]]

Raises

UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version

pypi_simple.parse_project_page(html: Union[str, bytes], base_url: Optional[str] = None, from_encoding: Optional[str] = None, project_hint: Optional[str] = None) → List[pypi_simple.classes.DistributionPackage][source]

Parse a project page from a simple repository and return a list of DistributionPackage objects

Deprecated since version 0.7.0: Use parse_repo_project_page() instead

Parameters
  • html (str or bytes) – the HTML to parse

  • base_url (Optional[str]) – an optional URL to join to the front of the packages’ URLs (usually the URL of the page being parsed)

  • from_encoding (Optional[str]) – an optional hint to Beautiful Soup as to the encoding of html when it is bytes (usually the charset parameter of the response’s Content-Type header)

  • project_hint (Optional[str]) – The name of the project whose page is being parsed; used to disambiguate the parsing of certain filenames

Return type

List[DistributionPackage]

Raises

UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version

Parse an HTML page and return a generator of links, where each link is represented as a triple of link text, link URL, and a dict of link tag attributes (including the unmodified href attribute).

Link text has all leading & trailing whitespace removed.

Keys in the attributes dict are converted to lowercase.

Deprecated since version 0.7.0: Use parse_repo_links() instead

Parameters
  • html (str or bytes) – the HTML to parse

  • base_url (Optional[str]) – an optional URL to join to the front of the URLs returned (usually the URL of the page being parsed)

  • from_encoding (Optional[str]) – an optional hint to Beautiful Soup as to the encoding of html when it is bytes (usually the charset parameter of the response’s Content-Type header)

Return type

Iterator[Tuple[str, str, Dict[str, Union[str, List[str]]]]]

Parsing Filenames

pypi_simple.parse_filename(filename: str, project_hint: Optional[str] = None) → Union[Tuple[str, str, str], Tuple[None, None, None]][source]

Given the filename of a distribution package, returns a triple of the project name, project version, and package type. The name and version are spelled the same as they appear in the filename; no normalization is performed.

The package type may be any of the following strings:

  • 'dumb'

  • 'egg'

  • 'msi'

  • 'rpm'

  • 'sdist'

  • 'wheel'

  • 'wininst'

If the filename cannot be parsed, (None, None, None) is returned.

Note that some filenames (e.g., 1-2-3.tar.gz) may be ambiguous as to which part is the project name and which is the version. In order to resolve the ambiguity, the expected value for the project name (modulo normalization) can be supplied as the project_name argument to the function. If the filename can be parsed with the given string in the role of the project name, the results of that parse will be returned; otherwise, the function will fall back to breaking the project & version apart at an unspecified point.

Parameters
  • filename (str) – The package filename to parse

  • project_hint (Optional[str]) – Optionally, the expected value for the project name (usually the name of the project page on which the filename was found). The name does not need to be normalized.

Return type

Union[Tuple[str, str, str], Tuple[None, None, None]]