Low-Level Utilities¶
Parsing Simple Repository Pages¶
-
pypi_simple.
parse_repo_index_page
(html: Union[str, bytes], from_encoding: Optional[str] = None) → pypi_simple.classes.IndexPage[source]¶ New in version 0.7.0.
Parse an index/root page from a simple repository into an
IndexPage
. Note that thelast_serial
attribute will beNone
.- Parameters
- Return type
- Raises
UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version
-
pypi_simple.
parse_repo_index_response
(r: requests.models.Response) → pypi_simple.classes.IndexPage[source]¶ New in version 0.7.0.
Parse an index page from a
requests.Response
returned from a (non-streaming) request to a simple repository, and return anIndexPage
.- Parameters
r (requests.Response) – the response object to parse
- Return type
- Raises
UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version
-
pypi_simple.
parse_repo_project_page
(project: str, html: Union[str, bytes], base_url: Optional[str] = None, from_encoding: Optional[str] = None) → pypi_simple.classes.ProjectPage[source]¶ New in version 0.7.0.
Parse a project page from a simple repository into a
ProjectPage
. Note that thelast_serial
attribute will beNone
.- Parameters
project (str) – The name of the project whose page is being parsed
base_url (Optional[str]) – an optional URL to join to the front of the packages’ URLs (usually the URL of the page being parsed)
from_encoding (Optional[str]) – an optional hint to Beautiful Soup as to the encoding of
html
when it isbytes
(usually thecharset
parameter of the response’s Content-Type header)
- Return type
- Raises
UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version
-
pypi_simple.
parse_repo_project_response
(project: str, r: requests.models.Response) → pypi_simple.classes.ProjectPage[source]¶ New in version 0.7.0.
Parse a project page from a
requests.Response
returned from a (non-streaming) request to a simple repository, and return aProjectPage
.- Parameters
project (str) – The name of the project whose page is being parsed
r (requests.Response) – the response object to parse
- Return type
- Raises
UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version
-
pypi_simple.
parse_repo_links
(html: Union[str, bytes], base_url: Optional[str] = None, from_encoding: Optional[str] = None) → Tuple[Dict[str, str], List[pypi_simple.classes.Link]][source]¶ New in version 0.7.0.
Parse an HTML page from a simple repository and return a
(metadata, links)
pair.The
metadata
element is aDict[str, str]
. Currently, the only key that may appear in it is"repository_version"
, which maps to the repository version reported by the HTML page in accordance with PEP 629. If the HTML page does not contain a repository version, this key is absent from thedict
.The
links
element is a list ofLink
objects giving the hyperlinks found in the HTML page.- Parameters
base_url (Optional[str]) – an optional URL to join to the front of the links’ URLs (usually the URL of the page being parsed)
from_encoding (Optional[str]) – an optional hint to Beautiful Soup as to the encoding of
html
when it isbytes
(usually thecharset
parameter of the response’s Content-Type header)
- Return type
- Raises
UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version
-
class
pypi_simple.
Link
(text: str, url: str, attrs: Dict[str, Union[str, List[str]]])[source]¶ New in version 0.7.0.
A hyperlink extracted from an HTML page
-
property
text
¶ The text inside the link tag, with leading & trailing whitespace removed and with any tags nested inside the link tags ignored
-
property
url
¶ The URL that the link points to, resolved relative to the URL of the source HTML page and relative to the page’s
<base>
href value, if any
-
property
Streaming Parsers¶
-
pypi_simple.
parse_links_stream
(htmlseq: Iterable[AnyStr], base_url: Optional[str] = None, http_charset: Optional[str] = None) → Iterator[pypi_simple.classes.Link][source]¶ New in version 0.7.0.
Parse an HTML page given as an iterable of
bytes
orstr
and yield each hyperlink encountered in the document as aLink
object.This function consumes the elements of
htmlseq
one at a time and yields the links found in each segment before moving on to the next one. It is intended to be faster than bothparse_links()
andparse_repo_links()
, especially when the complete document is very large.Warning
This function is rather experimental. It does not have full support for web encodings, encoding detection, or handling invalid HTML. It also leaves CDATA list attributes on links as strings instead of converting them to lists.
- Parameters
htmlseq (Iterable[AnyStr]) – an iterable of either
bytes
orstr
that, when joined together, form an HTML document to parsebase_url (Optional[str]) – an optional URL to join to the front of the links’ URLs (usually the URL of the page being parsed)
http_charset (Optional[str]) – the document’s encoding as declared by the transport layer, if any; e.g., as declared in the
charset
parameter of the Content-Type header of the HTTP response that returned the document
- Return type
Iterator[Link]
- Raises
UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version
-
pypi_simple.
parse_links_stream_response
(r: requests.models.Response, chunk_size: int = 65535) → Iterator[pypi_simple.classes.Link][source]¶ New in version 0.7.0.
Parse an HTML page from a streaming
requests.Response
object and yield each hyperlink encountered in the document as aLink
object.See
parse_links_stream()
for more information.- Parameters
r (requests.Response) – the streaming response object to parse
chunk_size (int) – how many bytes to read from the response at a time
- Return type
Iterator[Link]
- Raises
UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version
Deprecated Functions¶
-
pypi_simple.
parse_simple_index
(html: Union[str, bytes], base_url: Optional[str] = None, from_encoding: Optional[str] = None) → Iterator[Tuple[str, str]][source]¶ Parse a simple repository’s index page and return a generator of
(project name, project URL)
pairsDeprecated since version 0.7.0: Use
parse_repo_index_page()
orparse_links_stream()
instead- Parameters
base_url (Optional[str]) – an optional URL to join to the front of the URLs returned (usually the URL of the page being parsed)
from_encoding (Optional[str]) – an optional hint to Beautiful Soup as to the encoding of
html
when it isbytes
(usually thecharset
parameter of the response’s Content-Type header)
- Return type
- Raises
UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version
-
pypi_simple.
parse_project_page
(html: Union[str, bytes], base_url: Optional[str] = None, from_encoding: Optional[str] = None, project_hint: Optional[str] = None) → List[pypi_simple.classes.DistributionPackage][source]¶ Parse a project page from a simple repository and return a list of
DistributionPackage
objectsDeprecated since version 0.7.0: Use
parse_repo_project_page()
instead- Parameters
base_url (Optional[str]) – an optional URL to join to the front of the packages’ URLs (usually the URL of the page being parsed)
from_encoding (Optional[str]) – an optional hint to Beautiful Soup as to the encoding of
html
when it isbytes
(usually thecharset
parameter of the response’s Content-Type header)project_hint (Optional[str]) – The name of the project whose page is being parsed; used to disambiguate the parsing of certain filenames
- Return type
List[DistributionPackage]
- Raises
UnsupportedRepoVersionError – if the repository version has a greater major component than the supported repository version
-
pypi_simple.
parse_links
(html: Union[str, bytes], base_url: Optional[str] = None, from_encoding: Optional[str] = None) → Iterator[Tuple[str, str, Dict[str, Union[str, List[str]]]]][source]¶ Parse an HTML page and return a generator of links, where each link is represented as a triple of link text, link URL, and a
dict
of link tag attributes (including the unmodifiedhref
attribute).Link text has all leading & trailing whitespace removed.
Keys in the attributes
dict
are converted to lowercase.Deprecated since version 0.7.0: Use
parse_repo_links()
instead- Parameters
base_url (Optional[str]) – an optional URL to join to the front of the URLs returned (usually the URL of the page being parsed)
from_encoding (Optional[str]) – an optional hint to Beautiful Soup as to the encoding of
html
when it isbytes
(usually thecharset
parameter of the response’s Content-Type header)
- Return type
Parsing Filenames¶
-
pypi_simple.
parse_filename
(filename: str, project_hint: Optional[str] = None) → Union[Tuple[str, str, str], Tuple[None, None, None]][source]¶ Given the filename of a distribution package, returns a triple of the project name, project version, and package type. The name and version are spelled the same as they appear in the filename; no normalization is performed.
The package type may be any of the following strings:
'dumb'
'egg'
'msi'
'rpm'
'sdist'
'wheel'
'wininst'
If the filename cannot be parsed,
(None, None, None)
is returned.Note that some filenames (e.g.,
1-2-3.tar.gz
) may be ambiguous as to which part is the project name and which is the version. In order to resolve the ambiguity, the expected value for the project name (modulo normalization) can be supplied as theproject_name
argument to the function. If the filename can be parsed with the given string in the role of the project name, the results of that parse will be returned; otherwise, the function will fall back to breaking the project & version apart at an unspecified point.- Parameters
- Return type