U .eWF@s dZddlZddlZddlZddlZddlZddlmZddlm Z m Z ddl m Z ddl mZmZmZddlmZddlmZdd lmZdd lmZdd lmZdd lmZdd lmZm Z ddl!m"Z"m#Z#erDddl$m%Z%m&Z&m'Z'm(Z(m)Z)m*Z*m+Z+m,Z,m-Z-ddl.Z/ddl0m1Z1ddl2m3Z3ddl4m5Z5e/j6j7j8Z9e)e:e:fZ;eddZ?ddZ@GdddeAZBddZCGdddeAZDddZEdd ZFd!d"ZGd#d$ZHd%d&ZId'd(ZJd)d*ZKGd+d,d,eLZMdd6d7ZRGd8d9d9eLZSGd:d;d;eLZTdS)?zM The main purpose of this module is to expose LinkCollector.collect_links(). N) OrderedDict)html5librequests)unescape) HTTPError RetryErrorSSLError)parse)requestLink)ARCHIVE_EXTENSIONS)redact_auth_from_url)MYPY_CHECK_RUNNING) path_to_url url_to_path)is_urlvcs) CallableDictIterableListMutableMappingOptionalSequenceTupleUnion)Response) SearchScope) PipSessioncCs6tjD]*}||r|t|dkr|SqdS)zgLook for VCS schemes in the URL. Returns the matched VCS scheme, or None if there's no match. z+:N)rZschemeslower startswithlen)urlschemer%;/usr/lib/python3.8/site-packages/pip/_internal/collector.py_match_vcs_scheme/s  r'cCs(t|j}tD]}||rdSqdS)z2Return whether the URL looks like an archive. TF)r filenamer endswith)r#r(Zbad_extr%r%r&_is_url_like_archive;s   r*cseZdZfddZZS)_NotHTMLcs"tt|||||_||_dSN)superr+__init__ content_type request_desc)selfr/r0 __class__r%r&r.Gsz_NotHTML.__init__)__name__ __module__ __qualname__r. __classcell__r%r%r2r&r+Fsr+cCs.|jdd}|ds*t||jjdS)zCheck the Content-Type header to ensure the response contains HTML. Raises `_NotHTML` if the content type is not text/html. Content-Type text/htmlN)headersgetr r!r+r method)responser/r%r%r&_ensure_html_headerNsr?c@s eZdZdS)_NotHTTPN)r4r5r6r%r%r%r&r@Ysr@cCsDt|\}}}}}|dkr"t|j|dd}|t|dS)zSend a HEAD request to the URL, and ensure the response contains HTML. Raises `_NotHTTP` if the URL is not available for a HEAD request, or `_NotHTML` if the content type is not text/html. >httphttpsT)Zallow_redirectsN) urllib_parseZurlsplitr@headraise_for_statusr?)r#sessionr$netlocpathZqueryZfragmentrespr%r%r&_ensure_html_response]s rJcCsLt|rt||dtdt||j|dddd}|t||S)aAccess an HTML page with GET, and return the response. This consists of three parts: 1. If the URL looks suspiciously like an archive, send a HEAD first to check the Content-Type is HTML, to avoid downloading a large file. Raise `_NotHTTP` if the content type cannot be determined, or `_NotHTML` if it is not HTML. 2. Actually perform the request. Raise HTTP exceptions on network failures. 3. Check the Content-Type header to make sure we got HTML, and raise `_NotHTML` otherwise. rFzGetting page %sr:z max-age=0)ZAcceptz Cache-Control)r;)r*rJloggerdebugrr<rEr?)r#rFrIr%r%r&_get_html_responsens rNcCs2|r.d|kr.t|d\}}d|kr.|dSdS)zBDetermine if we have any encoding information in our headers. r8charsetN)cgiZ parse_header)r;r/Zparamsr%r%r&_get_encoding_from_headerss  rQcCs.|dD]}|d}|dk r |Sq |S)aDetermine the HTML document's base URL. This looks for a ```` tag in the HTML document. If present, its href attribute denotes the base URL of anchor tags in the document. If there is no such tag (or if it does not have a valid href attribute), the HTML file's URL is used as the base URL. :param document: An HTML document representation. The current implementation expects the result of ``html5lib.parse()``. :param page_url: The URL of the HTML document. z.//basehrefN)findallr<)documentpage_urlbaserRr%r%r&_determine_base_urls   rWcCsPt|}|jdkr(tt|j}ntjt|jdd}t |j |dS)zMakes sure a link is fully encoded. That is, if a ' ' shows up in the link, it will be rewritten to %20 (while not over-quoting % or other characters).r9z/@)Zsafe)rH) rCurlparserGurllib_requestZ pathname2url url2pathnamerHZquoteZunquoteZ urlunparse_replace)r#resultrHr%r%r& _clean_links   r]cCsf|d}|sdStt||}|d}|r8t|nd}|d}|rRt|}t||||d}|S)zJ Convert an anchor element in a simple repository page to a Link. rRNzdata-requires-pythonz data-yanked)Z comes_fromZrequires_python yanked_reason)r<r]rCurljoinrr )anchorrUbase_urlrRr#Z pyrequirer^linkr%r%r&_create_link_from_elements   rcccsVtj|j|jdd}|j}t||}|dD]"}t|||d}|dkrJq.|Vq.dS)zP Parse an HTML document, and yield its anchor elements as Link objects. F)Ztransport_encodingZnamespaceHTMLElementsz.//a)rUraN)rr contentencodingr#rWrSrc)pagerTr#rar`rbr%r%r& parse_linkss  rgc@s eZdZdZddZddZdS)HTMLPagez'Represents one page, along with its URLcCs||_||_||_dS)z :param encoding: the encoding to decode the given content. :param url: the URL from which the HTML was downloaded. N)rdrer#)r1rdrer#r%r%r&r.s zHTMLPage.__init__cCs t|jSr,)rr#r1r%r%r&__str__$szHTMLPage.__str__N)r4r5r6__doc__r.rjr%r%r%r&rhsrhcCs|dkrtj}|d||dS)Nz%Could not fetch URL %s: %s - skipping)rLrM)rbreasonmethr%r%r&_handle_get_page_fail(srncCst|j}t|j||jdS)N)rer#)rQr;rhrdr#)r>rer%r%r&_make_html_page3s roc Cs|dkrtd|jddd}t|}|r@td||dSt|\}}}}}}|dkrtj t |r| ds|d7}t|d}td |zt||d }WnDtk rtd |Yn,tk r}ztd ||j|jW5d}~XYntk r0}zt||W5d}~XYntk r\}zt||W5d}~XYntk r}z$d } | t|7} t|| tjdW5d}~XYn\tjk r}zt|d|W5d}~XYn*tjk rt|dYn Xt|SdS)Nz?_get_html_page() missing 1 required keyword argument: 'session'#rzCannot look at %s URL %sfile/z index.htmlz# file: URL is directory, getting %srKzQSkipping page %s because it looks like an archive, and cannot be checked by HEAD.z.sort_pathzfile:z)Path '{0}' is ignored: it is a directory.z:Url '%s' is ignored: it is neither a file nor a directory.zQUrl '%s' is ignored. It is either a non-existing path or lacks a specific scheme.)rvrHexistsr!rrwrealpathlistdirjoinrrLZwarningformatisfiler) locations expand_dirrr#Z is_local_pathZ is_file_urlrHitemr%rr&group_locationsxsF        rc@seZdZdZddZdS)CollectedLinksa Encapsulates all the Link objects collected by a call to LinkCollector.collect_links(), stored separately as-- (1) links from the configured file locations, (2) links from the configured find_links, and (3) a dict mapping HTML page url to links from that page. cCs||_||_||_dS)z :param files: Links from file locations. :param find_links: Links from find_links. :param pages: A dict mapping HTML page url to links from that page. Nr find_linkspages)r1rrrr%r%r&r.s zCollectedLinks.__init__N)r4r5r6rkr.r%r%r%r&rs rc@s4eZdZdZddZeddZddZdd Zd S) LinkCollectorz Responsible for collecting Link objects from all configured locations, making network requests as needed. The class's main method is its collect_links() method. cCs||_||_dSr,) search_scoperF)r1rFrr%r%r&r.szLinkCollector.__init__cCs|jjSr,)rrrir%r%r&rszLinkCollector.find_linksccs,|D]"}t||jd}|dkr q|VqdS)zp Yields (page, page_url) from the given locations, skipping locations that have errors. rKN)r}rF)r1rlocationrfr%r%r& _get_pagess zLinkCollector._get_pagescsj}||}t|\}}tjdd\}}ddt||D}ddjD} fddtdd|Dd d|DD} t| } d t| |g} | D]} | d | qt d | i} | D]}tt|| |j<qt|| | d S)zFind all available links for the given project name. :return: All the Link objects (unfiltered), as a CollectedLinks object. T)rcSsg|] }t|qSr%r .0r#r%r%r& sz/LinkCollector.collect_links..cSsg|]}t|dqS)z-fr rr%r%r&rscsg|]}j|r|qSr%)rFZis_secure_origin)rrbrir%r&r s css|]}t|VqdSr,r rr%r%r& sz.LinkCollector.collect_links..css|]}t|VqdSr,r rr%r%r&r sz,{} location(s) to search for versions of {}:z* {} r)rZget_index_urls_locationsrr itertoolschainrrr"rrLrMrrr~rgr#r)r1Z project_namerZindex_locationsZindex_file_locZ index_url_locZ fl_file_locZ fl_url_locZ file_linksZfind_link_linksZ url_locationslinesrbZ pages_linksrfr%rir& collect_linkssD       zLinkCollector.collect_linksN) r4r5r6rkr.propertyrrrr%r%r%r&rs    r)N)N)F)UrkrPrZloggingrrv collectionsrZ pip._vendorrrZpip._vendor.distlib.compatrZpip._vendor.requests.exceptionsrrrZpip._vendor.six.moves.urllibr rCr rYZpip._internal.models.linkr Zpip._internal.utils.filetypesr Zpip._internal.utils.miscrZpip._internal.utils.typingrZpip._internal.utils.urlsrrZpip._internal.vcsrrtypingrrrrrrrrrZxml.etree.ElementTreeZxmlZpip._vendor.requestsrZ!pip._internal.models.search_scoperZpip._internal.network.sessionrZetreeZ ElementTreeZElementZ HTMLElementrxZResponseHeadersZ getLoggerr4rLr'r* Exceptionr+r?r@rJrNrQrWr]rcrgobjectrhrnror}rrrrr%r%r%r&s^        ,         3    6 ;