B Re@s0ddlZddlZddlmZGdddeZdS)N) ProbingStatec@sneZdZdZdddZddZeddZd d Zed d Z d dZ e ddZ e ddZ e ddZdS) CharSetProbergffffff?NcCsd|_||_tt|_dS)N)_state lang_filterlogging getLogger__name__logger)selfrr /builddir/build/BUILDROOT/alt-python37-pip-20.2.4-6.el8.x86_64/opt/alt/python37/lib/python3.7/site-packages/pip/_vendor/chardet/charsetprober.py__init__'szCharSetProber.__init__cCs tj|_dS)N)r DETECTINGr)r r r r reset,szCharSetProber.resetcCsdS)Nr )r r r r charset_name/szCharSetProber.charset_namecCsdS)Nr )r bufr r r feed3szCharSetProber.feedcCs|jS)N)r)r r r r state6szCharSetProber.statecCsdS)Ngr )r r r r get_confidence:szCharSetProber.get_confidencecCstdd|}|S)Ns([-])+ )resub)rr r r filter_high_byte_only=sz#CharSetProber.filter_high_byte_onlycCs`t}td|}xH|D]@}||dd|dd}|sN|dkrNd}||qW|S)u9 We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [€-ÿ] marker: everything else [^a-zA-Z€-ÿ] The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character. This filter applies to all scripts which do not use English characters. s%[a-zA-Z]*[-]+[a-zA-Z]*[^a-zA-Z-]?Nr) bytearrayrfindallextendisalpha)rfilteredwordsword last_charr r r filter_international_wordsBs  z(CharSetProber.filter_international_wordscCst}d}d}x~tt|D]n}|||d}|dkr>d}n |dkrJd}|dkr|s||kr|s|||||d|d}qW|s|||d |S) a Returns a copy of ``buf`` that retains only the sequences of English alphabet and high byte characters that are not between <> characters. Also retains English alphabet and high byte characters immediately before occurrences of >. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by ``Latin1Prober``. Frr>s