3 ]9Y@s0ddlZddlZddlmZGdddeZdS)N) ProbingStatec@sneZdZdZdddZddZeddZd d Zed d Z d dZ e ddZ e ddZ e ddZdS) CharSetProbergffffff?NcCsd|_||_tjt|_dS)N)_state lang_filterloggingZ getLogger__name__Zlogger)selfrr #/usr/lib/python3.6/charsetprober.py__init__'szCharSetProber.__init__cCs tj|_dS)N)rZ DETECTINGr)r r r r reset,szCharSetProber.resetcCsdS)Nr )r r r r charset_name/szCharSetProber.charset_namecCsdS)Nr )r bufr r r feed3szCharSetProber.feedcCs|jS)N)r)r r r r state6szCharSetProber.statecCsdS)Ngr )r r r r get_confidence:szCharSetProber.get_confidencecCstjdd|}|S)Ns([-])+ )resub)rr r r filter_high_byte_only=sz#CharSetProber.filter_high_byte_onlycCsbt}tjd|}xJ|D]B}|j|dd|dd}|j rP|dkrPd}|j|qW|S)u9 We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [€-ÿ] marker: everything else [^a-zA-Z€-ÿ] The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character. This filter applies to all scripts which do not use English characters. s%[a-zA-Z]*[-]+[a-zA-Z]*[^a-zA-Z-]?Nrrr) bytearrayrfindallextendisalpha)rfilteredZwordsZwordZ last_charr r r filter_international_wordsBs  z(CharSetProber.filter_international_wordscCst}d}d}xtt|D]r}|||d}|dkr>d}n |dkrJd}|dkr|j r||kr| r|j||||jd|d}qW|s|j||d |S) a Returns a copy of ``buf`` that retains only the sequences of English alphabet and high byte characters that are not between <> characters. Also retains English alphabet and high byte characters immediately before occurrences of >. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by ``Latin1Prober``. Frr>s