3 ÝPföã@s0ddlZddlZddlmZGdd„deƒZdS)éNé)Ú ProbingStatec@sneZdZdZddd„Zdd„Zedd„ƒZd d „Zed d „ƒZ d d„Z e dd„ƒZ e dd„ƒZ e dd„ƒZdS)Ú CharSetProbergffffffî?NcCsd|_||_tjtƒ|_dS)N)Ú_stateÚ lang_filterÚloggingZ getLoggerÚ__name__Zlogger)Úselfr©r ú#/usr/lib/python3.6/charsetprober.pyÚ__init__'szCharSetProber.__init__cCs tj|_dS)N)rZ DETECTINGr)r r r r Úreset,szCharSetProber.resetcCsdS)Nr )r r r r Ú charset_name/szCharSetProber.charset_namecCsdS)Nr )r Úbufr r r Úfeed3szCharSetProber.feedcCs|jS)N)r)r r r r Ústate6szCharSetProber.statecCsdS)Ngr )r r r r Úget_confidence:szCharSetProber.get_confidencecCstjdd|ƒ}|S)Ns([-])+ó )ÚreÚsub)rr r r Úfilter_high_byte_only=sz#CharSetProber.filter_high_byte_onlycCsbtƒ}tjd|ƒ}xJ|D]B}|j|dd…ƒ|dd…}|jƒ rP|dkrPd}|j|ƒqW|S)u9 We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [€-ÿ] marker: everything else [^a-zA-Z€-ÿ] The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character. This filter applies to all scripts which do not use English characters. s%[a-zA-Z]*[€-ÿ]+[a-zA-Z]*[^a-zA-Z€-ÿ]?Nró€réÿÿÿÿr)Ú bytearrayrÚfindallÚextendÚisalpha)rÚfilteredZwordsZwordZ last_charr r r Úfilter_international_wordsBs  z(CharSetProber.filter_international_wordscCs¬tƒ}d}d}x‚tt|ƒƒD]r}|||d…}|dkr>d}n |dkrJd}|dkr|jƒ r||kr†| r†|j|||…ƒ|jdƒ|d}qW|s¨|j||d …ƒ|S) aÈ Returns a copy of ``buf`` that retains only the sequences of English alphabet and high byte characters that are not between <> characters. Also retains English alphabet and high byte characters immediately before occurrences of >. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by ``Latin1Prober``. Frró>ós