bg,pddlZddlZddlmZmZddlmZmZejdZ GddZ dS)N)OptionalUnion)LanguageFilter ProbingStates%[a-zA-Z]*[-]+[a-zA-Z]*[^a-zA-Z-]?c`eZdZdZejfdeddfdZddZede e fdZ ede e fdZ d e eefdefd Zedefd Zdefd Zed e eefdefdZed e eefdefdZed e eefdefdZdS) CharSetProbergffffff? lang_filterreturnNctj|_d|_||_t jt|_dS)NT) r DETECTING_stateactiver logging getLogger__name__logger)selfr s L/opt/cloudlinux/venv/lib64/python3.11/site-packages/chardet/charsetprober.py__init__zCharSetProber.__init__,s1",  &'11 c(tj|_dSN)rr rrs rresetzCharSetProber.reset2s", rcdSrrs r charset_namezCharSetProber.charset_name5strctrNotImplementedErrorrs rlanguagezCharSetProber.language9s!!rbyte_strctrr )rr#s rfeedzCharSetProber.feed=s!!rc|jSr)rrs rstatezCharSetProber.state@s {rcdS)Ngrrs rget_confidencezCharSetProber.get_confidenceDssrbufc2tjdd|}|S)Ns([-])+ )resub)r*s rfilter_high_byte_onlyz#CharSetProber.filter_high_byte_onlyGsf&c22 rct}t|}|D]Z}||dd|dd}|s|dkrd}||[|S)u7 We define three types of bytes: alphabet: english alphabets [a-zA-Z] international: international characters [€-ÿ] marker: everything else [^a-zA-Z€-ÿ] The input buffer can be thought to contain a series of words delimited by markers. This function works to filter all words that contain at least one international character. All contiguous sequences of markers are replaced by a single space ascii character. This filter applies to all scripts which do not use English characters. Nr,) bytearrayINTERNATIONAL_WORDS_PATTERNfindallextendisalpha)r*filteredwordsword last_chars rfilter_international_wordsz(CharSetProber.filter_international_wordsLs;; ,33C88 ' 'D OOD"I & & & RSS I$$&& !9w+>+> OOI & & & &rcvt}d}d}t|d}t|D]U\}}|dkr|dz}d}|dkr<||kr4|s2|||||dd}V|s|||d |S) a[ Returns a copy of ``buf`` that retains only the sequences of English alphabet and high byte characters that are not between <> characters. This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by ``Latin1Prober``. Frc>r(?I\B$U5)#34$$$$\$$$rr ) rr-typingrrenumsrrcompiler4r rrrrUs: """"""""////////(bj8 kkkkkkkkkkr