e f*@s dZddlZddlZddlmZddlmZddlmZddl m Z ddl m Z ddl m Z ed Zeed BZed ZeeBZeed Zeed ZeedBed ZeeBZeedBZeeBZeedZddZGdddZGdddeZGdddeZGdddeZGdddeZ GdddeZ!Gdd d eZ"Gd!d"d"eZ#Gd#d$d$eZ$Gd%d&d&eZ%Gd'd(d(eZ&Gd)d*d*e&Z'Gd+d,d,eZ(Gd-d.d.eZ)Gd/d0d0eZ*Gd1d2d2eZ+Gd3d4d4eZ,Gd5d6d6eZ-Gd7d8d8eZ.Gd9d:d:eZ/Gd;d<d<eZ0Gd=d>d>eZ1Gd?d@d@eZ2GdAdBdBeZ3GdCdDdDeZ4GdEdFdFeZ5GdGdHdHeZ6GdIdJdJeZ7GdKdLdLe Z8GdMdNdNeZ9GdOdPdPeZ:GdQdRdReZ;GdSdTdTeZ<GdUdVdVe<Z=GdWdXdXeZ>GdYdZdZeZ?Gd[d\d\eZ@Gd]d^d^eZAGd_d`d`eZBGdadbdbeBZCGdcddddeBZDGdedfdfeZEGdgdhdheZFGdidjdjeZGGdkdldleHZIGdmdndneIZJGdodpdpeIZKGdqdrdreJZLeKd dsZMeKdtduZNeKdvdwZOejPdxjQdyjRejSZTejPdzjQdyjRejUd{d|jUd}d~jVZWejPdjXZYejPdzjQdyjRejUd{d|jUd}d~jVZZejPdzjQdyjRejUd{d|jUd}d~jVZ[ejPdzjQdyjRejUd{d|jUd}d~jVZ\ddZ]ddZ^ddZ_ddZ`ddZaddZbddZcddZdddZeddZfddZgddZhddZiddZjddZkddZlddZmddZnddZoddZpddZqddZrddZsddZtddZuddZvddZwddZxddZyddZzddZ{ddZ|ddZ}ddZ~ddZddZddZddZddZddZddZddZddZddZddZddZddZddZddZddZddZdS)alHeader value parser implementing various email-related RFC parsing rules. The parsing methods defined in this module implement various email related parsing rules. Principal among them is RFC 5322, which is the followon to RFC 2822 and primarily a clarification of the former. It also implements RFC 2047 encoded word decoding. RFC 5322 goes to considerable trouble to maintain backward compatibility with RFC 822 in the parse phase, while cleaning up the structure on the generation phase. This parser supports correct RFC 5322 generation by tagging white space as folding white space only when folding is allowed in the non-obsolete rule sets. Actually, the parser is even more generous when accepting input than RFC 5322 mandates, following the spirit of Postel's Law, which RFC 5322 encourages. Where possible deviations from the standard are annotated on the 'defects' attribute of tokens that deviate. The general structure of the parser follows RFC 5322, and uses its terminology where there is a direct correspondence. Where the implementation requires a somewhat different structure than that used by the formal grammar, new terms that mimic the closest existing terms are used. Thus, it really helps to have a copy of RFC 5322 handy when studying this code. Input to the parser is a string that has already been unfolded according to RFC 5322 rules. According to the RFC this unfolding is the very first step, and this parser leaves the unfolding step to a higher level message parser, which will have already detected the line breaks that need unfolding while determining the beginning and end of each header. The output of the parser is a TokenList object, which is a list subclass. A TokenList is a recursive data structure. The terminal nodes of the structure are Terminal objects, which are subclasses of str. These do not correspond directly to terminal objects in the formal grammar, but are instead more practical higher level combinations of true terminals. All TokenList and Terminal objects have a 'value' attribute, which produces the semantically meaningful value of that part of the parse subtree. The value of all whitespace tokens (no matter how many sub-tokens they may contain) is a single space, as per the RFC rules. This includes 'CFWS', which is herein included in the general class of whitespace tokens. There is one exception to the rule that whitespace tokens are collapsed into single spaces in values: in the value of a 'bare-quoted-string' (a quoted-string with no leading or trailing whitespace), any whitespace that appeared between the quotation marks is preserved in the returned value. Note that in all Terminal strings quoted pairs are turned into their unquoted values. All TokenList and Terminal objects also have a string value, which attempts to be a "canonical" representation of the RFC-compliant form of the substring that produced the parsed subtree, including minimal use of quoted pair quoting. Whitespace runs are not collapsed. Comment tokens also have a 'content' attribute providing the string found between the parens (including any nested comments) with whitespace preserved. All TokenList and Terminal objects have a 'defects' attribute which is a possibly empty list all of the defects found while creating the token. Defects may appear on any token in the tree, and a composite list of all defects in the subtree is available through the 'all_defects' attribute of any node. (For Terminal notes x.defects == x.all_defects.) Each object in a parse tree is called a 'token', and each has a 'token_type' attribute that gives the name from the RFC 5322 grammar that it represents. Not all RFC 5322 nodes are produced, and there is one non-RFC 5322 node that may be produced: 'ptext'. A 'ptext' is a string of printable ascii characters. It is returned in place of lists of (ctext/quoted-pair) and (qtext/quoted-pair). XXX: provide complete list of token types. N) hexdigits) OrderedDict) itemgetter)_encoded_words)errors)utilsz (z ()<>@,:;.\"[].z."(z/?=z*'%%cCs*dt|jddjdddS)N"\z\\z\")strreplace)valuer?/opt/alt/python34/lib64/python3.4/email/_header_value_parser.py quote_string`src@s[eZdZddZddZddZddZd d Zd d d Zd S)_FoldedcCsC||_||_d|_d|_d|_g|_g|_dS)NrT)maxlenpolicylastlen stickyspace firstlinedonecurrent)selfrrrrr__init__is      z_Folded.__init__cCsC|jj|j|jj|jj|jjd|_dS)Nr)rextendrappendrlinesepclearr)rrrrnewliners z_Folded.newlinecCs|jr|jndS)N)rr!)rrrrfinalizexs z_Folded.finalizecCsdj|jS)N)joinr)rrrr__str__|sz_Folded.__str__cCs|jj|dS)N)rr)rstokenrrrrsz_Folded.appendNcCs|dkrt|}nt|}|jdk r7t|j}|j|||jkr|jj|j|j|7_|jj||j|7_d|_d|_dS|jr|j }|dk r|jt|7_|t|7}n|j |dS|r|d|jkr|j|}d|koM|knr||}|jj|jd||j|d|_|}n|j |jj|j|jj||||_d|_d|_dS|js|j n|jj|j|jj|d|_d|_dS|j||jkrp|jj||j|7_dS||jkr|j |jj|||_dSdS)NFTr) r lenrrrrrrhas_fwspop_leading_fws_foldr!)rtokenr&lZstickyspace_lenwsZmarginZtrimrrrappend_if_fitssf                     z_Folded.append_if_fits) __name__ __module__ __qualname__rr!r"r%rr/rrrrrgs     rcs-eZdZdZfddZddZfddZedd Zed d Z ed d Z ddZ ddZ ddZ eddZddZeddZddZddZddZd d!Zd"d#d$Zd"d%d&Zd"d'd(ZS)) TokenListNcs tj||g|_dS)N)superrdefects)rargskw) __class__rrrszTokenList.__init__cCsdjdd|DS)Nr#css|]}t|VqdS)N)r ).0xrrr sz$TokenList.__str__..)r$)rrrrr%szTokenList.__str__csdj|jjtjS)Nz{}({}))formatr8r0r4__repr__)r)r8rrr=szTokenList.__repr__cCsdjdd|DS)Nr#css!|]}|jr|jVqdS)N)r)r9r:rrrr;sz"TokenList.value..)r$)rrrrrszTokenList.valuecCstdd|D|jS)Ncss|]}|jVqdS)N) all_defects)r9r:rrrr;sz(TokenList.all_defects..)sumr5)rrrrr>szTokenList.all_defectsccs|j}g}x|D]}|jre|ret|dkrJ|dn ||V|jqen|j}|j||r||V|g}qqW|rt|dkr|dn ||VndS)Nr'r)r8startswith_fwsr(r pop_trailing_wsr)rklassthisr,Zend_wsrrrpartss   '   zTokenList.partscCs|djS)Nr)r@)rrrrr@ szTokenList.startswith_fwscCs.|djdkr |jdS|djS)Nrfws) token_typepopr*)rrrrr* s zTokenList.pop_leading_fwscCs.|djdkr |jdS|djS)Nr'cfwsrIrI)rFrGrA)rrrrrAs zTokenList.pop_trailing_wscCs"x|D]}|jrdSqWdS)NTF)r))rpartrrrr)s  zTokenList.has_fwscCs|djS)Nr)has_leading_comment)rrrrrKszTokenList.has_leading_commentcCs+g}x|D]}|j|jq W|S)N)rcomments)rrLr,rrrrL!s zTokenList.commentscCsE|jptd}t||}|j||jt|S)Nz+inf)Zmax_line_lengthfloatrr+r"r )rrrfoldedrrrfold(s   zTokenList.foldcCsg}|j}|r(|j|n|djdkrJ|jdnd}|jtjt|||j|dj|S)Nr'rEr#rIrI)r*rrFrG_ewencoder r$)rcharsetresr.Ztrailerrrras_encoded_word0s ( zTokenList.as_encoded_wordcCs=g}x'|D]}|j|j||q Wdj|S)Nr#)r cte_encoder$)rrRrrSrJrrrrU<s zTokenList.cte_encodec Cs;x4|jD])}t|}t|}yt|jdWn^tk rtdd|jDrtd}nd}|j||j}t|}YnX|j ||rq n|j }|dk rt|j d|_ |j |rq qn|j r|j|q n|j||jq WdS)Nzus-asciicss!|]}t|tjVqdS)N) isinstancerUndecodableBytesDefect)r9r:rrrr;Isz"TokenList._fold..z unknown-8bitzutf-8r)rDr r(rQUnicodeEncodeErroranyr>rUrr/r*rGrr)r+rr!)rrNrJtstrtlenrRr.rrrr+Bs0           zTokenList._foldr#cCs#tdj|jdddS)N indentr#)printr$_pp)rr]rrrpprintdszTokenList.pprintcCsdj|jddS)Nr\r]r#)r$r_)rr]rrrppstrgszTokenList.ppstrccsdj||jj|jVxH|D]@}t|dsN|dj|Vq$|j|dDdHq$W|jrdj|j}nd}dj||VdS)Nz{}{}/{}(r_z* !! invalid element in token list: {!r}z z Defects: {}r#z{}){})r<r8r0rFhasattrr_r5)rr]r,Zextrarrrr_js      z TokenList._pp)r0r1r2rFrr%r=propertyrr>rDr@r*rAr)rKrLrOrTrUr+r`rar_rr)r8rr3s(  +       "r3c@s4eZdZeddZeddZdS)WhiteSpaceTokenListcCsdS)N r)rrrrr~szWhiteSpaceTokenList.valuecCsdd|DS)NcSs(g|]}|jdkr|jqS)comment)rFcontent)r9r:rrr s z0WhiteSpaceTokenList.comments..r)rrrrrLszWhiteSpaceTokenList.commentsN)r0r1r2rcrrLrrrrrd|s rdc@s.eZdZdZddZddZdS)UnstructuredTokenList unstructuredc Cs)d}x|jD]}t|}d}yt|jdWntk rUtdd|jDrtd}nd}|dk r<tdj|j|d|gj |}t dd|jd|D}t|} t | } || |j kr<|j|d=|j | || |_wq<n|j |}d }YnX|j||r|rt |jd }qqn|s|r|j|qn|j} | dk rt| |_|j|rqqn|jr|j|qn|j ||jd}qWdS) NFzus-asciicss!|]}t|tjVqdS)N)rVrrW)r9r:rrrr;sz.UnstructuredTokenList._fold..z unknown-8bitzutf-8r#css|]}t|VqdS)N)r()r9r:rrrr;sTr')rDr rQrXrYr>get_unstructuredr$rrTr?r(rrrr/Z _fold_as_ewr*rr)rOr!) rrNlast_ewrJrZis_ewrRchunkZ oldlastlenschunklchunkr.rrrr+sT     /&               zUnstructuredTokenList._foldc Csg}d}x|D]}t|}y|jd|j|Wqtk r|dkr|j|j||t|}n9tdj||d|g}|j|jYqXqWdj|S)Nzus-asciir#) r rQrrXrUr(rkr$rT)rrRrrSrlrJsparttlrrrrUs     &z UnstructuredTokenList.cte_encodeN)r0r1r2rFr+rUrrrrris  4ric@s.eZdZdZddZddZdS)Phrasephrasec CsXd}xK|jD]@}t|}t|}d}yt|jdWntk rtdd|jDrd}nd}|dk r|j r|djdkr|j r|j d}nd }xFt |D]8\} } | jd krt | dd|| .z unknown-8bitzutf-8r'rHr#zbare-quoted-stringcss|]}t|VqdS)N)r()r9r:rrrr;sTz quoted-stringrIrI)rDr r(rQrXrYr>rKrFrLrG enumeraterirkr$rrTrrr?rr/r+)rrNrlrJrZr[Zhas_ewrR remainderir,rnrorprrrr+sL      !/       z Phrase._foldc Csg}d}d}x|D]}t|}y|jd|j|Wn&tk rqd}|dkr|jst|}n|j|j||n|jsm|d jdkr|jr|j d }nd}xFt |D]8\} } | jdkrt | dd|| rsc@seZdZdZdS)WordZwordN)r0r1r2rFrrrrrx0s rxc@s"eZdZdZddZdS)CFWSListrHcCs t|jS)N)boolrL)rrrrrK9szCFWSList.has_leading_commentN)r0r1r2rFrKrrrrry5s ryc@seZdZdZdS)AtomatomN)r0r1r2rFrrrrr{=s r{c@seZdZdZdS)Tokenr,N)r0r1r2rFrrrrr}Bs r}c@s:eZdZdZdZdZdZeddZdS) EncodedWordz encoded-wordNcCs3|jdk r|jStjt||jdS)N)cterPrQr rR)rrrrencodedNszEncodedWord.encoded) r0r1r2rFrrRlangrcrrrrrr~Gs r~c@sLeZdZdZeddZeddZeddZdS) QuotedStringz quoted-stringcCs+x$|D]}|jdkr|jSqWdS)Nzbare-quoted-string)rFr)rr:rrrrgZs zQuotedString.contentcCsYg}xC|D];}|jdkr8|jt|q |j|jq Wdj|S)Nzbare-quoted-stringr#)rFrr rr$)rrSr:rrr quoted_value`s  zQuotedString.quoted_valuecCs+x$|D]}|jdkr|jSqWdS)Nzbare-quoted-string)rFr)rr,rrrstripped_valuejs zQuotedString.stripped_valueN)r0r1r2rFrcrgrrrrrrrVs  rc@s4eZdZdZddZeddZdS)BareQuotedStringzbare-quoted-stringcCs tdjdd|DS)Nr#css|]}t|VqdS)N)r )r9r:rrrr;vsz+BareQuotedString.__str__..)rr$)rrrrr%uszBareQuotedString.__str__cCsdjdd|DS)Nr#css|]}t|VqdS)N)r )r9r:rrrr;zsz)BareQuotedString.value..)r$)rrrrrxszBareQuotedString.valueN)r0r1r2rFr%rcrrrrrrqs  rc@sReZdZdZddZddZeddZedd Zd S) Commentrfcs8djtdgfddDdgggS)Nr#rcsg|]}j|qSr)quote)r9r:)rrrrhs z#Comment.__str__..))r$r?)rr)rrr%s   zComment.__str__cCsG|jdkrt|St|jddjddjddS)Nrfr z\\rz\(rz\))rFr r)rrrrrrs   z Comment.quotecCsdjdd|DS)Nr#css|]}t|VqdS)N)r )r9r:rrrr;sz"Comment.content..)r$)rrrrrgszComment.contentcCs |jgS)N)rg)rrrrrLszComment.commentsN) r0r1r2rFr%rrcrgrLrrrrr}s   rc@sLeZdZdZeddZeddZeddZdS) AddressListz address-listcCsdd|DS)NcSs%g|]}|jdkr|qS)address)rF)r9r:rrrrhs z)AddressList.addresses..r)rrrr addressesszAddressList.addressescCstdd|DgS)Ncss'|]}|jdkr|jVqdS)rN)rF mailboxes)r9r:rrrr;sz(AddressList.mailboxes..)r?)rrrrrs zAddressList.mailboxescCstdd|DgS)Ncss'|]}|jdkr|jVqdS)rN)rF all_mailboxes)r9r:rrrr;sz,AddressList.all_mailboxes..)r?)rrrrrs zAddressList.all_mailboxesN)r0r1r2rFrcrrrrrrrrs rc@sLeZdZdZeddZeddZeddZdS) AddressrcCs"|djdkr|djSdS)Nrgroup)rF display_name)rrrrrszAddress.display_namecCs@|djdkr|dgS|djdkr5gS|djS)Nrmailboxzinvalid-mailbox)rFr)rrrrrs  zAddress.mailboxescCsG|djdkr|dgS|djdkr<|dgS|djS)Nrrzinvalid-mailbox)rFr)rrrrrs   zAddress.all_mailboxesN)r0r1r2rFrcrrrrrrrrs rc@s:eZdZdZeddZeddZdS) MailboxListz mailbox-listcCsdd|DS)NcSs%g|]}|jdkr|qS)r)rF)r9r:rrrrhs z)MailboxList.mailboxes..r)rrrrrszMailboxList.mailboxescCsdd|DS)NcSs%g|]}|jdkr|qS)rinvalid-mailbox)zmailboxr)rF)r9r:rrrrhs z-MailboxList.all_mailboxes..r)rrrrrszMailboxList.all_mailboxesN)r0r1r2rFrcrrrrrrrs rc@s:eZdZdZeddZeddZdS) GroupListz group-listcCs)| s|djdkrgS|djS)Nrz mailbox-list)rFr)rrrrrszGroupList.mailboxescCs)| s|djdkrgS|djS)Nrz mailbox-list)rFr)rrrrrszGroupList.all_mailboxesN)r0r1r2rFrcrrrrrrrs rc@sLeZdZdZeddZeddZeddZdS) GrouprcCs"|djdkrgS|djS)Nz group-list)rFr)rrrrrszGroup.mailboxescCs"|djdkrgS|djS)Nrz group-list)rFr)rrrrrszGroup.all_mailboxescCs |djS)Nr)r)rrrrrszGroup.display_nameN)r0r1r2rFrcrrrrrrrrs rc@speZdZdZeddZeddZeddZedd Zed d Z d S) NameAddrz name-addrcCs!t|dkrdS|djS)Nr'r)r(r)rrrrrszNameAddr.display_namecCs |djS)Nr'rI) local_part)rrrrrszNameAddr.local_partcCs |djS)Nr'rI)domain)rrrrrszNameAddr.domaincCs |djS)Nr'rI)route)rrrrr szNameAddr.routecCs |djS)Nr'rI) addr_spec)rrrrr szNameAddr.addr_specN) r0r1r2rFrcrrrrrrrrrrs rc@s^eZdZdZeddZeddZeddZedd Zd S) AngleAddrz angle-addrcCs+x$|D]}|jdkr|jSqWdS)Nz addr-spec)rFr)rr:rrrrs zAngleAddr.local_partcCs+x$|D]}|jdkr|jSqWdS)Nz addr-spec)rFr)rr:rrrrs zAngleAddr.domaincCs+x$|D]}|jdkr|jSqWdS)Nz obs-route)rFdomains)rr:rrrr"s zAngleAddr.routecCs/x(|D]}|jdkr|jSqWdSdS)Nz addr-specz<>)rFr)rr:rrrr(s  zAngleAddr.addr_specN) r0r1r2rFrcrrrrrrrrrs rc@s(eZdZdZeddZdS)ObsRoutez obs-routecCsdd|DS)NcSs(g|]}|jdkr|jqS)r)rFr)r9r:rrrrh7s z$ObsRoute.domains..r)rrrrr5szObsRoute.domainsN)r0r1r2rFrcrrrrrr1s rc@speZdZdZeddZeddZeddZedd Zed d Z d S) MailboxrcCs"|djdkr|djSdS)Nrz name-addr)rFr)rrrrr>szMailbox.display_namecCs |djS)Nr)r)rrrrrCszMailbox.local_partcCs |djS)Nr)r)rrrrrGszMailbox.domaincCs"|djdkr|djSdS)Nrz name-addr)rFr)rrrrrKsz Mailbox.routecCs |djS)Nr)r)rrrrrPszMailbox.addr_specN) r0r1r2rFrcrrrrrrrrrr:s rc@s:eZdZdZeddZeZZZZ dS)InvalidMailboxzinvalid-mailboxcCsdS)Nr)rrrrrYszInvalidMailbox.display_nameN) r0r1r2rFrcrrrrrrrrrrUs rcs.eZdZdZefddZS)DomainrcsdjtjjS)Nr#)r$r4rsplit)r)r8rrrdsz Domain.domain)r0r1r2rFrcrrr)r8rr`s rc@seZdZdZdS)DotAtomzdot-atomN)r0r1r2rFrrrrris rc@seZdZdZdS) DotAtomTextz dot-atom-textN)r0r1r2rFrrrrrns rc@s^eZdZdZeddZeddZeddZedd Zd S) AddrSpecz addr-speccCs |djS)Nr)r)rrrrrwszAddrSpec.local_partcCs!t|dkrdS|djS)Nr'rI)r(r)rrrrr{szAddrSpec.domaincCsJt|dkr|djS|djj|dj|djjS)Nrrr'r)r(rrstriplstrip)rrrrrs zAddrSpec.valuecCsht|j}t|t|tkr=t|j}n |j}|jdk rd|d|jS|S)N@)setrr( DOT_ATOM_ENDSrr)rZnamesetZlprrrrs zAddrSpec.addr_specN) r0r1r2rFrcrrrrrrrrrss rc@seZdZdZdS) ObsLocalPartzobs-local-partN)r0r1r2rFrrrrrs rcs@eZdZdZeddZefddZS) DisplayNamez display-namecCst|}|djdkr/|jdn8|ddjdkrgt|ddd|dszTerminal.all_defectsr#csIdj||jj|jtj|js3dndj|jgS)Nz {}{}/{}({}){}r#z {})r<r8r0rFr4r=r5)rr])r8rrr_s   z Terminal._ppc CsJt|}y|jd|SWn"tk rEtj||SYnXdS)Nzus-ascii)r rQrXrP)rrRrrrrrrUs    zTerminal.cte_encodecCsdS)Nr)rrrrrAszTerminal.pop_trailing_wscCsdS)Nr)rrrrr*szTerminal.pop_leading_fwscCsgS)Nr)rrrrrLszTerminal.commentscCsdS)NFr)rrrrrKszTerminal.has_leading_commentcCst||jfS)N)r rF)rrrr__getnewargs__szTerminal.__getnewargs__)r0r1r2rr=rcr>r_rUrAr*rLrKrrr)r8rrs     rc@s4eZdZeddZddZdZdS)WhiteSpaceTerminalcCsdS)Nrer)rrrrrszWhiteSpaceTerminal.valuecCsdS)NTr)rrrrr@ sz!WhiteSpaceTerminal.startswith_fwsTN)r0r1r2rcrr@r)rrrrrs  rc@s@eZdZeddZddZdZddZdS) ValueTerminalcCs|S)Nr)rrrrrszValueTerminal.valuecCsdS)NFr)rrrrr@szValueTerminal.startswith_fwsFcCstjt||S)N)rPrQr )rrRrrrrTszValueTerminal.as_encoded_wordN)r0r1r2rcrr@r)rTrrrrrs  rc@sFeZdZeddZeddZddZdZdS) EWWhiteSpaceTerminalcCsdS)Nr#r)rrrrr szEWWhiteSpaceTerminal.valuecCs|ddS)Nr)rrrrr$szEWWhiteSpaceTerminal.encodedcCsdS)Nr#r)rrrrr%(szEWWhiteSpaceTerminal.__str__TN)r0r1r2rcrrr%r)rrrrrs  rr,zlist-separatorrzroute-component-markerz([{}]+)r#z[^{}]+r z\\]z\]z[\x00-\x20\x7F]cCs]t|}|r.|jjtj|ntj|rY|jjtjdndS)z@If input token contains ASCII non-printables, register a defect.z*Non-ASCII characters found in header tokenN)_non_printable_finderr5rrZNonPrintableDefectrrrW)xtextZnon_printablesrrr_validate_xtextVs  rcCst|d^}}g}d}d}xtt|D]k}||dkrq|red}d}qqd}q:n|rd}n|||krPn|j||q:W|d}dj|dj||dg||fS)akScan printables/quoted-pairs until endchars and return unquoted ptext. This function turns a run of qcontent, ccontent-without-comments, or dtext-with-quoted-printables into a single string by unquoting any quoted printables. It returns the string, the remaining value, and a flag that is True iff there were any quoted printables decoded. r'Fr Tr#N) _wsp_splitterranger(rr$)rendcharsZfragmentrvZvcharsescapehad_qpposrrr_get_ptext_to_endchars`s$    rcCs?|j}t|dt|t|d}||fS)zFWS = 1*WSP This isn't the RFC definition. We're using fws to represent tokens where folding can be done, but when we are parsing the *un*folding has already been done so we don't need to watch out for CRLF. NrE)rrr()rZnewvaluerErrrget_fws~s )rc CsKt}|jds3tjdj|n|ddjdd^}}||ddkrtjdj|ndj|}t|dkr|dtkr|dtkr|jdd^}}|d|}nt|jdkr$|j j tj d n||_ dj|}y't jd|d\}}}} Wn-tk rtjd j|j YnX||_||_|j j| x|r@|dtkrt|\} }|j | qnt|d^} }t| d } t| |j | dj|}qW||fS) zE encoded-word = "=?" charset "?" encoding "?" encoded-text "?=" z=?z"expected encoded word but found {}rNz?=r'r#rzwhitespace inside encoded wordz!encoded word format invalid: '{}'vtext)r~ startswithrHeaderParseErrorr<rr$r(rr5rrrrPrrrRrrWSPrrrr) rZewrrvZremstrrrrRrr5r,charsrrrrget_encoded_wordsH "2  '       rc Cst}xq|r||dtkrGt|\}}|j|q n|jdr/yt|\}}Wntjk rYq/Xd}t|dkr|d j dkr|j jtj dd}qn|rt|dkr|d j d krt |dd|d This is not the RFC ctext, since we are handling nested comments in comment and unquoting quoted-pairs here. We allow anything except the '()' characters, but if we find any ASCII other than the RFC defined printable ASCII an NonPrintableDefect is added to the token's defects list. Since quoted pairs are converted to their unquoted values, what is returned is a 'ptext' token. In this case it is a WhiteSpaceTerminal, so it's value is ' '. z()r)rrr)rr_rrr get_qp_ctexts  rcCs;t|d\}}}t|d}t|||fS)aoqcontent = qtext / quoted-pair We allow anything except the DQUOTE character, but if we find any ASCII other than the RFC defined printable ASCII an NonPrintableDefect is added to the token's defects list. Any quoted pairs are converted to their unquoted values, so what is returned is a 'ptext' token. In this case it is a ValueTerminal. r r)rrr)rrrrrr get_qcontents  rcCsrt|}|s-tjdj|n|j}|t|d}t|d}t|||fS)zatext = We allow any non-ATOM_ENDS in atext, but add an InvalidATextDefect to the token's defects list if we find non-atext characters. zexpected atext but found '{}'Natext)_non_atom_end_matcherrrr<rr(rr)rmrrrr get_atext s   r c CsT|ddkr+tjdj|nt}|dd}x|r|ddkr|dtkrt|\}}n|dddkry/t|\}}|jjtj dWqtjk rt |\}}YqXnt |\}}|j|qGW|s@|jjtj d ||fS||ddfS) zbare-quoted-string = DQUOTE *([FWS] qcontent) [FWS] DQUOTE A quoted-string without the leading or trailing white space. Its value is the text between the quote marks, with whitespace preserved and quoted pairs decoded. rr zexpected '"' but found '{}'r'Nrz=?z!encoded word inside quoted stringz"end of header inside quoted string) rrr<rrrrr5rrr)rZbare_quoted_stringr,rrrget_bare_quoted_strings,   r cCs |r1|ddkr1tjdj|nt}|dd}x|r|ddkr|dtkrt|\}}n7|ddkrt|\}}nt|\}}|j|qMW|s|j jtj d||fS||ddfS)zcomment = "(" *([FWS] ccontent) [FWS] ")" ccontent = ctext / quoted-pair / comment We handle nested comments here, and quoted-pair in our qp-ctext routine. rrzexpected '(' but found '{}'r'Nrzend of header inside comment) rrr<rrr get_commentrrr5r)rrfr,rrrr ;s"   r cCstt}x^|ri|dtkri|dtkrGt|\}}nt|\}}|j|q W||fS)z,CFWS = (1*([FWS] comment) [FWS]) / FWS r)ry CFWS_LEADERrrr r)rrHr,rrrget_cfwsTs rcCst}|rA|dtkrAt|\}}|j|nt|\}}|j||r|dtkrt|\}}|j|n||fS)zquoted-string = [CFWS] [CFWS] 'bare-quoted-string' is an intermediate class defined by this parser and not by the RFC grammar. It is the quoted string without any attached CFWS. r)rr rrr )rZ quoted_stringr,rrrget_quoted_stringas  rc Cs%t}|rA|dtkrAt|\}}|j|n|rr|dtkrrtjdj|n|jdryt |\}}Wqtjk rt |\}}YqXnt |\}}|j||r|dtkrt|\}}|j|n||fS)zPatom = [CFWS] 1*atext [CFWS] An atom could be an rfc2047 encoded word. rzexpected atom but found '{}'z=?) r{r rr ATOM_ENDSrrr<rrr )rr|r,rrrget_atomss$  rcCst}| s |dtkr;tjdj|nxo|r|dtkrt|\}}|j||r>|ddkr>|jt|dd}q>q>W|dtkrtjdjd|n||fS)z( dot-text = 1*atext *("." 1*atext) rz8expected atom at a start of dot-atom-text but found '{}'r r'Nz4expected atom at end of dot-atom-text but found '{}'rI)rrrrr<r rr)rZ dot_atom_textr,rrrget_dot_atom_texts     rc Cst}|dtkr;t|\}}|j|n|jdryt|\}}Wqtjk rt|\}}YqXnt|\}}|j||r|dtkrt|\}}|j|n||fS)z dot-atom = [CFWS] dot-atom-text [CFWS] Any place we can have a dot atom, we could instead have an rfc2047 encoded word. rz=?) rr rrrrrrr)rZdot_atomr,rrr get_dot_atoms  rcCs|dtkr%t|\}}nd}|ddkrPt|\}}n=|dtkr{tjdj|nt|\}}|dk r|g|dd|dtkr>|ddkr|jt|jjtj d|dd}q`yt|\}}WnVtjk r-|dt kr&t |\}}|jjtj dnYnX|j|q`W||fS)a phrase = 1*word / obs-phrase obs-phrase = word *(word / "." / CFWS) This means a phrase can be a sequence of words, periods, and CFWS in any order as long as it starts with at least one word. If anything other than words is detected, an ObsoleteHeaderDefect is added to the token's defect list. We also accept a phrase that starts with CFWS followed by a dot; this is registered as an InvalidHeaderDefect, since it is not supported by even the obsolete grammar. zphrase does not start with wordrr zperiod in 'phrase'r'Nzcomment found without atom) rsrrrrr5r PHRASE_ENDSrObsoleteHeaderDefectr r)rrtr,rrr get_phrases.    rcCst}d}|dtkr4t|\}}n|sUtjdj|nyt|\}}Wnrtjk ryt|\}}WnDtjk r|ddkr|dtkrnt }YnXYnX|dk r|g|dd|djd kr>|jjtjdn|jrSd|_n||fS)z' obs-local-part = word *("." word) Frr r zinvalid repeated '.'Tr'Nzmisplaced-specialz/'\' character outside of quoted-string/ccontentrzmissing '.' between wordsrHz!Invalid leading '.' in local partrz"Invalid trailing '.' in local partzinvalid-obs-local-partrIrIrIr) rrr5rrrrrrFrrr r)rrZlast_non_ws_was_dotr,rrrr&sV )          rcCs]t|d\}}}t|d}|rI|jjtjdnt|||fS)a dtext = / obs-dtext obs-dtext = obs-NO-WS-CTL / quoted-pair We allow anything except the excluded characters, but if we find any ASCII other than the RFC defined printable ASCII an NonPrintableDefect is added to the token's defects list. Quoted pairs are converted to their unquoted values, so what is returned is a ptext token, in this case a ValueTerminal. If there were quoted-printables, an ObsoleteHeaderDefect is added to the returned token's defect list. z[]rz(quoted printable found in domain-literal)rrr5rrrr)rrrrrr get_dtextUs   rcCs:|r dS|jtjd|jtdddS)NFz"end of input inside domain-literalrzdomain-literal-endT)rrrr)rdomain_literalrrr_check_for_early_dl_endis   r cCst}|dtkr;t|\}}|j|n|sStjdn|ddkr~tjdj|n|dd}t||r||fS|jtdd|dt krt |\}}|j|nt |\}}|j|t||r'||fS|dt krYt |\}}|j|nt||rr||fS|ddkrtjd j|n|jtdd |dd}|r|dtkrt|\}}|j|n||fS) zB domain-literal = [CFWS] "[" *([FWS] dtext) [FWS] "]" [CFWS] rzexpected domain-literal[z6expected '[' at start of domain-literal but found '{}'r'Nzdomain-literal-startrz4expected ']' at end of domain-literal but found '{}'zdomain-literal-end) rr rrrrr<r rrrr)rrr,rrrget_domain_literalqsD       r"c Cst}d}|dtkr4t|\}}n|sUtjdj|n|ddkrt|\}}|dk r|g|dd" [CFWS] / obs-angle-addr obs-angle-addr = [CFWS] "<" obs-route addr-spec ">" [CFWS] rzangle-addr-endznull addr-spec in angle-addrz*obsolete route specification in angle-addrz.expected addr-spec or obs-route but found '{}'z"missing trailing '>' on angle-addr) rr rrrrr<rr5rr$r(r)rZ angle_addrr,rrrget_angle_addrsJ      r+cCsRt}t|\}}|j|dd|jdd|_||fS)z display-name = phrase Because this is simply a name-rule, we don't return a display-name token containing a phrase, but rather a display-name token with the content of the phrase. N)rrrr5)rrr,rrrget_display_name#s  r,cCsJt}d}|dtkrUt|\}}|sUtjdj|qUn|ddkr|dtkrtjdj|nt|\}}|stjdj|n|dk r|g|ddd.zinvalid-mailbox) rr-rrr$r<rYr>rFr)rrr,rrr get_mailboxPs    r.cCst}xv|r|d|kr|dtkr_|jt|dd|dd}q t|\}}|j|q W||fS)z Read everything up to one of the chars in endchars. This is outside the formal grammar. The InvalidMailbox TokenList that is returned acts like a Mailbox, but the data attributes are None. rzmisplaced-specialr'N)rrrrr)rrZinvalid_mailboxr,rrrget_invalid_mailboxes  r/c CsYt}xC|rN|ddkrNy#t|\}}|j|Wnftjk rd}|dtkr t|\}}| s|ddkr|j||jjtjdqt |d\}}|dk r|g|dd We allow any non-TOKEN_ENDS in ttext, but add defects to the token's defects list if we find non-ttext characters. We also register defects for *any* non-printables even though the RFC doesn't exclude all of them, because we follow the spirit of RFC 5322. zexpected ttext but found '{}'Nttext)_non_token_end_matcherrrr<rr(rr)rr r=rrr get_ttext s   r?cCst}|rA|dtkrAt|\}}|j|n|rr|dtkrrtjdj|nt|\}}|j||r|dtkrt|\}}|j|n||fS)ztoken = [CFWS] 1*ttext [CFWS] The RFC equivalent of ttext is any US-ASCII chars except space, ctls, or tspecials. We also exclude tabs even though the RFC doesn't. The RFC implies the CFWS but is not explicit about it in the BNF. rzexpected token but found '{}') r}r rr TOKEN_ENDSrrr<r?)rZmtokenr,rrr get_token s  rAcCsrt|}|s-tjdj|n|j}|t|d}t|d}t|||fS)aQattrtext = 1*(any non-ATTRIBUTE_ENDS character) We allow any non-ATTRIBUTE_ENDS in attrtext, but add defects to the token's defects list if we find non-attrtext characters. We also register defects for *any* non-printables even though the RFC doesn't exclude all of them, because we follow the spirit of RFC 5322. z expected attrtext but found {!r}Nr)_non_attribute_end_matcherrrr<rr(rr)rr rrrr get_attrtext s   rCcCst}|rA|dtkrAt|\}}|j|n|rr|dtkrrtjdj|nt|\}}|j||r|dtkrt|\}}|j|n||fS)aH [CFWS] 1*attrtext [CFWS] This version of the BNF makes the CFWS explicit, and as usual we use a value terminal for the actual run of characters. The RFC equivalent of attrtext is the token characters, with the subtraction of '*', "'", and '%'. We include tab in the excluded set just as we do for token. rzexpected token but found '{}') rr rrATTRIBUTE_ENDSrrr<rC)rrr,rrr get_attribute s  rEcCsrt|}|s-tjdj|n|j}|t|d}t|d}t|||fS)zattrtext = 1*(any non-ATTRIBUTE_ENDS character plus '%') This is a special parsing routine so that we get a value that includes % escapes as a single string (which we decode as a single string later). z)expected extended attrtext but found {!r}Nzextended-attrtext)#_non_extended_attribute_end_matcherrrr<rr(rr)rr rrrrget_extended_attrtext s   rGcCst}|rA|dtkrAt|\}}|j|n|rr|dtkrrtjdj|nt|\}}|j||r|dtkrt|\}}|j|n||fS)z [CFWS] 1*extended_attrtext [CFWS] This is like the non-extended version except we allow % characters, so that we can pick up an encoded value as a single string. rzexpected token but found '{}') rr rrEXTENDED_ATTRIBUTE_ENDSrrr<rG)rrr,rrrget_extended_attribute s  rIcCs<t}| s |ddkr;tjdj|n|jtdd|dd}| sy|dj rtjdj|nd}x8|r|djr||d7}|dd}qW|dd kr |d kr |jjtjd nt ||_ |jt|d ||fS) a6 '*' digits The formal BNF is more complicated because leading 0s are not allowed. We check for that and add a defect. We also assume no CFWS is allowed between the '*' and the digits, though the RFC is not crystal clear on that. The caller should already have dealt with leading CFWS. r*zExpected section but found {}zsection-markerr'Nz$Expected section number but found {}r#0z§ion numberhas an invalid leading 0r7) rrrr<rrr9r5ZInvalidHeaderErrorr:r)rrr7rrr get_section s$   rLcCst}|s!tjdnd}|dtkrLt|\}}n|smtjdj|n|ddkrt|\}}nt|\}}|dk r|g|dd s     rNc Cst}t|\}}|j|| s?|ddkrk|jjtjdj|||fS|ddkry,t|\}}d|_|j|Wntj k rYnX|stj dn|ddkr|jt dd|dd }d|_ qn|dd kr>tj d n|jt d d |dd }d }|r|dt krt |\}}|j|nd }|}|j r|r|dd krt|\}}|j}d}|jdkrP|r|ddkrd}qt|\}} | r| ddkrd}qn0yt|\}} WnYnX| sd}n|r|jjtjd|j|x7|D]/} | jdkrg| d d <| }PqqW|}qd }|jjtjdn|r0|ddkr0d }nt|\}}|j s[|jdkr| sr|ddkr|j||d k r| st||}n||fS|jjtjdn|s|jjtjd|j||d krX||fSnF|d k rkx!|D]} | jdkr%Pq%q%W| jdk|j| | j|_n|ddkrtj dj|n|jt dd|dd }|r2|ddkr2t|\}}|j||j|_| s|ddkr2tj dj|q2n|jt dd|dd }|d k rt} xN|r|dtkrt|\}}nt|\}}| j|qpW| }nt|\}}|j||d k r| st||}n||fS)aY attribute [section] ["*"] [CFWS] "=" value The CFWS is implied by the RFC but not made explicit in the BNF. This simplified form of the BNF from the RFC is made to conform with the RFC BNF through some extra checks. We do it this way because it makes both error recovery and working with the resulting parse tree easier. rr0z)Parameter contains name ({}) but no valuerJTzIncomplete parameterzextended-parameter-markerr'N=zParameter not followed by '='zparameter-separatorr F'z5Quoted string value for extended parameter is invalidzbare-quoted-stringzZParameter marked as extended but appears to have a quoted string value that is non-encodedzcApparent initial-extended-value but attribute was not marked as extended or was not initial sectionz(Missing required charset/lang delimiterszextended-attrtextrz=Expected RFC2231 char/lang encoding delimiter, but found {!r}zRFC2231 delimiterz;Expected RFC2231 char/lang encoding delimiter, but found {})rrErr5rrr<rLrrrrr rrrrrCrGrFrNAssertionErrorrrRrrrrr) rrr,rrvZappendtoZqstringZ inner_valueZ semi_validrtrMrrr get_parameterT s                                       rScCst}x|ry#t|\}}|j|Wntjk rF}zd}|dtkrxt|\}}n|s|j||S|ddkr|dk r|j|n|jjtjdn]t |\}}|r|g|ddt}d}|s2|jjtjd|Syt|\}}WnHtjk r|jjtjdj|t |||SYnX|j|| s|ddkr|jjtjd|rt ||n|S|j j j |_ |jtdd|dd }yt|\}}WnHtjk r|jjtjd j|t |||SYnX|j||j j j |_|s|S|dd kr|jjtjd j||` |`t |||S|jtd d |jt|dd |S)z maintype "/" subtype *( ";" parameter ) The maintype and substype are tokens. Theoretically they could be checked against the official IANA list + x-token, but we don't do that. Fz"Missing content type specificationz(Expected content maintype but found {!r}r/zInvalid content typezcontent-type-separatorr'Nz'Expected content subtype but found {!r}r0zrBrFrrrrrkrrr r r rrrrrrrrrrr r"r#r$r(r+r,r-r.r/r1r2r3r4r6r;r<r?rArCrErGrIrLrNrSrTrUrXrYrZrrrrDs         T J_     '# U3 $ 0 0 0 !  * 8           & ' /   ' $  ) .     9 %   > D          4  9