parsing - Using regular expressions in Python to parse LaTeX code -
i trying write python script tidy latex code. find instances in environment started, there non-whitespace characters after declaration before next newline. example, match
\begin{theorem}[weierstrass approximation] \label{wapprox}
but not match
\begin{theorem}[weierstrass approximation] \label{wapprox}
my goal insert (using re.sub) newline character between end of declaration , first non-whitespace character. sloppily stated, want find
(\begin{evn}) ({text} | [text]) ({text2}|[text2]) ... ({textn}|textn]) (\s)
to replacement. i've tried
expr = re.compile(r'\\(begin|end){1}({[^}]+}|\[[^\]]+\])+[^{\[]+$',re.m)
but isn't quite working. last group, it's matching last paired of {,} or [,].
you can this:
import re s = r'''\begin{theorem}[weierstrass approximation] \label{wapprox} not match \begin{theorem}[weierstrass approximation] \label{wapprox}''' p = re.compile(r'(\\(?:begin|end)(?=((?:{[^}]*}|\[[^]]*])*))\2)[^\s\n]*(?=\s)') print(p.sub(r'\1\n', s))
pattern details:
( # capture group 1 \\ (?:begin|end) # trick emulate atomic group (?=( # subpattern enclosed in lookahead , capture group (2) (?:{[^}]*}|\[[^]]*])* )) # lookahead naturally atomic \2 # backreference capture group 2 ) [^\s\n]* # eventual horizontal whitespaces (?=\s) # followed non whitespace character
explanation: if write pattern (\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\s\n]*(?=\s)
can't prevent cases have newline character before next token. see following scenario:
(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\s\n]*
(?=\s)
matches:
\begin{theorem}[weierstrass approximation]
\label{wapprox}
but since (?=\s)
fails (because next character newline) backtracking mechanism occurs:
(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*
)[^\s\n]*(?=\s)
matches:
\begin{theorem}
[weierstrass approximation]
\label{wapprox}
and (?=\s)
succeeds match [
character.
an atomic group non capturing group forbids backtracking in subpattern enclosed in group. notation (?>subpattern)
. unfortunately re module doesn't have feature, can emulate trick (?=(subpattern))\1
.
note can use regex module (that has feature) instead of re:
import regex p = regex.compile(r'(\\(?:begin|end)(?>(?:{[^}]*}|\[[^]]*])*)[^\s\n]*(?=\s)')
or
p = regex.compile(r'(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*+[^\s\n]*+(?=\s)')
Comments
Post a Comment