parsing - Using regular expressions in Python to parse LaTeX code -

i trying write python script tidy latex code. find instances in environment started, there non-whitespace characters after declaration before next newline. example, match

\begin{theorem}[weierstrass approximation] \label{wapprox}

but not match

\begin{theorem}[weierstrass approximation]  \label{wapprox}

my goal insert (using re.sub) newline character between end of declaration , first non-whitespace character. sloppily stated, want find

(\begin{evn}) ({text} | [text]) ({text2}|[text2]) ... ({textn}|textn]) (\s)

to replacement. i've tried

expr = re.compile(r'\\(begin|end){1}({[^}]+}|\[[^\]]+\])+[^{\[]+$',re.m)

but isn't quite working. last group, it's matching last paired of {,} or [,].

you can this:

import re  s = r'''\begin{theorem}[weierstrass approximation] \label{wapprox}  not match  \begin{theorem}[weierstrass approximation]  \label{wapprox}'''  p = re.compile(r'(\\(?:begin|end)(?=((?:{[^}]*}|\[[^]]*])*))\2)[^\s\n]*(?=\s)')  print(p.sub(r'\1\n', s))

pattern details:

(   # capture group 1     \\     (?:begin|end)     # trick emulate atomic group     (?=(  # subpattern enclosed in lookahead , capture group (2)         (?:{[^}]*}|\[[^]]*])*     ))  # lookahead naturally atomic     \2  # backreference capture group 2 ) [^\s\n]* # eventual horizontal whitespaces (?=\s) # followed non whitespace character

explanation: if write pattern (\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\s\n]*(?=\s) can't prevent cases have newline character before next token. see following scenario:

(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\s\n]*(?=\s) matches:

\begin{theorem}[weierstrass approximation]
\label{wapprox}

but since (?=\s) fails (because next character newline) backtracking mechanism occurs:

(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\s\n]*(?=\s) matches:

\begin{theorem}[weierstrass approximation]
\label{wapprox}

and (?=\s) succeeds match [ character.

an atomic group non capturing group forbids backtracking in subpattern enclosed in group. notation (?>subpattern). unfortunately re module doesn't have feature, can emulate trick (?=(subpattern))\1.

note can use regex module (that has feature) instead of re:

import regex  p = regex.compile(r'(\\(?:begin|end)(?>(?:{[^}]*}|\[[^]]*])*)[^\s\n]*(?=\s)')

p = regex.compile(r'(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*+[^\s\n]*+(?=\s)')

Search This Blog

Premier

parsing - Using regular expressions in Python to parse LaTeX code -

Comments

Post a Comment

Popular posts from this blog

python - ValueError: empty vocabulary; perhaps the documents only contain stop words -

ubuntu - collect2: fatal error: ld terminated with signal 9 [Killed] -

java - UnknownEntityTypeException: Unable to locate persister (Hibernate 5.0) -