java - Parse content-page using Regex? -
i'm writing java code using regex parse content-page extracted pdf document.
in string regex must match: digit (up three) followed space (or many) followed word (or many [word: sequence of characters]). , vise versa: (word(s) space(s) digit(s)), must in string. considering leading spaces , case insensitive.
the extracted content-page this:
directors’ responsibilities 8
corporate governance 9
remuneration report 10
the numbering-style not consistent , number of spaces between digit , string vary, like:
01 contents
02 strategy , highlights
04 chairman’s statement
the regex i'm using matches number of words followed number of spaces , number of no more 3 digits:
(?i)([a-z\\s])*[0-9]{1,3}(?i)
it works not quite well, can't tell i'm doing wrong? , wish there way detect both numbering-style (having page numbers left or right of string) instead of repeating regex , flip order.
cheers
if want match phrases should include punctuation want match in regex. afaik there no way in regex if phrase "before or after", should flip 1 , append |
. along lines of:
[a-za-z'".,!\s]+\d{1,3}|\d{1,3}[a-za-z'".,!\s]+
also, don't need 2 instances of (?i)
, regex apply case insensitivity until end of string or if encounters (?-i)
.
Comments
Post a Comment