java - Parse content-page using Regex? -

- January 15, 2014

i'm writing java code using regex parse content-page extracted pdf document.

in string regex must match: digit (up three) followed space (or many) followed word (or many [word: sequence of characters]). , vise versa: (word(s) space(s) digit(s)), must in string. considering leading spaces , case insensitive.

the extracted content-page this:

directors’ responsibilities 8

corporate governance 9

remuneration report 10

the numbering-style not consistent , number of spaces between digit , string vary, like:

01 contents

02 strategy , highlights

04 chairman’s statement

the regex i'm using matches number of words followed number of spaces , number of no more 3 digits:

(?i)([a-z\\s])*[0-9]{1,3}(?i)

it works not quite well, can't tell i'm doing wrong? , wish there way detect both numbering-style (having page numbers left or right of string) instead of repeating regex , flip order.

cheers

if want match phrases should include punctuation want match in regex. afaik there no way in regex if phrase "before or after", should flip 1 , append |. along lines of:

[a-za-z'".,!\s]+\d{1,3}|\d{1,3}[a-za-z'".,!\s]+

also, don't need 2 instances of (?i), regex apply case insensitivity until end of string or if encounters (?-i).

Search This Blog

Search

java - Parse content-page using Regex? -

Comments

Post a Comment

Popular posts from this blog

c++ - Creating new partition disk winapi -

VBA function to include CDATA -

php - Warning: file_get_contents() expects parameter 1 to be a valid path, array given 16 -