8.5. Syntax Identifier

  • Identifiers specifies what to find

  • They are also called Character Classes

8.5.1. SetUp

>>> import re

8.5.2. Numeric

  • \d - digit

  • \D - anything but digit

>>> TEXT = 'Mark Watney of Ares 3 landed on Mars on: Nov 7th, 2035 at 1:37 pm'
>>> re.findall('[0-9]', TEXT)
['3', '7', '2', '0', '3', '5', '1', '3', '7']
>>> re.findall('\d', TEXT)
['3', '7', '2', '0', '3', '5', '1', '3', '7']
>>> re.findall('\D', TEXT)  
['M', 'a', 'r', 'k', ' ', 'W', 'a', 't', 'n', 'e', 'y', ' ', 'o', 'f',
 ' ', 'A', 'r', 'e', 's', ' ', ' ', 'l', 'a', 'n', 'd', 'e', 'd', ' ',
 'o', 'n', ' ', 'M', 'a', 'r', 's', ' ', 'o', 'n', ':', ' ', 'N', 'o',
 'v', ' ', 't', 'h', ',', ' ', ' ', 'a', 't', ' ', ':', ' ', 'p', 'm']

8.5.3. Whitespaces

  • \s - whitespace (space, tab, newline, non-breaking space)

  • \S - anything but whitespace

  • \n - newline

  • \r\n - windows newline

  • \r - carriage return

  • \b - backspace

  • \t - tab

  • \v - vertical space

  • \f - form feed

>>> TEXT = 'Mark Watney of Ares 3 landed on Mars on: Nov 7th, 2035 at 1:37 pm'
>>> re.findall('\s', TEXT)
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
>>> re.findall('\S', TEXT)  
['M', 'a', 'r', 'k', 'W', 'a', 't', 'n', 'e', 'y', 'o', 'f', 'A', 'r',
 'e', 's', '3', 'l', 'a', 'n', 'd', 'e', 'd', 'o', 'n', 'M', 'a', 'r',
 's', 'o', 'n', ':', 'N', 'o', 'v', '7', 't', 'h', ',', '2', '0', '3',
 '5', 'a', 't', '1', ':', '3', '7', 'p', 'm']
>>> re.findall('\n', TEXT)
[]
>>>
>>> re.findall('\r\n', TEXT)
[]
>>>
>>> re.findall('\r', TEXT)
[]

8.5.4. Anchors

  • Matches the empty string, but only at the beginning or end of a word

  • \b - word boundary

  • \B - anything but word boundary

Examples:

  • \babc\b - performs a "whole words only" search

  • \Babc\B - pattern is fully surrounded by word characters

>>> TEXT = 'Mark Watney of Ares 3 landed on Mars on: Nov 7th, 2035 at 1:37 pm'
>>> re.findall('[a-z][a-z]', TEXT)  
['ar', 'at', 'ne', 'of', 're', 'la', 'nd', 'ed', 'on', 'ar', 'on', 'ov', 'th', 'at', 'pm']
>>> re.findall(r'\b[a-z][a-z]\b', TEXT)
['of', 'on', 'on', 'at', 'pm']
>>> re.findall('\b[a-z][a-z]\b', TEXT)  # without raw-string
[]

8.5.5. String

  • \w - any unicode alphabet character (lower or upper, also with diacritics (i.e. ąćęłńóśżź...), numbers and underscores

  • \W - anything but any unicode alphabet character (i.e. whitespace, dots, comas, dashes)

  • lowercase letters including diacritics (i.e. ąćęłńóśżź...) and accents

  • uppercase letters including diacritics (i.e. ąćęłńóśżź...) and accents

  • digits

  • underscores _

Valid characters are the same as allowed in variable/modules names in Python:

>>> imie = 'Mark'
>>> IMIE = 'Mark'
>>> imię = 'Mark'
>>> imię1 = 'Mark'
>>> Imię_1 = 'Mark'
>>> TEXT = 'Mark Watney of Ares 3 landed on Mars on: Nov 7th, 2035 at 1:37 pm'
>>> re.findall('\w', TEXT)  
['M', 'a', 'r', 'k', 'W', 'a', 't', 'n', 'e', 'y', 'o', 'f', 'A', 'r',
 'e', 's', '3', 'l', 'a', 'n', 'd', 'e', 'd', 'o', 'n', 'M', 'a', 'r',
 's', 'o', 'n', 'N', 'o', 'v', '7', 't', 'h', '2', '0', '3', '5', 'a',
 't', '1', '3', '7', 'p', 'm']
>>> re.findall('\W', TEXT)
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ':', ' ', ' ', ',', ' ', ' ', ' ', ':', ' ']

Mind, that following code gives similar output to \w but it is not completely true. \w would extract also unicode characters while this [a-zA-Z0-9] will not.

>>> re.findall('[a-zA-Z0-9]', TEXT)  
['M', 'a', 'r', 'k', 'W', 'a', 't', 'n', 'e', 'y', 'o', 'f', 'A', 'r',
 'e', 's', '3', 'l', 'a', 'n', 'd', 'e', 'd', 'o', 'n', 'M', 'a', 'r',
 's', 'o', 'n', 'N', 'o', 'v', '7', 't', 'h', '2', '0', '3', '5', 'a',
 't', '1', '3', '7', 'p', 'm']

Example:

>>> text = 'cześć'
>>>
>>> re.findall('[a-z]', text)
['c', 'z', 'e']
>>>
>>> re.findall('\w', text)
['c', 'z', 'e', 'ś', 'ć']
>>>
>>> re.findall('\w', text, flags=re.ASCII)
['c', 'z', 'e']
>>>
>>> re.findall('\w', text, flags=re.UNICODE)
['c', 'z', 'e', 'ś', 'ć']

Flag re.UNICODE is set by default.

8.5.6. Use Case - 0x01

  • Phone

>>> phone = '+48 123 456 789'
>>> re.findall('\d', phone)
['4', '8', '1', '2', '3', '4', '5', '6', '7', '8', '9']
>>> phone = '+48 (12) 345 6789'
>>> re.findall('\d', phone)
['4', '8', '1', '2', '3', '4', '5', '6', '7', '8', '9']

8.5.7. Use Case - 0x02

  • Compare Phones

>>> PHONE1 = '+48 123 456 789'
>>> PHONE2 = '+48 (12) 345 6789'
>>>
>>> phone1 = re.findall('\d', PHONE1)
>>> phone2 = re.findall('\d', PHONE2)
>>>
>>> phone1 == phone2
True

8.5.8. Use Case - 0x03

  • EU VAT Tax ID

>>> number = '777-286-18-23'
>>> re.findall('\d', number)
['7', '7', '7', '2', '8', '6', '1', '8', '2', '3']
>>> number = '777-28-61-823'
>>> re.findall('\d', number)
['7', '7', '7', '2', '8', '6', '1', '8', '2', '3']
>>> number = '7772861823'
>>> re.findall('\d', number)
['7', '7', '7', '2', '8', '6', '1', '8', '2', '3']

8.5.9. Use Case - 0x04

  • Number and Spaces

>>> TEXT = 'Mark Watney of Ares 3 landed on Mars on: Nov 7th, 2035 at 1:37 pm'
>>> re.findall('[0-9]\s', TEXT)
['3 ', '5 ', '7 ']
>>> re.findall('\d\s', TEXT)
['3 ', '5 ', '7 ']