The email could in HTML or text format.If it's in HTML format then use libraries like bs4
, pyquery
etc.
If it's text then use regex to search the URL using the following regex
regex = ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
Refer: http://www.ietf.org/rfc/rfc3986.txt
Use re module to search the string as
import reregex = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"urls = re.findall( regex, text )print(urls)
Use pyquery module
from pyquery import pyQuery as pqq = pq( text )a_list = q( "a" )urls = [ a.attr[ 'href' ] for a in a_list ]print(urls)
EDIT:
Instead of using generic URL we can use specific URL, for example https?:\/\/www\.boomlings\.com\/database\/accounts\/activate\.php\?uid=.*&actcode=.*