Adsense Leaderboard Ad

12.21.2011

Grep All Email Addresses from a Text File Using Regular Expressions

Took me a while to construct this regular expression, thought it might be useful to someone else. grep -E -o "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" filename.txt NOTE: The above command should be on one line.
(You can also use egrep instead of grep with the -E switch)
egrep -o "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" filename.txt
From grep man pages:
-E = Interpret PATTERN as an extended regular expression.
-o = Show only the part of a matching line that matches PATTERN.

More info:
GREP MAN PAGE: http://ss64.com/bash/grep.html
GREP REGEX TUTORIAL: http://www.regular-expressions.info/grep.html

35 comments:

  1. Thanks! I was almost there but not quite . .

    ReplyDelete
  2. According to your regexp, some@....com will be a valid email address.

    ReplyDelete
  3. Gentlemen, I appreciate you finding issues with my regex. Please post some solutions!

    ReplyDelete
  4. grep -E -o "\b[a-zA-Z0-9.-._]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" filename.txt

    ReplyDelete
  5. I love this command! thank you!

    ReplyDelete
  6. Thanks. It works perfectly!

    ReplyDelete
  7. Just what I was after - thanks!

    ReplyDelete
  8. Can you help me to clear \b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b , i dont understand it ? can you explain ? Thanks

    ReplyDelete
    Replies
    1. This is using regular expressions, here is some reasoning.

      \b = Tell grep to match a word boundary
      [a-zA-Z0-9] = Tells grep to match any character from a-z, then the same thing capitalized, and also match anything from 0-9 (So basically any letter or number)
      + = Tell grep to match the preceeding any number of times. Which means all thoughter any number of upper case letters, lower case letters or digits.

      And so on... Here are some good resources:

      https://www.gnu.org/software/findutils/manual/html_node/find_html/egrep-regular-expression-syntax.html

      http://www.cs.columbia.edu/~tal/3261/fall07/handout/egrep_mini-tutorial.htm

      Delete
  9. egrep -i ^[a-z0-9.-]+@[a-z0-9.-]+\.[a-z0-9.-]+$ filename.txt

    ReplyDelete
    Replies
    1. @Khanbaba khan - That is in imperfect solution. It will find "joe@domain." which is not a valid email address.

      Delete
    2. You're right that you can't end with a period. Though this leads to a related issue: nearly anything@anything is a valid email address according to the full spec (RFC822), including things like "{-a.-b@c=d$*!/?" (not to mention Unicode). If it doesn't matter for your application to reject uncommon addresses, this isn't much of an issue; just force people to get a "real" address that ends in .blah and doesn't contain fancy symbols. But if you want to err on the side of caution, *@* is pretty much the only way to go. A separate RegEx or script can be used later for actual validation. For example, processing the TLD according to a separate whitelist (like only accepting currently valid TLDs like com, net, gov, tv) though even that changes yearly and the list numbers in the thousands.

      Delete
  10. This comment has been removed by the author.

    ReplyDelete
  11. cool! but how to add a comma "," in every email adress? like this aaa@bbb.com,bbb@aaa.com,...

    ReplyDelete
    Replies
    1. I came up with this in 5 seconds, might be a cleaner way though.

      for i in `grep -E -o "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" help`; do echo -n "$i,"; done

      Although you will have a comma at the end of the list.

      Delete
    2. BTW, that should all be on one line.

      Delete
  12. Thank you . It was very useful...............

    ReplyDelete
  13. Oh, you just saved me an hour!

    ReplyDelete
  14. Simpler with [:alnum:]. "_" and "-" allowed and verify correct string length of domain and top domain:
    egrep -o "[[:alnum:]_-]+@[[:alnum:]_-]{2,}\\.[[:alnum:]]{2,}"

    ReplyDelete
  15. Sorry correction
    egrep -o "\b[[:alnum:]_-]+@[[:alnum:]_-]{2,}\.[[:alnum:]]{2,}\b"

    ReplyDelete
  16. Thank you for this command :) it is quite useful to extract emails from various files, not only txt but cvs and similar...

    ReplyDelete
  17. plus(+) is a valid email address for most email systems test+whichmaillinglist@gmail.com will get delivered to test@gmail.com but you will know +which... as it is ignored, you know which company is spamming you

    ReplyDelete
  18. late to the party by a good few years here, but this was very helpful! thankyou!

    ReplyDelete