Using Regular Expressions to Match Twitter Users and Hashtags

If you want to find Twitter usernames and hashtags in tweets and do something with them, like turn them into links when you’re displaying them on your website, the most compact way of doing so is through regular expressions. However, most of the articles I looked through on the web mess up the regexp.

Usernames start with a “@”, while hashtags start with a “#”. Since usernames and hashtags will only have letters, numbers, or underscores in them, most all of the examples on the web use a regexp like so:

@([A-Za-z0-9_]+)

There’s only one problem: if you have an email address in a tweet, it’ll match on that. Run that regular expression on “Email me at spammy@mailinator.com” and you’ll match on “mailinator” as a username when it’s not. What you really need to do is make sure that there’s nothing in front of the “@” or “#” but whitespace or the beginning of the string.

For completeness, here’s example code to add links to both usernames and hashtags in a bunch of different languages.

Javascript

<script type="text/javascript">
    String.prototype.linkify_tweet = function() {
   var tweet = this.replace(/(^|\s)@(\w+)/g, "$1@<a href="http://www.twitter.com/$2">$2</a>");
   return tweet.replace(/(^|\s)#(\w+)/g, "$1#<a href="http://search.twitter.com/search?q=%23$2">$2</a>");
 };
</script>

PHP

function linkify_tweet($tweet) {
    $tweet = preg_replace('/(^|\s)@(\w+)/',
        '\1@<a href="http://www.twitter.com/\2">\2</a>',
        $tweet);
    return preg_replace('/(^|\s)#(\w+)/',
        '\1#<a href="http://search.twitter.com/search?q=%23\2">\2</a>',
        $tweet);
}

Python

import re

def linkify_tweet(tweet):
    tweet = re.sub(r'(\A|\s)@(\w+)', r'\1@<a href="http://www.twitter.com/\2">\2</a>', tweet)
    return re.sub(r'(\A|\s)#(\w+)', r'\1#<a href="http://search.twitter.com/search?q=%23\2">\2</a>', tweet)

Perl

$s =~ s{(\A|\s)@(\w+)}{$1@<a href="http://www.twitter.com/$2">$2</a>};
$s =~ s{(\A|\s)#(\w+)}{$1#<a href="http://search.twitter.com/search?q=%23$2">$2</a>};

(Javascript approach taken from Simon Whatley)

12 Comments

  1. on April 6, 2009 at 6:44 pm | Permalink

    What about other #punctuation/#symbols? (@sargent like this)

  2. on April 7, 2009 at 9:50 am | Permalink

    I have no interest in such things!

  3. on April 7, 2009 at 9:14 pm | Permalink

    Thanks, this is a very useful blog post. I am building hash tag support into my forum so that users can tag their posts with keyword information. However, I couldn’t figure out how to only detect hash characters at the start of a new line, or with a whitespace in front, and so pasted URLs were breaking!

  4. on April 8, 2009 at 8:29 am | Permalink

    I’m glad you found this useful!

  5. on April 21, 2009 at 11:36 am | Permalink

    Gee, I wonder why you’re wrangling this, Stephen … hehehehehehehe.

  6. Marcelino Dornas
    on May 6, 2009 at 4:16 pm | Permalink

    C#.NET – Using Regular Expressions to Match Twitter Users

    string b = Regex.Replace(a, @”(\A|\s)@(\w+)”, @”@$2“);

  7. on November 16, 2009 at 4:15 am | Permalink

    Note that this will destroy your HTML if you have something like this:

    <a href=”http://www.example.com” title=”Bla @test blubb”>Don’t break!</a>

  8. on November 16, 2009 at 7:12 pm | Permalink

    Ah, good point. Sadly there’s no good way around it using regexp. Parsing is probably the true and correct way to go.

  9. on January 13, 2010 at 5:47 pm | Permalink

    Same method as above in ruby for those interested:

    def linkify_tweet(tweet)
    tweet.gsub!(/(^|\s)#(\w+)/, ‘\1#\2‘)
    tweet.gsub!(/(^|\s)@(\w+)/, ‘\1@\2‘)
    end

  10. Graphity
    on February 4, 2010 at 6:13 pm | Permalink

    Question: Why don’t you use “/#([^ ]+)/” for hashtags, so you also capture non-Ascii tags?

  11. on February 9, 2010 at 7:22 pm | Permalink

    Graphity: I tend to be cautious about using match-anything regexps, so deliberately limited it to letters, numbers, or an underscore. You certainly could use the match-anything regexp you posted if you wanted to be more liberal in what you match.

  12. Graphity
    on March 3, 2010 at 1:22 pm | Permalink

    @Stephen: Ah, I understand. I just saw problems using localized hashtags, like, for me, German umlauts.

One Trackback

  1. By Twitter PHP Badge with Caching « Kien Tran on May 22, 2009 at 1:27 am

    [...] with email addresses inside of the tweet, and does not create an twitter link for them. Thanks to Live Grenades writer Stephen for the regular [...]

Post a Comment

Comments are moderated according to our moderation policy. Sometimes comments are delayed by our spam filter. We try to release them as soon as possible.

Your email is never published nor shared. Required fields are marked *

*
*