Using Regular Expressions to Match Twitter Users and Hashtags

If you want to find Twitter usernames and hashtags in tweets and do something with them, like turn them into links when you’re displaying them on your website, the most compact way of doing so is through regular expressions. However, most of the articles I looked through on the web mess up the regexp.

Usernames start with a “@”, while hashtags start with a “#”. Since usernames and hashtags will only have letters, numbers, or underscores in them, most all of the examples on the web use a regexp like so:


@([A-Za-z0-9_]+)

There’s only one problem: if you have an email address in a tweet, it’ll match on that. Run that regular expression on “Email me at spammy@mailinator.com” and you’ll match on “mailinator” as a username when it’s not. What you really need to do is make sure that there’s nothing in front of the “@” or “#” but whitespace or the beginning of the string.

For completeness, here’s example code to add links to both usernames and hashtags in a bunch of different languages.

Javascript



PHP


function linkify_tweet($tweet) {
    $tweet = preg_replace('/(^|\s)@(\w+)/',
        '\1@\2',
        $tweet);
    return preg_replace('/(^|\s)#(\w+)/',
        '\1#\2',
        $tweet);
}

Python


import re

def linkify_tweet(tweet):
    tweet = re.sub(r'(\A|\s)@(\w+)', r'\1@\2', tweet)
    return re.sub(r'(\A|\s)#(\w+)', r'\1#\2', tweet)

Perl


$s =~ s{(\A|\s)@(\w+)}{$1@$2};
$s =~ s{(\A|\s)#(\w+)}{$1#$2};

(Javascript approach taken from Simon Whatley)

20 thoughts on “Using Regular Expressions to Match Twitter Users and Hashtags

  1. Thanks, this is a very useful blog post. I am building hash tag support into my forum so that users can tag their posts with keyword information. However, I couldn’t figure out how to only detect hash characters at the start of a new line, or with a whitespace in front, and so pasted URLs were breaking!

  2. C#.NET – Using Regular Expressions to Match Twitter Users

    string b = Regex.Replace(a, @”(\A|\s)@(\w+)”, @”@$2“);

  3. Note that this will destroy your HTML if you have something like this:

    <a href=”http://www.example.com” title=”Bla @test blubb”>Don’t break!</a>

  4. Same method as above in ruby for those interested:

    def linkify_tweet(tweet)
    tweet.gsub!(/(^|\s)#(\w+)/, ‘\1#\2‘)
    tweet.gsub!(/(^|\s)@(\w+)/, ‘\1@\2‘)
    end

  5. Question: Why don’t you use “/#([^ ]+)/” for hashtags, so you also capture non-Ascii tags?

  6. Graphity: I tend to be cautious about using match-anything regexps, so deliberately limited it to letters, numbers, or an underscore. You certainly could use the match-anything regexp you posted if you wanted to be more liberal in what you match.

  7. @Stephen: Ah, I understand. I just saw problems using localized hashtags, like, for me, German umlauts.

  8. Helo,
    I need to make a web page where I can show tweets of say two different categories. I found by searcg that hash tag is a way to find tweeks of different types. But I do not find any help how to use these in php i.e. to find hashtags using php. I will be thankful for any help.

  9. in AS3 for those interested. Works great.

    public static function parseTweetUsersAndTagsToLinks( tweet:String ):String
    {
    tweet = tweet.replace(/(^|\s)@(\w+)/g, “$1@$2“);
    return tweet.replace(/(^|\s)#(\w+)/g, “$1#$2“);
    }

  10. any ideas on Java?
    Just stumbled upon your blog looking for a solution to my problem – replace links, usernames and hashtags from tweets: my solution is
    testData.replaceAll(“((?i)http:\\S*?\\s|(?i)http:\\S*?$|@\\S*?|@\\S*?$|\\#[:alnum:]*?|\\#[:alnum:]*?$)”, “[replaced]”)
    Problem is the hashtag gets removed, but not the groupName.
    Any ideas to solve that?

  11. If you do not want people to type anything before the “@”, in other words.. it has to start with “@”, then it should be this way:

    ^@([A-Za-z0-9_]+)

Comments are closed.