If you want to find Twitter usernames and hashtags in tweets and do something with them, like turn them into links when you’re displaying them on your website, the most compact way of doing so is through regular expressions. However, most of the articles I looked through on the web mess up the regexp.
Usernames start with a “@”, while hashtags start with a “#”. Since usernames and hashtags will only have letters, numbers, or underscores in them, most all of the examples on the web use a regexp like so:
@([A-Za-z0-9_]+)
There’s only one problem: if you have an email address in a tweet, it’ll match on that. Run that regular expression on “Email me at spammy@mailinator.com” and you’ll match on “mailinator” as a username when it’s not. What you really need to do is make sure that there’s nothing in front of the “@” or “#” but whitespace or the beginning of the string.
For completeness, here’s example code to add links to both usernames and hashtags in a bunch of different languages.
Javascript
<script type="text/javascript">
String.prototype.linkify_tweet = function() {
var tweet = this.replace(/(^|\s)@(\w+)/g, "$1@<a href="http://www.twitter.com/$2">$2</a>");
return tweet.replace(/(^|\s)#(\w+)/g, "$1#<a href="http://search.twitter.com/search?q=%23$2">$2</a>");
};
</script>
PHP
function linkify_tweet($tweet) {
$tweet = preg_replace('/(^|\s)@(\w+)/',
'\1@<a href="http://www.twitter.com/\2">\2</a>',
$tweet);
return preg_replace('/(^|\s)#(\w+)/',
'\1#<a href="http://search.twitter.com/search?q=%23\2">\2</a>',
$tweet);
}
Python
import re
def linkify_tweet(tweet):
tweet = re.sub(r'(\A|\s)@(\w+)', r'\1@<a href="http://www.twitter.com/\2">\2</a>', tweet)
return re.sub(r'(\A|\s)#(\w+)', r'\1#<a href="http://search.twitter.com/search?q=%23\2">\2</a>', tweet)
Perl
$s =~ s{(\A|\s)@(\w+)}{$1@<a href="http://www.twitter.com/$2">$2</a>};
$s =~ s{(\A|\s)#(\w+)}{$1#<a href="http://search.twitter.com/search?q=%23$2">$2</a>};
(Javascript approach taken from Simon Whatley)
18 Comments
What about other #punctuation/#symbols? (@sargent like this)
I have no interest in such things!
Thanks, this is a very useful blog post. I am building hash tag support into my forum so that users can tag their posts with keyword information. However, I couldn’t figure out how to only detect hash characters at the start of a new line, or with a whitespace in front, and so pasted URLs were breaking!
I’m glad you found this useful!
Gee, I wonder why you’re wrangling this, Stephen … hehehehehehehe.
C#.NET – Using Regular Expressions to Match Twitter Users
string b = Regex.Replace(a, @”(\A|\s)@(\w+)”, @”@$2“);
Note that this will destroy your HTML if you have something like this:
<a href=”http://www.example.com” title=”Bla @test blubb”>Don’t break!</a>
Ah, good point. Sadly there’s no good way around it using regexp. Parsing is probably the true and correct way to go.
Same method as above in ruby for those interested:
def linkify_tweet(tweet)
tweet.gsub!(/(^|\s)#(\w+)/, ‘\1#\2‘)
tweet.gsub!(/(^|\s)@(\w+)/, ‘\1@\2‘)
end
Question: Why don’t you use “/#([^ ]+)/” for hashtags, so you also capture non-Ascii tags?
Graphity: I tend to be cautious about using match-anything regexps, so deliberately limited it to letters, numbers, or an underscore. You certainly could use the match-anything regexp you posted if you wanted to be more liberal in what you match.
@Stephen: Ah, I understand. I just saw problems using localized hashtags, like, for me, German umlauts.
Helo,
I need to make a web page where I can show tweets of say two different categories. I found by searcg that hash tag is a way to find tweeks of different types. But I do not find any help how to use these in php i.e. to find hashtags using php. I will be thankful for any help.
in AS3 for those interested. Works great.
public static function parseTweetUsersAndTagsToLinks( tweet:String ):String
{
tweet = tweet.replace(/(^|\s)@(\w+)/g, “$1@$2“);
return tweet.replace(/(^|\s)#(\w+)/g, “$1#$2“);
}
Cool post.
Btw: As I know, the punction of apostrophe( ‘ ) can also be a part of the hashtag.
any ideas on Java?
Just stumbled upon your blog looking for a solution to my problem – replace links, usernames and hashtags from tweets: my solution is
testData.replaceAll(“((?i)http:\\S*?\\s|(?i)http:\\S*?$|@\\S*?|@\\S*?$|\\#[:alnum:]*?|\\#[:alnum:]*?$)”, “[replaced]”)
Problem is the hashtag gets removed, but not the groupName.
Any ideas to solve that?
Thank you very much – these regexps will help me enhance http://myretweetedtweets.appspot.com with auto-links for references and hashes!
If you do not want people to type anything before the “@”, in other words.. it has to start with “@”, then it should be this way:
^@([A-Za-z0-9_]+)
2 Trackbacks
[…] with email addresses inside of the tweet, and does not create an twitter link for them. Thanks to Live Grenades writer Stephen for the regular […]
[…] you are displaying tweets gracefully on your website. In the case of tweets you may want to use appropriate regular expressions as well to add links to Twitter usernames and […]