Scientists at Carnegie Mellon University, working with federal grant monies, have discovered that phishing e-mails are decidedly different from most other spam — so much so that the fraudulent messages can almost entirely be detected and filtered out.
CMU researchers state that their analysis catches 92.65 percent of phishing attempts. Only 0.12 percent of legitimate messages are miscategorized as fraudulent. This “false positive” percentage is tiny enough that the phishing filter could be added to traditional spam filters even by corporations that can’t allow any significant loss of important inbound mail.
These findings have a tremendous potential to reduce identity thefts that are initiated by e-mail. But neither CMU nor its government sponsors have issued any press releases about the study. You’re reading about it here first.
Summertime, and the Phishing is Easy
If you’re a frequent reader of my columns, you’ve probably heard a lot about phishing — bogus e-mails that appear to be from a bank or ISP. These messages lure users to a fake Web site that’s designed to collect usernames, passwords, credit-card numbers or other valuable information.
But many computer users are still falling for these scams. It’s difficult to get hard figures on how many billions of dollars are lost each year to phishing, but the number of attacks is soaring.
The latest Phishing Trends Report by the Anti-Phishing Working Group, a coalition of financial institutions and other businesses, says 11,976 new phishing Web sites were detected by the group in May 2006. That’s up from 3,326 such sites in the same month of 2005. Despite misconceptions that hackers in Russia are behind most attacks, 34 percent of phishing Web sites are based in the United States, with 15 percent in China and smaller numbers in other countries, APWG says.
Corporate spam filters are adequate to suppress some phishing e-mails, but not all. Now, the new Carnegie Mellon report shows effective ways to discern phishing messages that might otherwise slip through the net.
The study was conducted at CMU by Ph.D candidate Ian Fette, associate professor Norman Sadeh, and faculty member Anthony Tomasic. It was funded by the U.S. Army Research Office and the National Science Foundation’s Cyber Trust Initiative, which is sponsoring a CMU research center called CyLab.
Tell-tale Warning Signs of Phishing Messages
Most spam messages don’t need to pretend that the Web sites they link to are respected brand names. People who wish to buy prescription drugs on the sly, for example, may not mind being directed to a site with an obscure name like Pills-Without-Prescriptions.com.
The essence of phishing, however, is that the Web site that’s linked to appears to be the legitimate home of a well-known company. It’s this central fact of deception, the CMU researchers say, that enables phishing e-mails to be detected. The study uses sophisticated statistical analysis to detect unusual e-mail traits, such as:
• Links to “fresh” domains.More than 12 percent of phishing e-mails contain a link to a domain name that was registered fewer than 60 days ago. Because fraudulent Web sites quickly disappear or are kicked off the Internet when discovered, the average phishing site stays online only 5 days, according to APWG.
• Links in dotted-decimal format.Many Web sites used for phishing are hosted on home PCs that have been infected by spyware and turned into “zombies.” These sites don’t have domain names assigned to them, so phishing e-mails must link to them using a raw IP address, such as 192.0.34.166. About 45 percent of phishing e-mails link to such a “dotted-decimal” address.
• Clickable domain name doesn’t match destination.It’s simple for the creator of an e-mail message to make the visible text of a link say “Citibank.com” or whatever. In reality, an end user who clicks the link is sent to some other domain that merely looks like Citibank’s. About 50 percent of phishing e-mails contain links in which the visible domain name and the destination don’t match.
• Atypical destination of “click here” links.To appear legitimate, several links in a phishing e-mail may point to actual privacy statements and customer-service forms at, for example, PayPal.com. The link that the phisher urges users to click, however, points to a different Web site entirely. About 18 percent of the time, phishing e-mails contain an atypical link such as this.
In a telephone interview, researchers Fette and Tomasic acknowledged that their work was in its early stages. “We don’t actually have a decision tree that weights each of the factors,” said Fette. “We don’t have some program yet that people can download.”
The research also suffers from the fact that the dataset of tested messages is more than two years old. To determine whether destination domain names had been registered fewer than 60 days before the messages were sent, the researchers had to laboriously look up the registration dates. Running further experiments on live data would help to verify whether the algorithms that work on the tested dataset still work on today’s mail, the study’s authors say.
Don’t Try This on Your Own Mail, Please
Because no packaged software that implements the study’s findings is commercially available yet, you might be tempted to start simply deleting e-mails you receive, based solely on a few of the “tell-tale factors.” I strongly advise you against trying to invent your own rules in this way.
Many legitimate e-mail messages bear features that the study found to be suspicious. If you delete all messages that exhibit any of the four factors described above, for example, you’ll eliminate more than 2 percent of your legitimate inbound messages, according to figures in the study. No company can allow that much mail from customers and vendors to be lost.
Instead, I urge you to wait for professional phishing-filter software to become available. The report’s authors explained to me that their algorithm, using 10 complex factors, establishes an n-dimensional space and computes a nonplanar boundary between phishing messages and legitimate e-mails. That’s not something you can reproduce with a few simple rules.
If you’re really impatient to eliminate phishing messages, your first line of defense is a brand-name spam filter, which will stop most unsolicited bulk e-mails. Then you can consider adding rules to look for “tell-tale signs” of phishing messages that slipped through. If you find anything suspicious using your own unsophisticated rules, write “[CAUTION]” into the Subject line rather than deleting what may be legitimate messages.
I asked why the university and its sponsors hadn’t publicized the report, which was completed in June. “This is still very early research,” Fette replied. The academics would like to find an executive of a large corporation who would authorize them to rerun their experiment on a live datastream. The researchers, they assure me, would protect the confidentiality of the messages that were scored in the test.
Conclusion
I hope one of my readers will take the researchers up on their challenge. The study’s authors can be reached at CMU’s Institute for Software Research International.
If you’d like more information, CMU has posted a short abstract of the researchers’ study. A 16-page PDF report on the work is available as a PDF file.
Huawei’s AI Update: Things Are Moving Faster Than We Think
FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA
FEATURE | By Guest Author,
November 10, 2020
FEATURE | By Samuel Greengard,
November 05, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media
FEATURE | By Rob Enderle,
October 16, 2020
FEATURE | By Cynthia Harvey,
October 07, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science
FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future
FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2020
FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI
FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality
FEATURE | By James Maguire,
September 09, 2020
Anticipating The Coming Wave Of AI Enhanced PCs
FEATURE | By Rob Enderle,
September 05, 2020
The Critical Nature Of IBM’s NLP (Natural Language Processing) Effort
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
August 14, 2020
Datamation is the leading industry resource for B2B data professionals and technology buyers. Datamation's focus is on providing insight into the latest trends and innovation in AI, data security, big data, and more, along with in-depth product recommendations and comparisons. More than 1.7M users gain insight and guidance from Datamation every year.
Advertise with TechnologyAdvice on Datamation and our other data and technology-focused platforms.
Advertise with Us
Property of TechnologyAdvice.
© 2025 TechnologyAdvice. All Rights Reserved
Advertiser Disclosure: Some of the products that appear on this
site are from companies from which TechnologyAdvice receives
compensation. This compensation may impact how and where products
appear on this site including, for example, the order in which
they appear. TechnologyAdvice does not include all companies
or all types of products available in the marketplace.