More than any year before it, 2017 has seen big data analytics go mainstream in a wide range of industries. Although it’s been a driving force for nearly a decade in fields like engineering and medicine, big data now drives marketing campaigns, shapes customer relationships and guides many other business operations for an ever-growing list of organizations worldwide.
Of course, many companies are still struggling to realize the value of the data they already possess. One 2016 survey by the Harvard Business Review found that most industries are “nowhere close” to capturing the full potential of their data. This lack of capitalization on data is an increasingly serious problem in a competitive landscape where customers expect consistent, predictive, personalized interactions with the organizations that serve them.
Aside from analytics-related hurdles such as lack of organizational policy and planning, the data itself poses its own difficulties. “Big data is dirty data,” as the saying goes. It’s filled with information that’s out-of-date, incomplete or just plain missing. In order for your organization to be able to act on that dirty data, someone first has to clean it up — and figure out how to derive actionable insights from it.
Here are some pointers for approaching your organization’s data cleanup and aligning your approach to analytics around your overall goals.
When you run a quick scan on any large data set, you’ll likely notice groupings of unusual entries that jump out from the overall patterns. Sometimes these anomalies take the form of long lines or blocks of missing data points, while other variations may come from entries logged inconsistently from the rest of the data set, or even shifted out of alignment with the rows or columns where they belong.
In all these cases, groups of anomalies tell you something significant about the way the data was gathered and reported. They may indicate that a section of the data set is missing, or that some rows or columns need to be moved into a different alignment, or that a certain abbreviation means the same thing as another one and needs to be found-and-replaced. The more efficiently you can train your algorithms to recognize and fix these issues, the sooner you can move on to the actual analytics — and start doing something useful with that data.
A growing list of software and software-as-a-service (SaaS) companies offer expertise in pattern recognition and data cleanup. But a far more cos-effective solution is to learn to handle these tasks in-house and go on to apply that learning to other data sets.
Start by training your machine learning algorithms to group similar anomalies together. Use these similarities to look for correlations. And chain those correlations together into meaningful trends.
As time-saving as anomaly detection algorithms can be, it’s equally important to use your human intuition to know when to use them and when to take a step back and ask if a given deviation might mean something unexpected. Just because your data looks dirty doesn’t always mean it is. Do a little digging of your own, and you may discover patterns that lead to original, actionable insights.
or example, one team of data analysts was combing through data from a luxury hotel chain, and discovered what appeared to be a large number of inaccurate entries: dozens of teenagers were reported as staying at high-end hotel properties in a wealthy country in the Middle East. But after some cross-checking, it became clear that these high-income guests were, in fact, exactly who they appeared to be. The analysts had stumbled upon a completely untapped customer demographic, a discovery that inspired a new marketing initiative for the hotel brand.
The moral, of course, is that outliers aren’t necessarily “dirty.” Another well-known example of this is Google’s vast repository of misspelled words and phrases typed into its suite of cloud software. Instead of discarding this mountain of seemingly worthless data, Google has held onto it for decades and has used it to create some of the most accurate spell-check algorithms on earth.
Before you discard any outliers, look for patterns in them and think about how those patterns might be useful in ways you haven’t considered before. You just might make a major breakthrough.
Anomaly detection and outlier identification are indispensable when dealing with second- and third-party data you’ve acquired from partners and vendors. But when it comes to your own first-party data, the most effective way to get a clean, actionable data set is to train your teams to enter data correctly and consistently in the first place.
One of the most impactful ways to enforce disciplined data entry is to standardize the fields and codes used in reporting. Breaking down departmental silos is a major aspect of becoming a data-driven business. Each silo often has its own data standards and formats. As your organization moves toward a more integrated data-sharing structure, emphasize the importance of using the same set of fields, codes and idioms for reporting the same types of information, no matter who’s reporting it.
As you can see, the only way to know for sure what your data is telling you is to treat each data set as unique and look for the unexpected in terms of patterns. As helpful (and popular) as visuals and data dashboards are, they’re only as effective as the analytical capabilities of the people using them.
And despite the proliferation of companies providing data cleanup services, the fact remains that each organization needs to develop its own data-related goals, as well as its own roadmap for using big data to reach those targets.
Ilan Hertz is head of digital marketing at Sisense, the leader in simplifying business intelligence for complex data. He has close to a decade of experience in applying data-driven methodologies in senior marketing positions in the technology industry.
Photo courtesy of Shutterstock.
Huawei’s AI Update: Things Are Moving Faster Than We Think
FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA
FEATURE | By Guest Author,
November 10, 2020
FEATURE | By Samuel Greengard,
November 05, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media
FEATURE | By Rob Enderle,
October 16, 2020
FEATURE | By Cynthia Harvey,
October 07, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science
FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future
FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2020
FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI
FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality
FEATURE | By James Maguire,
September 09, 2020
Anticipating The Coming Wave Of AI Enhanced PCs
FEATURE | By Rob Enderle,
September 05, 2020
The Critical Nature Of IBM’s NLP (Natural Language Processing) Effort
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
August 14, 2020
Datamation is the leading industry resource for B2B data professionals and technology buyers. Datamation's focus is on providing insight into the latest trends and innovation in AI, data security, big data, and more, along with in-depth product recommendations and comparisons. More than 1.7M users gain insight and guidance from Datamation every year.
Advertise with TechnologyAdvice on Datamation and our other data and technology-focused platforms.
Advertise with Us
Property of TechnologyAdvice.
© 2025 TechnologyAdvice. All Rights Reserved
Advertiser Disclosure: Some of the products that appear on this
site are from companies from which TechnologyAdvice receives
compensation. This compensation may impact how and where products
appear on this site including, for example, the order in which
they appear. TechnologyAdvice does not include all companies
or all types of products available in the marketplace.