Today’s data revolution isn’t just being driven by the growing abundance of data; it’s being fueled by fundamental technologies that change the way we gather, store, analyze, and transform data. Together, these drivers are enabling us to glean powerful insights from deep within data, and thereby unlock new knowledge, discover new connections, and make new predictions.
Throughout history we have always gathered data and used it to help advance society. But often, data was too scarce. Today, we are fortunate to be able to harness more data from the world around us — data infused with more meaning, gathered in more useful forms, and producing more deliberate results. Data has gone from a once-scarce resource to an increasingly abundant, vital, and renewable resource.
The plummeting cost of storage is enabling data-driven innovation. In 1980, a gigabyte of data storage was scarce to come by, cost hundreds of thousands of dollars, and required a full-time person to manage. Today, a gigabyte of storage costs just pennies, is managed easily, and can be accessed anytime, anywhere. Since the 1980s, the price of storage has dropped by more than a factor of 10 million.
Data is only valuable when it is understandable; otherwise, it’s just a jumble of random observations. Making sense of the insights contained within data can only be achieved by combining human ingenuity with innovative software. Despite an increasingly autonomous world, it still takes personal curiosity, human skills, and intensive work to unlock answers from within data.
TRANSFORM & TRANSLATE
Powerful new software tools are equipping us with the ability to use data sets to make better decisions, based on facts and not “gut” or intuition. These new tools, including machine-learning systems and modeling and simulations technologies are helping give data purpose, by transforming it in ways that can help us extrapolate, visualize, refine, model, and predict.
The powers to gather, store, analyze, and transform data are converging to unlock new opportunities for better solutions.
Myths vs. Facts
There are a number of myths surrounding recent data innovations and the data economy. These include myths about:
- Personal information and data protection
- The economic impact of the data economy
- Data reliability
- 21st-century data innovations
- Global benefits of data innovation
- Governments’ role in data regulation
All data is personal data.
Some data may be personal information (e.g., data we generate on our mobile devices or that we create by using social networks). Most data, however, is not personal.
The vast amount of data being created every day includes information like satellite weather monitoring, jet engine performance, computer-generated stock market trades, and sensors unrelated to individuals. Even when data does pertain to an individual, it is often not accessed by another human and likely is de-identified essentially stored and used without information that reveals the identity of the individual involved.
Developing countries aren’t yet ready to take advantage of data analysis.
The data revolution and the benefits it creates are a global phenomenon. Some of data’s most important benefits and biggest opportunities lie in the developing world where technology has often lagged. According to IDC, emerging markets in the digital universe will surpass mature markets by 2017 — growing from 36 percent to 62 percent of the expanding digital universe between 2012 and 2020. Likewise, a survey of NGOs in the developing world found that over 90 percent believed that data analytics would be the most important tool to deliver better insight for helping their end beneficiaries.
The opportunity that data innovation presents the world is virtually unparalleled. Innovative software tools already are revolutionizing our lives in amazing ways.
Understand the Language of Data
To make the most of data, it helps to understand the language. The following is a glossary of terms to aid digital discourse:
Once scarce, today the abundance of data has been made possible by a growing ability to gather meaningful forms of digital data in entirely new ways, combined with the plummeting costs of storing data, and new ways to create value from it.
Anomaly detection is the identification of data items in a data set that do not match an expected pattern. Anomalies are also called outliers, exceptions, or contaminants in data and can often provide critical and useful information.
Bad data is data that is missing or incorrect. It can be as simple as an incorrect street address, but bad data costs Fortune 1000 companies billions of dollars every year.
Big data is an umbrella term that often refers to the process of applying computer analytics to massive quantities of data in order to discover new insights and improve decision-making. It often describes data sets that are so large in volume, so diverse in variety, and moving with such velocity that it is difficult to process using traditional data processing tools.
A brontobyte is an unofficial measurement term for an extraordinarily enormous amount of data. A brontobyte is generally considered to be the equivalent of 1,000 yottabytes and is represented by a 1 followed by 27 zeros.
BUSINESS INTELLIGENCE (BI)
Business intelligence refers to the set of technologies and applications that transform raw data into operational insights that can improve business performance and decision-making.
The cloud is a broad term that refers to any application, service, or data that is hosted remotely. In general, it is made possible by large groups of remote servers that are networked together to enable ubiquitous, on-demand network access to computing or storage resources.
Clustering analysis is the process of identifying data objects that are similar to each other and clustering them together in order to better understand the differences as well as the similarities between data.
Cognitive computing is the process of combining large amounts of information with machine learning techniques, pattern recognition technologies and sometimes natural language processing to mimic the way the human brain works. These systems are often able to learn and interact with people by combining information sources with context and insight.
Computer-generated data refers to data that is automatically generated by a computer without human intervention — like a computer log file, satellite telemetry data, or sensor data from an industrial machine.
Dark data consists of unstructured and untapped data that is being stored, has not been analyzed or processed, and is believed to be neglected or underutilized in some way.
Data is information in a raw and unorganized form that can be digitally manipulated to represent conditions, objects, or ideas. Common types of data include sales figures, marketing research results, readings from weather sensors, or a list of cities and their populations. We now generate an estimated 2.5 quintillion bytes of data each day.
Data aggregation is the act of gathering data from multiple sources for the purpose of providing a higher order analysis.
DATA AGGREGATION TOOLS
Data aggregation tools transform scattered data from multiple sources into a single new set of data.
A data analyst is someone responsible for preparing, cleaning, and processing data.
Data analytics is the application of software as a way of transforming and modeling data in order to derive useful information, insights or meaning from data. It is often used to uncover hidden patterns or unknown correlations, and aid in decision-making.
DATA ARCHITECTURE AND DESIGN
Data architecture is generally performed in the planning phase of a new system to design and structure how data will be processed, stored, used, and accessed. By defining at the start how specific data will be related to each other and put into motion, it is possible to design how the data will flow and control the flow of data to ensure it is protected throughout the system.
A data center is a physical facility that houses a large number of networked servers and data storage repositories typically used for remote storage and processing of large amounts of remotely accessible data. There are an estimated half a million data centers worldwide, many of which make up the cloud.
Data cleansing is the process of reviewing and revising raw data to find and delete duplicates, correct errors, add missing data, remove corrupt data, and provide more consistency.
Data mining is the process of using powerful computer algorithms to find patterns or knowledge from within large data sets.
Data quality is a metric used to define the value of data to the user. It refers to the reliability, efficiency, and worthiness of the data for decision making, planning, or operations.
Data science is a discipline that incorporates statistics, data visualization, computer programming, data mining, machine learning, and database engineering in order to extract meaningful insights that can solve complex problems.
A data scientist is someone who is able to combine human insights, mathematical know-how, and technological tools to make sense out of data, for example by developing and deploying computer algorithms.
Data security is the practice of protecting data from destruction, misuse or unauthorized access. Appropriate data security measures can help prevent data breaches, ensure data integrity, and protect privacy. It often involves a combined focus on people, processes, and technology.
A data set is a collection of related sets of information, typically separate elements, in a tabular form that can be manipulated as a unit.
A data source is the primary location where data comes from, for example, from a database, spreadsheet, or a data stream.
Data virtualization is the process for retrieving and manipulating different data sources without having to know the technical details about where it is located or how it is formatted.
Data visualization involves creating visual representation of data in order to derive meaning or communicate information more effectively.
DATA-DIRECTED DECISION MAKING
Companies that use data-directed decision making gather, process, and analyze data to support crucial decisions. Research by Eric Brynjolfsson, an economist at the Sloan School of Management at the Massachusetts Institute of Technology, shows that companies that use data-directed decision-making enjoy a 5 percent to 6 percent boost in productivity.
A database is a large structured set of organized digital data designed so that the data within it can be rapidly searched, accessed, and updated.
De-identification of data is the process of stripping out information that links a person to a particular piece of information.
Disruptive shifts are the big and fundamental changes in society and businesses, often enabled by transformative new technologies that set up a whole new context for how we work, live, play, and create value. Data innovation is often described as a technology that enables disruptive shifts.
An exabyte is an enormous unit of data storage—a 1 followed by 18 zeros. To put it in context, today we create one exabyte of new information on a daily basis.
Hadoop is an open source software framework that was built to enable the processing and storage of huge amounts of data across distributed file systems.
INTERNET OF THINGS
The Internet of Things describes a world where ordinary devices are made much smarter, and connected to the Internet to extend the smart revolution from the palm of our hands to the world around us. Because everything that can be connected, will be connected, some have more aptly described it as the Internet of Everything. By one estimate, we have only connected about 1 percent of the things in the world that can be connected. By 2020, an estimated 50 billion devices will be connected to the Internet.
A legacy system is any computer, application, or technology that is outdated or obsolete, but continues to be used because it performs a needed function adequately.
Machine learning is the use of algorithms to allow a computer to analyze data for the purpose of “learning” from experience the actions to take when a specific pattern or event occurs.
Metadata is the data about data. It can include basic summary information about the data like the author of the data, the date it was created, the file-size, and date last modified.
An outlier is a piece of data that deviates significantly from the general average within a larger data set. It is numerically distant from the rest of the data and therefore, the outlier indicates that something is going on and generally therefore requires additional analysis. (See also Anomaly detection.)
Pattern recognition is the process of looking for and identifying patterns within data. It can be simple, like identifying a repeating set of sequences within a DNA sequence, it can be finding a pattern in the way two data sets interact to discover whether there is a pattern connecting one event to another, or with the help of machine learning it can be looking for more complex patterns like finding numerical characters in a picture.
A petabyte is an enormous measure of storage capacity that is represented by a 1 followed by 15 zeros, or a million gigabytes. A petabyte is roughly four times the amount of data contained in the Library of Congress.
Predictive analytics involves using software algorithms on one or more data sets to predict trends or future events. When data from the present can be compared to the past, it can often be used to help predict the future.
Predictive modeling is the process of developing a model that will most likely predict a trend, future behavior, or outcome — often by comparing events from today to events from the past.
Real-time data is data that is acted upon as it is created. It is often created, processed, stored, and analyzed within milliseconds. Real-time data can include everything from stock market prices to the speed of a wheel as used in a car’s anti-lock brake system.
A recommendation engine is a computer algorithm that makes recommendations, suggestions, or that can personalize something for you based upon a variety of data patterns often derived through machine learning techniques.
Regression analysis is a statistical process for using data to estimate the relationship between two or more variables.
Risk analysis is the use of software data analytics tools to identify the likely risk of a project, action, or decision. New data tools can help identify possible risks up front, better model an array of scenarios to help reduce the risk facing organizations, and monitor systems to identify problems if things begin to head off course.
Root-cause analysis is a method of problem solving that is focused on looking at the relationship between cause and effect to identify the root cause of a fault or problem. The cause is a root cause if once it is removed from a sequence of events, it prevents the undesirable event from repeating.
Semi-structured data is not structured by a formal data model, like those used in databases, but provides other means of describing the data and hierarchies. Semi-structured data often uses tags or other data markers in what is sometimes knows as self-describing structure.
Small data is about harnessing even small amounts of data, like that contained in a customer survey, to achieve actionable results. It generally refers to data sizes small enough that a human could comprehend and analyze it.
Structured data is highly organized and generally organized into rows and columns making it easy to search and manipulate.
A terabyte is a measure of data that is represented by a 1 followed by 12 zeros. Terabyte hard drives can now be commonly found in home and work computers, or accessed via the cloud. To put it in context, a terabyte can store about 300 hours of high-definition video.
Text analytics is the use of statistical, linguistic, and machine learning techniques on text-based data to derive meaning, extract concepts, or unlock insights. Text analytics is generally performed on natural language text like that contained in documents, transcripts, web postings, commentary, or forms. It can be useful for the summarization, discovery, or classification of content.
Transactional data is data that is derived from specific events like financial purchases, invoices, payments, and shipping data. It generally includes a timestamp and supports the daily operations of an organization.
Unstructured data has no pre-defined structure—for example, notes from a meeting. According to some estimates, unstructured information might account for more than 70 percent to 80 percent of all data in an organization.
Variety, one of the four Vs defining data innovation, represents the various kinds of data often from different sources that are combined and analyzed to produce insights. The variety of types of data that today are being processed in applications can include textual databases, transaction data, streaming data, images, audio, and video.
Velocity, one of the four Vs defining data innovation, is the speed at which the data is created, stored, analyzed, and visualized. For example, large data warehouses may receive billions of rows of new information each day. Time-sensitive data must be used as it is streamed in order to maximize its value.
Veracity, one of the four Vs defining data innovation, is used to signify the accuracy, certainty, and precision of the data.
Volume, one of the four Vs defining data innovation, refers to the amount of data processed — ranging from megabytes to brontobytes.
A yottabyte is a very large measure of data storage that is represented by 1 followed by 24 zeros. To put it in context, a yottabyte represents that amount of data stored on 250 trillion DVD’s.