Data Treasure Hunters: Science Expanding to New Frontiers
Science and engineering are rapidly heading toward a major culture change—a change in how we think about data.
This change is already happening, and it will be dramatic and exciting! It will completely change how most of us think about data, and how we tackle science and engineering problems. With it will come a flood of new discoveries—advances in the sciences and in new technologies—that were never before possible. What is this revolution? How did we get here? Where is it going, and how is signal processing involved?
The short answer is that we are entering an era of treasure hunting. Rather than digging through dirt like archaeologists looking for ancient artifacts, the future will involve digging through data.
We are experiencing an explosion in the amount of data available for scientists and engineers to do their job. The world around us is becoming increasingly “connected” as communication technologies have proliferated and the costs of digital data storage have plummeted. Seemingly mundane items and devices that never before had a bit or byte associated with them are now streaming a constant flow of data to data warehouses located in the cloud. Advancements in medical devices are leading to the emergence of miniaturized, non-intrusive medical sensors that will be integrated with communication technologies to report real-time glucose levels, monitor respiratory conditions, track immune-responses, and allow for the analysis of a wide-array of other data associated with the human condition. Meanwhile, scientific equipment is being aimed both out into space as well as deep into the Earth and across its ecosystems. Matching this explosion in data is a commensurate advance in computing: computing resources are now sophisticated enough to be able to perform immense amounts of computation on this data.
This is the emergence of the new field of big data and data analytics. Data scientists are the postmodern treasure hunters. They will reach into their toolboxes of algorithms and dig into data looking for hidden correlations, trying to find never-before-seen patterns with the hope of advancing the frontier of knowledge and supporting the development of new products.
The frank truth, though, is that data science isn’t really new. Many technical fields have been performing analysis on large amounts of data before the term “big data” was ever coined. The signal processing community has been analyzing data since its inception. After all, what is the Fourier transform but a tool to find periodic phenomena in data? Or, take a quick survey of papers over the past twenty-five years (or more) and you will find signal processing is involved in everything from analyzing geological data for oil discovery, to face recognition for domestic security, to processing genomic data and looking for patterns that indicate the onset of cancer. Signal processing was fundamental to advancements in multimedia processing and storage; and much like data science, many of the disciplines within signal processing are fundamentally about looking for correlations and dependencies in data to effectively make decisions.
That is not to say that signal processing is the same as data science. Perhaps the most notable difference between the past era and the new era of data science and big data is the tearing down of boundaries associated with how data is produced and accessed. Big data is fundamentally heterogeneous, involving data from a vast collection of sources that report data of various modalities for analysis. Whereas the previous generation of scientific discovery involved scientists conducting (and planning) experiments to intentionally measure specific data for the purpose of discovery, often times big data involves the opportunistic sharing of data from non-vetted sources often provided in unstructured representations. Thus, the new era of big data and data analytics will likely lead to new engineered systems that utilize data from sources previously unknown to the engineer and application developer.
In short, the new era will have more data than you could ever dream of.
But, as we move forward in this new era, we need to relish in the opportunities it will provide, yet also retain an appropriate level of caution. The promise of being able to analyze large amounts of data to find a cure for cancer, or integrate infrastructure and vehicle sensor data to allow for automated driving and more efficient transportation, or the potential to analyze the data being generated by the broad collection of astronomical observatories to discover new stellar phenomena are certainly fantastic and truly important to society. We would not be able to make such advancements or build new systems without the emergence of this new field of data science. However, we must be careful as this explosion of data and data science could take on a life of its own. Regardless of whether you are a scientist, mathematician, engineer or in some other profession, you were likely raised with the “scientific method” drummed into you like a mantra. We’ve all grown up in an era of slow, methodical research and development. In fact, one way of looking at both the scientific method and the engineering design process is that it leads to implicit practice of quality control—almost bordering on pessimism and overt caution.
Big data will often involve others unintentionally conducting experiments for the data scientist. The allure of hunting through more and more data to find patterns without vetting that data is dangerous. Data science will have some growing pains, especially as the vast amount of data being examined guarantees that data will be haphazardly analyzed and spurious correlations will be proclaimed as scientific truths. Data science will need quality control.
And this is where the signal processing community can advance big data and data science. Over the years, the signal processing community has carefully built up a sophisticated toolbox full of algorithms designed to analyze data, as well as the deep understanding of when and how to use these algorithms, and how they can be made to work efficiently. Signal processors are a mixed breed of statisticians crossed with control theorists crossed with computer engineer that have, over the decades, folded performance assurance into their algorithms to ensure that video looks good after compressed, targets are accurately tracked and tumors can be effectively classified with low rates of false alarm and missed detections.
Hence, as this new era of big data and data science unfolds, let us issue a challenge to scientists, engineers and signal processors to establish new forms of collaboration: To the data scientists, reach out and ask a signal processor whether they know of any signal processing tools that might work on your data. To the signal processor, find the scientists and engineers that are making the next wave of data, and offer your services. Now more than ever is the time for those engaged in signal processing to reach across the boundaries of technical fields and contribute their tools to the analysis of the vast amounts of data that are being generated everywhere. Signal processing has had a fantastic record of success and, as we move to this new world of data treasure hunting, signal processing can ensure the success of data science—can ensure that the hidden correlations one finds are truly golden treasures and not spurious pyrite counterfeits.