Data is virtually everywhere and, as a species, we have been gathering and utilizing it for our own benefit for centuries. This process of "data collection" has been the foundation of human development and survival, but we are now deep in the age of information where the large number of digital platforms and channels has made data collection more complex than ever before. In this guide, we will break down the purpose, use and application of data collection in the modern age and how organizations can benefit by staying up-to-date on this constantly evolving process.
A brief introduction
Technology has given us the ability to collect data from a seemingly endless supply. That process is not without its challenges. The sheer volume of information available makes collection more difficult. While you can, theoretically, collect every scrap of data. What would the purpose be? What are you using this data to achieve?
Those are the real and important questions when you embark on data collection. The truth is that you can collect a massive abundance of information, but not all of it would be useful for your purpose — most of it would be extraneous. There are many reasons for data collection, and each of these has its own specific use once the information is gathered.
You're likely familiar with data collection for the purposes of marketing. In this scenario, marketing professionals locate, collect and analyze key pieces of information about their customer base. This information might include past customer history, behavioral insights and overall market information. All of these pieces fit together to help the marketing team develop strategies to improve their performance.
As technology and theories on data collection expand, marketers develop more advanced tools to aid in data collection and analysis. This allows them to streamline campaigns and effectively maintain or improve their positioning in a competitive field.
While marketing often gets the highest recognition when data collection is discussed, it's not the only field that benefits from the collection of information. No technology could run without it. When you break down the basics of computing, no machine process are possible without data. As they say, "garbage in, garbage out." The ability of your program to effectively work the way it's intended depends on the veracity of the data that you've input to begin with.
This ultimate guide will delve into modern data collection, focusing on the purpose of the collection process, which is the real key to pinpointing the type of data you need and where to find it. In addition, this guide will discuss methodologies in collection, as well as some uses with AI and the newest technologies.
Before you embark on the new frontier in data collection though, it helps to start with a primer on the history of data.
Data before the digital age
The word, "data" usually refers to the information imparted through computers and machine learning. It's often equated with figures, statistics, or computer language and coding. The word itself goes back much further than our recent history. Etymology links the word, "data" back to the mid 1600s, and it means a given fact. It was most often used in reference to mathematical equations, which may be a good indication why it was immediately used within the computer science field.
Data, as we use it today, is often associated with a collection of statistics, demographics, or other pertinent bits of fact. It's all information.
We can gather this data using sophisticated, computerized collection methods. But we still had ways to collect information prior to the computer age. In fact, we can use these rudimentary methods today. Sometimes the outcomes and process is better validated with a combination of both methods.
Data can be compiled through computer surveys and other online data fields. But it's just as present in conversations, lectures, text, email, written letters, and an almost endless variety of communications. Information can be culled from virtually all areas and used effectively to enhance any number of processes. Every creation in history starts with data.
The information you use to inform a project is integral to the success of that venture.
Collection methods that include a digital record are invaluable because there's little room for error in the data itself. Older methods, such as transcribing or reporting on information learned through conversation could be susceptible to human errors, which are not present in machine collection. However, machine learning and AI do not yet offer the autonomy of human management. In many cases, the collection methods might be pristine, but they miss the nuances that human beings can gather through personal interaction. These aspects of data collection can be equally informative and useful.
We're not at the point where machines can replace humanity entirely. We may be able to automate collection to a great degree but analysis and use of the collected data is still ideally overseen.
The purpose of data collection
Before you embark on data collection, you need to know why you're collecting the information. This is an important variable. There is no success without planning and there is an infinite amount of data in the world. To try to collect every piece of data possible wouldn't be useful for any project. It would likely just confuse the process with the massive breadth and width of information.
The first place you need to start is with your purpose. What are you trying to achieve? Are you working on improving the features of an eLearning experience? Improving a game in development? Are you working to improve a translation data collection process in order to reach a larger segment of the population?
The possibilities are simply endless. Data collection could be used in every industry and each department of an enterprise. Before you start to work with data collection, you need to have a clear and concise goal for what that information will help you accomplish.
For example, you're in the midst of designing a new technology. You need data collection initiatives to help you program your solution. Any app or process needs data. The next step is to determine what type of data you need to achieve the ultimate goal of your design. This process may not be exact. Even with the greatest amount of research, you may find that you're missing pertinent information once you start the testing and troubleshooting phase. But having that purpose in mind will ultimately help you determine where to start your data collection process.
Making sure that the purpose of your project is front and center will help you determine which data is significant for your uses. It will also help you to eliminate the collection of data that won't be useful for the scope of the project. This aids in productivity and efficiency.
Data collection will be necessary throughout the scope of your project, as well. In another example, you might be using data collection to develop a software. Even after that product is in the marketplace, you'll need further data collection methods to test its usefulness and determine whether there are any possible weaknesses that need to be patched. The data collection for any project is ongoing for the length of time that the product or project is in use.
Different types of data
As discussed, data is everywhere. Today's definition most often paints it as information, which can be an endless supply of things. The type of data you're looking for will often determine the collection methods and where you are most likely to find and collect the best sampling.
There are two basic types of data:
- Quantitative. Quantitative data gets its name from the word quantity. This data is easily measurable and often comes in the form of a numeric representation. This can be statistics, numbers, or other pieces of information that can be easily measured.
- Qualitative. Qualitative data is more difficult to measure because it often needs human analysis. This is a description of quality and can include opinion or other factors that are not numerical or dependent on statistics. Qualitative data is important because it often bridges the machine function with the personalized needs of the end user for any process.
Data might be further segmented by where you've collected the information. First-party data is the reference used for information that you've collected yourself. This might include interviews, voice data collection, gesture data collection, survey collection, and first party collection from website properties. This is also called primary data.
You may also collect second-party data or third-party data, which might include data purchased from vendors or larger public databases.
Data can also be defined as clean or naturalistic. Clean data is data without interruptions or noise. It's often data that's collected in a perfect environment or, as the name implies, data that's been enhanced or manipulated to remove the imperfections.
While clean data can be useful to come to a clearer understanding, it is also less valuable to a certain degree. Naturalistic data is thought to provide more in depth information because it's collected from lifelike scenarios. Naturalistic data has not been edited or cleaned up in any way. With data, as with anything, manipulation can compromise the validity.
Data collection methodology
The methods you use for data collection will rely primarily on your purpose. As we've seen, data is everywhere, so it's important that you start with your objective.
The objective of your project should be well-thought-out and revised. This should be a written statement and the purpose needs to be ultimately clear. For instance, if your objective is translation data collection to improve voice recognition in your chatbox, which languages do you need to include? Narrowing down the exact languages involved is obviously important. It's not a step you should miss.
Writing out your objective helps you to analyze exactly what data needs to be collected because you'll be able to easily see what your final goal is in the project.
Developing your objective statement will often help you clarify the goal. If the project is remotely complicated, a concise objective statement will help you develop the strategy for further research. There are particular aspects to any software, game, or project that you won't easily see from the initial idea. You start to find these specific issues once you reason through the realities of use, which tests the theoretical implications of the initial premise.
Once you have your objective statement written down and fairly specific, you'll move on to use cases. This phase of planning helps you to understand who will be using your software or offering. If you have an ideal market or specific region, that can help narrow down the type of data you'll need to collect in order to make the process work effectively.
During the use case phase, you need to reason through every possible aspect of your final project. Who will use it? What is their purpose? What is the ideal outcome?
These questions lead you to the ultimate answer of the type of data you'll need to collect in order to serve the project adequately. This methodology works no matter what your purpose for the data collection is. This might include positioning your new business in the marketplace, marketing endeavors, and building the best new technology to serve your company or your customer base.
In all of these scenarios, you'll be able to pinpoint the most useful data segment by crystallizing your ultimate goal for success.
Once you've reasoned through these initial planning stages, you need to prepare for the collection process. This includes finding the best places to collect data and gathering all the tools necessary for proper collection.
Data collection options are seemingly endless, and they just keep growing with new technologies. Your data collection does not need to be limited to survey answers or information collected through your web properties, though that information can also be useful.
Some newer options in data collection include:
- Visual Data. Visual data includes any information gained through video, images, or other visual means. This data might include people, animals, or the environment.
- Speech Data. There have been a lot of great advances with speech data. This includes better machine learning to detect dialect and specific patterns familiar to geographic locations. Talk to text is notorious for hilarious typos, but with more concentrated data collection in your specific user base, your program can be relatively free from these type of glitches.
- Gesture Data Collection. Gesture data collection is a fairly new technology that is in the family of facial recognition. This is a promising technology but, in the case of facial recognition, it's not completely trusted by governments for its validity. There have been too many mistaken identity cases with facial recognition to use the technology in legal proceedings in some areas. Though this technology is growing fast and can be used for some software and program development. For instance, facial recognition is often used on laptops and mobile devices. Gesture data collection is also widely used in gamification endeavors and medical research and innovation.
Jobs in data collection
Any data collection project will need a full team of experienced professionals. In some cases, it's tempting to take on all of the roles in the process, but it's often in your best interest to rely on a number of team members. First, there are more responsibilities and tasks present in any large scale data collection project than one person can manage.
Second, and more important, other professionals can help keep the process balanced. Often in data collection there are preconceived ideas of the final outcome. It's important to keep this out of the equation because the whole point of gathering data is to build and predict outcomes that are accurate, which often leads to some surprising discoveries. More team members often means a more scientific approach to the subject.
There are a number of disciplines involved in the field of data collection. Here are a few that you'll likely need on your team:
- Engineers. The engineer or developer creates the program and works with the database to collect the data that you need. They can write the programming and deal with any of the advanced architectural needs for the project. They can also troubleshoot any issues involved in the database or programming.
- Data Scientists. Your data scientists work with the data collected to analyze the results. Their ongoing work throughout the project can help you map changes to benefit the final offering and add to or limit collection options when necessary.
- Project Manager. The project manager will be the person tasked with maintaining communication throughout the team and making sure that the overall project runs smoothly. Ideally, the project manager is skilled at understanding each of the disciplines and fostering communication between team members who are not familiar with the terms or needs of other members of the project.
- Quality Assurance. Your quality assurance personnel test your device or product. Often this team works throughout the process because their input helps to find problems in coding and usability. QA will work the process at least until the product is ready for market but often after launch, as well.
You may also need other professionals to work on your team in various capacities. This might include respondents or subjects who you can use for development and market research. You might also use independent contractors to aid in various areas of your data collection. In some cases, you may work with outside vendors to collect data that you might not have access to otherwise.
Data collection and the role of AI
The title of this section is a slight misnomer. Artificial Intelligence does play an integral role in data collection today but that's only part of the story. AI is rapidly changing the way that data collection is done. Technology advances every year, so techniques will be changing rapidly and the role will continue to expand.
A few years ago, AI was not collecting data. Instead, data was collected to inform AI. For instance, programs such as Siri and other types of AI Assistants were built on collected data that allowed the program to serve the user. The function of the device was not to collect data from the user, but to offer that data back to the user at a set request. The technology has advanced significantly since then. Today's assistants do, in fact, collect data.
There are a number of ways that AI collects data to improve the user experience and ultimately the function of the device or program. Chat boxes, for instance, serve the function of customer support with ready-made information that can answer the customer's simply quandaries. The same can be said for surveys and assistants. AI today also assists with scanned documents. Programs are able to use the information scanned in order to pull information forward, cross-check, and use the data in practice.
These artificial intelligence devices no longer work solely to impart data that was already programed into their design. They also function to collect new information from the customer base or user. This user generated data can then be used by the device and collected for further use in other areas of the business or industry.
What has become exceptionally exciting about the use of artificial intelligence in the field of data collection is that it has advanced to the point where machine learning is possible. The program itself learns from the input. When the user takes one action, it triggers the machine to reply without any human interaction necessary to run the process.
Machine learning and new technology
As the role of AI advances, machine learning becomes integral. The initial role of these programs was to be able to use data that was programed into the database. For instance, a game had a finite choice of reactions based on the actions of the user. The user was only able to act in a set number of ways and the machine worked because it was programed to respond exactly to those set scenarios.
Machine learning does more than leverage the already programed data. It can collect data and respond based on new information. This is the direction that machine learning is currently taking. Theoretically, the machine learning process could eventually negate the need for human input or manipulation. Essentially, the machine reasons on its own.
We are not completely there yet and there will always be a need for humans to oversee the data collection and use methods to some degree. Currently, machine learning can be broken down into three types: supervised machine learning, unsupervised machine learning, and reinforcement machine learning.
- Supervised Machine Learning. This type of machine learning depends on the programmer to give the machine the input and output results. You would input the data and tell the machine how to classify that data. Once the process was complete, the machine would have a finite dataset for how that information is categorized and labeled. The data does not change and the machine doesn't manipulate that information in any way. The dataset is pulled forward when it's required by a set action that the machine recognizes.
- Unsupervised Machine Learning. With unsupervised machine learning, the programmer inputs the data set but the machine creates the output. With this type of machine learning, the veracity and quality of the input is essential. Remember that the machine can only utilize information it's been given so the quality of the output will be dependent on the quality of the data it has been given. The more datasets used as input, the more exact the computer's output can be. This programing works because the machine can automatically group the datasets so that it can essentially decipher the best action or reaction to a large amount of user choices.
- Reinforced Machine Learning. This type of machine learning is a cross-section of both supervised and unsupervised learning. The data given to the machine is the widest variety of input and set output details. The machine can then use both methods to determine where new data should be categorized and find the dataset that's the most appropriate.
Machine learning can use a number of different approaches but the overall design allows the machine to learn how to accomplish a task or master a skill that would previously need the human thought process. The machine actually becomes better at the task as more data is added to the database. Rather than only the input data, the machine builds a more inclusive database as it keeps collecting information from use, as well as imported information.
This information is stored by a process called data annotation. This labeling system organizes the information so that it is stored and brought together with other information integral to the dataset. The dataset can be used to improve algorithms and form patterns for the machine to increase its knowledge base. Data labeling might be done by the programmer, so the data put into the machine is already labeled. This is the case with supervised machine learning and with some data involved in reinforced machine learning. The data in unsupervised machine learning is not labeled. The machine itself selects the category for these data sets in its process of devising the output.
In the machine learning process, data collection is essential. This includes the data that is originally provided by the creator of the machine, as well as any data collected by the machine through unsupervised and reinforced learning.
Data validation and outcomes
Data validation is the process for testing the data that was collected. In a machine learning process, you won't be testing all the data. A large portion of the data is used to train the program to master machine learning.
Any data sets need to be validated or checked for errors. Human errors can occur if there is manipulation. If your data consists of user generated information, there may be a level of error simply because the users included information that was not true or not entirely accurate. When working with human created data, which is a great deal of the information in the world, realize that there will be fundamental differences in perspective and communication involved. For instance, when data comes from survey information, people will not always answer questions honestly. In other cases they may answer honestly but be mistaken in their understanding or recollection. They often don't follow directions.
All of these issues can cause a lack of quality in the original data set.
When working with human input, such as through chatbox, survey information, and other user input, it's important to validate the questionnaire as well as the data.
When analyzing the collected data sets, you can pinpoint questions that are misleading or may leave too much room for interpretation to allow for a good quality of data collection. These content fields can be edited in order to improve the quality of the future data collection efforts to spur better machine learning and performance.
Data validation is necessary in order to arrive at the best possible or most accurate outcome. If the data has not been validated, the reports and analytics may worthless inferences, and programs based around the data sets can be wildly unsuccessful in practice. The machine will only be able to function as well as the data it has access to in the database.
Regulatory compliance and legal ramifications for data collection
Data often includes personal information, which can pose a problem for individual privacy. There are regulatory compliance issues that govern the collection of data in virtually every industry and for every use. It's important that these regulations are understood and followed during any process where you collect, store, or use data protected through these laws.
Virtually every country follows some guideline on data collection and use. These laws don't only discuss what can be collected, but how it is stored and how the private citizens or customers must be informed in the event that there is a data breach. Chances of data breach loom larger each year with new hacking and malicious attacks aimed at virtually every industry and business size.
Some prominent regulatory compliance guidelines:
- General Data Protection Regulation. The GDPR is the set of compliance guidelines enacted by the European Union. Most businesses and collectors of data follow this regulatory guideline because it's integral if you collect any data from citizens of the EU, but also because it is comprehensive and increases the protection for your cybersecurity and data storage.
- HIPAA. These laws govern the use, collection, and storage of health records in the US. With technology adding more layers of sharing to personal information, the HIPAA regulations and requirements set out terms to aid in the sharing of information for better patient care. At the same time, they mandate protections and encryption directives for data collection and storage with regard to healthcare records.
- Individual State Laws. Individual states are enacting laws to govern data collection, which include New York's SHEILD law and California's CCPA. For organizations that collect data outside of these individual jurisdictions, the laws may still apply because data collected belongs to residents.
Other laws and regulations are in place to guard payment card data and countries outside of the US and EU have enacted their own comprehensive regulatory measures with regard to data.
Evaluating the process and using your data
Once you've determined the best method to compile data, validated your methods and the data, and enabled processes such as machine learning if it was applicable to your project, you can begin to test the quality of the outcome.
Quality assurance measures and go live checks are integral to testing any program before launching to the public. You may also launch a beta testing phase which allows users access to a program, knowing that it's in beta stage. This gives you the ability to test the function in real use and work through any bugs, as well as to get feedback on the way the program functions and flaws in the design.
No matter the data collection goal, it's important to thoroughly evaluate the validity of the outcomes before using the design or data in practice.
For more information about data collection
If you're interested in data collection processes and services, feel free to get in touch with us here at Andovar. With years of experience in localization, we have a grounded understanding in the importance of research and the application of data in targeted markets. We're happy to answer your questions and help make sure that you have everything you need to know to realize your organization's vision.