research database definition

Research Tutorial

Introduction
Understanding Your Assignment
Developing Key Term from Your Topic
Creating Searches
Searching the Library Catalog
Finding Books on the Shelf
The Call Number System

What Are Databases?

Accessing Databases
Off Campus Access
Choosing a Database
The Mechanics of Searching
Finding Websites
Evaluating Websites
What Every Citation Needs
Citation Formats
Making the Databases Work for You
Resources for Citation Formats

In the context of libraries and doing research, the generally accepted definition of a database is "an electronic collection of searchable information on one or more topics."

Right. So we ask again, what are databases?

Image of gate with caption: The good stuff is behind here!

Think of databases as penned up corners of the Internet, sections that have been fenced off and locked, so that a regular Internet search (say, searching Google) won't find what's inside. And of course, it's behind this fence that most of the good information for doing academic research sits.

Why do they make this stuff hard to get to?

Unfortunately, like many things, it comes down to money. The companies who publish academic journals, magazines, and newspapers (all types of material you will find in these databases) need to make money. The people who digitize the content and create the search platforms need to make money. So they restrict access to the database by making them available through subscriptions.

So where does that leave you, the researcher?

With the library, of course!

One of the biggest roles that an academic library plays (and, not inconsequentially, where a large portion of the library's budget goes!) is in providing access to these databases. The library is your key through those locked gates. So rather than going to Google to try to find an academic journal article, come to the library's web page. We know the secret codes to get you in.

Image of key with caption: The library gets you in!

<< Previous: Databases: An Introduction
Next: Accessing Databases >>
Last Updated: Jun 28, 2024 11:13 AM
URL: https://libguides.hartford.edu/ResearchTutorial

A-Z Site Index | Contact Us | Directions | Hours | Get Help

Educational resources and simple solutions for your research journey

Academic Databases: A Guide for Researchers

Academic research databases are essential sources of information for researchers, academics, and scholars. These databases are academic search engines that help academics stay updated on the latest developments in their field. Scholarly databases support their own work with credible sources and contribute to the overall progress of knowledge and literature in different subjects. Let’s understand more about academic databases.

Table of Contents

What is an Academic Database?

An academic database is a comprehensive repository of information that includes various types of content, such as articles, images, market trend reports, and scientific papers. Its primary function is to help users find valuable journal articles by using relevant keywords or a specific topic name.

By doing so, researchers and students can access pertinent information for their essays or studies. Many times, these articles are accessible online, making academic databases a convenient and efficient tool for retrieving essential academic materials. [1]

Some popular academic research databases include Scopus for various research, JSTOR for humanities and social sciences, PubMed for medical research, and IEEE Xplore for engineering and technology.

Types of Academic Databases

Different kinds of academic databases are designed to meet specific academic needs or disciplines. Following are some key scholarly databases:

Open Access Databases: These scholarly databases offer unrestricted access to research articles and publications. Unlike some traditional journals that require payment for access, Open Access Databases allow anyone to read and download scholarly content without any cost. Examples include the Directory of Open Access Journals (DOAJ), Public Library of Science (PLOS), and BioMed Central (BMC). [2]
Full-Text Databases: Unlike databases that are limited in scope and only provide summaries, Full-Text Databases allow users to read and download the whole text of articles, papers, and other scholarly materials. Examples include JSTOR, IEEE Xplore, and ScienceDirect. [3]
Subject-specific databases: Subject-specific databases stay focused and are specifically designed to cater to scholars and researchers from a particular field of study. Examples of Subject-specific Databases include PubMed for medical research, PsycINFO for engineering and technology, and ERIC for Education. [4]
Multidisciplinary Databases: Unlike Subject-specific Databases, as the name suggests, Multidisciplinary Databases don’t limit themselves to a single discipline but cover a wide range of subjects. They go beyond specific fields and offer a diverse collection of scholarly content on various fields. Examples include ProQuest Central, Scopus, and Web of Science. [4]
Abstracting and Indexing databases: Abstracting and Indexing Databases limit themselves to summaries (abstracts) and details like authors and topics. They do not offer complete content like full-text articles. Examples include Web of Science (multidisciplinary), ProQuest (multidisciplinary), and INSPEC (physics, engineering). [4][5]
Citation Databases: The popularity of an academic work can be measured with Citation Databases. These academic databases show how often a scholarly article has been cited by other researchers, helping assess its impact and influence. Examples include Web of Science, Scopus, and Google Scholar. [6]

Benefits of Using Academic Databases

Using academic databases offers several benefits for researchers, students, and academics. [7]

Access to Reliable Information: Academic databases offer access to peer-reviewed articles, databases, and scholarly content, ensuring the reliability and credibility of the information.
Comprehensive and Up-to-date: These academic search engines include information on a wide range of subjects and have the latest and the most extensive collection of research articles, papers, and journals.
Efficient Research: Academic databases streamline the research process through targeted and specific searches using keywords, research topics, or authors. Compared with traditional methods of gathering information, academic research databases save time and money.
Citation and Referencing: It becomes relatively easy for scholars to find reliable academic sources and appropriately cite them, building on and adding to the credibility of their research work.
Interdisciplinary Exploration: Instead of a unidimensional approach to research, multidisciplinary databases allow the discovery of different perspectives and findings from various fields. This would be a challenge in the traditional or ‘offline’ way of research.
Research Collaboration: Research databases facilitate and encourage collaboration by providing a centralized platform for researchers to share their work and discover relevant studies.
Time and Cost Savings: Online academic databases eliminate the need for physical library visits, which in turn saves time and costs associated with obtaining printed research and academic materials.

R Discovery for Researchers

R Discovery, an AI-powered academic database, makes it easier for researchers to find relevant scholarly articles. Using advanced algorithms, it scans through its robust and growing academic database sourced from major databases like Microsoft Academic, PubMed, PubMed Central, and Crossref to recommend top articles in your field. The tool aims to save time and effort, providing access to over 250 million research articles, including more than 40 million open-access articles, with 14 million authors, 9 million topics, and 32,000 journals.

References:

Academic Databases – StudyHub
Open Access from Publishers and Databases – Library Technology Launchpad
Searching Online Databases – Tennessee State University
Types of Databases – Database Help – Murdoch University
Resources for Social Work – Liverpool John Moores University
Full-Text vs. Citation Databases – LIS-Educate.com
The Knowledge House For Research: Academic Databases Guide – Mind The Graph

R Discovery is a literature search and research reading platform that accelerates your research discovery journey by keeping you updated on the latest, most relevant scholarly content. With 250M+ research articles sourced from trusted aggregators like CrossRef, Unpaywall, PubMed, PubMed Central, Open Alex and top publishing houses like Springer Nature, JAMA, IOP, Taylor & Francis, NEJM, BMJ, Karger, SAGE, Emerald Publishing and more, R Discovery puts a world of research at your fingertips.

Try R Discovery Prime FREE for 1 week or upgrade at just US$72 a year to access premium features that let you listen to research on the go, read in your language, collaborate with peers, auto sync with reference managers, and much more. Choose a simpler, smarter way to find and read research – Download the app and start your free 7-day trial today !

What is Research Impact: Types and Tips for Academics

Research in Shorts: R Discovery’s New Feature Helps Academics Assess Relevant Papers in 2mins

University of Houston Libraries

What’s the difference between a research database and google, what’s the difference between a research database and google.

Brought to you by the University of Houston Libraries.

As part of your research, your instructor may sometimes require you to use articles or other resources from the library’s research databases . But what is a research database and why are they useful?

A research database is an organized, searchable collection of information that allows you to quickly search many resources simultaneously. Databases can be general, such as Academic Search Complete or ProQuest , or subject-specific, such as PsycInfo , which has resources related to psychology, or America, History and Life , which has resources related to history.

So what makes a research database different than other search engines, like Google? There are a few important distinctions to keep in mind when you’re using a research database instead of Google. First, the types of information you’re searching are usually different. Google searches for results across the internet, including websites, while research databases typically include scholarly journal articles, popular magazine articles and newspapers, books, and videos. The content of a research database is also reviewed and updated regularly.

Also, how you search is different. Google uses natural language searching, which allows you to search using complete sentences, such as “How many moons does Jupiter have?”. Google also searches the full text of resources, which usually means you get many results, but not all of them are relevant to your search query. Research databases use more precise, keyword searching, and most don’t automatically search the full text of a resource. Keywords are words or phrases that describe the topic you’re researching, and you’ll want to use them when searching databases to locate the most relevant resources on your topic.

Also, while Google offers some advanced searching options , most people don’t need to use them to find what they’re looking for. However, advanced search options in research databases, such as filtering by date, language, document format, and peer review status, can be effective in retrieving more relevant, precise results. Google also uses ads and tracks its users based on what they’re searching and clicking on, which the library doesn’t.

Both Google and research databases can be useful depending on your information need, and results from both need to be evaluated for accuracy and credibility. If you’re searching for scholarly research in mechanical engineering, a subject-specific engineering database would be a better place to search than Google. However, if you’re looking for websites of professional engineering organizations , Google is the better search option.

If you still have questions about research databases and how to use them, contact UH Libraries for help.

What is the Difference Between a Research Database and Google transcript

Finding Information

Types of Information Resources

Introduction

What are databases, further reading, learning objectives.

Find ebooks
Find Articles
Find Newspaper Articles
Find Images
Getting Copies of Articles and Books You Can't Find

This page is designed to help you:

Identify at least two differences between the type of information available through general search engines and academic databases
Understand what research databases are and why they are a valuable part of academic work

There are many types of databases that you can use for your research. The database you choose will depend on what type of information you want to find.

Research databases, such as JSTOR and Academic Search Premier, uncover the world of scholarly information. Most of the content in these databases is only available through the library. The complete list of databases is on the Databases A-Z list. The Library has purchased access to hundreds of databases on your behalf. There is no charge to use these resources.

Search Engines

Defining scope:	Indexes the web which provides way to find information on any topic
Scope of information:	Search engines, such as Google, make finding general information on pretty much any topic fairly easy. You may get millions of results for a search, with only the first 10 readily visible
Narrow your search:	Ability to focus a search on a type of consumer-oriented content, such as news, shopping, and images
Information strengths:	Information from organizations, including reports, white papers, and company information
Evaluating content:	Since anyone can share information online, you have to carefully check any information that you may want to use in your academic work.

Research Databases

Defining scope:	Highly organized information that allows you to find information with high relevance to search terms
Scope of information:	Collections of information that are organized by subject, theme, genre, language, and other factors
Narrow your search:	Robust tools allow you to narrow efficiently by dozens of categories
Information strengths:	Primary and secondary sources in an array of formats including journal articles, ebooks, historical documents, videos, music, images, data, and newspapers.
Evaluating content:	Verified, often peer-reviewed, high-quality content from carefully selected sources

Let us search for the same thing in Google and in a general academic database called Academic Search Premier.

Search for the impact of social media on teenagers

Results in Google

Screenshot of search in Google for "impact of social media on teenagers"

Notes about these results in Google:

81,500,000 results
Advertisements are the first two results
Highlighted article with images from a high school
Ability to quickly sort based on top Google categories: News, Images, Videos, Shopping

Search results in Academic Search Premier

Screenshot of search of Academic Search Premier database for "impact of social media on teenagers"

Notes about these results in the Academic Search Premier database:

3 are from academic journals
Able to quickly sort by scholarly qualification and publication date
Avdic, A., & Eklund, A. (2010). Searching reference databases: What students experience and what teachers believe that students experience. Journal of Librarianship and Information Science, 42 (4), 224–235. https://doi.org/10.1177/0961000610380119

This page was designed to help you:

<< Previous: Types of Information Resources
Next: Find ebooks >>
Last Updated: Aug 30, 2024 11:37 AM
URL: https://libguides.brown.edu/findinginformation

Library Intranet

What Is a Database?

A database is a systematic way of storing information so data can be accessed, analyzed, transformed, updated and moved with efficiency.

A database is simply a structured and systematic way of storing information to be accessed, analyzed, transformed, updated and moved (to other databases).

To begin understanding databases, consider an Excel notebook or Google sheet. Spreadsheets like these are a basic form of a table. Databases are almost exclusively organized in tables and those tables have rows and columns. So, think of a simple database as a collection of spreadsheets (or tables) joined together in a systematic way.

Database Definition

A database is a way for organizing information, so users can quickly navigate data, spot trends and perform other actions. Although databases may come in different formats, most are stored on computers for greater convenience.

Databases are stored on servers either on-premises at an organization’s office or off-premises at an organization’s data center (or even within their cloud infrastructure). Databases come in many formats in order to do different things with various types of data.

Related Reading From Built In Experts Python Databases 101: How to Choose a Database Library

Why Do We Use Databases?

Computerized databases were first introduced to the world in the 1960s and have since become the foundation for products, analysis, business processes and more. Many of the services you use online every day (banking, social media, shopping, email) are all built on top of databases.

Today, databases are used for many reasons.

Databases Hold Data Efficiently

We use databases because they are an extremely efficient way of holding vast amounts of data and information. Databases around the world store everything from your credit card transactions to every click you make within one of your social media accounts. Given there are nearly eight billion people on the planet, that’s a lot of data .

Databases Allow Smooth Transactions

Databases allow access to various services which, in turn, allow you to access your accounts and perform transactions all across the internet. For example, your bank’s login page will ping a database to figure out if you’ve entered the right password and username. Your favorite online shop pings your credit card’s database to pull down the funds needed for you to buy that item you’ve been eyeing.

Databases Update Information Quickly

Databases allow for easy information updates on a regular basis. Adding a video to your TikTok account, directly depositing your salary into your bank account or buying a plane ticket for your next vacation are all updates made to a database and displayed back to you almost instantaneously.

Databases Simplify Data Analysis

Databases make research and data analysis much easier because they are highly structured storage areas of data and information. This means businesses and organizations can easily analyze databases once they know how a database is structured. Common structures (e.g. table formats, cell structures like date or currency fields) and common database querying languages (e.g., SQL ) make database analysis easy and efficient.

What Is a Database Management System?

A database management system (DBMS) is a software package we use to create and manage databases. In other words, a DBMS makes it possible for users to actually interact with the database. In other words, the DBMS is the user interface (UI) that allows us to access, add, modify and delete content from the database. There are several types of database management systems, including relations, non-relational and hierarchical.

Evolution of Databases

Storing information is nothing new, but the rise of computers in the 1960s marked a shift toward more digital forms of databases. While working for GE, Charles Bachman created the Integrated Data Store, ushering in a new age of computerized databases. IBM soon followed suit with its Information Management System, a hierarchical database.

In the 1970s, IBM’s Edgar F. Codd released a paper touting the benefits of relational databases, leading to IBM and the University of California, Berkeley releasing their own models. Relational databases became popular in the following years, with more businesses developing models and using Structured Query Language (SQL). Even though object-oriented databases became an alternative in the 1980s, relational databases remained the gold standard.

The invention of the World Wide Web led to greater demand for databases in the 1990s. MySQL and NoSQL databases entered the scene, competing with the commercial databases developed by businesses. Object-oriented databases also began to replace relational databases in popularity.

During the 2000s and 2010s, organizations began to collect larger volumes of data, and many turned to the scalability offered by NoSQL databases. Distributed databases provided another way to organize this proliferating data , storing it away in multiple locations.

Types of Databases

There are many types of databases used today. Below are some of the more prominent ones.

1. Hierarchical Databases

Hierarchical databases were the earliest form of databases. You can think of these databases like a simplified family tree. There’s a singular parent object (like a table) that has child objects (or tables) under it. A parent can have one or many child objects but a child object only has one parent. The benefit of these databases are that they’re incredibly fast and efficient plus there’s a clear, threaded relationship from one object to another. The downside to hierarchical databases is that they’re very rigid and highly structured.

2. Relational Databases

Relational databases are perhaps the most popular type of database. Relational databases are set up to connect their objects (like tables) to each other with keys. For example, there might be one table with user information (name, username, date of birth, customer number) and another table with purchase information (customer number, item purchased, price paid). In this example, the key that creates a relationship between the tables is the customer number.

3. Non-Relational or NoSQL Databases

Non-relational databases were invented more recently than relational databases and hierarchical databases in response to the growing complexity of web applications. Non-relational databases are any database that doesn’t use a relational model. You might also see them referred to as NoSQL databases . Non-relational databases store data in different ways such as unstructured data, structured document format or as a graph. Relational databases are based on a rigid structure whereas non-relational databases are more flexible.

4. Cloud Databases

Cloud databases refer to information that’s accessible in a hybrid or cloud environment. All users need is an internet connection to reach their files and manipulate them like any other database. A convenience of cloud databases is that they don’t require extra hardware to create more storage space. Users can either build a cloud database themselves or pay for a service to get started.

5. Centralized Databases

Centralized databases are contained within a single computer or another physical system. Although users may access data through devices connected within a network, the database itself operates from one location. This approach may work best for larger companies or organizations that want to prioritize data security and efficiency.

6. Distributed Databases

Distributed databases run on more than one device. That can be as simple as operating several computers on the same site, or a network that connects to many devices. An advantage of this method is that if one computer goes down, the other computers and devices keep functioning.

7. Object-Oriented Databases

Object-oriented databases perceive data as objects and classes. Objects are specific data — like names and videos — while classes are groups of objects. Storing data as objects means users don’t have to distribute data across tables. This makes it easier to determine the relationships between variables and analyze the data.

8. Graph Databases

Graph databases highlight the relationships between various data points. While users may have to do extra work to determine trends in other types of databases, graph databases store relationships right next to the data itself. Users can then immediately see how various data points are connected to each other.

What Are the Components of a Database?

The components of a database vary slightly depending on whether the database is hierarchical, relational or non-relational. However, here’s a list of database components you might expect to be associated with any database.

The database schema is essentially the design of the database . A schema is developed at the early conceptual stages of building a database. It’s also a valuable source of ongoing information for those wanting to understand the database’s design.

Constraints and Rules

Databases use constraints to determine what types of tables can (and cannot) be stored and what types of data can live in the columns or rows of the database tables, for example. These constraints are important because they ensure data is structured, less corruptible by unsanctioned data structures and that the database is regulated so users know what to expect. These constraints are also the reason why databases are considered rigid.

Metadata is essentially the data about the data. Each database or object has metadata, which the database software reads in order to understand what’s in the database. You can think of metadata as the database schema design and constraints combined together so a machine knows what kind of database it is and what actions can (or can’t) be performed within the database.

Query Language

Each database can be queried. In this case, “queried” means people or services can access the database. That querying is done by way of a particular language or code snippet. The most common querying language is SQL (Structured Query Language) but there are also many other languages and even SQL variations like MySQL , Presto and Hive.

Each database is a collection of objects. There are a few different types of objects stored within databases such as tables, views, indexes, sequences and synonyms. The most well known of these are tables, like spreadsheets, that store data in rows and columns. You may also hear the term “object instance,” which is simply an instance or element of an object. For example, a table called “Transactions” in a database is an instance of the object-type table.

Database Advantages

The structured nature of databases offers a range of benefits for professional and casual users alike. Below are some of the more prominent advantages:

Improved data sharing and handling
Improved data storage capacity
Improved data integrity and data security
Reduced data inconsistency
Quick data access
Increased productivity
Improved data-driven decision making

Database Disadvantages

Although databases can be helpful for many, there are some limitations to consider before investing in a database:

High complexity
Required dedicated database management staff
Risk of database failure

Applications of Databases

When used correctly, databases can be a helpful tool for organizations in various industries looking to better arrange their information. Common use cases include:

Healthcare: storing massive amounts of patient data .
Logistics: monitoring and analyzing route information and delivery statuses.
Insurance: storing customer data like addresses, policy details and driver history.
Finance: handling account details, invoices, stock information and other assets.
E-commerce: compiling and arranging data on products and customer behavior.
Transportation: storing passengers’ names, scheduled flights and check-in status.
Manufacturing: keeping track of machinery status and production goals.
Marketing: collecting data on demographics, purchasing habits and website visits.
Education: tracking student grades, course schedules and more.
Human resources: organizing personnel info, benefits and tax information.

Future of Databases

As organizations handle increasing amounts of data, future databases must be able to keep up. Users will expect databases to be accessible across the globe and able to deal with limitless volumes of data. As a result, it’s likely that more companies will migrate their data to cloud environments. The percent of data stored in the cloud doubled between 2015 and 2022, and there’s reason to believe this percentage will only grow in the years to come.

With the increase in data has also come a spike in cybersecurity threats , so organizations can be expected to complement their cloud environments with reinforced security measures . Databases will become more easily accessible only for authorized personnel while companies adopt tools and best practices for keeping their data out of the wrong hands.

Frequently Asked Questions

What is the difference between a database and a spreadsheet.

Spreadsheets organize data into rows and columns, with each individual cell housing the actual data. Databases also employ rows and columns, but each cell contains a record of data gathered from an external table. As a result, databases provide more ways to arrange and structure information as opposed to spreadsheets.

What is the most commonly used database type?

The most commonly used database type is the relational database.

What is the definition of a database?

A database is highly organized information that is designed to be easily accessible and navigable for users. Most databases are stored on computers, making it possible to quickly analyze, transform and manipulate data in other ways.

Recent Big Data Articles

skip to Main Navigation
skip to Main Content
skip to Footer
Accessibility feedback
Data & Visualization and Research Support
Data Management

Defining Research Data

One definition of research data is: "the recorded factual material commonly accepted in the scientific community as necessary to validate research findings." ( OMB Circular 110 ).

Research data covers a broad range of types of information (see examples below), and digital data can be structured and stored in a variety of file formats.

Note that properly managing data (and records) does not necessarily equate to sharing or publishing that data.

Examples of Research Data

Some examples of research data:

Documents (text, Word), spreadsheets
Laboratory notebooks, field notebooks, diaries
Questionnaires, transcripts, codebooks
Audiotapes, videotapes
Photographs, films
Protein or genetic sequences
Test responses
Slides, artifacts, specimens, samples
Collection of digital objects acquired and generated during the process of research
Database contents (video, audio, text, images)
Models, algorithms, scripts
Contents of an application (input, output, logfiles for analysis software, simulation software, schemas)
Methodologies and workflows
Standard operating procedures and protocols

Exclusions from Sharing

In addition to the other records to manage (below), some kinds of data may not be sharable due to the nature of the records themselves, or to ethical and privacy concerns. As defined by the OMB , this refers to:

preliminary analyses,
drafts of scientific papers,
plans for future research,
peer reviews, or
communications with colleagues

Research data also do not include:

Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are published, or similar information which is protected under law; and
Personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study.

Some types of data, particularly software, may require special license to share. In those cases, contact the Office of Technology Transfer to review considerations for software generated in your research.

Other Records to Manage

Although they might not be addressed in an NSF data management plan, the following research records may also be important to manage during and beyond the life of a project.

Correspondence (electronic mail and paper-based correspondence)
Project files
Grant applications
Ethics applications
Technical reports
Research reports
Signed consent forms

Adapted from Defining Research Data by the University of Oregon Libraries.

Opens in your default email client

URL copied to clipboard.

QR code for this page

Data Module #1: What is Research Data?

Defining research data.

Qualitative vs. Quantitative
Types of Research Data
Data and Statistics
Let's Review...

Data Module Quick Navigation

Data Modules Table of Contents

#1 - What is Research Data? #2 - Planning for Your Data Use #3 - Finding & Collecting Data #4 - Keeping Your Data Organized #5 - Intellectual Property & Ethics #6 - Storage, Backup, & Security #7 - Documentation

Library Resources

Module created by Aaron Albertson, Beth Hillemann, & Ron Joslin.

Many people think of data-driven research as something that primarily happens in the sciences. It is often thought of as involving a spreadsheet filled with numbers. Both of these beliefs are incorrect. Research data are collected and used in scholarship across all academic disciplines and, while it can consist of numbers in a spreadsheet, it also takes many different formats, including videos, images, artifacts, and diaries. Whether a psychologist collecting survey data to better understand human behavior, an artist using data to generate images and sounds, or an anthropologist using audio files to document observations about different cultures, scholarly research across all academic fields is increasingly data-driven.

In our Data Literacy Modules, we will demonstrate the ways in which research data are gathered and used across various academic disciplines by discussing it in a very broad sense. We define research data as: any information collected, stored, and processed to produce and validate original research results. Data might be used to prove or disprove a theory, bolster claims made in research, or to further the knowledge around a specific topic or problem.

Other Definitions of Research Data

There are many different definitions of research data available. Here are just a few examples of other definitions. We share these examples to illustrate there is not universal consensus on a definition, although many similarities are apparent.

U.S. Office of Management & Budget

“research data, unlike other types of information, is collected, observed, or created, for purposes of analysis to produce original research results”

University of Edinburgh

"...recorded factual material commonly accepted in the scientific community as necessary to validate research findings..."

National Endowment for the Humanities

"...materials generated or collected during the course of conducting research..."

Research Data Formats

Research data takes many different forms. Data may be intangible as in measured numerical values found in a spreadsheet or an object as in physical research materials such samples of rocks, plants, or insects. Here are some examples of the formats that data can take:

Next: Qualitative vs. Quantitative >>
Last Updated: Aug 28, 2024 11:05 AM
URL: https://libguides.macalester.edu/data1

Research 101: What are Databases?

Introduction
Comprehend Assignment
Source Types
The Information Cycle
What are Databases?
A-Z Databases This link opens in a new window
Navigating Databases
Finding Books
Video Tutorials
Scholarly vs Popular vs Trade This link opens in a new window
Using Website Resources
Saving Research

What are databases?

A database is a collection of organized and stored information designed for search and retrieval. Databases come in various forms and can be used for different applications.

Libraries typically subscribe to research databases. Research databases are electronic platforms that contain a collection of electronic information that is searchable and, in most cases, retrievable in full-text format. Specifically, they encompass articles from periodicals like academic journals, newspapers, magazines, and trade publications.

General vs. Specialized

In most cases, databases can be categorized in two ways: general or specialized.

General databases cover a wide range of academic disciplines by means of indexing many source types, including articles from academic/scholarly journals, newspapers, magazines, trade publications, and reports.

Specialized databases usually contain various types of information that deal with a specific field of study.

<< Previous: The Information Cycle
Next: A-Z Databases >>
Last Updated: Sep 5, 2024 2:02 PM
URL: https://libguides.columbiasouthern.edu/research101

Research Process: Research Databases

Selecting a Topic
Background Information
Narrowing the Topic
Library Terms
Generating Keywords
Boolean Operators
Search Engine Strategies
Google Searching
Basic Internet Terms
Research & The Web
Search Engines
Evaluating Books
Evaluating Articles
Evaluating Websites
Bibliographic Information
Off Campus Access
Periodical Locator

Research Databases

Research databases are organized collections of computerized information or data such as periodical articles, books, graphics and multimedia that can be searched to retrieve information. Databases can be general or subject oriented with bibliographic citations, abstracts, and or full text. The sources indexed may be written by scholars, professionals or generalists.

Research databases that are retrieved on the World Wide Web are generally non-fee based, lack in-depth indexing, and do not index proprietary resources. Subscription or commercial databases are more refined with various types of indexing features, searching capabilities, and help guides.

Prince George's Community College's Library provides commercial databases for its users as well as non-fee databases. These databases are available from the Library's Website. To review these databases, click on Research Databases .

Selecting Appropriate Online Databases

Your topic statement determines the type of database, kind of information, and the date of the sources that you will use. It is important to clarify whether your topic will require research from journals, magazines, newspapers, and books or just journals. To understand the differences between magazines, journals, and newspapers, see the Magazines, Journals, and Newspapers: What's the Difference section under Evaluating Sources.

Search Strategies

Before you begin to search the databases, it is important that you develop a well planned comprehensive search strategy. Determine what your keywords are and how you want them to link together. Always read the help screens and review any tutorials that have been developed for a particular database.

After you determine what your keywords are, consult any subject headings or guides to locate controlled vocabulary such as a thesaurus that may appear in the subject field. You will also want to decide what other fields may be valuable for your search.

Boolean searching is one of the basic and best search strategies that is used by most online databases.

For more help with search strategies see Search Strategies section.

<< Previous: Bibliographic Information
Next: Off Campus Access >>
Last Updated: Jun 26, 2024 2:47 PM
URL: https://pgcc.libguides.com/researchprocess

History & Society
Science & Tech
Biographies
Animals & Nature
Geography & Travel
Arts & Culture
Games & Quizzes
On This Day
One Good Fact
New Articles
Lifestyles & Social Issues
Philosophy & Religion
Politics, Law & Government
World History
Health & Medicine
Browse Biographies
Birds, Reptiles & Other Vertebrates
Bugs, Mollusks & Other Invertebrates
Environment
Fossils & Geologic Time
Entertainment & Pop Culture
Sports & Recreation
Visual Arts
Demystified
Image Galleries
Infographics
Top Questions
Britannica Kids
Saving Earth
Space Next 50
Student Center

Our editors will review what you’ve submitted and determine whether to revise the article.

Corporate Finance Institiute - Database
Engineering LibreTexts - Data and Databases
Saylor.org - A Brief History of Databases
database - Children's Encyclopedia (Ages 8-11)
database - Student Encyclopedia (Ages 11 and up)

database , any collection of data, or information , that is specially organized for rapid search and retrieval by a computer . Databases are structured to facilitate the storage, retrieval, modification, and deletion of data in conjunction with various data-processing operations. A database management system (DBMS) extracts information from the database in response to queries.

A brief treatment of databases follows. For full treatment, see computer science: Information systems and databases ; information processing .

computer chip. computer. Hand holding computer chip. Central processing unit (CPU). history and society, science and technology, microchip, microprocessor motherboard computer Circuit Board

A database is stored as a file or a set of files. The information in these files may be broken down into records , each of which consists of one or more fields. Fields are the basic units of data storage, and each field typically contains information pertaining to one aspect or attribute of the entity described by the database. Records are also organized into tables that include information about relationships between its various fields. Although database is applied loosely to any collection of information in computer files, a database in the strict sense provides cross-referencing capabilities. Using keywords and various sorting commands, users can rapidly search, rearrange, group, and select the fields in many records to retrieve or create reports on particular aggregates of data.

Database records and files must be organized to allow retrieval of the information. Queries are the main way users retrieve database information. The power of a DBMS comes from its ability to define new relationships from the basic ones given by the tables and to use them to get responses to queries. Typically, the user provides a string of characters, and the computer searches the database for a corresponding sequence and provides the source materials in which those characters appear; a user can request, for example, all records in which the contents of the field for a person’s last name is the word Smith .

The many users of a large database must be able to manipulate the information within it quickly at any given time. Moreover, large business and other organizations tend to build up many independent files containing related and even overlapping data, and their data-processing activities often require the linking of data from several files. Several different types of DBMS have been developed to support these requirements: flat, hierarchical, network, relational, and object-oriented.

Early systems were arranged sequentially (i.e., alphabetically, numerically, or chronologically); the development of direct-access storage devices made possible random access to data via indexes. In flat databases, records are organized according to a simple list of entities; many simple databases for personal computers are flat in structure. The records in hierarchical databases are organized in a treelike structure, with each level of records branching off into a set of smaller categories. Unlike hierarchical databases, which provide single links between sets of records at different levels, network databases create multiple linkages between sets by placing links, or pointers, to one set of records in another; the speed and versatility of network databases have led to their wide use within businesses and in e-commerce . Relational databases are used where associations between files or records cannot be expressed by links; a simple flat list becomes one row of a table, or “relation,” and multiple relations can be mathematically associated to yield desired information. Various iterations of SQL (Structured Query Language) are widely employed in DBMS for relational databases. Object-oriented databases store and manipulate more complex data structures, called “objects,” which are organized into hierarchical classes that may inherit properties from classes higher in the chain; this database structure is the most flexible and adaptable.

The information in many databases consists of natural-language texts of documents; number-oriented databases primarily contain information such as statistics, tables, financial data, and raw scientific and technical data. Small databases can be maintained on personal-computer systems and used by individuals at home. These and larger databases have become increasingly important in business life, in part because they are now commonly designed to be integrated with other office software, including spreadsheet programs.

Typical commercial database applications include airline reservations, production management functions, medical records in hospitals, and legal records of insurance companies. The largest databases are usually maintained by governmental agencies, business organizations, and universities. These databases may contain texts of such materials as abstracts, reports, legal statutes, wire services, newspapers and journals, encyclopaedias, and catalogs of various kinds. Reference databases contain bibliographies or indexes that serve as guides to the location of information in books, periodicals, and other published literature. Thousands of these publicly accessible databases now exist, covering topics ranging from law, medicine, and engineering to news and current events, games, classified advertisements, and instructional courses.

Increasingly, formerly separate databases are being combined electronically into larger collections known as data warehouses . Businesses and government agencies then employ “ data mining ” software to analyze multiple aspects of the data for various patterns. For example, a government agency might flag for human investigation a company or individual that purchased a suspicious quantity of certain equipment or materials, even though the purchases were spread around the country or through various subsidiaries.

BibGuru Blog

Be more productive in school

Citation Styles

What is an academic database?

When you do research in an academic setting, you’ll likely encounter academic databases like JSTOR, Scopus, PubMed, or ERIC. In this blog post, we define academic databases and explore what resources can be found in them.

Databases are online platforms that contain searchable resources such as journals, articles, ebooks, and data sets. Sources within databases are sometimes also published in print editions, but, if not, many of the resources that you find are printable in formats like .pdf.

Your school’s library contracts with vendors like EBSCO and Proquest to subscribe to databases that you can search to find sources . Databases are costly to maintain and, as a result, access to the sources in them is sometimes limited depending on your institution’s subscription.

Although academic databases are searchable, they differ from Google or Google scholar because the sources within them are properly indexed and vetted through peer-review. This means that academic databases allow for more precise keyword searches and that the sources they contain are credible.

What can you find in an academic database?

The most common types of resources that you’ll find in an academic database are journals and journal articles . Journals are periodicals that feature journal articles and that are published on a regular basis.

Also known as scholarly articles, journal articles are secondary, peer-reviewed sources that you can use as evidence in an academic paper. Additionally, databases may contain ebooks, newspaper or magazine articles, data sets, reports, and other source types.

Most common academic databases

Databases are often subject- or discipline-specific, but there are some that cover a range of topics, such as Academic Search Complete and Scopus. Some databases contain full-text sources, while others include abstracts for sources that you can acquire elsewhere.

Humanities databases

Databases like JSTOR, ProjectMUSE, and the MLA International Bibliography are the most common for research in humanities fields. If you’re a history or literature student, you will likely encounter these at some point in your studies.

Both JSTOR and ProjectMUSE offer full-text sources. The MLA International Bibliography indexes all publications in languages and literature. Although it doesn’t contain full-text sources, many libraries are able to provide links to full-text sources in other databases.

Social sciences databases

If you’re writing a paper on a social science topic, you might find sources in databases like Sociological Abstracts, PsycINFO, and Opposing Viewpoints.

Because social science fields are often interdisciplinary, you may also find relevant social science sources in other types of databases.

Business databases

Some of the most common business databases are Business Source Complete, ABI/INFORM, Mergent, IBISWorld, and Mintel. These offer different resources, depending on what kind of business research your are doing.

Some business databases are best for industry or company research, while others specialize in providing resources for consumer research. If you’re not sure what kind of business research you need to do for your assignment, you might consider contacting your school’s business librarian, if one is available.

STEM databases

STEM is a huge field, so you can expect to encounter a wide range of databases when you undertake research in the sciences, technology, engineering, or math. Some of the most common databases in these subjects include Web of Science, arXiv.org, and ScienceDirect.

Science databases will include a broader range of resources than those in other fields, including source types like data sets and code. If you’re not sure which science database would provide you with the best sources for your topic, ask a librarian or your instructor.

Health sciences and education databases

Like social science fields, both the health sciences and education are interdisciplinary. However, there are two top databases that you will certainly interact with if you’re doing research in either of these subjects: PubMed for health sciences and ERIC for education.

How to cite sources from an academic database

To create the most accurate citations for sources you find in a database, use BibGuru’s citation generator . You can create citations for more than 70 source types in all of the major citation styles.

The BibGuru browser extension for Chrome and Edge can also help you automatically generate citations for online database sources. Correct citations ultimately help you avoid plagiarism .

Frequently Asked Questions about what is an academic database

Databases are online platforms that contain searchable resources such as journals, articles, ebooks, and data sets. They are used to find sources for academic papers.

Some of the major academic databases include Academic Search Complete, Scopus, JSTOR, and PubMed.

The most common types of resources that you’ll find in an academic database are journals and journal articles . Additionally, databases may contain ebooks, newspaper or magazine articles, data sets, reports, and other source types.

You can find academic databases through your school library’s website. Most academic libraries organize their database list alphabetically or by subject.

Differences between websites and journal articles

Make your life easier with our productivity and writing resources.

For students and teachers.

Basic Research Process
Quick Links

What is a Database?

Why should i use databases, using databases, help using databases, frequently asked questions (databases).

Using "Libguides"
Using the Internet
Using the library's Search and Discover! box

Academic Search Premier: 8,500+ full-text periodicals, 7,300+ peer-reviewed journals

JSTOR: Archival database of books, primary sources, current issues of journals

Films On Demand: Streaming database of educational videos

Basic definition:

A collection of data arranged for ease and speed of search and retrieval.

In the library world, a database is a collection of articles, ebooks, videos, or other resources, which can be quickly searched by keyword, author, publication or other terms. Most materials in library databases are not available to the general public or standard Internet search engines. The library pays a substantial fee to make them available to Concordia library users.

Short Answer:

Databases typically make searching easier.

Longer Explanation:

Databases often are organized by topics or subjects or materials, have multiple filters which can limit or increase search results in various ways, and can provide access to resources which are otherwise not obtainable. As was mentioned in the definition, one of the primary functions of a database is speed of retrieval. This means databases are designed to provide quick access to materials which are actually useful to you, instead of just providing access to as many resources as possible.

Scholarly (Peer-Reviewed) Resources : You may see that an instructor requires scholarly or peer-reviewed resources. Databases are not the only way to get access to these types of resources, but like stated above, they're often the fastest or easiest.

Examples and Usage:

Databases work essentially the same as many internet search engines that you may be familiar with ( Google , for example). However, Google and most web search engines use your search terms and look for "anything out there" and then send you somewhere else to get your information. A database in a sense, will keep you in their system. These databases actually work similarly to how Amazon.com works. You can search for just about anything, add limitations to your search and terms, and find related searches, but all are within Amazon's setup. Amazon doesn't send you to Walmart or Target , they keep you in Amazon's "ecosystem." These databases aren't going to send you to some organization's website for the information you're looking for, like Google would, instead you stay in the Database's ecosystem. There are pros and cons to this, but in most cases that ecosystem is cut off from standard search engine searches, meaning only that database has access to it.

Databases usually focus their subject matter and curate their collections, or we have subscriptions to very specific collections in those databases. This means that there may be databases that work really well for some subject matter, but have almost nothing in other areas. For example, the Quick Links on this page represent some of our most popular databases, but each has a different focus and will provide you with very different materials.

Academic Search Premier: This is our standard database which holds many academic and popular resources and is our most used database. It covers a wide range of information with tons of publications with Full Text access. To learn how to use this database better, jump to the Help Using Databases section of this page and watch the videos "Finding Articles" and "Judging Articles" for an in-depth look at how to use this resource.

JSTOR : JSTOR is another very popular database, with a "how to" video in the Help Using Databases section. JSTOR is not just a database that offers many various resources in full text, but is also an Archive . Simply put, JSTOR has a lot of great information both new and old, and it'll always be there.

Films on Demand: Films on demand is a database that holds tons of educational videos on many different subjects. What's more, they provide citation materials and video embed options, for use in presentations or saving for later.

Nexis Uni : Formerly Nexis Lexis, this database collects newspaper articles from all over the world, if you need news articles from a specific publication, date-range or on a topic, this is the place to look.

Others : Of course, the databases listed above are just some of the more popular database options we have access to. Link library subscribes to many more databases, please check out the Database List Page to see all the others.

Finding Articles (Using Academic Search Premier)

Judging Articles (Using Academic Search Premier)

Judging articles (using academic search premier).

Using JSTOR

Can I Access Databases from Off Campus?

When i go to a database, it has a login option in the top right corner, what is this, an article in a database says "concordia subscribes to this title" but doesn't have an access link, how do i get it, i found an article i want, but i don't have full access, what do i do.

<< Previous: Basic Research Process
Next: Using "Libguides" >>
Last Updated: Aug 29, 2024 2:15 PM
URL: https://cune.libguides.com/research

Introduction to Library Research

Find Articles by Subject
Tips for Finding Articles
How to Read a Scholarly Article
Citation Help
Helpful Videos

What is a Library Database?

A library database is an electronic collection of information, organized to allow users to get that information by searching in various ways.

Examples of Database information

Articles from magazines, newspapers, peer-reviewed journals and more. More unusual information such as medical images, audio recitation of a poem, radio interview transcripts, and instruction video can be found in databases as well.

General reference information such as that found in an encyclopedia. Both very broad topic information is available as well as very specific.

Books. Online versions, eBooks, are the same as print versions with some enhancements at times, such as an online glossary.

Why not just use Google?

What’s the difference?

Information in a database has been tagged with all sorts of data, allowing you to search much more effectively and efficiently. You can search by author, title, keyword, topic, publication date, type of source (magazine, newspaper, etc.) and more.

Database information has been evaluated in some way, ranging from a very rigorous peer-review publishing process to an editor of a popular magazine making a decision to publish an article.

Databases are purchased, and most of the information is not available for free on the internet. The databases are continually updated as new information is produced.

Citation information. Databases include the information you need to properly cite your sources and create your bibliography. Information you retrieve using Google may or may not have this information.

My professor says I can’t use the Internet. Can I still use these databases?

Yes! The internet is only the delivery system for the databases. The information in the databases is not found on the free web.

<< Previous: Home
Next: Find Articles by Subject >>
Last Updated: Mar 1, 2024 12:41 PM
URL: https://libguides.regiscollege.edu/researchintro

English 1013 - Composition I: What is a database?

What is a database?
About Academic Writing
Explain a Concept
Opposing Arguments
Find Current Event Resources
Avoid Plagiarism

Library Databases

Your best bet when doing any type of research for course assignments is to use the library databases. When you search using Find It , you are searching one kind of database. However, there are many more to choose from, each focused on a certain kind of resource or focused around a certain subject. Library databases were built with you--the researcher--in mind, and include many tools to help you organize, cite, and save the scholarly resources you find.

What is a Database?

A database is an electronic collection of information that is organized so that it can easily be accessed, managed, and updated. Amazon.com is a familiar database as is the library's catalog, Find It . The library also subscribes to over 200 scholarly and research databases. You can browse them all here: http://libguides.atu.edu/az.php

Basic Database Searching

Peer Review in 3 Minutes

Multi-subject Databases

Multi-disciplinary database with full text journals, indexing /abstracts, monographs, reports, conference proceedings, and more.

Multidisciplinary database of periodical content.

Databases provided by the Arkansas State Library [ASL] Traveler project are funded by a grant from the U.S. Institute of Museum and Library Services (Grant LS-00-14-0004-14) and the Arkansas Department of Education.

Multidisciplinary digital library of academic journals, books, and primary sources.

Full text, multidisciplinary databases.

Online resource covering today’s social issues. Provides pro/con viewpoint essays, primary source documents, statistics, articles, and images.

Why Should I Use the Library Databases?

<< Previous: Home
Next: About Academic Writing >>
Last Updated: Sep 11, 2024 12:59 PM
URL: https://libguides.atu.edu/english1013

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

Advanced Search
Journal List

PMC9650106.1 ; 2022 Jan 28
➤ PMC9650106.2; 2022 Nov 1

Research Software vs. Research Data I: Towards a Research Data definition in the Open Science context

Teresa gomez-diaz.

1 Laboraroire d'Informatique Gaspard-Monge, CNRS, Paris-Est, France

Tomas Recio

2 Universidad Antonio de Nebrija, Madrid, Spain

Associated Data

Underlying data.

Data underlying the arguments presented in this article can be found in the references, footnotes and Box 1 .

Version Changes

Revised. amendments from version 1.

This version considers the comments of the reviewers to better explain and illustrate some of the concepts presented in the article. In particular we have stressed the importance of the scientific production context for the RS and RD definitions. We have as well introduced new references related to the concepts of data and information, to further illustrate our view on the complexity of the data concept, and a new reference to complete the studied landscape for the proposed RD definition. As asked by the Referees, we have moved the translations of French and Spanish quotes to the main text. See our answers to the referee reports to complete the differences with the version 1 of this article.

Peer Review Summary

Review date	Reviewer name(s)	Version reviewed	Review status
	Joachim Schopfel		Approved
	Remedios Melero		Approved
	Tibor Koltay		Approved
	Joachim Schopfel		Approved with Reservations
	Remedios Melero		Approved with Reservations
	Tibor Koltay		Approved with Reservations

Background: Research Software is a concept that has been only recently clarified. In this paper we address the need for a similar enlightenment concerning the Research Data concept.

Methods: Our contribution begins by reviewing the Research Software definition, which includes the analysis of software as a legal concept, followed by the study of its production in the research environment and within the Open Science framework. Then we explore the challenges of a data definition and some of the Research Data definitions proposed in the literature.

Results: We propose a Research Data concept featuring three characteristics: the data should be produced (collected, processed, analyzed, shared & disseminated) to answer a scientific question, by a scientific team, and has yield a result published or disseminated in some article or scientific contribution of any kind.

Conclusions: The analysis of this definition and the context in which it is proposed provides some answers to the Borgman’s conundrum challenges, that is, which Research Data might be shared, by whom, with whom, under what conditions, why, and to what effects. They are completed with answers to the questions: how? and where?

1. Introduction

Each particle of the Universe, known or unknown by what is widely accepted as Science, is information. Different datasets can be associated to each particle to convey information, as, for example: where has this particle been discovered? By whom? At what time? Is this particle a constituent element of a rock, or a plant, or … ? Indeed, as living entities of the Earth planet, … we are all part of this Universe and every atom in our bodies came from a star that exploded … , therefore … we are all stardust … . 1

So long ago that we have never been able to give a precise date, information started to be fixed in cave paintings, figurines, and bone cravings, which have been found in caves like Altamira 2 or Lascaux 3 . That is, some human beings intentionally fixed information on a support. Much more recently, languages have been developed to deal with information, fixing and exchanging it in clay bricks, papyrus, monument walls, and paper books. Even more recently, information has been fixed in films, photographs, and has finally adopted digital formats.

Scientists study all kinds of subjects and objects: persons, animals, trees and plants and other living beings, philosophies and philosophers, artists and artworks, mathematical theories, music, languages, societies, cities, Earth and many other planets and exoplanets, clouds, weather and climate, stars and galaxies, as well as other animate or inanimate objects, molecules, particles, nanoparticles and viruses, nowadays including digital objects such as computer programs. Some of these items, like images, texts, and music etc. may have associated intellectual property rights; but others, like statistics or geographical data, may not. Yet, they may be affected by other legal contexts, such as, for example, the one given by the EU INSPIRE Directive 1 for spatial data, concerning any data with a direct or indirect reference to a specific location or geographical area.

Now, in our digital era, most of the above subjects under consideration are handled by humans using computers, through numerical data. Scientists present new theories and results built and produced with numerical simulations and through the analysis of numerical datasets. They are usually stored in databases, manipulated or produced in digital environments using existing software, either Free/Open Source Software (FLOSS) 4 or commercial, or by means of software developed by research teams to address specific problems 2 , 3 .

In this specific scientific context, the aims and developments of Open Science practices are particularly relevant. Indeed, as remarked by 4 : "We must all accept that science is data and that data are science … ". Therefore, in this article we take into consideration the following definition of Open Science, in which the open access to Research Data (RD) and to Research Software (RS) is part of the core pillars 5 :

Open Science is the political and legal framework where research outputs are shared and disseminated in order to be rendered visible, accessible and reusable.

A more transversal and global vision can be found in the UNESCO Recommendation on Open Science 5 , 6 . See also 7 for another relevant example of ongoing work on the Open Science concept. But in this paper, following the analysis and the conclusions of 5 , we focus here on this restricted framework as more suitable for our purposes.

Among the most important kinds of research outputs of any scientific work, we focus on the trio formed by articles , software and data. Actually, among all the possible duos, the couple RS and RD present more similarities, although a light list of differences between software and data have been mentioned in 8 and 9 . On the other hand, regarding other duos, we think that differences are much stronger. For instance, unlike the dissemination of published articles, usually at the hands of scientific editors, the dissemination of software and data that have been produced in the research process is mostly at the hands of their producers, the research team. The analogies between RS and RD have been already summarily highlighted in 10 , such as those concerning the release protocols of RD and RS, which raises the same questions, at the same time, in the production context. As a direct consequence, it seems suitable to propose a similar dissemination procedure for both kinds of research outputs 11 .

Indeed, let us remark that, as mentioned in 11 , 12 , both RS and RD dissemination might involve the use of licenses to set their sharing conditions, such a core issue. Information about RS licenses and licensing can be found at the Free Software Foundation (FSF) 6 , the Open Source Initiative (OSI) 7 , and the Software Package Data Exchange (SPDX) 8 . The SPDX licenses list also includes licenses that can be used for databases, like the Creative Commons licenses 9 or the Open Data Commons Licenses 10 , see for example 13 .

Other similarities regarding RS and RD are related to management plans: for example, Data Management Plans are nowadays required by research funders (see for example 14 , 15 ) and, in the same mood, Software Management Plans have been recently proposed, see 16 and the references therein.

Finally, concerning evaluation, as observed in 3 , similar evaluation protocols can be proposed for both RS and RD.

Leaving aside the common issues in RS and RD for licensing and management plans, that have been already studied in the above mentioned references, the RS and RD dissemination and evaluation analogies are more closely analyzed in the article 12 that follows the present work, including FAIR related issues 17 and 5Stars Open Data 11 . On the other hand, in the current article we focus on the conceptual analogies of RS and RD, and their consequences (see Section 5 ).

As we will argue in the next sections, a definition for RD can be proposed following the main features of the RS definition given in our recent work 3 , 18 . However, we consider that formulating such proposal still remains a challenging issue that we dare to address here. In fact, although one of the most widely accepted RD definitions is the one proposed by the OECD (2007) 19 , other works have shown the difficulties to fix such a definition 20 , 21 . Indeed, establishing this concept has important and not well settled consequences, for example, concerning the context of RD sharing, as highlighted by C. Borgman in 22 :

Data sharing is thus a conundrum. […] The challenges are to understand which data might be shared, by whom, with whom, under what conditions, why, and to what effects. Answers will inform data policy and practice.

It is the intention of our present work to bring some answers to these questions.

The plan of this article is as follows. The next section introduces the concept of RS after a summary presentation of the key points involved in the notion of software as a legal object. Section 3 is devoted to discuss the different issues involved in the challenge towards a precise definition of data (in the more comprehensive sense of this concept). Section 4 describes partially the landscape of existing work addressing the RD definition, enumerating, again, some difficulties to settle such a concept.

There we propose our RD definition, based in three characteristics: the data should be produced (collected, processed, analyzed, shared & disseminated) to answer a scientific question, by a scientific team, and has yield a result published or disseminated in some article or scientific contribution of any kind. Comparisons with other RD definitions are examined.

The last and final section concludes with the proposition of some specific answers to Borgman’s conundrum challenges 22 . Let us remark that these conundrum challenges involve as well RD dissemination issues that are studied in detail in the article that follows this work 12 , which also includes the analysis of RD evaluation and FAIR issues.

The reader of the current work should be aware that its authors are not legal experts. Thus, in order to address our goals in this article, we have analyzed (French, Spanish, European and USA) legal documents and articles written by law experts 1 , 13 , 20 , 21 , 23 – 34 , but from the scientist’s point of view. Yet, a deeper understanding of legal issues may require the intervention of legal specialists.

Following the standard scientific protocol, the authors of this work (mathematicians) have, first, detected a problem – the need to provide a more suitable RD definition. Then, they have observed the involved landscape and studied the related literature; have focused on and structured different components of the problem; finally, they have proposed what they believe could be a solution for the challenge under consideration. As in any other research work, we, authors of the present work, believe that our proposal should be examined by the scientific community in order to evaluate its correctness, and to help improving it, if needed, advancing towards a better solution.

2. Research Software

In this section we bring together some of the existing definitions of software as a legal object (see references below). We also recall our definition of RS coming from 3 , 18 .

2.1. Software is a legal object

In what follows we refer to the documents 26 – 29 dealing with a definition of software as a legal object. Note that the terms computer program , software , logiciel (in French), programa de ordenador (in Spanish) are synonyms in this work. The terms source code (or código fuente in Spanish), compiled code (or code compilé , código compilado ) correspond to subsets of a computer program.

The first definition that we would like to consider comes from the Directive 2009/24/EC of the European Parliament 26 , that states:

For the purpose of this Directive, the term “computer program” shall include programs in any form, including those which are incorporated into hardware. This term also includes preparatory design work leading to the development of a computer program provided that the nature of the preparatory work is such that a computer program can result from it at a later stage.

Moreover, in the Spanish Boletín Oficial del Estado n. 97 (1996) 27 we can find 12 :

A los efectos de la presente Ley se entenderá por programa de ordenador toda secuencia de instrucciones o indicaciones destinadas a ser utilizadas, directa o indirectamente, en un sistema informático para realizar una función o una tarea o para obtener un resultado determinado, cualquiera que fuere su forma de expresión y fijación. […] comprenderá también su documentación preparatoria.

[For the purpose of this Law, a computer program shall be understood as any sequence of instructions or indications intended to be used, directly or indirectly, in a computer system to perform a function or a task or to obtain a certain result, whatever expression and fixation form it can take. […] it can also include its preparatory documentation.]

Likewise, in the French Journal officiel de la République française (1982) 29 we can read:

Logiciel : Ensemble des programmes, procédés et règles, et éventuellement de la documentation, relatifs au fonctionnement d’un ensemble de traitement de données (en anglais : software). [ Software : All programs, procedures and rules, and possibly documentation, related to the performance of some data processing (in English: software).].

And in the French Code de la propriété intellectuelle (current regulation) 28 , Article L112-2, we can find:

Les logiciels, y compris le matériel de conception préparatoire, sont considérés notamment comme œuvres de l’esprit au sens du présent code. [Software, including the preparatory material, is considered as works protected by the present code.]

We observe that, in the above mentioned documents, the concept of software or computer program, logiciel or programa de ordenador refers to the set of instructions, of any kind, that are to be used in a computer system (including hardware). It is a work protected by the author rights. It can include the source code, the compiled code, and, eventually, the associated documentation and the preparatory material. It can be related to some data processing or to other tasks to be implemented in a computer system.

In order to complete this legal vision of the software concept we refer to item (11) of 26 :

For the avoidance of doubt, it has to be made clear that only the expression of a computer program is protected and that ideas and principles which underlie any element of a program, including those which underlie its interfaces, are not protected by copyright under this Directive. In accordance with this principle of copyright, to the extent that logic, algorithms and programming languages comprise ideas and principles, those ideas and principles are not protected under this Directive. In accordance with the legislation and case-law of the Member States and the international copyright conventions, the expression of those ideas and principles is to be protected by copyright.

Indeed, there is a difference between the concepts of algorithm and software from the legal point of view, as there is a difference between the mere idea for the plot of a novel and the final written work. Several persons could have the same idea for the plot, but its realization in a final document will deliver different novels by different writers, as the novel will reflect the personality of its author. Similarly, an algorithm remains on the side of ideas, and as such, it is not protected by copyright laws. On the other side, poetry, novels and software are protected under copyright laws. Moreover, a computer program can implement several algorithms, and the same algorithm can be implemented in several programs.

Finally, note the nature of software as a digital object underlying all the above considerations.

2.2. Software as a research output: definition of Research Software

Beyond the vision of software as a legal object, we bring here the concept of Research Software (RS) as a scientific production, as defined in 3 , 18 :

Research Software is a well identified set of code that has been written by a (again, well identified) research team. It is software that has been built and used to produce a result published or disseminated in some article or scientific contribution. Each research software encloses a set (of files) that contains the source code and the compiled code. It can also include other elements as the documentation, specifications, use cases, a test suite, examples of input data and corresponding output data, and even preparatory material.

Thus, Section 2.1 of 3 introduces several definitions regarding the notions of scientific and research software as found in the literature, as a way to support the above definition, while 18 provides complementary analysis on this concept. Note that this definition does not take into consideration if the RS status is “ongoing” or “finalized”, and does not regard if the RS has been disseminated, its quality or scope, its size, or if it is documented, maintained, used only by the development team for the production of an article, or it is currently used in several labs … 2 .

Different recent works on the RS concept can be found, for example, on 35 and the references therein, where the RDA FAIR for Research Software (FAIR4RS) working group 13 proposes a definition of RS full of subtleties and details, albeit, perhaps, of complex interpretation in practice.

We observe, following our proposed definition, that RS can be characterized through three main features:

• the goal of the RS development is to do research. As stated by D. Kelly: it is developed to answer a scientific question 36 ,
• it has been written by a research team,
• the RS is involved in the obtention of the results presented in scientific articles (as the most important means for scientific exchange are still articles published in scientific journals) or by any other kind of recognized scientific means.

Note that documentation, licenses, examples, data, tests, Software Management Plans and other related information and materials can also be part of the set of files that constitutes a specific RS. Remark that the data we refer to in this list will qualify as RD (as defined in Section 4 ) if they have been produced by a research team, that can be the same team that has produced the RS, but not necessarily (notice that the role of the research team involved in the development of a RS has been thoroughly studied in Section 2.2 of 3 ). Indeed, Section 2.1 above shows that the preparatory design work and documentation are part of the software, and these are documents that can be included in the released version of a RS, following the choice of the RS producer team. There can be other elements as for example tests, input and output files to illustrate how to use the RS, licenses, etc. To include these elements in the released RS correspond to best practices that facilitate RS reuse. In our view, the release of a RD (see Section 4 and 12 ) can follow similar practices, that is, to include a documentation, some use examples, a license, a data management plan … this is to be decided by the producer team.

The initial origin of this RS definition is to be found in 2 , that contains a detailed and complete study comparing articles and software produced in a typical (French) research lab. As remarked in received comments and Referee reports to this article, this RS definition (as well as the RD definition proposed in Section 4 ) is placed in what can be considered as a narrow context, emphasizing the role of the scientific production context. The relevance of such context is widely accepted by the scientific community in the case of articles: not every article published in a newspaper qualifies as a research article, that requires to be released in a scientific journal and subject to a referee procedure. Similarly, the importance of the production context has been already highlighted in the case of data, regarding those that qualify as cultural data 23 .

Besides, our definition does not include as RS neither commercial software nor existing Free/Open Source Software (FLOSS) or other software developed outside Academia, a restriction which does not exclude that RS (or research articles, data...) can be produced in other contexts like private laboratories, for example. Rather, this means that we are not considering here differences between private or public funding of research. As a matter of fact, a research team can use RS produced by other teams for their scientific work, as well as FLOSS or other software developed outside the scientific community, but the present work is centered in the making-of aspects which are pertinent for the proposed definition. Obviously, a RS that has been initially developed in a research lab can evolve to become commercial software or just evolve outside its initial academic context. The above definition concerns its early, academic life.

Moreover, a RS development team may not just use software produced by other teams, but also include external software as a component inside the ongoing computer program, a procedure that could be facilitated by the FLOSS licenses. We consider that this external component qualifies as RS if it complies with the three characteristics given in the above definition. Moreover, the producers of the final RS should clearly identify the included external components, and their licenses. They should also highlight the used or included RS components, by means of a correct citation form 3 , 8 , 11 , 37 – 39 .

Furthermore, a RS may involve other software components that can remain external , and that are not included in the RS development and release. It is then left to the users the task to recover and install them, and to assemble these external components in order to get a running environment. Another situation, as the one we have analyzed in 18 , deals with the RS developed within a given software environment which is not perhaps fully disseminated with the RS. For example, the GeoGebra code developed by T. Recio and collaborators 14 does not disseminate the whole GeoGebra software 15 , but only some parts that are relevant for their goals and that include their code.

See 2 , 3 , 18 for more discussions and references that have motivated the RS definition we have sketched in this section.

3. The challenges of a data definition

As stated in 40 :

“Data” is a difficult concept to define, as data may take many forms, both physical and digital.

For example, unlike software, data is, as a legal object, much more difficult to grasp. In fact, according to 33 , data is not a legal concept, as it does not fall into a specific legal regime. For example, data can be either mere information or une œuvre , a work with associated intellectual property, when it involves creative choices in its production that reflect the author’s personality 32 . The Knowledge Exchange report 21 provides guidelines that can be used to assess the legal status of research data, and mentions:

It is important to know the legal status of the data to be shared. […] not all data are protected by law, and not every use of protected research data requires the author’s consent. […] Whether data are in fact protected must be determined on a case-by-case basis.

In relation with this legal context of data sharing and reuse, a very complete framework is introduced in 23 :

Les problématiques liées à la réutilisation nécessitent une maîtrise parfaite du droit de la propriété intellectuelle, du droit à l’image, du droit des données personnelles, du respect à la vie privée et du secret de la statistique, du droit des affaires, du droit de la concurrence, du droit de la culture, du droit européen et des règles de l’économie publique. [The issues related to reuse require a perfect mastership of intellectual property rights, image rights, personal data rights, respect for private life and statistical confidentiality, business law, competition law, cultural law, European law and the rules of the public economy.]

Another list of legal issues related to data is provided by 33 , similar but not equal to the one in the previous quote. Yet, it is also necessary to consider other legal contexts concerning, for example, les données couvertes par le secret médical ou le secret industriel et commercial [Data covered by medical secret or by the industrial and commercial secret] 16 . Let us remark that the section Applicable Laws and Regulations of 15 provides a broad overview of regulatory aspects that need to be taken into consideration when developing disciplinary RD management protocols in the European context. But, as declared in the introduction, it is not our intention to go deeper into these legal aspects, that should be also regarded from the perspective of many different laws.

The underlying problem is that data can refer to many different subjects or objects. We need to simplify the context to help us setting a manageable concept of research data adapted to the scientific framework. For this purpose we present here two relevant data definitions found in the data scientific literature.

The OECD data definition in its Glossary of Statistical Terms 17 states that:

DATA Definition: Characteristics or information, usually numerical, that are collected through observation. Context: Data is the physical representation of information in a manner suitable for communication, interpretation, or processing by human beings or by automatic means (Economic Commission for Europe of the United Nations (UNECE)), “Terminology on Statistical Metadata”, Conference of European Statisticians Statistical Standards and Studies, No. 53, Geneva, 2000.

Also, as a relevant precedent, let us quote here the data definition of the Committee for a Study on Promoting Access to Scientific and Technical Data for the Public Interest , as mentioned in 41 :

A data set is a collection of related data and information – generally numeric, word oriented, sound, and/or image – organized to permit search and retrieval or processing and reorganizing. Many data sets are resources from which specific data points, facts, or textual information is extracted for use in building a derivative data set or data product. A derivative data set, also called a value-added or transformative data set, is built from one or more preexisting data set(s) and frequently includes extractions from multiple data sets as well as original data (Committee for a Study on Promoting Access to Scientific and Technical Data for the Public Interest, 1999, p. 15).

We can notice that both definitions combine the concepts of data and information, yielding, again, to a challenging situation. Thus, to better grasp the connection between both terms we have consulted several sources of different nature, see Box 1 . Note that we can find in Box 1 that information among the data synonyms in the Larousse dictionary, but data is not among the information synonyms. On the other hand, Wikipedia mentions that both terms can be used interchangeably, but that they have different meanings.

A promenade around the data and information concepts.

I.1 Diccionario de la lengua española of the Real Academia Española

• Definition of dato ( https://dle.rae.es/dato )
– Del latín datum ‘lo que se da’.
– 1. m. Información sobre algo concreto que permite su conocimiento exacto o sirve para deducir las consecuencias derivadas de un hecho. A este problema le faltan datos numéricos.
– 2. m. Documento, testimonio, fundamento.
– 3. m. Inform. Información dispuesta de manera adecuada para su tratamiento por una computadora.
• Definition of información ( https://dle.rae.es/informaci%C3%B3n )
– Del latín informatio, - o ¯ nis ‘concepto’, ‘explicación de una palabra’.
– 1. f. Acción y efecto de informar.
– 2. f. Oficina donde se informa sobre algo.
– 3. f. Averiguación jurídica y legal de un hecho o delito.
– 4. f. Pruebas que se hacen de la calidad y circunstancias necesarias en una persona para un empleo u honor. U. m. en pl.
– 5. f. Comunicación o adquisición de conocimientos que permiten ampliar o precisar los que se poseen sobre una materia determinada.
– 6. f. Conocimientos comunicados o adquiridos mediante una información.
– 7. f. Biol. Propiedad intrínseca de ciertos biopolímeros, como los ácidos nucleicos, originada por la secuencia de las unidades componentes.
– 8. f. desus. Educación, instrucción.

I.2 Diccionnaire Larousse de la langue française

• Definition of donnée ( https://www.larousse.fr/dictionnaires/francais/donn%c3%a9e/26436 )
– Ce qui est connu ou admis comme tel, sur lequel on peut fonder un raisonnement, qui sert de point de départ pour une recherche (ex. Les données actuelles de la biologie).
– Idée fondamentale qui sert de point de départ, élément essentiel sur lequel est construit un ouvrage (ex. Les données d’une comédie).
– Renseignement qui sert de point d’appui (ex. Manquer de données pour faire une analyse approfondie).
– Représentation conventionnelle d’une information en vue de son traitement informatique.
– Dans un problème de mathématiques, hypothèse figurant dans l’énoncé.
– Résultats d’observations ou d’expériences faites délibérément ou à l’occasion d’autres tâches et soumis aux méthodes statistiques.
• Definition of information ( https://www.larousse.fr/dictionnaires/francais/information/42993 )
– Action d’informer quelqu’un, un groupe, de le tenir au courant des événements : La presse est un moyen d’information.
– Indication, renseignement, précision que l’on donne ou que l’on obtient sur quelqu’un ou quelque chose: Manquer d’informations sur les causes d’un accident. (Abréviation familière : info.)
– Tout événement, tout fait, tout jugement porté à la connaissance d’un public plus ou moins large, sous forme d’images, de textes, de discours, de sons. (Abréviation familière : info.)
– Nouvelle communiquée par une agence de presse, un journal, la radio, la télévision. (Abréviation familière : info.)
– Cybernétique. Mesure de la diversité des choix dans un répertoire de messages possibles.
– Droit. Instruction préparatoire, diligentée par le juge d’instruction en vue de rechercher et de rassembler les preuves d’une infraction, de découvrir l’auteur, de constituer à charge et à décharge le dossier du procès pénal. (Elle est close par un non-lieu ou par un renvoi devant une juridiction répressive. En matière criminelle, l’instruction est à double degré [juge d’instruction, chambre d’accusation].)
– Informatique. Élément de connaissance susceptible d’être représenté à l’aide de conventions pour être conservé, traité ou communiqué.

I.3 Wikipedia

Extract from the Data page of Wikipedia ( https://en.wikipedia.org/wiki/Data ):

Data are characteristics or information, usually numeric, that are collected through observation. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum (singular of data) is a single value of a single variable. […] Although the terms “data” and “information” are often used interchangeably, these terms have distinct meanings. […] data are sometimes said to be transformed into information when they are viewed in context or in post-analysis. However, […] data are simply units of information.

Moreover, in 42 and in the web page of ISKO 18 , when discussing in detail the concept of data, an etymological and linguistic vision is also the starting point, and among other sources also, it mentions Wikipedia. The conclusion in 42 (section 2.5):

Therefore, our conclusion of this Section is that Kaase’s (2001, 3251) definition seems the most fruitful one suggested thus far: Data are information on properties of units of analysis.

See also 43 – 45 where ours readers can find further reflections on the concepts of data, information, knowledge, understanding, evidence and wisdom.

Such reflections bring to us an eclectic panorama on the ingredients that could form a data definition and their relation with the concept of information, attesting the involved difficulties in such goal.

Focusing in the scientific context, we can illustrate this complexity in full terms referring to the French Code de l’environnement 30 . In its Article L-124-2 19 we can appreciate the subtleties of the definition of environmental data in the following description:

Est considérée comme information relative à l’environnement au sens du présent chapitre toute information disponible, quel qu’en soit le support, qui a pour objet : 1. L’état des éléments de l’environnement, notamment l’air, l’atmosphère, l’eau, le sol, les terres, les paysages, les sites naturels, les zones côtières ou marines et la diversité biologique, ainsi que les interactions entre ces éléments ; 2. Les décisions, les activités et les facteurs, notamment les substances, l’énergie, le bruit, les rayonnements, les déchets, les émissions, les déversements et autres rejets, susceptibles d’avoir des incidences sur l’état des éléments visés au point 1 ; 3. L’état de la santé humaine, la sécurité et les conditions de vie des personnes, les constructions et le patrimoine culturel, dans la mesure où ils sont ou peuvent être altérés par des éléments de l’environnement, des décisions, des activités ou des facteurs mentionnés ci-dessus ; 4. Les analyses des coûts et avantages ainsi que les hypothèses économiques utilisées dans le cadre des décisions et activités visées au point 2 ; 5. Les rapports établis par les autorités publiques ou pour leur compte sur l’application des dispositions législatives et réglementaires relatives à l’environnement. [For the purposes of this chapter, information relating to the environment is considered to be any information available, whatever the medium, the purpose of which is: 1. The state of the elements of the environment, namely the air, atmosphere, water, soil, land, landscapes, natural sites, coastal or marine areas and biological diversity, as well as the interactions between these elements; 2. Decisions, activities and factors, namely substances, energy, noise, radiation, waste, emissions, spills and other discharges, likely to have an impact on the state of the elements concerned in point 1; 3. The state of human health, safety and living conditions of people, buildings and cultural heritage, insofar as they are or may be altered by elements of the environment, decisions, activities or the factors mentioned above; 4. The analyses of costs and advantages as well as the economic assumptions used in the context of the decisions and activities referred to in point 2; 5. Reports drawn up by public authorities or on their behalf on the application of legislative and regulatory provisions related to the environment. ] .

To be compared with the much more easier to understand concept of geographical data as introduced by the Article L127-1 20 of the same Code de l’environnement 30 :

Donnée géographique, toute donnée faisant directement ou indirectement référence à un lieu spécifique ou une zone géographique ; [Geographic data, any data that refers directly or indirectly to a specific place or geographic area;]

Another example to show the complexity of the representation and manipulation of data and information that we would like to mention here corresponds to the linguistic research work developed at the Laboratoire d’informatique Gaspard-Monge, where one of the authors of the present work resides, see for example the doctoral thesis 46 , 47 .

An additional factor that adds complexity to the concept of scientific data has to do with the potential use(s) and sharing of these data. As remarked by the OECD Glossary of Statistical Terms 21 :

The context provides detailed background information about the definition, its relevance, and in the case of data element definitions, the appropriate use(s) of the element described.

The importance of the context is also noted in 22 :

… research data take many forms, are handled in many ways, using many approaches, and often are difficult to interpret once removed from their initial context.

This opens the door to a series of complex issues. For example, to the need for complementary, technical information or documentation associated to a given dataset in order to facilitate its reuse. See 48 (p.16) (and also 40 ) that highlights the difficulties raised by the concept of temperature related data, as explained by a CENS biologist:

There are hundreds of ways to measure temperature. “The temperature is 98” is low-value compared to, “the temperature of the surface, measured by the infrared thermopile, model number XYZ, is 98.” That means it is measuring a proxy for a temperature, rather than being in contact with a probe, and it is measuring from a distance. The accuracy is plus or minus.05 of a degree. I [also] want to know that it was taken outside versus inside a controlled environment, how long it had been in place, and the last time it was calibrated, which might tell me whether it has drifted.

Another instance to further illustrate the complexity of technical information associated to a data set in the STRENDA Guidelines that have been developed to assist authors to provide data describing their investigations of enzyme activities. 22

Other examples from the collection of complex issues associated to data use(s) and sharing conditions are:

• 23 (p.11) The concept of right of access , involving the meaning of public information, requiring three characteristics: the existence of a document, of administrative nature, and in the possession of the Public Administration.

… l’utilisation d’une information publique par toute personne qui le souhaite à d’autres fins que celles de la mission de service public pour les besoins de laquelle les documents ont été élaborés ou détenus. [… the use of public information by anyone who wishes it for other purposes than those of the original needs for which the documents were prepared or held by the public service mission.].

finds a strong formulation for scientific data in 49 :

The value of data lies in their use. Full and open access to scientific data should be adopted as the international norm for the exchange of scientific data derived from publicly funded research. The public-good interests in the full and open access to and use of scientific data need to be balanced against legitimate concerns for the protection of national security, individual privacy, and intellectual property.

For more information on ‘re-use’ see, for example, 20 , 25 , 32 , 48 .

• 23 (p.10) The evolution from the right of access to documents from the Public Administration to the right of reuse of public information.
1 public information derived from a document produced or hold by the Administration,
2 there are no other intellectual property rights owners,
3 data do not affect personal or private issues of people.
• 22 (p. 1060) The concept of data sharing in a scientific context:

For the purposes of this article, data sharing is the release of research data for use by others. Release may take many forms, from private exchange upon request to deposit in a public data collection. Posting datasets on a public website or providing them to a journal as supplementary materials also qualifies as sharing.

• The importance of licenses to set the sharing and re-use conditions as highlighted in 5 , 11 , 13 , 50 .

Open data are data in an open format that can be freely used, re-used and shared by anyone for any purpose.

• 53 also provides a classification of scientific data in four types: observational, experimental, computational and reference data sets.
• The FAIR guiding principles 17 are studied in the article that follows this work 12 .
• The recent and relevant introduction of the term Big Data 24 , that refers to the exploitation of larger amounts of data. They can appear in medical research, meteorology, genomics, astronomy, demographic studies … and in real life, as we live all in a digital world where we generate large amounts of data every day by the use of phones and computers to do work, travelling, e-mail, business, shopping etc. 42 . Big data is associated mainly to four “V” characteristics: Volume, Variety, Velocity, Veracity, and others can be found for example in the mentioned Wikipedia page and in the references mentioned there. See also 54 .

Closing the conceptual loop developed in this section, let us remark, again, that legal aspects arise quite naturally in the above list of items. Among others, some aspects are related to the fact that the datasets are usually organized in databases, where data is arranged in a systematic or methodical way and is individually accessible by electronic or other means 13 , 20 , 21 , 24 , 28 . The intellectual property rights can apply to the content of a database, the disposition of its elements and to the tools that make it working (for example software). The sui generis database rights primarily protects the producer of the database and may prohibit, for instance, the extraction and/or reuse of all or a substantial part of its content 24 .

Finally, let us quote here this paragraph from the OpenAIRE project report 20 (p.19) that highlights the difficulties to set a research data definition in the context of legal studies:

From a legal point of view, one of the very basic questions of this study is which kind of potentially protected data we are dealing with in the context of e-infrastructures for publications and research data such as OpenAIREplus. The term “research data” in this context does not seem to be very helpful, since there is no common definition of what research data basically is. It seems rather that every author or research study in this context uses its own definition of the term. Therefore, the term “research data” will not be strictly defined, but will include any kind of data produced in the course of scientific research, such as databases of raw data, tables, graphics, pictures or whatever else.

We can remark, that although the preceding quote does not provide a strict definition of research data, it highlights the relevance of the production context, as we have already mentioned in Section 2.2 .

4. Data as a research output: towards a definition for Research Data

In the previous section we have exemplified the complexity of the concept of data through different approaches. In this section we focus on the research data concept, proposing here a RD definition, directly derived from the RS definition presented in Section 2.2 . To this aim we start by gathering some previous definitions that are particularly relevant for our proposal.

The first one is the White House document 34 , and in particular the Intangible property section where we can find the following definition.

Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues.

Let us remark that, according to 34 this definition explicitly excludes:

(A) Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are published, or similar information which is protected under law; and (B) Personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study.

The above RD definition has been extended in 55 , emphasizing, among other aspects, the scientific purpose of the recorded factual material and the link with the scientific community.

A second basic inspiration for our proposal is the Directive for Open Data 25 that states:

(Article 2 (27)) The volume of research data generated is growing exponentially and has potential for re-use beyond the scientific community. […] Research data includes statistics, results of experiments, measurements, observations resulting from fieldwork, survey results, interview recordings and images. It also includes meta-data, specifications and other digital objects. Research data is different from scientific articles reporting and commenting on findings resulting from their scientific research. […] (Article 2 (9)) ‘research data’ means documents in a digital form, other than scientific publications, which are collected or produced in the course of scientific research activities and are used as evidence in the research process, or are commonly accepted in the research community as necessary to validate research findings and results;

The third pillar that we consider essential to support our proposal is the OECD report 19 (p.13) where we can find one of the most largely accepted and adopted definitions of RD:

Research data are defined as factual records (numerical scores, textual records, images and sounds) used as primary sources for scientific research, and that are commonly accepted in the scientific community as necessary to validate research findings. A research data set constitutes a systematic, partial representation of the subject being investigated. This term does not cover the following: laboratory notebooks, pre-liminary analyses, and drafts of scientific papers, plans for future research, peer reviews, or personal communications with colleagues or physical objects (e.g. laboratory samples, strains of bacteria and test animals such as mice). Access to all of these products or outcomes of research is governed by different considerations than those dealt with here.

Finally, let us bring here the research data definition coming from the “Concordat on Open Research Data” 25 signed by the research councils of the UK Research and Innovation (UKRI) organisation 26 :

Research data are the evidence that underpins the answer to the research question, and can be used to validate findings regardless of its form (e.g. print, digital, or physical). These might be quantitative information or qualitative statements collected by researchers in the course of their work by experimentation, observation, modelling, interview or other methods, or information derived from existing evidence. Data may be raw or primary (e.g. direct from measurement or collection) or derived from primary data for subsequent analysis or interpretation (e.g. cleaned up or as an extract from a larger data set), or derived from existing sources here the rights may be held by others.

Let us observe that this last definition highlights the important role of data as a tool to find an answer to a scientific question, coinciding with the first characteristic of our RS definition, and also agreeing with 40 (p. 508): … data from scientific sensors are a means and not an end for their own research.

A remarkable “positive” aspect of these four definitions is that they separate the data from the subject under study, and establish what is, or is not, RD. This is relevant, as the legal context of the subjects under study sets up the legal (and ethical ) context of the RD.

We must say that we do not agree completely with all the terms in these definitions. For example, regarding the exclusion of the laboratory notebooks as RD elements, as we think they can be used to generate input data for other studies (how a laboratory works, which is the information that appears in some notebooks depending on the scientific matter). We think that these information and data can be of interest for other researchers.

Some other “negative” aspects: the role of the data producers does not appear in the above definitions, although it is more or less implicit when they refer to the connection with the scientific community. Indeed, their role is very important as observed in 48 (p.6):

Data creators usually have the most intimate knowledge about a given dataset, gained while designing, collecting, processing, analyzing and interpreting the data. Many individuals may participate in data creation, hence knowledge may be distributed among multiple parties over time.

Certainly, as for each research output, the producer team is the guarantor of the data quality, in particular to ensure that the data are not outdated, erroneous, falsified, irrelevant, and unusable. Note that this is particularly relevant in the case of RD, as a consequence of the lack of a widely accepted RD publication procedures, compared to the existing ones for articles in scientific journals, where the responsibility of the quality of the publication is somehow shared by the authors, the journal editors, and the reviewers. This is also confirmed by 56 (p. 73):

The concept of data quality is determined by multiple factors. The first is trust. This factor is complex in itself. […] Giarlo (2013) also mentions trust in first place, stating that it depends on subjective judgments on authenticity, acceptability or applicability of the data. Trust is also influenced by the given subject discipline, the reputation of those responsible for the creation of the data, and the biases of the persons who are evaluating the data.

Even more, note that, as remarked in 23 the quality of the producer legal entity defines the cultural quality of the data in legal terms, yielding in this way the qualification of cultural data.

On the other hand, in some of the above definitions, the RD scientific purpose is focused in its role to validate research findings , although RD can be reused for many other finalities in the scientific context as, for instance, to generate new knowledge, i.e. as primary sources for new scientific findings. Let us observe that these are two of the four rationales for data sharing examined in 22 .

Bearing all these arguments in mind, we propose the following RD definition.

Research data is a well identified set of data that has been produced (collected, processed, analyzed, shared & disseminated) by a (again, well identified) research team. The data has been collected, processed and analyzed to produce a result published or disseminated in some article or scientific contribution. Each research data encloses a set (of files) that contains the dataset maybe organized as a database, and it can also include other elements as the documentation, specifications, use cases, and any other useful material as provenance information, instrument information, etc. It can include the research software that has been developed to manipulate the dataset (from short scripts to research software of larger size) or give the references to the software that is necessary to manipulate the data (developed or not in an academic context).

We can summarize the above definition in the following three main characteristics:

• the goal of the collection and analysis is to do research, that is, to answer a scientific question (which includes the validation of research findings),
• it has been produced by a research team,
• the RD is involved in the obtention of the results presented in scientific articles (as the most important means for scientific exchange are still articles published in scientific journals) or by any other kind of recognized scientific means.

We provide here some further considerations concerning this proposal. First, it is clear that we have followed closely the RS definition in Section 2.2 , in order to formulate this RD counterpart, which involves the transaltion of some RS features of strict digital nature to RD. This does not mean that we do not consider non digital data as possible RD, but rather we assume that the information extracted from the physical samples has been already treated as digital information to be manipulated in a computer system, which simplifies the manipulation of physical data and its inclusion in the proposed RD definition.

Secondly, we emphasize that our RD definition also follows the consideration of a restricted research production context, as in the case of our RS definition. But this limited context to set the RD definition does not mean that e.g. public sector data can not be used in the research work. Rather it means that the external components that have not been directly collected/produced by the research team should be well identified, indicating their origin, where the data is available, which is the license that allows the reuse. It is also necessary to indicate if the data has been reused (processed) without modification, or if some adaptations have been necessary for the analysis. External data components can have any origin, not just public sector. As we have highlighted in Section 3 , the production context of the data may have a lot of importance, as data can be difficult to interpret once removed from their initial context 22 .

Third, note that, according to our definition, documentation, licenses, Data Management Plans and other documents can also be part of the set of files that constitutes the RD. Moreover, as explained in Section 2.2 , a RS can also include data in the list of included materials that could also be qualified as RD. There are here a broad spectrum of possibilities, according to the size, the importance given by the research team and the chosen strategy in the dissemination stage. If the RD is considered of little size and less importance than the RS, it can be just included and disseminated as part of the software, and also the other way around, when the RS is considered less important than the RD, as for example when the software development effort is much less important than the time and effort invested in the data collection and analysis. It can also happen that both outputs are considered as of equal value, and can be disseminated separately. In this case it is important that both outputs are linked in order to allow other researchers to find easily the other output.

In a similar manner as for RS, RD can include other data components, and some can also qualify as research data. The RD producer team should explain how these components have been selected, mixed and analyzed, and highlight the reuse of other RD components by means of a correct citation form, see for example, 38 , 41 , 57 .

Moreover, software and data can have several versions and releases, and they can be manipulated alike and with similar tools (forges, etc …) 37 , 58 , 59 . One of the differences that we have detected between RS and RD is that while some research teams can decide to give access to early stages of the software development, what we observe in the consulted work is that RD is expected in its final form, ready for reuse, as mentioned in 22 :

If the rewards of the data deluge are to be reaped, then researchers who produce those data must share them, and do so in such a way that the data are interpretable and reusable by others.

This difference is a consequence of the distinct nature of the building process of both objects. In the FLOSS community, we find the release early, release often principle associated to the development of the Linux kernel 60 and to Agile developments. 27 This principle may not have the same sense in the building of a dataset for which a research team collects, processes and analyzes data with a very particular research purpose, maybe difficult to share with a large or external community in the early stages of the RD production.

Yet, in this work, we do not address some production issues like best software development practices or data curation, as they are out of the scope of the present article, and could be the object of future work. It is not that we do not give enough appreciation to these important issues, as they are part of the 3rd step of the proposed CDUR evaluation protocol for RS and RD, see sections 2.3 and 3.3 of 12 . For us, the research team decides when the research outputs have reached the right status for its dissemination. Neither we do enter in the different roles (see 22 ) that may appear in the RD team, taking care of actions involving: collection, cleaning, selection, documentation, analysis, curation, preservation, maintenance, or the role of Data Officer proposed in 15 .

5. Conclusion

While some authors highlight differences between software and data 8 , 9 , the present article leans toward profiting from the similarities shared by RS and RD. For example, taking into consideration the difference between the definition of software and the definition of RS has driven us to the proposition of a RD definition that is independent from the definition of data. Likewise, along the above sections we have emphasized other characteristics of RD that are grounded in the RS features. As a side effect of this approach, the fact that we can easily adopt issues from the RS definition formulation to RD, confirms and validates our proposed RS definition.

In the introduction we have mentioned Borgman’s conundrum challenges related to RD 22 :

The challenges are to understand which data might be shared, by whom, with whom, under what conditions, why, and to what effects. Answers will inform data policy and practice.

In our experience, Borgman's conundrum challenges correspond to questions that appear regularly at different stages of the RD production. We think that to provide the vision developed in Section 4 could be of help to deal with these questions, as a first step to tackle some problems in a well determined situation. Moreover, the view proposed in Section 4 is extended and completed with the dissemination and evaluation protocols of 12 . Our experience of many years confirms the need of these protocols for RS, and we think that they will be appropriated, useful and relevant for RD as well.

As a test for the soundness of the proposed RD definition we have used the conundrum queries as a benchmark, checking if our definition allows us to provide answers to the different questions, as well as to two extra ones that we consider equally relevant, namely how and where to share RD:

Which data might be shared? Following the arguments supporting our RD definition, we think that it is a decision of the research team: similarly to the stage in which the team decides to present some research work in the form of a document for its dissemination as a preprint, or a journal article, a conference paper, a book … the team should decide which data might be shared, in which form and when (following maybe funder or institutional Open Science requirements).

By whom? The research team that has collected, processed, analyzed the RD, and decided to share and disseminate it. That is the RD producer team, as stated in the second characteristic of our RD definition. On the other hand, data ownership issues have been discussed for example in 20 , 21 , 32 , 61 – 63 .

How? As observed in the precedent sections, the How? should follow some kind of dissemination procedure like the one proposed in 11 , 12 in order to identify correctly the RD set of files, to set a title and the list of persons in the producer team (that can be completed with their different roles), to determine the important versions and associated dates, to give a documentation, to verify the legal 21 , 33 (and ethical ) context of the RD and give the license to settle the sharing conditions 13 , etc. which can include the publication of a data paper and decisions about in which form and when the RD should be disseminated, maybe following grant funders or institutional Open Science requirements). In order to increase the return on public investments in scientific research, RD dissemination could respect principles and follow guidelines as described in 17 , 19 . Further analysis on RD dissemination issues can be found in 12 .

Where? There are different places to disseminate a RD, including the web pages of the producer team, of the funded project, or in a existing data repository. Let us remark that the Registry of Research Data Repository 28 is a global registry of RD repositories that covers repositories from different academic disciplines. It is funded by the German Research Foundation (DFG) 29 and it can help to find the right repository. Note that the Science Europe report 64 provides criteria for the selection of trustworthy repositories to deposit RD.

With whom? Each act of scholar communication has its own target public, and initially, the RD dissemination strategy can target the same public as the one that could be interested in the corresponding research article. But it can happen that the RD is of interdisciplinary value, possibly wider than the initial discipline associated to the scientific publication, and to assess what is the public involved in this larger context can be difficult. Indeed, as observed by 22 :

An investigator may be part of multiple, overlapping communities of interest, each of which may have different notions of what are data and different data practices. The boundaries of communities of interest are neither clear nor stable.

So, it can be complex to determine the community of interest for a particular RD, but this also happens for articles, see for example the studies on HIV/AIDS 65 making reference to automatic reasoning in elementary geometry in studies in its reference number 12, and it seems to us that this has never been an obstacle for sharing a publication. Thus 22 :

… the intended users may vary from researchers within a narrow specialty to the general public.

Under what conditions? As described previously, and in parallel with the case of RS, the sharing conditions are to be found in the license that goes with the RD, such as a Creative Commons license 30 or other licenses to settle the attribution, re-use, mining … conditions 13 . For example, in France, the law of 2016 for a Digital Republic Act sets in a Décret the list of licenses that can be used for RS or RD release 31 , 32 .

Why and to what effects? There maybe different reasons to release some RD, from the contribution to build more solid, and easy to validate science, to just comply with the recommendations or requirements of the funder of a project, of the institutions supporting the research team, or those of a scientific journal, including Open Science issues 5 . The works 22 , 49 give a thorough analysis on this subject. As documented there and already mentioned in Section 3 :

“The value of data lies in their use. Full and open access to scientific data should be adopted as the international norm for the exchange of scientific data derived from publicly funded research.”

As remarked in 5 and in the work analyzed there, the evaluation step is an important enabler in order to improve the adoption of Open Science best practices and to increase RD sharing and open access. To disseminate high quality RD outputs is a task that requires time, work and hands willing to verify the quality of the data, to write the associated documentation, etc. Incentives are needed to motivate the teams to accomplish these tasks. RD dissemination also asks for the establishment of best citation practices and evolution in the protocols of research evaluation. In particular, following the parallelism present all along this work, the CDUR protocol 3 proposed for RS evaluation can be also proposed for RD as developed in the article that extends the present work 12 .

Data availability

Acknowledgments.

With many thanks to the Referees, to the Departamento de Matemáticas, Estadística y Computación de la Universidad de Cantabria (Spain) for hospitality, and to Prof. T. Margoni for useful comments and references.

[version 2; peer review: 3 approved]

Funding Statement

This work is partially funded by the CNRS-International Emerging Action (IEA) PREOSI (2021-22).

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

1 “We Are Star Dust” - Symphony of Science, https://www.youtube.com/watch?v=8g4d-rnhuSg

2 Cave of Altamira and Paleolithic Cave Art of Northern Spain, https://whc.unesco.org/en/list/310/

3 Prehistoric Sites and Decorated Caves of the Vézère Valley, https://whc.unesco.org/en/list/85/

4 https://en.wikipedia.org/wiki/Free_and_open-source_software

5 https://en.unesco.org/science-sustainable-future/open-science/recommendation

6 https://www.fsf.org/licensing/

7 https://opensource.org/licenses

8 https://spdx.org/licenses/

9 https://creativecommons.org/licenses/?lang=en

10 https://opendatacommons.org/licenses/

11 https://5stardata.info/en/

12 Note that the authors of this article provide their own translations. Authors prefer to keep the original text for two reasons. First, because of the legal nature of the involved quotations. Second, for French or Spanish speaking readers to enjoy it, very much in line with the Helsinki Initiative on Multilingualism in Scholarly Communication (2019), see https://doi.org/10.6084/m9.figshare.7887059 . These translations have been helped by Google Translate, https://translate.google.com/ and Linguee, https://www.linguee.fr/ .

13 https://www.rd-alliance.org/groups/fair-research-software-fair4rs-wg

14 https://matek.hu/zoltan/issac-2021.php

15 https://swmath.org/software/4203

16 See, for example, https://www.senat.fr/dossier-legislatif/pjl16-504.html

17 https://stats.oecd.org/glossary/detail.asp?ID=532

18 https://www.isko.org/cyclo/data

19 https://www.legifrance.gouv.fr/codes/article_lc/LEGIARTI000006832922/

20 https://www.legifrance.gouv.fr/codes/section_lc/LEGITEXT000006074220/LEGISCTA000022936254/

21 The entries of the glossary https://stats.oecd.org/glossary/ have several parts including Definition and Context as shown in the Data definition included in Section 3 . This quotation appears when placing the pointer over the Context part of the Data entry.

22 https://www.beilstein-institut.de/en/projects/strenda/guidelines/

23 https://en.wikipedia.org/wiki/Open_data

24 https://en.wikipedia.org/wiki/Big_data

25 https://www.ukri.org/wp-content/uploads/2020/10/UKRI-020920-ConcordatonOpenResearchData.pdf

26 https://www.ukri.org/

27 https://en.wikipedia.org/wiki/Agile_software_development

28 https://www.re3data.org/

29 http://www.dfg.de/

30 https://creativecommons.org/

Reviewer response for version 2

Joachim schopfel.

1 GERiiCO Labor, University of Lille, Lille, France

The second version is fine with me. The authors replied to all comments; they fixed some issues, and they provided complementary arguments for other issues. I do not share all their viewpoints but that is science and not a problem. The paper is interesting and relevant.

Is the work clearly and accurately presented and does it cite the current literature?

If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable

Are all the source data underlying the results available to ensure full reproducibility?

No source data required

Is the study design appropriate and is the work technically sound?

Are the conclusions drawn adequately supported by the results?

Are sufficient details of methods and analysis provided to allow replication by others?

Reviewer Expertise:

Information science

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Remedios Melero

1 Instituto de Agroquímica y Tecnología de Alimentos, CSIC, Valencia, Spain

I do not have any further comments.

Open science, open research data, scholarly publications, open access policies

Tibor Koltay

1 Institute of Learning Technologies, Eszterházy Károly University, Eger, Hungary

I am satisfied with the author’s reply, and found the other two reviews’ comments intriguing and useful for the authors. I have no further comments.

Reviewer response for version 1

The research data management is a central dimension of the development of scientific research and related infrastructures. Also, any original attempt to define research data is welcome and helpful for the understanding of this field. This conceptual paper will be a valuable contribution to the discussion on the research but. Yet, it should be improved, and a couple of more or less minor issues should be fixed.

First, all cited text should be systematically translated into English.
The main concepts (such as data, information, knowledge...) should be defined from the beginning on and not only later (section 3). The definitions should not be based on Wikipedia, Larrousse etc but on academic works in the field of information science (eg ISKO).
Open science is a fuzzy concept, an umbrella term or even a "boundary object" (as Samuel Moore put it). But it should be made clear that open science is more than "sharing and dissemination of research outputs" (as in the [5] citation).
The former comment is important because the approach of the paper is in some kind limited or reduced to the aspect of "research output". Generally, in the research process, research software and research data are not only output but also tools (software) and input (data). This needs clarification.
In the same context, the paper cites Wikipedia with " We must all accept that science is data and that data are science". I have two problems with this: nobody must accept anything in science, all is matter of discussion; and this sentence is either trivial or it makes no sense. My advice would be to avoid these kind of sentences.
Later on, the paper presents "analogies" between RS and RD. Analogy, even if it exists, does not mean "similarity", and I think that this comparison is somehow misleading because the underlying assumption is not entirely correct ("a definition for RD can be proposed following the main features of the RS definition"). Software and data are different objects, with different issues (IP protection, communities etc.); the analysis of RS may be helpful for a better understanding of RD but this does not mean that both are more or less similar or even "fungible".
In section 3, I would suggest that the paper tries to describe the relationship between RS and RD, perhaps with "use cases".
I admit that the authors are not legal experts but section 3 should be more explicit (and perhaps shorter and more restrictive) about the different laws and legal frameworks. Are you speaking about French laws? Or about the EU regulation?
Another, related issue is the data typology. The paper is about research data but section 3 mentions (and apparently does not differentiate) environmental data, cultural data and public sector information.
My suggestion would be to improve the structure of section 3 and to distinguish between concepts, typology, legal status and reuse/policy (subsections).
Section 4: I already mentioned it above - RD is not only output but also input, with different issues (third party rights etc). This requires clarification.
At the end of section 4, the paper states that "documentation, licenses, Data Management Plans and other documents can also be part of the set of files that constitutes the RD". The meaning of this statement requires attention, as well as its implications. Does this mean that "RDM and other documents" are data? Or that they may be part of a RD deposit? But again (see above, comment 5), a statement that "all is data" is not helpful; it may make sense as a political catchword but not in an academic paper.
Last comment: I like very much Borgman's assessment of RD and her "conundrum challenges" but I have a somewhat different understanding of the meaning of this - for me, these "challenges" are questions that require attention and evaluation in a given situation, not for all RD in a general way. For me, they provide a kind of "reading grid" to analyse a specific data community, or a specific instrument or infrastructure or workflow; but they don't require or demand a comprehensive response as such provided by the paper.
Anyway, the paper is an interesting contribution to the academic research on RD, and I am looking forward to read a revised and improved version. Thank you!

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

LIGM, Gustave Eiffel University & CNRS, France

Many thanks to you, Joachim Schopfel, for your interesting comments that give us the opportunity to improve this work. A new version is in preparation, but we provide here some answers to your comments.

1. [translations into English]

Translations are included as footnotes, they will be moved to the main text.

2. [information science (eg ISKO).]

Many thanks for this reference, we are looking into it.

3. [Open science is a fuzzy concept…]

As indicated in the introduction: A more transversal and global vision can be found in the ongoing work for the UNESCO Recommendation on Open Science [Reference 6]. See also [Reference 7] . We will explain better this point.

4. [the paper is in some kind limited or reduced to the aspect of "research output". Generally, in the research process, research software and research data are not only output but also tools (software) and input (data). This needs clarification.]

In our view, each "research output" is a potential input for new research work. For example a RS can be a tool to manipulate data or an input for a new RS, this can be in the form of a component, or in the form of a new version done by the initial research team or another one. A RD can be used by other teams (as a tool) to understand some problem, it can be modified to produce a new RD, or it can be included as part of a larger data set, that can be as well a new RD. To better understand the production context is not, in our view, a limitation. But you are right, this point needs clarification.

5. [cites Wikipedia with " We must all accept that science is data and that data are science ".]

Please note that this cited phrase comes from [Reference 4], and as indicated to Referee T. Koltay, we have chosen to do this reference in a slightly different manner as done in the Borgman’s work, where we have found it.

6.1. [similarity/analogy]

When consulting Cambridge English Learner’s Dictionary dictionary we find:

analogy: a comparison that shows how two things are similar

6.2. [Software and data are different objects, with different issues (IP protection, communities etc.); the analysis of RS may be helpful for a better understanding of RD but this does not mean that both are more or less similar or even "fungible".]

It is one of the intentions of the present work to show the differences between data and software form the legal point of view. While software finds a somehow clear and simple presentation (Section 2.1), data is much more difficult to grasp, as studied in Section 3. But this is not an obstacle to present an unified vision of RS and RD as research outputs, as we can see in the RS and RD proposed definitions. The fact that we can propose a similar formulation for both definitions allows us to propose similar dissemination and evaluation protocols as you can find in the article that follows this work [Reference 13]. The fact that we can deal with RS and RD in a similar way does not mean that they are similar.

7. [describe the relationship between RS and RD, perhaps with "use cases".]

It seems to us that it is quite usual for the targeted research audience to use and/or produce RS and/or RD as part of their everyday research practices, and that this point does not require further explanation. Examples can be found easily in the literature, as for example in the bibliography included at the end of this work.

8. [I admit that the authors are not legal experts but section 3 should be more explicit (and perhaps shorter and more restrictive) about the different laws and legal frameworks. Are you speaking about French laws? Or about the EU regulation?]

As indicated in the introduction, we have consulted legal texts and legal experts’ work in order to understand and explain the legal context in which we place this work. We have consulted French, European and USA texts, and selected the parts that we have used to document the article. We consider that our role is restricted to this intention, due to the lack or further expertise in legal matters, which does not hide the efforts we have put in to understand and to explain some legal issues. But we are unable to give more information on the regulations that can be taken into consideration, as this is the role of legal experts in the light of a well defined setting.

9. [Another, related issue is the data typology. The paper is about research data but section 3 mentions (and apparently does not differentiate) environmental data, cultural data and public sector information.]

The goal of Section 3 is to show the difficulties existing to set a data definition from the legal point of view, which is a very different context as the one existing for software, as shown in Section 2.1. The case of cultural data is very interesting, as legally speaking [Reference 19] the quality of the producer legal entity defines the cultural quality of the data . Then we can establish the parallel with the quality of research for some data set, as the consequence of the research quality of the producer team. Data typology could be the object of future work.

10. [My suggestion would be to improve the structure of section 3 and to distinguish between concepts, typology, legal status and reuse/policy (subsections).]

We will consider this suggestion

11. [Section 4: I already mentioned it above - RD is not only output but also input, with different issues (third party rights etc). This requires clarification.]

As already explained, we study in here the production aspects, and other aspects are presented in [Reference 13]. But you are right, this needs better explanation.

12. [At the end of section 4, the paper states that "documentation, licenses, Data Management Plans and other documents can also be part of the set of files that constitutes the RD". ]

Section 2.1 shows that the preparatory design work and documentation are part of the software, and these are documents that can be included in the released version of a RS, following the choice of the RS producer team. There can be other elements as for example tests, input and output files to illustrate how to use the RS, licenses, etc. To include these elements in the released RS correspond to best practices that facilitate RS reuse. In our view, to release a RD can follow similar practices, that is, to include a documentation, some use examples, a license, a data management plan…this is to be decided by the producer team.

13. [Last comment: I like very much Borgman's assessment of RD and her "conundrum challenges" but I have a somewhat different understanding of the meaning of this - for me, these "challenges" are questions that require attention and evaluation in a given situation, not for all RD in a general way. For me, they provide a kind of "reading grid" to analyse a specific data community, or a specific instrument or infrastructure or workflow; but they don't require or demand a comprehensive response as such provided by the paper.]

In our experience, Borgman's conundrum challenges correspond to questions that appear regularly at different stages of the RD production. We think that to provide such vision as the one exposed in Section 4 could be of help to deal with these questions, and, as you said, as a first step to tackle some problems in a well determined situation. Moreover, this view proposed in Section 4 is extended and completed with the dissemination and evaluation protocols proposed in [Reference 13]. Our experience of many years confirms the need of these protocols for RS, and we think that they will be appropriated, useful and relevant for RD as well.

Teresa Gomez-Diaz and Tomas Recio

The authors proposed a Research Data (RD) definition "based in three characteristics: the data should be produced (collected, processed, analyzed, shared & disseminated) to answer a scientific question, by a scientific team, and has yield a result published or disseminated in some article or scientific contribution of any kind." From my point of view this definition restricts RD to those that are published by a scientific team, however what about the citizen science, or data produced by non-scientist staff? What about any other data that do not deserve be published but help to further research?

Authors say: "the RS is involved in the obtention of the results presented in scientific articles" - This is not necessarily true. RS is not always involved in the obtention of results because it can be developed for any other purpose, again the authors make a very strict definition.

Authors say: "As a matter of fact, a research team can use RS produced by other teams for their scientific work, as well as FLOSS or other software developed outside the scientific community, but the present work is centered in the making-of aspects which are pertinent for the proposed definition." - This restricts the definition of Research Software (RS) a lot by excluding all FLOSS produced by non-academic members.

The authors have missed any mention to the Directive (EU) 2019/1024 of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information, in which RD are defined and included as part of the public sector. In fact, the authors have cited it but they have not commented/mentioned the fact that RD has a wider meaning and that according to this Directive are considered public sector information, and they need not necessarily be published in a scientific journal but shared.

Definitions given by dictionaries are not particularly relevant to the scientific context/environment. I think this part should be omitted, it only adds some definitions in the authors' own languages.

"For example, to the need for complementary, technical information associated to a given dataset in order to facilitate its reuse." - This is part of the FAIR principles which are not mentioned/linked to this comment. Obviously, a dataset without any information about how data have been produced/obtained, etc. are not valuable.

Authors write: "In here, the research outputs have reach a status in which the research team is happy enough for its dissemination." - This seems a very naïve assertion. Because the authors "do not consider production issues like best software development practices or data curation", it seems they do not care about these important issues.

Conclusions again repeat the proposal of a RD definition. Concepts like linked data, FAR data, and open data have not been treated in the article. Their definition of RD is very strict and narrow, and they have not considered any semantic issues about data and the benefits and implications of being a 5star open data . Their definition is far from the 4th or 5th step of the stars.

In general, from my point of view, the article does not add any new ideas about RD definition and restricts it to data produced by scientific teams.

Many thanks to you, Remedios Melero, for these very interesting comments. We are preparing a new version of this article and we will include several of the proposed corrections. Meanwhile, we would like to provide in here some preliminary comments.

1. [this definition restricts RD to those that are published by a scientific team, however what about the citizen science, or data produced by non-scientist staff?]

[the article does not add any new ideas about RD definition and restricts it to data produced by scientific teams.]

It would be strange to consider any article published in a newspaper as a scientific publication.

On the other hand, scientists may read the newspapers and many other documents, including tweets, and may use these documents as input information for a research work. As already explained in our answer to Rob Hooft’s comment, yes, we have chosen a restricted definition for RD. It allows us to provide the answers to the Borgman’s conundrum challenges that are in the Conclusion section. As far as we know, we have not found in the consulted literature the proposition of such kind of answers in this complete view. Moreover, as the RD definition finds a similar formulation as the RS one, we can also translate RS dissemination and evaluation protocols to RD [Reference 13]. Once we understand well the restricted context, it can be studied its extension and then see which are the answers to Borgman’s conundrum challenges and the dissemination and evaluation protocols that can be proposed in the extended context.

The fact that we do not include e.g. public sector data as RD is different from the claim that these data cannot be used as input for a research work. As explained in section 3.2 of [Reference 13], these external data components should be correctly presented and referenced, and some can also fall in the category of RD.

2. [RS is not always involved in the obtention of results because it can be developed for any other purpose, again the authors make a very strict definition.]

[This restricts the definition of Research Software (RS) a lot by excluding all FLOSS produced by non-academic members.]

You are right, this point should be explained better. To obtain a research result may involve the use of software (FLOSS or not FLOSS), the development of software to support some work or service, and the development of RS by the research team as explained in [References 3, 14]. Note that RS can be also disseminated as FLOSS, which is the usual practice in the work of T. Recio and in the research lab of T. Gomez-Diaz. This is also similar for data and RD, that can be disseminated as open data, as well as for publications and research articles as seen in the previous point.

3. [Research data defined in the Directive (EU) 2019/1024 ]

This definition was included in the preparation versions of the present article, and it will be included again in the new version in preparation, following your advice.

4. [Definitions given by dictionaries]

In the difficulties to explain easily the concepts of data and information we have ended in the consultation of several dictionaries, including some in English. Some of the found definitions, mainly in Spanish and French have attracted our attention and we have decided to included them in Box 1. This box can be easily skipped by readers not interested in these definitions.

We prefer to leave the reading of the content of this box at the choice of readers.

5. [FAIR and "For example, to the need for complementary, technical information associated to a given dataset in order to facilitate its reuse."]

Please note that FAIR principles appear in the [Reference 55] dated 2016, while [Reference 36] that we have chosen to illustrate the need for complementary, technical information is dated 2012. Moreover, this is also related to the importance of context, that is explained in the OECD Glossary of Statistical Terms, with PDF and WORD download versions dated 2007 [ https://stats.oecd.org/glossary/download.asp ]. On the other hand, FAIR principles are considered in the second part of this work [Reference 13], as they are related to dissemination issues. We will also mention them in the second version of this first part.

6. ["In here, the research outputs have reached a status in which the research team is happy enough for its dissemination."]

[authors "do not consider production issues like best software development practices or data curation", it seems they do not care about these important issues.]

You are right, this point should be better explained in the new version of the article. It is not that we do not care about these important issues, as they are part of the 3 rd step of the proposed CDUR evaluation protocol for RS and RD, see sections 2.3 and 3.3 of [Reference 13].

7. [Concepts like linked data, FAIR data, and open data have not been treated in the article. Their definition of RD is very strict and narrow, and they have not considered any semantic issues about data and the benefits and implications of being a 5star open data . Their definition is far from the 4th or 5th step of the stars.]

Please note that FAIR data and open data are treated in [Reference 13]. We will include in the second version the mention of the 5star open data, many thanks for this reference.

Teresa Gomez-Diaz, Tomas Recio

The content of the first two paragraphs of the paper (especially the first one) seems to be less appropriate, compared to the purpose of your paper. I would thus advise you to consider rewriting these paragraphs.

Your practice of providing the cited texts in the original language (French or Spanish) and providing the translations of these passages only in the footnotes is unusual and may be not appropriate for a readership that probably reads and writes only in English, or is not familiar with Spanish and/or French texts. As I see it, if you would want to make a favour to your readers, who prefer French or Spanish, the solution could be reverse this order, i.e. putting the original texts into the footnotes.

Other remarks

I think that it would be better if the following sentence would be changed as follows:

“Indeed, as remarked by Hanson et al ., we must all accept that science is data and that data are science… 4 ”

This regards not only the form of citing, but content, because this remark comes from Borgman’s Conundrum, cited in your paper a couple of times.

You describe three main characteristics of RS:

“the goal of the RS development is to do research. As stated by D. Kelly: it is developed to answer a scientific question 32 ,
it has been written by a research team,
the RS is involved in the obtention of the results presented in scientific articles (as the most important means for scientific exchange are still articles published in scientific journals).”

In general, these three claims are correct. However, the first one of them is a little awkward. I would thus change it to anything like “the goal of the RS development is to support research. As stated by Kelly, it is developed to answer a general, or a specific scientific question. Writing the software requires close involvement of someone with deep domain knowledge in the application area related to the question. 32 ”. Theses sentences however may prove redundant, because you provide a more complete definition:

“Research Software is a well identified set of code that has been written by a (again, well identified) research team…If take this, linger definition only, the expression “(again, well identified)” should be deleted.

You write that “Indeed, there is a difference between the concepts of algorithm and software from the legal point of view, as there is a difference between the mere idea for the plot of a novel and the final written work.” This is a brilliant idea, although I believe that it should not be restricted to the legal point of view.

In my view, it seems to be dangerous to write about copyright issues without being legal experts. Personally, I have only basic knowledge of copyright laws, so I cannot judge the correctness of all your argument. Fortunately, what you describe is also related to different issues.

I do not see any further problems. Therefore, I will not enumerate passages that are correct and rather straightforward. My suggestion is however, that you carefully review you text in order to reach clarity of your argument.

Many thanks to you, Tibor Koltay, for these very interesting comments. We are preparing a new version of this article and we will include several of the proposed corrections. Meanwhile, we would like to provide in here some preliminary comments.

1. [first two paragraphs]

We have chosen to start in a ''light'' manner an article that can ask for some effort to be understood, this is our author’s choice. It is the reader’s choice to skip these two first paragraphs or to enjoy them, as this does not have any consequence for the understanding of the content of the article.

2. [translations to English]

We agree with you, the translations to English in the footnotes may hinder the fluent reading of this work, we will modify the presentation.

3. [Hanson et al. Reference]

You are right, we have found this reference in Borgman’s work, but we have consulted the original article and we have chosen to do this reference in a slightly different manner.

4. [RS definition characteristics]

We will modify the phrase to include your proposition as follows: “the goal of the RS development is to do or to support research’’. Please note that the composition of a research team involved in the

development of a RS has been thoroughly studied in section 2.2 of [Reference 3]. We will include this reference to clarify this point as you ask. Please, also note that long developments may involve many different contributions from developers with different status. As copyright issues enter into play, it is important that the RS developers and contributors are correctly listed.

5. [Algorithms and software]

Comparisons between algorithms and software can be done in several contexts, for example in mathematics, or in computer science, among others. We have highlighted the legal aspects as we detect regularly the confusion between these two concepts, and the [Reference 22] providers a pretty clear explanation.

6. [Copyright issues]

Please note that one of the authors has study copyright issues in order to write [Reference 2], work that has been validated by several experts, including legal experts. On the other hand, we are regularly in contact and follow the work of legal experts, in such a manner as to provide us with the necessary confidence to deal with copyright issues in the way we propose in this article. The remark included at the end of the Introduction gives the necessary warning to our readers on this point.

Database Search

What is Database Search?

Harvard Library licenses hundreds of online databases, giving you access to academic and news articles, books, journals, primary sources, streaming media, and much more.

The contents of these databases are only partially included in HOLLIS. To make sure you're really seeing everything, you need to search in multiple places. Use Database Search to identify and connect to the best databases for your topic.

In addition to digital content, you will find specialized search engines used in specific scholarly domains.

Related Services & Tools

Home » Research Data – Types Methods and Examples

Research Data – Types Methods and Examples

Table of Contents

Research Data

Research data refers to any information or evidence gathered through systematic investigation or experimentation to support or refute a hypothesis or answer a research question.

It includes both primary and secondary data, and can be in various formats such as numerical, textual, audiovisual, or visual. Research data plays a critical role in scientific inquiry and is often subject to rigorous analysis, interpretation, and dissemination to advance knowledge and inform decision-making.

Types of Research Data

There are generally four types of research data:

Quantitative Data

This type of data involves the collection and analysis of numerical data. It is often gathered through surveys, experiments, or other types of structured data collection methods. Quantitative data can be analyzed using statistical techniques to identify patterns or relationships in the data.

Qualitative Data

This type of data is non-numerical and often involves the collection and analysis of words, images, or sounds. It is often gathered through methods such as interviews, focus groups, or observation. Qualitative data can be analyzed using techniques such as content analysis, thematic analysis, or discourse analysis.

Primary Data

This type of data is collected by the researcher directly from the source. It can include data gathered through surveys, experiments, interviews, or observation. Primary data is often used to answer specific research questions or to test hypotheses.

Secondary Data

This type of data is collected by someone other than the researcher. It can include data from sources such as government reports, academic journals, or industry publications. Secondary data is often used to supplement or support primary data or to provide context for a research project.

Research Data Formates

There are several formats in which research data can be collected and stored. Some common formats include:

Text : This format includes any type of written data, such as interview transcripts, survey responses, or open-ended questionnaire answers.
Numeric : This format includes any data that can be expressed as numerical values, such as measurements or counts.
Audio : This format includes any recorded data in an audio form, such as interviews or focus group discussions.
Video : This format includes any recorded data in a video form, such as observations of behavior or experimental procedures.
Images : This format includes any visual data, such as photographs, drawings, or scans of documents.
Mixed media: This format includes any combination of the above formats, such as a survey response that includes both text and numeric data, or an observation study that includes both video and audio recordings.
Sensor Data: This format includes data collected from various sensors or devices, such as GPS, accelerometers, or heart rate monitors.
Social Media Data: This format includes data collected from social media platforms, such as tweets, posts, or comments.
Geographic Information System (GIS) Data: This format includes data with a spatial component, such as maps or satellite imagery.
Machine-Readable Data : This format includes data that can be read and processed by machines, such as data in XML or JSON format.
Metadata: This format includes data that describes other data, such as information about the source, format, or content of a dataset.

Data Collection Methods

Some common research data collection methods include:

Surveys : Surveys involve asking participants to answer a series of questions about a particular topic. Surveys can be conducted online, over the phone, or in person.
Interviews : Interviews involve asking participants a series of open-ended questions in order to gather detailed information about their experiences or perspectives. Interviews can be conducted in person, over the phone, or via video conferencing.
Focus groups: Focus groups involve bringing together a small group of participants to discuss a particular topic or issue in depth. The group is typically led by a moderator who asks questions and encourages discussion among the participants.
Observations : Observations involve watching and recording behaviors or events as they naturally occur. Observations can be conducted in person or through the use of video or audio recordings.
Experiments : Experiments involve manipulating one or more variables in order to measure the effect on an outcome of interest. Experiments can be conducted in a laboratory or in the field.
Case studies: Case studies involve conducting an in-depth analysis of a particular individual, group, or organization. Case studies typically involve gathering data from multiple sources, including interviews, observations, and document analysis.
Secondary data analysis: Secondary data analysis involves analyzing existing data that was collected for another purpose. Examples of secondary data sources include government records, academic research studies, and market research reports.

Analysis Methods

Some common research data analysis methods include:

Descriptive statistics: Descriptive statistics involve summarizing and describing the main features of a dataset, such as the mean, median, and standard deviation. Descriptive statistics are often used to provide an initial overview of the data.
Inferential statistics: Inferential statistics involve using statistical techniques to draw conclusions about a population based on a sample of data. Inferential statistics are often used to test hypotheses and determine the statistical significance of relationships between variables.
Content analysis : Content analysis involves analyzing the content of text, audio, or video data to identify patterns, themes, or other meaningful features. Content analysis is often used in qualitative research to analyze open-ended survey responses, interviews, or other types of text data.
Discourse analysis: Discourse analysis involves analyzing the language used in text, audio, or video data to understand how meaning is constructed and communicated. Discourse analysis is often used in qualitative research to analyze interviews, focus group discussions, or other types of text data.
Grounded theory : Grounded theory involves developing a theory or model based on an analysis of qualitative data. Grounded theory is often used in exploratory research to generate new insights and hypotheses.
Network analysis: Network analysis involves analyzing the relationships between entities, such as individuals or organizations, in a network. Network analysis is often used in social network analysis to understand the structure and dynamics of social networks.
Structural equation modeling: Structural equation modeling involves using statistical techniques to test complex models that include multiple variables and relationships. Structural equation modeling is often used in social science research to test theories about the relationships between variables.

Purpose of Research Data

Research data serves several important purposes, including:

Supporting scientific discoveries : Research data provides the basis for scientific discoveries and innovations. Researchers use data to test hypotheses, develop new theories, and advance scientific knowledge in their field.
Validating research findings: Research data provides the evidence necessary to validate research findings. By analyzing and interpreting data, researchers can determine the statistical significance of relationships between variables and draw conclusions about the research question.
Informing policy decisions: Research data can be used to inform policy decisions by providing evidence about the effectiveness of different policies or interventions. Policymakers can use data to make informed decisions about how to allocate resources and address social or economic challenges.
Promoting transparency and accountability: Research data promotes transparency and accountability by allowing other researchers to verify and replicate research findings. Data sharing also promotes transparency by allowing others to examine the methods used to collect and analyze data.
Supporting education and training: Research data can be used to support education and training by providing examples of research methods, data analysis techniques, and research findings. Students and researchers can use data to learn new research skills and to develop their own research projects.

Applications of Research Data

Research data has numerous applications across various fields, including social sciences, natural sciences, engineering, and health sciences. The applications of research data can be broadly classified into the following categories:

Academic research: Research data is widely used in academic research to test hypotheses, develop new theories, and advance scientific knowledge. Researchers use data to explore complex relationships between variables, identify patterns, and make predictions.
Business and industry: Research data is used in business and industry to make informed decisions about product development, marketing, and customer engagement. Data analysis techniques such as market research, customer analytics, and financial analysis are widely used to gain insights and inform strategic decision-making.
Healthcare: Research data is used in healthcare to improve patient outcomes, develop new treatments, and identify health risks. Researchers use data to analyze health trends, track disease outbreaks, and develop evidence-based treatment protocols.
Education : Research data is used in education to improve teaching and learning outcomes. Data analysis techniques such as assessments, surveys, and evaluations are used to measure student progress, evaluate program effectiveness, and inform policy decisions.
Government and public policy: Research data is used in government and public policy to inform decision-making and policy development. Data analysis techniques such as demographic analysis, cost-benefit analysis, and impact evaluation are widely used to evaluate policy effectiveness, identify social or economic challenges, and develop evidence-based policy solutions.
Environmental management: Research data is used in environmental management to monitor environmental conditions, track changes, and identify emerging threats. Data analysis techniques such as spatial analysis, remote sensing, and modeling are used to map environmental features, monitor ecosystem health, and inform policy decisions.

Advantages of Research Data

Research data has numerous advantages, including:

Empirical evidence: Research data provides empirical evidence that can be used to support or refute theories, test hypotheses, and inform decision-making. This evidence-based approach helps to ensure that decisions are based on objective, measurable data rather than subjective opinions or assumptions.
Accuracy and reliability : Research data is typically collected using rigorous scientific methods and protocols, which helps to ensure its accuracy and reliability. Data can be validated and verified using statistical methods, which further enhances its credibility.
Replicability: Research data can be replicated and validated by other researchers, which helps to promote transparency and accountability in research. By making data available for others to analyze and interpret, researchers can ensure that their findings are robust and reliable.
Insights and discoveries : Research data can provide insights into complex relationships between variables, identify patterns and trends, and reveal new discoveries. These insights can lead to the development of new theories, treatments, and interventions that can improve outcomes in various fields.
Informed decision-making: Research data can inform decision-making in a range of fields, including healthcare, business, education, and public policy. Data analysis techniques can be used to identify trends, evaluate the effectiveness of interventions, and inform policy decisions.
Efficiency and cost-effectiveness: Research data can help to improve efficiency and cost-effectiveness by identifying areas where resources can be directed most effectively. By using data to identify the most promising approaches or interventions, researchers can optimize the use of resources and improve outcomes.

Limitations of Research Data

Research data has several limitations that researchers should be aware of, including:

Bias and subjectivity: Research data can be influenced by biases and subjectivity, which can affect the accuracy and reliability of the data. Researchers must take steps to minimize bias and subjectivity in data collection and analysis.
Incomplete data : Research data can be incomplete or missing, which can affect the validity of the findings. Researchers must ensure that data is complete and representative to ensure that their findings are reliable.
Limited scope: Research data may be limited in scope, which can limit the generalizability of the findings. Researchers must carefully consider the scope of their research and ensure that their findings are applicable to the broader population.
Data quality: Research data can be affected by issues such as measurement error, data entry errors, and missing data, which can affect the quality of the data. Researchers must ensure that data is collected and analyzed using rigorous methods to minimize these issues.
Ethical concerns: Research data can raise ethical concerns, particularly when it involves human subjects. Researchers must ensure that their research complies with ethical standards and protects the rights and privacy of human subjects.
Data security: Research data must be protected to prevent unauthorized access or use. Researchers must ensure that data is stored and transmitted securely to protect the confidentiality and integrity of the data.

About the author

Muhammad Hassan

Researcher, Academic Writer, Web developer

Secondary Data – Types, Methods and Examples

Qualitative Data – Types, Methods and Examples

Primary Data – Types, Methods and Examples

Quantitative Data – Types, Methods and Examples

Information in Research – Types and Examples

All Subjects

study guides for every class

That actually explain what's on your next test, research database, from class:, documentary production.

A research database is a structured collection of data and information that allows users to search, retrieve, and analyze content from various sources efficiently. These databases often include articles, books, multimedia, and archival materials relevant to specific subjects or fields, making them vital tools for gathering information and conducting in-depth analysis for various projects. They play a significant role in informing documentary production by providing factual content and sources during the research and development phases.

congrats on reading the definition of Research Database . now let's actually learn it.

5 Must Know Facts For Your Next Test

Research databases can be subject-specific or multidisciplinary, offering a wide range of materials to support diverse documentary topics.
Many academic institutions provide access to online research databases that include peer-reviewed articles, primary sources, and multimedia content.
Users can utilize advanced search features in research databases to filter results based on criteria such as publication date, author, or type of material.
Citing sources from a research database is essential for maintaining credibility and academic integrity in documentary production.
Familiarity with various research databases can enhance the efficiency and effectiveness of the research process during project development.

Review Questions

Utilizing a research database significantly streamlines the research process by providing structured access to a wealth of information. It allows filmmakers to quickly locate relevant articles, multimedia resources, and archival materials essential for their documentaries. The advanced search features enable targeted queries that save time and enhance the quality of content gathered, ensuring that the final product is well-researched and factually accurate.
Common features of research databases include advanced search functionalities, metadata tags for better content understanding, and tools for organizing and managing references. These elements contribute to effective project development by allowing filmmakers to filter their searches, easily access detailed information about each source, and compile references efficiently for citation purposes. Such features help ensure that all aspects of the documentary are well-supported by credible evidence.
Incorporating data from research databases greatly enhances the credibility of documentaries by grounding them in well-documented evidence and authoritative sources. This practice allows filmmakers to present balanced narratives supported by extensive research rather than anecdotal evidence or unverified claims. As audiences increasingly seek factual accuracy in media content, utilizing credible information from established databases not only boosts the documentary's reliability but also fosters trust with viewers, which is critical for impactful storytelling.

Related terms

Archival Research : The process of locating, accessing, and analyzing historical documents and records that provide context and depth to a documentary subject.

Metadata : Data that provides information about other data, helping users understand the context, content, and organization of information within a database.

Literature Review : A comprehensive survey of existing research and publications on a specific topic, often used to identify gaps in knowledge or establish a foundation for new research.

" Research Database " also found in:

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

Knowledge Base

Methodology

Research Methods | Definitions, Types, Examples

Research methods are specific procedures for collecting and analyzing data. Developing your research methods is an integral part of your research design . When planning your methods, there are two key decisions you will make.

First, decide how you will collect data . Your methods depend on what type of data you need to answer your research question :

Qualitative vs. quantitative : Will your data take the form of words or numbers?
Primary vs. secondary : Will you collect original data yourself, or will you use data that has already been collected by someone else?
Descriptive vs. experimental : Will you take measurements of something as it is, or will you perform an experiment?

Second, decide how you will analyze the data .

For quantitative data, you can use statistical analysis methods to test relationships between variables.
For qualitative data, you can use methods such as thematic analysis to interpret patterns and meanings in the data.

Methods for collecting data, examples of data collection methods, methods for analyzing data, examples of data analysis methods, other interesting articles, frequently asked questions about research methods.

Data is the information that you collect for the purposes of answering your research question . The type of data you need depends on the aims of your research.

Qualitative vs. quantitative data

Your choice of qualitative or quantitative data collection depends on the type of knowledge you want to develop.

For questions about ideas, experiences and meanings, or to study something that can’t be described numerically, collect qualitative data .

If you want to develop a more mechanistic understanding of a topic, or your research involves hypothesis testing , collect quantitative data .


Qualitative		to broader populations. .
Quantitative		.

You can also take a mixed methods approach , where you use both qualitative and quantitative research methods.

Primary vs. secondary research

Primary research is any original data that you collect yourself for the purposes of answering your research question (e.g. through surveys , observations and experiments ). Secondary research is data that has already been collected by other researchers (e.g. in a government census or previous scientific studies).

If you are exploring a novel research question, you’ll probably need to collect primary data . But if you want to synthesize existing knowledge, analyze historical trends, or identify patterns on a large scale, secondary data might be a better choice.


Primary	.	methods.
Secondary

Descriptive vs. experimental data

In descriptive research , you collect data about your study subject without intervening. The validity of your research will depend on your sampling method .

In experimental research , you systematically intervene in a process and measure the outcome. The validity of your research will depend on your experimental design .

To conduct an experiment, you need to be able to vary your independent variable , precisely measure your dependent variable, and control for confounding variables . If it’s practically and ethically possible, this method is the best choice for answering questions about cause and effect.


Descriptive		. .
Experimental

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Research methods for collecting data
Research method	Primary or secondary?	Qualitative or quantitative?	When to use
	Primary	Quantitative	To test cause-and-effect relationships.
	Primary	Quantitative	To understand general characteristics of a population.
Interview/focus group	Primary	Qualitative	To gain more in-depth understanding of a topic.
Observation	Primary	Either	To understand how something occurs in its natural setting.
	Secondary	Either	To situate your research in an existing body of work, or to evaluate trends within a research topic.
	Either	Either	To gain an in-depth understanding of a specific group or context, or when you don’t have the resources for a large study.

Your data analysis methods will depend on the type of data you collect and how you prepare it for analysis.

Data can often be analyzed both quantitatively and qualitatively. For example, survey responses could be analyzed qualitatively by studying the meanings of responses or quantitatively by studying the frequencies of responses.

Qualitative analysis methods

Qualitative analysis is used to understand words, ideas, and experiences. You can use it to interpret data that was collected:

From open-ended surveys and interviews , literature reviews , case studies , ethnographies , and other sources that use text rather than numbers.
Using non-probability sampling methods .

Qualitative analysis tends to be quite flexible and relies on the researcher’s judgement, so you have to reflect carefully on your choices and assumptions and be careful to avoid research bias .

Quantitative analysis methods

Quantitative analysis uses numbers and statistics to understand frequencies, averages and correlations (in descriptive studies) or cause-and-effect relationships (in experiments).

You can use quantitative analysis to interpret data that was collected either:

During an experiment .
Using probability sampling methods .

Because the data is collected and analyzed in a statistically valid way, the results of quantitative analysis can be easily standardized and shared among researchers.

Research methods for analyzing data
Research method	Qualitative or quantitative?	When to use
	Quantitative	To analyze data collected in a statistically valid manner (e.g. from experiments, surveys, and observations).
Meta-analysis	Quantitative	To statistically analyze the results of a large collection of studies. Can only be applied to studies that collected data in a statistically valid manner.
	Qualitative	To analyze data collected from interviews, , or textual sources. To understand general themes in the data and how they are communicated.
	Either	To analyze large volumes of textual or visual data collected from surveys, literature reviews, or other sources. Can be quantitative (i.e. frequencies of words) or qualitative (i.e. meanings of words).

Prevent plagiarism. Run a free check.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

Chi square test of independence
Statistical power
Descriptive statistics
Degrees of freedom
Pearson correlation
Null hypothesis
Double-blind study
Case-control study
Research ethics
Data collection
Hypothesis testing
Structured interviews

Research bias

Hawthorne effect
Unconscious bias
Recall bias
Halo effect
Self-serving bias
Information bias

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to systematically measure variables and test hypotheses . Qualitative methods allow you to explore concepts and experiences in more detail.

In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .

A sample is a subset of individuals from a larger population . Sampling means selecting the group that you will actually collect data from in your research. For example, if you are researching the opinions of students in your university, you could survey a sample of 100 students.

In statistics, sampling allows you to test a hypothesis about the characteristics of a population.

The research methods you use depend on the type of data you need to answer your research question .

If you want to measure something or test a hypothesis , use quantitative methods . If you want to explore ideas, thoughts and meanings, use qualitative methods .
If you want to analyze a large amount of readily-available data, use secondary data. If you want data specific to your purposes with control over how it is generated, collect primary data.
If you want to establish cause-and-effect relationships between variables , use experimental methods. If you want to understand the characteristics of a research subject, use descriptive methods.

Methodology refers to the overarching strategy and rationale of your research project . It involves studying the methods used in your field and the theories or principles behind them, in order to develop an approach that matches your objectives.

Methods are the specific tools and procedures you use to collect and analyze data (for example, experiments, surveys , and statistical tests ).

In shorter scientific papers, where the aim is to report the findings of a specific study, you might simply describe what you did in a methods section .

In a longer or more complex research project, such as a thesis or dissertation , you will probably include a methodology section , where you explain your approach to answering the research questions and cite relevant sources to support your choice of methods.

Is this article helpful?

Other students also liked, writing strong research questions | criteria & examples.

What Is a Research Design | Types, Guide & Examples
Data Collection | Definition, Methods & Examples

Get unlimited documents corrected

✔ Free APA citation check included ✔ Unlimited document corrections ✔ Specialized in correcting academic texts

What is a Database?

November 2005
The Journal of World Intellectual Property 5(6):981 - 1011
5(6):981 - 1011

University of Nottingham

Discover the world's research

25+ million members
160+ million publication pages
2.3+ billion citations

Putranegara Riauwindu

Jens Schovsbo
Olga Kokoulina

Rafał Witek
Recruit researchers
Join for free
Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

REVIEW article

When we talk about big data, what do we really mean toward a more precise definition of big data.

$\r\nXiaoyao Han$

1 Department of Governance and Innovation, Campus Fryslan, University of Groningen, Leeuwarden, Netherlands
2 Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence, University of Groningen, Groningen, Netherlands

Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this “no consensus” stance over the years. However, the lack of a clear definition and scope for Big Data results in scientific research and communication lacking a common ground. Even with the popular “V” characteristics, Big Data remains elusive. The term is broad and is used differently in research, often referring to entirely different concepts, which is rarely stated explicitly in papers. While many studies and reviews attempt to draw a comprehensive understanding of Big Data, there has been little systematic research on the position and practical implications of the term Big Data in research environments. To address this gap, this paper presents a Systematic Literature Review (SLR) on secondary studies to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term. Our study found that various Big Data technologies are being used in different scientific fields, including machine learning algorithms, distributed computing frameworks, and other tools. These manifestations of Big Data can be classified into four major categories: abstract concepts, large datasets, machine learning techniques, and the Big Data ecosystem. This study revealed that despite the general agreement on the “V” characteristics, researchers in different scientific fields have varied implicit understandings of Big Data. These implicit understandings significantly influence the content and discussions of studies involving Big Data, although they are often not explicitly stated. We call for a clearer articulation of the meaning of Big Data in research to facilitate smoother scientific communication.

1 Introduction

As the amount of data being generated continues to grow over the last two decades ( Petroc, 2023 ), the need to process and analyze it became increasingly urgent. This led to the development of new tools and techniques for handling Big Data ( Khalid and Yousaf, 2021 ), such as distributed computing frameworks like Hadoop and Apache Spark, as well as machine learning algorithms like neural networks and decision trees. As researchers and practitioners in various fields began to integrate Big Data into their work, a body of literature emerged discussing its implementation, characteristics and features, as well as its potential benefits and limitations.

The concept of Big Data has become increasingly important for scientific research. Its characteristics have been extensively discussed in the literature, and numerous researchers have summarized the definitions of Big Data and its closely related features ( Khan et al., 2014 ; Chebbi et al., 2015 ). Epistemological discussions are common ( Kitchin, 2014 ; Ekbia et al., 2015 ; Succi and Coveney, 2019 ) and the recent emergence of new quantitative-based approaches, utilizing large text corpora,offers new opportunities to understand Big Data from different perspectives. Hansmann and Niemeyer (2014) , for example, apply a text mining approach to extract and characterize the elements of the Big Data concept by analyzing 248 publications relevant to Big Data. They focus on the concept of the term Big Data and summarize four dimensions to describe it: the dimensions of data, IT infrastructure, applied methods, and an applications perspective. Van Altena et al. (2016) , analyze a large number of articles from the biomedical literature to give a better understanding of the term Big Data by extracting relevant topics through topic modeling. Akoka et al. (2017) review studies using a systematic mapping approach based on more than a decade of academic papers relevant to Big Data. Their bibliometric review of Big Data publications shows that the research community has made a significant contribution in the field of Big Data, which is further evidenced by the continued increase in the number of scientific publications dealing with the concept.

Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this “no consensus” stance over the years. Many authors endeavor to provide comprehensive definitions of Big Data based on their research, aiming to better capture its essence. However, a universally accepted definition is yet to emerge. Instead, a commonly accepted description portrays Big Data through various “V” characteristics (volume, variety, velocity, etc.). Nonetheless, this widely accepted description does not ensure a profound common ground for discussing Big Data in different contexts. As a result, Big Data is still being perceived as a broad and vague concept, making it difficult to grasp in interdisciplinary contexts. Over the past few decades, there has been little systematic research on the position and practical implications of the term Big Data in research environments. When different authors spend significant portions of their studies discussing similar definitions and characteristics of Big Data, they are actually referring to different concepts (technology, platforms, or datasets), which is often not explicitly stated. Exploring this ambiguity is crucial, as it forms the basis for research and discussion. Many reviews aim to enrich a comprehensive understanding of Big Data, rather than retrospectively observing its actual usage in different contexts. we believe this inspection on the current useage sitation of Big Data will help to clarify the ambiguities and facilitates clearer communication among researchers by providing a understanding framework of what Big Data entails in different contexts.

To address this gap, this paper presents a Systematic Literature (SLR) on secondary studies ( Kitchenham et al., 2009 , 2010 ) to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term.

The rest of this paper is structured as follows: we first present our systematic literature review methodology. The results are then divided into four sections: an overview of the bibliographic information of the extracted articles, a summary of the most prominent technologies used, a discussion on the technologies that are considered to be Big Data, and a presentation of the perceived benefits and drawbacks of Big Data. Finally, before conclusion we discuss our findings in terms of the understanding of Big Data and its value, and suggest avenues for further research.

2 Methodology

This study employs a SLR methodology to investigate the use of Big Data in scientific research. Our review, which follows the guidelines for SLR defined by Kitchenham et al. (2009) and Kitchenham et al. (2010) , enables a structured and replicable procedure that increases the reliability of research findings. An SLR study is referred to as a tertiary study when it applies the SLR methodology on secondary studies, i.e. it surveys literature reviews instead of primary studies. We selected a tertiary approach for this study due to the extensive range of Big Data technologies and applications in research. Collecting primary data from a large number of papers across domains can be a laborious task. By analyzing secondary data sources instead, our approach offers a comprehensive and holistic overview of the landscape of Big Data in the scientific community. Adopting this approach allows us to better control the bibliographic selection and obtain a comprehensive overview of the field, thereby ensuring high-quality data analysis.

The research questions addressed by this paper are:

1. Which technologies are considered as Big Data in different fields?

2. What Big Data implies in a scientific study context?

3. What is the perceived impact of the adoption of Big Data in each research domain?

For the purpose of answering RQ1, we have referenced the Big Data technology taxonomy developed by Patgiri (2019) . We anticipated that the reviewed papers would mention a broad spectrum of technologies, techniques, applications, and heterogeneities, which could pose a challenge in synthesizing the results. Therefore, we relied on Patgiri's taxonomy to guide our full-text review and data extraction process to identify the Big Data technologies utilized in the various scientific domains. This taxonomy offers a comprehensive overview of the various technical approaches to Big Data. By utilizing this taxonomy, we aim to initiate a discourse on what Big Data truly represents in each research domain, and how it has been understood by the respective scientific community. This discussion will enable us to further deliberate on RQ2, which examines the perception of Big Data in specific research domains. To structure our investigation for both research questions (RQ1 and RQ2) we utilize the subject taxonomy of Web of Science (WoS) to classify scientific fields, and apply no limitations to the scope of included scientific fields. As a result, both questions encompass a wide range of Big Data usage.

RQ2 is rooted in the fact that the scope of Big Data technologies has yet to be strictly defined and reach a consensus. Through this question we aim to investigate which technologies or objects are perceived as constituting Big Data in specific research domains. By these means we attempt to shed light on how Big Data as a concept is implicitly understood in each domain by linking it to specific technologies. In addition to offering an overview of the technological landscape, RQ3 aims to delve deeper into the discussion of the impact of Big Data in each domain by examining the benefits and drawbacks of its adoption, as reported by the examined secondary studies. By doing so, we seek to gain a comprehensive understanding of the role of Big Data in scientific domains.

2.1 Study design

This study uses the framework of Population Intervention Comparison Outcome and Context (PICOC) ( Petticrew and Roberts, 2008 ) to guide the research questions and the systematic literature review process, with the aim of exploring the use of Big Data technologies across various scientific domains and understanding the perceived meaning and value of Big Data in the scientific community. The use of the PICOC standard allowed for a structured approach to conducting the study and ensured that all relevant aspects were considered in the data collection, refinement, extraction, and analysis processes ( Mengist et al., 2020 ). The Population of interest consists of studies that explore the state of the art of Big Data in different research domains. Intervention refers to systematic secondary studies, such as systematic literature reviews, mapping studies, and surveys. The Comparison in this study is between the perception of Big Data as a concept and the adoption of related technologies. The desired Outcome of this study is a conceptual mapping of how the term Big Data applied in research across domains. In terms of Context, the study focuses on research domains that use the term Big Data to explore the potential of its technology and tool in advancing research without any restriction on the type of domains.

Figure 1 summarizes the process followed by this SLR. Each of the steps in this process is discussed in more detail in the following.

Figure 1 . Process of the research methodology.

2.2 Data collection

Our aim for data collection was to explore as wide a range of studies in as many different domains as possible. As a result, we decided to retrieve candidate studies through queries on the online databases of Elsevier Scopus, Web of Science, and PubMed. Scopus is a database containing 77 million records from almost 5,000 international publishers ( Scopus Content Coverage Guide, 2023 ), Web of Science is one of the world's leading databases which covers all domains of science ( Cumbley and Church, 2013 ), and PubMed is a comprehensive medicine and biomedical database ( Falagas et al., 2007 ). Our search strategy for all three databases after a few stages of pilot searches, was to look for publications with the exact phrase “big data” and either technology, platform, application, or adoption in the title, along with review, survey, or “mapping study” in either title or keywords.

The specific search query for each database is shown in Figure 2 . As the databases use slightly different tags to identify search fields, the search query expression varied but had the same conceptual content. For example, Scopus used the tag TITLE for the title field, whereas WoS used TI and PubMed used square brackets behind the term. For the keywords field, Scopus automatically combined author keywords, index terms, trade names, and chemical names in the tag KEY. WoS provided two relevant fields for keywords, AK representing author keywords and KP representing keywords derived from the titles of article references based on a special algorithm. PubMed placed author keywords and other terms indicating major concepts in the field Other Term. We obtained 490 results from Scopus, 116 from WoS, and 18 from PubMed, resulting in a total of 624 articles before removing duplicates.

Figure 2 . Search queries and screening process.

2.3 Data selection

We established inclusion and exclusion criteria listed in Table 1 to ensure the relevance of the retrieved studies to be included in our analysis. In terms of inclusion criteria, we only considered articles that were published in English between 2017 and 2022 (in order to ensure that the studies reflect the current state of the art in each field), and that were secondary studies. Specifically, we focused on comprehensive reviews or systematic literature reviews that provided an overview of Big Data adoption in a particular research field, rather than articles that focused solely on the application of Big Data technologies to address individual research questions or on the improvement of Big Data technologies themselves. Moreover, we only included articles that were published in their entirety in journals, rather than conference proceedings or abstracts. In contrast, we excluded early access and not final stage articles to avoid potential bias or incomplete information. Publications in languages other than English were also excluded due to language barriers. Additionally, we excluded primary studies, as our aim was to focus on the adoption, application, and implications of Big Data technologies in specific scientific domains, rather than specific research questions.

Table 1 . Inclusion and exclusion for data selection.

By implementing these criteria, we were able to ensure that the articles selected for our analysis were recent, relevant, and provided a comprehensive view of the Big Data landscape. Moreover, by focusing on secondary studies, we were able to avoid redundant coverage of primary research studies and instead obtain a broader perspective on the field. Ultimately, this approach allowed us to generate a more comprehensive and robust understanding of the different Big Data technologies, applications, and techniques being used in various scientific domains. This step resulted in a corpus of 76 records after removing the duplicates.

2.4 Quality assessment

To ensure that the papers selected for full-text review meet the requirements of our study, we then conducted a Quality Assessment (QA) based on five standards:

1. Is Big Data adoption discussed at sufficient depth?

2. Is a specific methodology being used in the secondary study?

3. Are the bibliographic sources used included in the study design?

4. Is the number of included primary studies clear in the study?

5. Are the results well-organized?

Firstly, we examined whether the papers provide an overview of Big Data adoption in a particular research field, excluding other aspects such as the application of Big Data technologies that focus on individual research questions or the improvement of Big Data Technologies themselves. Secondly, we assessed whether the selected papers clearly define their research methodology or base it on a specific paradigm. Thirdly, we evaluated whether the papers clearly specify their bibliographic sources for the secondary studies. Fourthly, we documented whether the number of primary papers that are contained in the secondary studies is also clearly stated. Finally, we evaluated whether the results of the papers are well-organized around research questions.

We scored the papers based on these criteria in a binary fashion (fulfilled/not fulfilled), and to be considered for inclusion in our study, we set a threshold of at least three out of five standards that must be fulfilled. While QA is not meant in principle to be used for an additional filtering, we decided that given the observed wide disparities in the quality of the collected secondary studies, setting this threshold is justified. The resulting selected articles primarily use systematic literature reviews, although some pure review articles that are relevant to our study have also been included. This step leads to 33 studies left for further processing.

2.5 Snowballing

In addition to the papers collected, we employed the forward snowball sampling method ( Wohlin, 2014 ) to supplement our database. For each article under full-text review passing the QA step, we reviewed all the titles of its references and the articles citing it. We added records that met our inclusion and exclusion criteria and were relevant to our topics. As we are only interested in recent studies on the topic, we did not perform backward snowballing. Ultimately, we selected only one additional study for inclusion, resulting in 34 studies in total for data extraction. All authors actively contributed to each step of the research process, engaging in thorough discussions and collaborations. In every stage outlined above, one researcher took the lead and collaborated with the other two authors to ensure comprehensive coverage and rigorous analysis.

2.6 Data extraction

The data extraction form shown in Table 2 was designed to answer the research questions in this study. The extraction fields can be divided into three categories. The first part concerns demographics-related data to be extracted from each study in order to check for possible bias toward specific fields or publication venues. The second part is used to answer RQ1 and RQ2. For the Research Field aspect, we applied the subject classification from WoS to ensure consistency and heterogeneity of the level of the discipline described. This also enabled us to make cross-sectional comparisons and statistical summaries. Apart from the WoS categories, we added another field to specify the sub-level of the Application Domain in which the secondary study takes place. This helps us to better connect the technologies collected with their intended use cases. In terms of the technologies and applications that we collected, due to the wide range of information described from different perspectives and heterogeneity issues, we added one more column to clarify and support the documentation, namely Perspective. This column records the perspective from which the Big Data technologies are described. For example, some papers only summarized the use cases of machine learning methods as applications of Big Data. In this way, besides the concrete names of methods and models documented in the Technologies column (including also specific Big Data applications or platforms), the perspective will be marked as machine learning. This helps us to further analyze and categorize the discussed technologies. Reported shortcomings and benefits of the perceived impact of Big Data from the authors are extracted in the third part. This helps us to answer RQ3.

Table 2 . Fields of data extraction form.

2.7 Threats to validity

The present study faces several potential threats to its validity. Firstly, the inclusion and exclusion criteria employed could introduce bias into the sample. If the criteria are too narrow, relevant articles may be omitted, while if they are too broad, irrelevant articles may be included, resulting in a less representative sample. To mitigate this threat, we thoroughly reviewed and refined our criteria through piloting to ensure their optimal balance between inclusivity and exclusivity. Secondly, the quality assessment of the chosen articles could be influenced by researcher bias, as different researchers might interpret the quality criteria differently. To address this issue, we utilized a standardized quality assessment tool during the evaluation process to ensure consistent interpretation of the criteria. Moreover, limiting the search to a specific publication date range could result in publication date bias, leading to an outdated or incomplete synthesis of the literature. To avoid this, we focused on secondary studies and restricted the publication date to the latest 5 years, which enabled us to include publications from a wide range of dates. To enhance the efficiency of the quality assessment, we established the quality assessment process before conducting the full-text review. This approach ensured that all selected articles were relevant to the topic and that their quality was assessed rigorously. Other possible threats to the validity of this study include publication bias and selective reporting of results. To mitigate these threats, we conducted a comprehensive search of multiple databases and employed the snowballing method, while carefully screening all articles for possible selective reporting.

This section presents the results of our study and is divided into several subsections. Firstly, we provide an overview of the background information of the articles in our corpus, including the distribution of disciplines, publishers, and popular databases cited for their secondary studies. The presentation of the results is using the classification of the secondary studies into research domains extracted for RQ1 and RQ2. In Section 3.2, we discuss our findings on the Big Data technologies extracted from our SLR in order to answer RQ1. In Section 3.3, we discuss the concepts that are perceived as Big Data as reported by the secondary studies in our corpus. Finally, we summarize the impact of the adoption of Big Data technologies in the various domains in Section 3.4.

3.1 Overview of bibliographic results

The following section presents an overview of the created corpus used in the analysis, providing statistics on domain distribution, publishers, and bibliographic sources. Domain distribution helps to identify the research background of the analyzed articles, while publishers' distribution sheds light on those who are active in the field of Big Data. Additionally, the bibliographic sources used by secondary studies provide insights into the popularity of different databases and sources among researchers in the field. This information is crucial since it can affect the quality of the data and the generalizability of the findings and ensure transparency for analysis.

Figure 3 shows the distribution of WoS-defined disciplines of the articles extracted for the study. Health care emerged as the discipline that includes the most Big Data articles, with COVID-19 as a distinct area of interest that we promoted to its own domain. Transportation and Business & Economics were the next most popular areas of study. While several cross-domain studies did not specify the application domain of Big Data, they did focus on a specific aspect of Big Data applications such as storage and cybersecurity. Disaster response studies identified in the Public, Environmental & Occupational Health category are also notable. Other disciplines appear only once; these include Material Science that focused on Aerospace, Telecommunications (e.g. mobile Big Data), Construction & Building Technology, and Agriculture. It should be noted that this distribution does not necessarily represent the entire scientific community, and no conclusions regarding the interest of studying or using Big Data can be drawn from this.

Figure 3 . Publisher distribution of included articles.

Our analysis of the articles in our corpus, as summarized in Figure 4 , showed that MDPI and Elsevier published the most studies related to Big Data (with nine each), followed by Springer and IEEE (with three each). Eight more publishers have one publication each in our corpus and are omitted from the figure.

Figure 4 . Publishers' distribution of the included secondary studies.

In terms of databases chosen for secondary studies, Figure 5 shows that WoS and Scopus were the most popular sources, being in alignment with our choice of databases for our study, followed by IEEE, ScienceDirect, Google Scholar, Wiley Online Library, Sage, PubMed, and Association for Computing Machinery's (ACM) Digital Library.

Figure 5 . Popularity of databases used by the secondary studies.

Overall, a wide diversity of publishing venues and data sources can be attested for our corpus, providing evidence toward a lack of selection bias for our review.

3.2 Big Data technologies

To address RQ1, we refer to Patgiri (2019) , who presents a taxonomy of Big Data consisting of seven key categories: Semantics, Compute Infrastructure, Storage System, Big Data Management, Big Data Mining, Big Machine Learning, and Security & Privacy. This taxonomy covers various aspects of Big Data technologies, including implementation tools, system architectures, and operational processes. We categorize and describe our findings based on the six perspectives (excluding Semantics) proposed by Patgiri. Since the Semantics perspective primarily concerns the conceptual understanding of Big Data, we exclude it from our summary of technologies.

The data extraction results from our study reveal the use of various Big Data technologies already discussed in Patgiri (2019) , including Xplenty, Apache Cassandra, MongoDB, Hadoop, Datawrapper, Rapidminer, Tableau, KNIME, Storm, Cloudera Distributed Hadoop(CDH), Kafka, Spark, Mapreduce, Hive, Pig, Flume, Sqoop, Apache Tez, Flink, and Storm. These technologies are used to extract, process, analyze, and visualize large volumes of data across a range of scientific domains.

From the Compute Infrastructure perspective, the most commonly used technology is Hadoop, 1 which is used for distributed computing and data processing. Various disciplines have been found utilizing Hadoop, including health care [S12, S63, S87, S133], environmental science [S77], and transportation [S194]. Other technologies, such as Apache Cassandra 2 and MongoDB, 3 are used for distributed data storage and retrieval. These technologies enable data processing at scale and facilitate parallel processing, enabling users to analyze large volumes of data quickly and efficiently. In terms of storage systems, Apache Cassandra, Hadoop Distributed File System (HDFS), 4 HBase, 5 and MongoDB are popular choices for distributed data storage. These technologies are designed to handle large volumes of data and provide high scalability and fault tolerance. The reported application domains for these technologies include Internet of Things (IoT), healthcare, decision making and electric power data [S94]. [S133] notes that NoSQL database systems such as MongoDB, Cassandra, and HDFS can be used to handle exponential data growth to replace the traditional database management systems. Additionally, technologies such as Flume 6 and Sqoop 7 enable the efficient transfer of data between different storage systems [S74].

Regarding the Big Data Management perspective, frameworks such as Hadoop, MapReduce, Hive, 8 and Pig 9 are used for managing and processing large volumes of data. These technologies enable users to process data in a distributed environment, thereby increasing efficiency and scalability [S74, S133, S194]. Other technologies, such as Apache Tez 10 and Flink, 11 provide high-performance data processing and streaming capabilities [S74]. From the Big Data Mining & Machine Learning aspect, a range of machine learning models are used to analyze large volumes of data, including Artificial Neural Network (ANN), fuzzy logic, Decision Tree (DT), regression, Support vector machine (SVM), Self-Organizing Map (SOM), Fuzzy C-Means (FCM), K-means, Genetic algorithm (GA), and Convolution Neural Networks (CNN), DT, Random Forest, Rotation Forest, ANN, Bayesian, Boosted Regression Tree, Classification And Regression Trees (CART), Conditional Inference Tree, Maxent, K-means, Non-negative Matrix Fact, and others [S19, S27, S65, S96, S178, S203, S207]. These techniques enable users to extract insights from large volumes of data, identify patterns and trends, and make predictions based on data analysis. Finally, as for security and privacy, Intrusion Detection Systems (IDS) are used to detect anomalous behaviors in complex environments. Machine learning models such as unsupervised online deep neural networks and deep learning techniques are used to identify and analyze Controller Area Network (CAN-Bus) attacks. Log parsers such as Zookeeper, 12 Proxifier, 13 BGL, HPC and HDFS are used to secure data and ensure privacy in distributed environments [S88]. These technologies provide access control, authentication, and encryption capabilities to protect data against unauthorized access and misuse.

Although many Big Data technologies and their applications are presented in this section, a significant portion of our research found that there are additional technologies and applications that are discussed outside of Patgiti's taxonomy, yet are still labeled as Big Data. In order to distinguish between these two cases and further deepen the understanding of Big Data and related technologies in the scientific field, we discuss the latter case in the following research question.

3.3 Perceptions of Big Data scope

In this section, we aim to provide an overview of the landscape on how Big Data is actually used based on our analysis of the literature to answer RQ2. Initially, we attempted to use the WoS categories to structure the discussion around domains; however, due to the heterogeneity issue at different angles, the difference in technology use was found to be significant even within one subject. As a result, we summarize the technologies according to the Perspective field recorded in the data extraction form. Table 3 summarizes our findings.

Table 3 . Perceptions of Big Data scope.

3.3.1 Big Data as abstract concepts or collections for application domains

In this category, the studies do not exemplify the Big Data technologies in concrete circumstances, but rather as concepts or collections for application domains. For example, [S67] discusses using text mining methods to explore the application of Big Data disciplines. The research focuses on the broad application of Big Data but does not include specific technologies. Based on the presented text mining method, Big Data technologies are identified automatically as keywords without further classification, and no concrete technologies and their use cases are explored. Similarly, [S104] studies the implementation of cloud computing, Big Data, and blockchain technology adoption in Enterprise Resource Planning (ERP), using only publication keywords generated from literature review and covering a broad spectrum. Why certain technologies are being referred to as Big Data, and how they are specifically utilized is missing, however. In another case, the term “Big Data” and related expressions such as “Big Data Analytics” and “Big Data Applications” are used as abstract concepts throughout the paper without a clear definition of the scope of the technology or little mention of the technology itself. For instance, in an attempt to identify privacy issues in Big Data applications, [S181] uses phrases such as “Big Data analytics” to refer to the technology, without focusing on any specific technologies.

3.3.2 Big Data as large data sources

This category of studies focuses on newly established large data sources, such as databases, websites, and crowd-sourcing platforms, which are taken as the application target of Big Data. The term Big Data is not strictly focused on the technologies but rather the volume or velocity of the data, with no specific techniques linked to those data sources. The application of Big Data in this case mainly means the use of large volumes of data. For example, [S132] explores the utilization of Big Data, smart and digital technologies, and artificial intelligence in the field of psoriatic arthritis studies. The authors cite a series of examples of how Big Data, combined with techniques such as artificial neural networks, natural language processing, k-nearest neighbors, and Bayesian models, can help intercept patients with psoriatic arthritis early. In this case, Big Data mainly refers to the repositories, registries, and databases that are generated from surveys, medical insurance data, vital registration data, etc. A similar research study in healthcare is [S47]. The authors list a number of databases worldwide with concrete gastrointestinal and liver diseases sample sizes, and techniques that are briefly mentioned for analysis are R, python, statistics, and Natural Language Processing (NLP). Also in ecology, ecological datasets such as AmeriFlux are considered Big Data as indicated by [S16].

3.3.3 Machine learning methods and models

This category refers to Big Data as models, algorithms, and statistical methods for managing large data sets, closely related to Machine Learning, Artificial Intelligence, Data Mining, Internet of Things, etc. Big Data technology is equivalent to the definition of machine learning or artificial intelligence, sometimes referred to as deep learning. Specific machine learning models are mentioned, such as ANN, Neural Network (NN), SVM, K-means, decision trees, and also improved models that are based on those methods, to solve practical problems, such as prediction for a certain disease, optimization of transportation. In our research, the majority of the technologies and applications extracted fall into this category. Popular models and techniques summarized are decision trees, random forests and support vector logistic regression models, naive Bayes models, decision trees, and k-nearest neighbors (k-NN) classic linear statistic models, Bayesian networks, and CNNs. There are also some studies (e.g. [S203, S207]) that summarize the applications of Big Data from their research field by discussing a series of concrete improvements of existing models.

3.3.4 Big Data ecosystem

The Big Data ecosystem category represents a comprehensive structure of the different levels of technology solutions that exist. Various applications such as Hadoop, MapReduce, and NoSQL are commonly mentioned in this category. The authors of those studies show a strong familiarity with how these technologies fit into their respective disciplines, demonstrating a deep understanding of the subject matter. In one study [S194], for example, the authors explained the different layers in an architecture for Big Data relating to traffic, and how applications such as Hadoop, MapReduce, HDFS, Apache Spark, Apache Hive, Hut, and Apache Kafka work together in the system. The article provides valuable insights into the role that these applications play in the overall architecture. Another study [S216] focused on the use of MapReduce and HDFS functionality in Big Data Architecture.

3.3.5 Outside of the classification

Some articles in our study did not fit into the categories outlined in our classification, as it is challenging to extract useful information from them. For instance, some studies e.g., [S12,S63,S77,S172] presented a comparison of Big Data frameworks sourced from other literature without exploring how these frameworks are utilized in their respective fields after introducing their background discipline. While these frameworks can be categorized into the fourth category, we found it difficult to ascertain how Big Data is applied and understood in these disciplines. As a result, we did not include them in Table 3 . Articles that do not fit into this classification do not imply that the classification is invalid for the study, but rather that we cannot accurately estimate the authors' understanding of Big Data.

3.4 Added value of Big Data

In this section, we summarize the perceived impact of Big Data adoption to each domain, as stated in the secondary studies and in order to answer RQ3. The impact of Big Data is categorized into benefits and shortcomings, which will be introduced below. Table 4 provides a summary of the main points extracted from the texts.

Table 4 . Summary of benefits and shortcomings of Big Data.

3.4.1 Benefits

The advantages of Big Data in various fields have been extensively researched and documented. Big Data tools have the ability to store, process and analyze large volumes and varieties of data in real time, enabling researchers to extract valuable insights and improve their performance. In healthcare, the integration of different scientific fields such as informatics, clinical sciences, and analytics has been facilitated by the application of Big Data [S12]. Further reported benefits of Big Data in healthcare include, for example, cost reduction in medical treatments, elimination of illness-related risk factors, disease prediction, and improved preventative care and medication efficiency analysis [S12, S47, S132, S181]. Large sample sizes have enabled reliable capture of small variations in incidence or disease flare, and epidemiological/clinical Big Data has been instrumental in guiding public and global health policies [S47, S132]. In the field of transportation, Big Data technologies have been used to improve the effectiveness of traffic crash detection and prediction research, with the aim of preventing the occurrences of traffic crashes and secondary crashes [S178, S194]. In the ecosystem service research, Big Data and machine learning tools have been used to address data availability, uncertainty, and socio-ecological gaps [S16]. In addition, Big Data analytics platforms have been useful in revealing previously overlooked correlations, market trends, and valuable information from a large amount of Big Data [S66]. Machine learning has also been used in the smart grid area to sift through Big Data and extract useful information that can aid in demand and generation pattern recognition [S51]. Overall, the benefits of Big Data are diverse and far-reaching, and its application has the potential to revolutionize research and improve outcomes in various fields.

3.4.2 Shortcomings

One of the primary challenges with Big Data is the assumption that having vast amounts of data guarantees accurate results. As [S74] notes, Big Data can give a false sense of security because having a lot of data does not necessarily mean that the results are valid. In the field of Psoriatic Arthritis, for example, there are significant variations in socio-demographic characteristics, co-morbidities, and major complication rates between individual (single- or multi-center) and database-based studies [S132]. This inconsistency raises concerns when critically appraising rheumatological and dermatological research, as well as risk adjustment modeling.

Another challenge with Big Data is that unnecessary utilization of Big Data can lead to a waste of resources, as it ties up computer resources [S74]. With the exponential growth of data, the storage and processing demands for Big Data have increased significantly. Unnecessary utilization of Big Data can exhaust computer resources and make them less available for other important tasks. This can result in increased costs for organizations, as they need to invest in more powerful computer systems to handle the increased demand. Big Data also poses physical challenges to current IT architecture, servers, and software. As pointed out by [S51], IoT devices generate a huge amount of data, which cannot be handled through conventional analysis techniques. While many data storage technologies have been proposed to store and process growing data, more efficient technologies are required for data acquisition, processing, pre-processing, storage, and management of Big Data [S94].

Security and privacy issues also arise with Big Data. In the healthcare industry, data security and patient privacy issues are a significant concern for authorities and patients [S66]. In the research concerning COVID-19 [S207], for example, there are ethical issues surrounding privacy, the use of personal data to limit the pandemic spread, and the need for security to protect data from being overused by technology.

Finally, there are challenges in managing data consistency, scalability, and integration. For instance, challenges in using Big Data in environmental Sciences include data cleansing, lack of labeled datasets, mismatched data ingestion, high costs of platforms, and lack of data governance and socio-technical infrastructure [S135].The transportation industry faces difficulties in collecting data from diverse sources and addressing quality concerns. When using Big Data in transportation, data collected from different sources needs to be analyzed to extract meaningful insights. However, this data often contains noise and uncertainty that must be addressed before use [S194].

In conclusion, several challenges need to be addressed to ensure that the use of Big Data is effective. These challenges include the need for integrated and comprehensive systems, efficient data acquisition and management technologies, security and privacy concerns, data consistency, scalability, as well as integration issues.

4 Discussion

Going beyond our findings as reported in the previous section, in the following we identify three specific topics emerging from our analysis that need further consideration. These are how Big Data is being understood across scientific domains, how the adoption of the Big Data occurs across disciplines, and how the perception of the added value that Big Data technologies bring looks like.

4.1 Understanding of Big Data

The challenge in analyzing research using Big Data technologies stems from the lack of a clear definition of what a Big Data technology is. The authors of the secondary studies surveyed through this study elaborate on their understanding of Big Data and its applications from different perspectives. This results in a wide spectrum of technologies which are labeled as Big Data, while significant differences between them remain, even when applied in the same field. Additionally, terms closely related to Big Data technology, such as AI, ML, Big Data analytics, Big Data platform, IoT, and Deep Learning are often used interchangeably. While these terms may not require definition in each individual article surveyed, their meanings and scope may overlap in practice. Therefore, to gain a clear understanding of Big Data, it would be helpful to have a clear distinction and consistent use of each term with more precision to avoid confusion. In other words, we argue that when reviewing the literature, the primary task is to understand what authors mean when they use concepts such as Big Data and Big Data technology, rather than simply extracting relevant technical applications and their context.

The results show that the concept of Big Data technology can be very broad. At the same time, these tools and technologies may be specific to a particular discipline and may not have interdisciplinary significance. This ambiguity has led to the widespread use of Big Data terminology in some cases. Our findings also revealed that some articles provide a vague description of Big Data, making it difficult to extract useful information that cannot be categorized under any of the categories in Section 3.3. These articles focus on the theoretical knowledge of Big Data and do not delve into its practical applications in different domains. As a result, they provide a pure comparison of Big Data frameworks cited in other literature without exploring how these frameworks are used in their respective domains. This approach contributes to pervasive references to Big Data, however, without providing a clear understanding of its practical applications.

In conclusion, this study highlights the need for a clear and comprehensive definition of Big Data technology and its practical applications in different domains. By doing so, we can avoid the misuse of Big Data terminology and gain a better understanding of the tools and technologies that are truly related to Big Data.

4.2 Range of disciplines represented in Big Data applications

In this study, we conducted a rigorous systematic literature review to investigate the applications of Big Data and used the WoS subject classification to categorize the related research fields. However, we acknowledge that this approach may not capture the full range of disciplines, as indicated by our analysis of the discipline distribution. While disciplines such as astronomy and high-energy physics are typically associated with managing large volumes of data ( Jacobs, 2009 ), we found that they were not represented in the papers we extracted. The lack of such data-intensive disciplines led us to question why they were not included, since our primary objective was to provide an overview of Big Data applications across the entire scientific community. We took great care to ensure that our SLR was not biased toward any specific domain by carefully selecting keywords and defining our inclusion and exclusion criteria. Additionally, we used comprehensive databases and an auditable and repeatable methodology. The absence of data-intensive subjects in our review suggests that some disciplines may be utilizing Big Data technologies without explicitly using the term “Big Data” in their research papers. One hypothesis is that computer science and related fields may consider the scale of data used as the norm, hence not explicitly mentioning “Big Data.” To understand this phenomenon better, we suggest a further step to investigate the underlying assumptions, such as the choice of terminologies, used in these disciplines as part of future work.

4.3 Value of Big Data

The enthusiasm surrounding Big Data arises from the belief that vast amount of information with the development of technologies, algorithms and machine learning techniques, can provide innovative insights that traditional research methods cannot achieve ( Agrawal et al., 2011 ; Chen and Zhang, 2014 ; Khalid and Yousaf, 2021 ). However, these optimistic views are not undisputed. As Big Data knowledge infrastructures emerge, researchers are increasingly discussing the challenges and limitations they present.

Numerous papers extol the merits of Big Data, with the main claim being that it offers unparalleled opportunities for scientific breakthroughs, leading to transformative research, as discussed in Section 3.4. The most significant advantage of the Big Data approach is its ability to address problems on larger and finer temporal and spatial scales, as well as provide information on which data are reliable or uncertain, thereby mapping ignorance ( Hampton et al., 2013 ; Harford, 2014 ; Isaac et al., 2020 ). Big Data also highlights blind spots and uncertainties in research, revealing gaps in existing knowledge ( Hortal et al., 2015 ). In this study, we have explored the benefits of Big Data tools. They can store, process, and analyze large volumes and varieties of data in real-time, enabling researchers to extract valuable insights and improve their performance. Big Data analytics platforms have proven useful in revealing previously overlooked correlations, market trends, and valuable information from large datasets. Additionally, machine learning has aided in sifting through Big Data to extract useful information for demand and generation pattern recognition in the smart grid.

However, according to Ekbia et al. (2015) , Big Data presents both conceptual and practical dilemmas based on a broad range of literature. They argue that an epistemological shift in science occurs due to the use of Big Data, where predictive modeling and simulation gain more importance than causal explanations based on repeatable experiments testing hypotheses. The authors Rosenheim and Gratton (2017) reject what they perceive as the suggestion of the most fervent proponents of Big Data that knowledge of correlation alone can replace knowledge of causality. They point out that understanding cause-and-effect relationships is critical in fields such as agricultural entomology, where research-oriented recommendations enable farmers to implement management actions that lead to desired outcomes. The study in Harford (2014) points out that conducting a correlation-based analysis without a theoretical framework is inherently vulnerable. Without understanding the underlying factors influencing a correlation, it becomes impossible to anticipate and account for factors that could potentially compromise its validity.

Similarly to the above, in our research, we found critical questions regarding whether vast amounts of data guarantee accurate results. Brady (2019) argues that social scientists must grasp the meaning of concepts and predictions generated by convoluted algorithms, weigh the relative value of prediction vs. causal inference, and cope with ethical challenges as their methods. Another notable challenge posed by Big Data that we found is managing data consistency, scalability and heterogeneity in different fields. Additionally, large amounts of data also pose challenges to IT architecture, servers and software. This was also testified by Fan et al. (2014) . They point out that the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. Furthermore, we found that the misuses of Big Data arouse ethical issues. This concern has already been widely discussed in previous literature, see for example Cumbley and Church (2013) and Knoppers and Thorogood (2017) .

In conclusion, while Big Data has generated much enthusiasm as a powerful tool for scientific breakthroughs, it is not without its challenges and limitations. Despite these challenges, Big Data holds promise as a tool for innovative research, and future work should continue to address these concerns while exploring new opportunities for knowledge discovery.

5 Conclusion

The lack of a clear definition and scope for Big Data results in scientific research and communication lacking a common ground. Even with the popular “V” characteristics, Big Data remains elusive. The term is broad and is used differently in research, often referring to entirely different concepts, which is rarely stated explicitly in papers. While many studies and reviews attempt to draw a comprehensive understanding of Big Data, there is a need to first retrospectively observe what Big Data actually means and refers to in concrete studies. This will help clarify ambiguities and enhance understanding of the role of Big Data in science. It will facilitate clearer communication among researchers by providing a framework for a common understanding of what Big Data entails in different contexts, which is crucial for interdisciplinary collaboration. To address this gap, we conducted a systematic literature review (SLR) of secondary studies to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term.

Our study found that various Big Data technologies are being used in different scientific fields, including machine learning algorithms, distributed computing frameworks, and other tools. These manifestations of Big Data can be classified into four major categories: abstract concepts, large datasets, machine learning techniques, and the Big Data ecosystem. All these aspects combined represent the true meaning of Big Data. This study revealed that despite the general agreement on the “V” characteristics, researchers in different scientific fields have varied implicit understandings of Big Data. These implicit understandings significantly influence the content and discussions of studies involving Big Data, although they are often not explicitly stated. We call for a clearer articulation of the meaning of Big Data in research to facilitate smoother scientific communication.

Author contributions

XH: Conceptualization, Writing – original draft. OG: Conceptualization, Supervision, Writing – review & editing. VA: Conceptualization, Methodology, Supervision, Validation, Writing – review & editing.

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

1. ^ https://hadoop.apache.org/

2. ^ https://cassandra.apache.org/

3. ^ https://www.mongodb.com/

4. ^ https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

5. ^ https://hbase.apache.org/

6. ^ https://flume.apache.org/

7. ^ https://sqoop.apache.org/

8. ^ https://hive.apache.org/

9. ^ https://pig.apache.org/

10. ^ https://tez.apache.org/

11. ^ https://flink.apache.org/

12. ^ https://zookeeper.apache.org/

13. ^ https://www.proxifier.com/

Agrawal, D., Bernstein, P. A., Bertino, E., Davidson, S. B., Dayal, U., Franklin, M. J., et al. (2011). Challenges and opportunities with big data. Cyber Center Technical Reports, 2011–1. Available at: https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1000&context=cctech

Google Scholar

Akoka, J., Comyn-Wattiau, I., and Laoufi, N. (2017). Research on Big Data – a systematic mapping study. Comput. Stand. Interf . 54, 105–115. doi: 10.1016/j.csi.2017.01.004

Crossref Full Text | Google Scholar

Brady, H. E. (2019). The challenge of Big Data and data science. Ann. Rev. Polit. Sci . 22, 297–323. doi: 10.1146/annurev-polisci-090216-023229

Chebbi, I., Boulila, W., and Farah, I. R. (2015). “Big data: concepts, challenges and applications,” inComputational Collective Intelligence , eds. M. Núñez, N. T. Nguyen, D. Camacho, and B. Trawiński (Cham: Springer International Publishing), 638–647. doi: 10.1007/978-3-319-24306-1_62

Chen, C., and Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf. Sci . 275, 314–347. doi: 10.1016/j.ins.2014.01.015

PubMed Abstract | Crossref Full Text | Google Scholar

Cumbley, R., and Church, P. C. (2013). Is “Big Data" creepy? Comput. Law Secur. Rev . 29, 601–609. doi: 10.1016/j.clsr.2013.07.007

Ekbia, H. R., Mattioli, M., Kouper, I., Arave, G., Ghazinejad, A., Bowman, T. D., et al. (2015). Big data, bigger dilemmas: a critical review. J. Assoc. Inform. Sci. Technol . 66, 1523–1545. doi: 10.1002/asi.23294

Falagas, M. E., Pitsouni, E., Malietzis, G., and Pappas, G. (2007). Comparison of PubMed, Scopus, Web of Science, and Google Scholar: strengths and weaknesses. FASEB J . 22, 338–342. doi: 10.1096/fj.07-9492LSF

Fan, J., Han, F., and Liu, H. (2014). Challenges of Big Data analysis. Natl. Sci. Rev . 1, 293–314. doi: 10.1093/nsr/nwt032

Hampton, S. E., Strasser, C., Tewksbury, J. J., Gram, W. K., Budden, A. E., Batcheller, A. L., et al. (2013). Big data and the future of ecology. Front. Ecol. Environ . 11, 156–162. doi: 10.1890/120103

Hansmann, T., and Niemeyer, P. (2014). “Big Data - characterizing an emerging research field using topic models,” in 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) (New York, NY: ACM). doi: 10.1109/WI-IAT.2014.15

Harford, T. (2014, March 28). Big data: Are we making a big mistake? Financial Times. Retrieved from: https://www.ft.com/content/21a6e7d8-b479-11e3-a09a-00144feabdc0 (accessed August 25, 2024).

Hortal, J., De Bello, F., Diniz-Filho, J. A. F., Lewinsohn, T. M., Lobo, J. M., and Ladle, R. J. (2015). Seven shortfalls that beset large-scale knowledge of biodiversity. Annu. Rev. Ecol. Evol. Syst . 46, 523–549. doi: 10.1146/annurev-ecolsys-112414-054400

Isaac, N. J. B., Jarzyna, M. A., Keil, P., Dambly, L. I., Boersch-Supan, P. H., Browning, E., et al. (2020). Data integration for large-scale models of species distributions. Trends Ecol. Evol . 35, 56–67. doi: 10.1016/j.tree.2019.08.006

Jacobs, A. (2009). The pathologies of big data. Commun. ACM 52, 36–44. doi: 10.1145/1536616.1536632

Khalid, M., and Yousaf, M. H. (2021). A comparative analysis of big data frameworks: an adoption perspective. Appl. Sci . 11:11033. doi: 10.3390/app112211033

Khan, M. A.-U.-D., Uddin, M. J., and Gupta, N. (2014). Seven V's of Big Data understanding Big Data to extract value . doi: 10.1109/ASEEZone1.2014.6820689

Kitchenham, B., Brereton, O. P., Budgen, D., Seed, P. T., Bailey, J. E., Linkman, S., et al. (2009). Systematic literature reviews in software engineering – a systematic literature review. Inf. Softw. Technol . 51, 7–15. doi: 10.1016/j.infsof.2008.09.009

Kitchenham, B., Pretorius, R., Budgen, D., Brereton, O. P., Seed, P. T., Niazi, M., et al. (2010). Systematic literature reviews in software engineering – a tertiary study. Inf. Softw. Technol . 52, 792–805. doi: 10.1016/j.infsof.2010.03.006

Kitchin, R. (2014). Big Data, new epistemologies and paradigm shifts. Big Data Soc . 1:205395171452848. doi: 10.1177/2053951714528481

Knoppers, B. M., and Thorogood, A. (2017). Ethics and Big Data in health. Curr. Opin. Syst. Biol . 4, 53–57. doi: 10.1016/j.coisb.2017.07.001

Mengist, W., Soromessa, T., and Legese, G. (2020). Method for conducting systematic literature review and meta-analysis for environmental science research. MethodsX 7:100777. doi: 10.1016/j.mex.2019.100777

Patgiri, R. (2019). A taxonomy on Big Data: survey. arXiv [Preprint] . doi: 10.48550/arxiv.1808.08474

Petroc, T. (2023). Amount of data created, consumed, and stored 2010-2020, with forecasts to 2025. Statista. Available at: https://www.statista.com/statistics/871513/worldwide-data-created/

Petticrew, M., and Roberts, H. (2008). Systematic Reviews in the Social Sciences . Hoboken, NJ: John Wiley & Sons.

Rosenheim, J. A., and Gratton, C. (2017). Ecoinformatics (Big Data) for agricultural entomology: pitfalls, progress, and promise. Annu. Rev. Entomol . 62, 399–417. doi: 10.1146/annurev-ento-031616-035444

Scopus Content Coverage Guide (2023). In Scopus. Available at: https://assets.ctfassets.net/o78em1y1w4i4/EX1iy8VxBeQKf8aN2XzOp/c36f79db25484cb38a5972ad9a5472ec/Scopus_ContentCoverage_Guide_WEB.pdf

Succi, S., and Coveney, P. V. (2019). Big data: the end of the scientific method? Philos. Trans. A Math. Phys. Eng. Sci . 377:20180145. doi: 10.1098/rsta.2018.0145

Van Altena, A. J., Moerland, P. D., Zwinderman, A. H., and Olabarriaga, S. D. (2016). Understanding big data themes from scientific biomedical literature through topic modeling. J. Big Data 3. doi: 10.1186/s40537-016-0057-0

Wohlin, C. (2014). “Guidelines for snowballing in systematic literature studies and a replication in software engineering,” in Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering. doi: 10.1145/2601248.2601268

Table A1 . Secondary studies included in our review.

Keywords: Big Data definition, systematic literature review, scientific research, Big Data review, Big Data epistemology

Citation: Han X, Gstrein OJ and Andrikopoulos V (2024) When we talk about Big Data, What do we really mean? Toward a more precise definition of Big Data. Front. Big Data 7:1441869. doi: 10.3389/fdata.2024.1441869

Received: 31 May 2024; Accepted: 12 August 2024; Published: 10 September 2024.

Reviewed by:

Copyright © 2024 Han, Gstrein and Andrikopoulos. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Xiaoyao Han, x.han@rug.nl

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

IMAGES

What is an Analytical Database? Definition and FAQs
The 7 Best Databases for Research
Databases
What is a Database? Definition, Meaning, Types with Example (2022)
What is database, its types and examples?
PPT

VIDEO

Chapter01-Databases and Database Users-01 basic definitions
Research Meaning
What is a Database
Introduction of Research Databases
7-Relational Database Concepts
What is Database?

COMMENTS

What Are Databases?
Research Tutorial. What Are Databases? In the context of libraries and doing research, the generally accepted definition of a database is "an electronic collection of searchable information on one or more topics." Right.
Academic Databases: A Guide for Researchers
How AI-powered tool R Discovery offers access to over 150 million research articles, facilitating academic collaboration. And what are the essentials of academic database for researchers, including types like open access, full-text, subject-specific, multidisciplinary, abstracting, and citation databases.
Research Databases
Definition. Research databases are organized collections of digital data and resources that provide access to a wide range of scholarly articles, journals, books, and other academic materials. These databases are essential for conducting thorough research as they offer curated content, advanced search functionalities, and reliable sources that ...
What's the difference between a Research Database and Google?
A research database is an organized, searchable collection of information that allows you to quickly search many resources simultaneously. Databases can be general, such as Academic Search Complete or ProQuest , or subject-specific, such as PsycInfo , which has resources related to psychology, or America, History and Life , which has resources ...
Google vs. Research Databases: What's the Difference?
Research databases, such as JSTOR and Academic Search Premier, uncover the world of scholarly information. Most of the content in these databases is only available through the library. The complete list of databases is on the Databases A-Z list. The Library has purchased access to hundreds of databases on your behalf.
What Is a Database? (Definition, Types, Components)
Database Definition. A database is a way for organizing information, so users can quickly navigate data, spot trends and perform other actions. Although databases may come in different formats, most are stored on computers for greater convenience. Databases are stored on servers either on-premises at an organization's office or off-premises ...
Defining Research Data
Learn what research data are, how they are managed, and what types of data are excluded from sharing. Research data are the recorded facts that validate research findings and can be structured and stored in various formats.
Data Module #1: What is Research Data?
Research data comes in many different formats and is gathered using a wide variety of methodologies. In this module, we will provide you with a basic definition and understanding of what research data are. We'll also explore how data fits into the scholarly research process.
Research 101: What are Databases?
A database is a collection of organized and stored information designed for search and retrieval. Databases come in various forms and can be used for different applications. Libraries typically subscribe to research databases. Research databases are electronic platforms that contain a collection of electronic information that is searchable and ...
Research Databases
Research Databases. Research databases are organized collections of computerized information or data such as periodical articles, books, graphics and multimedia that can be searched to retrieve information. Databases can be general or subject oriented with bibliographic citations, abstracts, and or full text. The sources indexed may be written ...
Database
database, any collection of data, or information, that is specially organized for rapid search and retrieval by a computer. Databases are structured to facilitate the storage, retrieval, modification, and deletion of data in conjunction with various data-processing operations. A database management system (DBMS) extracts information from the ...
23 Research Databases for Professional and Academic Use
ERIC is a free database that the United States Department of Education sponsors to share resources for teachers and other academic professionals. It also has a thesaurus built into the database, which individuals can use while writing their research papers. 6. ScienceDirect.
What is an academic database?
Databases are online platforms that contain searchable resources such as journals, articles, ebooks, and data sets. Sources within databases are sometimes also published in print editions, but, if not, many of the resources that you find are printable in formats like .pdf. Your school's library contracts with vendors like EBSCO and Proquest ...
Data Collection
Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem. While methods and aims may differ between fields, the overall process of ...
Using Databases
Basic definition: A collection of data arranged for ease and speed of search and retrieval. In the library world, a database is a collection of articles, ebooks, videos, or other resources, which can be quickly searched by keyword, author, publication or other terms. Most materials in library databases are not available to the general public or ...
Why use a Library Database?
A library database is an electronic collection of information, organized to allow users to get that information by searching in various ways. Examples of Database information. Articles from magazines, newspapers, peer-reviewed journals and more. More unusual information such as medical images, audio recitation of a poem, radio interview ...
What is a database?
A database is an electronic collection of information that is organized so that it can easily be accessed, managed, and updated. Amazon.com is a familiar database as is the library's catalog, Find It. The library also subscribes to over 200 scholarly and research databases.
Research Software vs. Research Data I: Towards a Research Data
Background: Research Software is a concept that has been only recently clarified.In this paper we address the need for a similar enlightenment concerning the Research Data concept. Methods: Our contribution begins by reviewing the Research Software definition, which includes the analysis of software as a legal concept, followed by the study of its production in the research environment and ...
Database Search
What is Database Search? Harvard Library licenses hundreds of online databases, giving you access to academic and news articles, books, journals, primary sources, streaming media, and much more. The contents of these databases are only partially included in HOLLIS. To make sure you're really seeing everything, you need to search in multiple places.
Research Data
Research data refers to any information or evidence gathered through systematic investigation or experimentation to support or refute a hypothesis or answer a research question. It includes both primary and secondary data, and can be in various formats such as numerical, textual, audiovisual, or visual. Research data plays a critical role in ...
Guides: Research Data Management: What is data?
Research data is any information that has been collected, observed, generated or created to validate original research findings. Research data may be arranged or formatted in such a way as to make it suitable for communication, interpretation and processing. Data comes in many formats, both digital and physical. More information:
Research Database
Definition. A research database is a structured collection of data and information that allows users to search, retrieve, and analyze content from various sources efficiently. These databases often include articles, books, multimedia, and archival materials relevant to specific subjects or fields, making them vital tools for gathering ...
List of academic databases and search engines
This article contains a representative list of notable databases and search engines useful in an academic setting for finding and accessing articles in academic journals, institutional repositories, archives, or other collections of scientific and other articles. Databases and search engines differ substantially in terms of coverage and retrieval qualities. [1]
Research Methods
Research methods are specific procedures for collecting and analyzing data. Developing your research methods is an integral part of your research design. When planning your methods, there are two key decisions you will make. First, decide how you will collect data. Your methods depend on what type of data you need to answer your research question:
(PDF) What is a Database?
A database is an organized collection. of information, usually with one central topic. In a computer. database (as opposed to a paper database), the program that. you use to enter and manipulate ...
When we talk about Big Data, What do we really mean? Toward a more
1 Department of Governance and Innovation, Campus Fryslan, University of Groningen, Leeuwarden, Netherlands; 2 Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence, University of Groningen, Groningen, Netherlands; Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this "no consensus ...
The Engagement and Disengagement of Heterogeneous Stakeholders: A
Central to research within the relational view is how interactions among multiple stakeholders are affected by their interests, hierarchies, and relationships (Castelló et al., 2016; Dawkins, 2015), as well as value congruence and strategic complementarity (Bundy et al., 2018).For example, in cities, stakeholders with different interests—what we term heterogeneous stakeholders—such as ...
Entrepreneurial mindset: An integrated definition, a review of current
Despite an increasing interest in understanding the mindset of entrepreneurs, little consensus exists on what an entrepreneurial mindset (EM) is, how it is developed, or its precise outcomes. Given the fragmented nature of the multidisciplinary study of EM, we review prior work in an effort to enhance scholarly progress. To this end, we identify and review 61 publications on the topic and ...

Research Tutorial

Academic Databases: A Guide for Researchers

What is an Academic Database?

Types of Academic Databases

Benefits of Using Academic Databases

R Discovery for Researchers

Related Posts

What is Research Impact: Types and Tips for Academics

Research in Shorts: R Discovery’s New Feature Helps Academics Assess Relevant Papers in 2mins

University of Houston Libraries

Finding Information

Introduction

Search Engines

Research Databases

Results in Google

Search results in Academic Search Premier

What Is a Database?

Database Definition

Why Do We Use Databases?

Databases Hold Data Efficiently

Databases Allow Smooth Transactions

Databases Update Information Quickly

Databases Simplify Data Analysis

What Is a Database Management System?

Evolution of Databases

Types of Databases

1. Hierarchical Databases

2. Relational Databases

3. Non-Relational or NoSQL Databases

4. Cloud Databases

5. Centralized Databases

6. Distributed Databases

7. Object-Oriented Databases

8. Graph Databases

What Are the Components of a Database?

Constraints and Rules

Query Language

Database Advantages

Database Disadvantages

Applications of Databases

Future of Databases

Frequently Asked Questions

What is the most commonly used database type?

What is the definition of a database?

Recent Big Data Articles

Defining Research Data

Examples of Research Data

Exclusions from Sharing

Other Records to Manage

QR code for this page

Data Module #1: What is Research Data?

Data Module Quick Navigation

Library Resources

Other Definitions of Research Data

Research Data Formats

Research 101: What are Databases?

What are databases?

General vs. Specialized

Research Process: Research Databases

Research Databases

BibGuru Blog

What is an academic database?

What can you find in an academic database?

Most common academic databases

Humanities databases

Social sciences databases

Business databases

STEM databases

Health sciences and education databases

How to cite sources from an academic database

Frequently Asked Questions about what is an academic database

What is a Database?

Basic definition:

Short Answer:

Longer Explanation:

Examples and Usage:

Finding Articles (Using Academic Search Premier)

Judging Articles (Using Academic Search Premier)

Using JSTOR

Can I Access Databases from Off Campus?