Data Modules Table of Contents
#1 - What is Research Data? #2 - Planning for Your Data Use #3 - Finding & Collecting Data #4 - Keeping Your Data Organized #5 - Intellectual Property & Ethics #6 - Storage, Backup, & Security #7 - Documentation
Module created by Aaron Albertson, Beth Hillemann, & Ron Joslin.
Many people think of data-driven research as something that primarily happens in the sciences. It is often thought of as involving a spreadsheet filled with numbers. Both of these beliefs are incorrect. Research data are collected and used in scholarship across all academic disciplines and, while it can consist of numbers in a spreadsheet, it also takes many different formats, including videos, images, artifacts, and diaries. Whether a psychologist collecting survey data to better understand human behavior, an artist using data to generate images and sounds, or an anthropologist using audio files to document observations about different cultures, scholarly research across all academic fields is increasingly data-driven.
In our Data Literacy Modules, we will demonstrate the ways in which research data are gathered and used across various academic disciplines by discussing it in a very broad sense. We define research data as: any information collected, stored, and processed to produce and validate original research results. Data might be used to prove or disprove a theory, bolster claims made in research, or to further the knowledge around a specific topic or problem.
There are many different definitions of research data available. Here are just a few examples of other definitions. We share these examples to illustrate there is not universal consensus on a definition, although many similarities are apparent.
“research data, unlike other types of information, is collected, observed, or created, for purposes of analysis to produce original research results”
"...recorded factual material commonly accepted in the scientific community as necessary to validate research findings..."
"...materials generated or collected during the course of conducting research..."
Research data takes many different forms. Data may be intangible as in measured numerical values found in a spreadsheet or an object as in physical research materials such samples of rocks, plants, or insects. Here are some examples of the formats that data can take:
A database is a collection of organized and stored information designed for search and retrieval. Databases come in various forms and can be used for different applications.
Libraries typically subscribe to research databases. Research databases are electronic platforms that contain a collection of electronic information that is searchable and, in most cases, retrievable in full-text format. Specifically, they encompass articles from periodicals like academic journals, newspapers, magazines, and trade publications.
In most cases, databases can be categorized in two ways: general or specialized.
General databases cover a wide range of academic disciplines by means of indexing many source types, including articles from academic/scholarly journals, newspapers, magazines, trade publications, and reports.
Specialized databases usually contain various types of information that deal with a specific field of study.
Research databases are organized collections of computerized information or data such as periodical articles, books, graphics and multimedia that can be searched to retrieve information. Databases can be general or subject oriented with bibliographic citations, abstracts, and or full text. The sources indexed may be written by scholars, professionals or generalists.
Research databases that are retrieved on the World Wide Web are generally non-fee based, lack in-depth indexing, and do not index proprietary resources. Subscription or commercial databases are more refined with various types of indexing features, searching capabilities, and help guides.
Prince George's Community College's Library provides commercial databases for its users as well as non-fee databases. These databases are available from the Library's Website. To review these databases, click on Research Databases .
Selecting Appropriate Online Databases
Your topic statement determines the type of database, kind of information, and the date of the sources that you will use. It is important to clarify whether your topic will require research from journals, magazines, newspapers, and books or just journals. To understand the differences between magazines, journals, and newspapers, see the Magazines, Journals, and Newspapers: What's the Difference section under Evaluating Sources.
Search Strategies
Before you begin to search the databases, it is important that you develop a well planned comprehensive search strategy. Determine what your keywords are and how you want them to link together. Always read the help screens and review any tutorials that have been developed for a particular database.
After you determine what your keywords are, consult any subject headings or guides to locate controlled vocabulary such as a thesaurus that may appear in the subject field. You will also want to decide what other fields may be valuable for your search.
Boolean searching is one of the basic and best search strategies that is used by most online databases.
For more help with search strategies see Search Strategies section.
Our editors will review what you’ve submitted and determine whether to revise the article.
database , any collection of data, or information , that is specially organized for rapid search and retrieval by a computer . Databases are structured to facilitate the storage, retrieval, modification, and deletion of data in conjunction with various data-processing operations. A database management system (DBMS) extracts information from the database in response to queries.
A brief treatment of databases follows. For full treatment, see computer science: Information systems and databases ; information processing .
A database is stored as a file or a set of files. The information in these files may be broken down into records , each of which consists of one or more fields. Fields are the basic units of data storage, and each field typically contains information pertaining to one aspect or attribute of the entity described by the database. Records are also organized into tables that include information about relationships between its various fields. Although database is applied loosely to any collection of information in computer files, a database in the strict sense provides cross-referencing capabilities. Using keywords and various sorting commands, users can rapidly search, rearrange, group, and select the fields in many records to retrieve or create reports on particular aggregates of data.
Database records and files must be organized to allow retrieval of the information. Queries are the main way users retrieve database information. The power of a DBMS comes from its ability to define new relationships from the basic ones given by the tables and to use them to get responses to queries. Typically, the user provides a string of characters, and the computer searches the database for a corresponding sequence and provides the source materials in which those characters appear; a user can request, for example, all records in which the contents of the field for a person’s last name is the word Smith .
The many users of a large database must be able to manipulate the information within it quickly at any given time. Moreover, large business and other organizations tend to build up many independent files containing related and even overlapping data, and their data-processing activities often require the linking of data from several files. Several different types of DBMS have been developed to support these requirements: flat, hierarchical, network, relational, and object-oriented.
Early systems were arranged sequentially (i.e., alphabetically, numerically, or chronologically); the development of direct-access storage devices made possible random access to data via indexes. In flat databases, records are organized according to a simple list of entities; many simple databases for personal computers are flat in structure. The records in hierarchical databases are organized in a treelike structure, with each level of records branching off into a set of smaller categories. Unlike hierarchical databases, which provide single links between sets of records at different levels, network databases create multiple linkages between sets by placing links, or pointers, to one set of records in another; the speed and versatility of network databases have led to their wide use within businesses and in e-commerce . Relational databases are used where associations between files or records cannot be expressed by links; a simple flat list becomes one row of a table, or “relation,” and multiple relations can be mathematically associated to yield desired information. Various iterations of SQL (Structured Query Language) are widely employed in DBMS for relational databases. Object-oriented databases store and manipulate more complex data structures, called “objects,” which are organized into hierarchical classes that may inherit properties from classes higher in the chain; this database structure is the most flexible and adaptable.
The information in many databases consists of natural-language texts of documents; number-oriented databases primarily contain information such as statistics, tables, financial data, and raw scientific and technical data. Small databases can be maintained on personal-computer systems and used by individuals at home. These and larger databases have become increasingly important in business life, in part because they are now commonly designed to be integrated with other office software, including spreadsheet programs.
Typical commercial database applications include airline reservations, production management functions, medical records in hospitals, and legal records of insurance companies. The largest databases are usually maintained by governmental agencies, business organizations, and universities. These databases may contain texts of such materials as abstracts, reports, legal statutes, wire services, newspapers and journals, encyclopaedias, and catalogs of various kinds. Reference databases contain bibliographies or indexes that serve as guides to the location of information in books, periodicals, and other published literature. Thousands of these publicly accessible databases now exist, covering topics ranging from law, medicine, and engineering to news and current events, games, classified advertisements, and instructional courses.
Increasingly, formerly separate databases are being combined electronically into larger collections known as data warehouses . Businesses and government agencies then employ “ data mining ” software to analyze multiple aspects of the data for various patterns. For example, a government agency might flag for human investigation a company or individual that purchased a suspicious quantity of certain equipment or materials, even though the purchases were spread around the country or through various subsidiaries.
Be more productive in school
When you do research in an academic setting, you’ll likely encounter academic databases like JSTOR, Scopus, PubMed, or ERIC. In this blog post, we define academic databases and explore what resources can be found in them.
Databases are online platforms that contain searchable resources such as journals, articles, ebooks, and data sets. Sources within databases are sometimes also published in print editions, but, if not, many of the resources that you find are printable in formats like .pdf.
Your school’s library contracts with vendors like EBSCO and Proquest to subscribe to databases that you can search to find sources . Databases are costly to maintain and, as a result, access to the sources in them is sometimes limited depending on your institution’s subscription.
Although academic databases are searchable, they differ from Google or Google scholar because the sources within them are properly indexed and vetted through peer-review. This means that academic databases allow for more precise keyword searches and that the sources they contain are credible.
The most common types of resources that you’ll find in an academic database are journals and journal articles . Journals are periodicals that feature journal articles and that are published on a regular basis.
Also known as scholarly articles, journal articles are secondary, peer-reviewed sources that you can use as evidence in an academic paper. Additionally, databases may contain ebooks, newspaper or magazine articles, data sets, reports, and other source types.
Databases are often subject- or discipline-specific, but there are some that cover a range of topics, such as Academic Search Complete and Scopus. Some databases contain full-text sources, while others include abstracts for sources that you can acquire elsewhere.
Databases like JSTOR, ProjectMUSE, and the MLA International Bibliography are the most common for research in humanities fields. If you’re a history or literature student, you will likely encounter these at some point in your studies.
Both JSTOR and ProjectMUSE offer full-text sources. The MLA International Bibliography indexes all publications in languages and literature. Although it doesn’t contain full-text sources, many libraries are able to provide links to full-text sources in other databases.
If you’re writing a paper on a social science topic, you might find sources in databases like Sociological Abstracts, PsycINFO, and Opposing Viewpoints.
Because social science fields are often interdisciplinary, you may also find relevant social science sources in other types of databases.
Some of the most common business databases are Business Source Complete, ABI/INFORM, Mergent, IBISWorld, and Mintel. These offer different resources, depending on what kind of business research your are doing.
Some business databases are best for industry or company research, while others specialize in providing resources for consumer research. If you’re not sure what kind of business research you need to do for your assignment, you might consider contacting your school’s business librarian, if one is available.
STEM is a huge field, so you can expect to encounter a wide range of databases when you undertake research in the sciences, technology, engineering, or math. Some of the most common databases in these subjects include Web of Science, arXiv.org, and ScienceDirect.
Science databases will include a broader range of resources than those in other fields, including source types like data sets and code. If you’re not sure which science database would provide you with the best sources for your topic, ask a librarian or your instructor.
Like social science fields, both the health sciences and education are interdisciplinary. However, there are two top databases that you will certainly interact with if you’re doing research in either of these subjects: PubMed for health sciences and ERIC for education.
To create the most accurate citations for sources you find in a database, use BibGuru’s citation generator . You can create citations for more than 70 source types in all of the major citation styles.
The BibGuru browser extension for Chrome and Edge can also help you automatically generate citations for online database sources. Correct citations ultimately help you avoid plagiarism .
Databases are online platforms that contain searchable resources such as journals, articles, ebooks, and data sets. They are used to find sources for academic papers.
Some of the major academic databases include Academic Search Complete, Scopus, JSTOR, and PubMed.
The most common types of resources that you’ll find in an academic database are journals and journal articles . Additionally, databases may contain ebooks, newspaper or magazine articles, data sets, reports, and other source types.
You can find academic databases through your school library’s website. Most academic libraries organize their database list alphabetically or by subject.
Make your life easier with our productivity and writing resources.
For students and teachers.
Why should i use databases, using databases, help using databases, frequently asked questions (databases).
Academic Search Premier: 8,500+ full-text periodicals, 7,300+ peer-reviewed journals
JSTOR: Archival database of books, primary sources, current issues of journals
Films On Demand: Streaming database of educational videos
A collection of data arranged for ease and speed of search and retrieval.
In the library world, a database is a collection of articles, ebooks, videos, or other resources, which can be quickly searched by keyword, author, publication or other terms. Most materials in library databases are not available to the general public or standard Internet search engines. The library pays a substantial fee to make them available to Concordia library users.
Databases typically make searching easier.
Databases often are organized by topics or subjects or materials, have multiple filters which can limit or increase search results in various ways, and can provide access to resources which are otherwise not obtainable. As was mentioned in the definition, one of the primary functions of a database is speed of retrieval. This means databases are designed to provide quick access to materials which are actually useful to you, instead of just providing access to as many resources as possible.
Scholarly (Peer-Reviewed) Resources : You may see that an instructor requires scholarly or peer-reviewed resources. Databases are not the only way to get access to these types of resources, but like stated above, they're often the fastest or easiest.
Databases work essentially the same as many internet search engines that you may be familiar with ( Google , for example). However, Google and most web search engines use your search terms and look for "anything out there" and then send you somewhere else to get your information. A database in a sense, will keep you in their system. These databases actually work similarly to how Amazon.com works. You can search for just about anything, add limitations to your search and terms, and find related searches, but all are within Amazon's setup. Amazon doesn't send you to Walmart or Target , they keep you in Amazon's "ecosystem." These databases aren't going to send you to some organization's website for the information you're looking for, like Google would, instead you stay in the Database's ecosystem. There are pros and cons to this, but in most cases that ecosystem is cut off from standard search engine searches, meaning only that database has access to it.
Databases usually focus their subject matter and curate their collections, or we have subscriptions to very specific collections in those databases. This means that there may be databases that work really well for some subject matter, but have almost nothing in other areas. For example, the Quick Links on this page represent some of our most popular databases, but each has a different focus and will provide you with very different materials.
Academic Search Premier: This is our standard database which holds many academic and popular resources and is our most used database. It covers a wide range of information with tons of publications with Full Text access. To learn how to use this database better, jump to the Help Using Databases section of this page and watch the videos "Finding Articles" and "Judging Articles" for an in-depth look at how to use this resource.
JSTOR : JSTOR is another very popular database, with a "how to" video in the Help Using Databases section. JSTOR is not just a database that offers many various resources in full text, but is also an Archive . Simply put, JSTOR has a lot of great information both new and old, and it'll always be there.
Films on Demand: Films on demand is a database that holds tons of educational videos on many different subjects. What's more, they provide citation materials and video embed options, for use in presentations or saving for later.
Nexis Uni : Formerly Nexis Lexis, this database collects newspaper articles from all over the world, if you need news articles from a specific publication, date-range or on a topic, this is the place to look.
Others : Of course, the databases listed above are just some of the more popular database options we have access to. Link library subscribes to many more databases, please check out the Database List Page to see all the others.
Judging articles (using academic search premier).
When i go to a database, it has a login option in the top right corner, what is this, an article in a database says "concordia subscribes to this title" but doesn't have an access link, how do i get it, i found an article i want, but i don't have full access, what do i do.
A library database is an electronic collection of information, organized to allow users to get that information by searching in various ways.
Examples of Database information
Articles from magazines, newspapers, peer-reviewed journals and more. More unusual information such as medical images, audio recitation of a poem, radio interview transcripts, and instruction video can be found in databases as well.
General reference information such as that found in an encyclopedia. Both very broad topic information is available as well as very specific.
Books. Online versions, eBooks, are the same as print versions with some enhancements at times, such as an online glossary.
What’s the difference?
Information in a database has been tagged with all sorts of data, allowing you to search much more effectively and efficiently. You can search by author, title, keyword, topic, publication date, type of source (magazine, newspaper, etc.) and more.
Database information has been evaluated in some way, ranging from a very rigorous peer-review publishing process to an editor of a popular magazine making a decision to publish an article.
Databases are purchased, and most of the information is not available for free on the internet. The databases are continually updated as new information is produced.
Citation information. Databases include the information you need to properly cite your sources and create your bibliography. Information you retrieve using Google may or may not have this information.
My professor says I can’t use the Internet. Can I still use these databases?
Yes! The internet is only the delivery system for the databases. The information in the databases is not found on the free web.
Your best bet when doing any type of research for course assignments is to use the library databases. When you search using Find It , you are searching one kind of database. However, there are many more to choose from, each focused on a certain kind of resource or focused around a certain subject. Library databases were built with you--the researcher--in mind, and include many tools to help you organize, cite, and save the scholarly resources you find.
A database is an electronic collection of information that is organized so that it can easily be accessed, managed, and updated. Amazon.com is a familiar database as is the library's catalog, Find It . The library also subscribes to over 200 scholarly and research databases. You can browse them all here: http://libguides.atu.edu/az.php
Multi-disciplinary database with full text journals, indexing /abstracts, monographs, reports, conference proceedings, and more.
Multidisciplinary database of periodical content.
Databases provided by the Arkansas State Library [ASL] Traveler project are funded by a grant from the U.S. Institute of Museum and Library Services (Grant LS-00-14-0004-14) and the Arkansas Department of Education.
Multidisciplinary digital library of academic journals, books, and primary sources.
Full text, multidisciplinary databases.
Online resource covering today’s social issues. Provides pro/con viewpoint essays, primary source documents, statistics, articles, and images.
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
The PMC website is updating on October 15, 2024. Learn More or Try it out now .
Teresa gomez-diaz.
1 Laboraroire d'Informatique Gaspard-Monge, CNRS, Paris-Est, France
2 Universidad Antonio de Nebrija, Madrid, Spain
Underlying data.
Data underlying the arguments presented in this article can be found in the references, footnotes and Box 1 .
Revised. amendments from version 1.
This version considers the comments of the reviewers to better explain and illustrate some of the concepts presented in the article. In particular we have stressed the importance of the scientific production context for the RS and RD definitions. We have as well introduced new references related to the concepts of data and information, to further illustrate our view on the complexity of the data concept, and a new reference to complete the studied landscape for the proposed RD definition. As asked by the Referees, we have moved the translations of French and Spanish quotes to the main text. See our answers to the referee reports to complete the differences with the version 1 of this article.
Review date | Reviewer name(s) | Version reviewed | Review status |
---|---|---|---|
Joachim Schopfel | Approved | ||
Remedios Melero | Approved | ||
Tibor Koltay | Approved | ||
Joachim Schopfel | Approved with Reservations | ||
Remedios Melero | Approved with Reservations | ||
Tibor Koltay | Approved with Reservations |
Background: Research Software is a concept that has been only recently clarified. In this paper we address the need for a similar enlightenment concerning the Research Data concept.
Methods: Our contribution begins by reviewing the Research Software definition, which includes the analysis of software as a legal concept, followed by the study of its production in the research environment and within the Open Science framework. Then we explore the challenges of a data definition and some of the Research Data definitions proposed in the literature.
Results: We propose a Research Data concept featuring three characteristics: the data should be produced (collected, processed, analyzed, shared & disseminated) to answer a scientific question, by a scientific team, and has yield a result published or disseminated in some article or scientific contribution of any kind.
Conclusions: The analysis of this definition and the context in which it is proposed provides some answers to the Borgman’s conundrum challenges, that is, which Research Data might be shared, by whom, with whom, under what conditions, why, and to what effects. They are completed with answers to the questions: how? and where?
Each particle of the Universe, known or unknown by what is widely accepted as Science, is information. Different datasets can be associated to each particle to convey information, as, for example: where has this particle been discovered? By whom? At what time? Is this particle a constituent element of a rock, or a plant, or … ? Indeed, as living entities of the Earth planet, … we are all part of this Universe and every atom in our bodies came from a star that exploded … , therefore … we are all stardust … . 1
So long ago that we have never been able to give a precise date, information started to be fixed in cave paintings, figurines, and bone cravings, which have been found in caves like Altamira 2 or Lascaux 3 . That is, some human beings intentionally fixed information on a support. Much more recently, languages have been developed to deal with information, fixing and exchanging it in clay bricks, papyrus, monument walls, and paper books. Even more recently, information has been fixed in films, photographs, and has finally adopted digital formats.
Scientists study all kinds of subjects and objects: persons, animals, trees and plants and other living beings, philosophies and philosophers, artists and artworks, mathematical theories, music, languages, societies, cities, Earth and many other planets and exoplanets, clouds, weather and climate, stars and galaxies, as well as other animate or inanimate objects, molecules, particles, nanoparticles and viruses, nowadays including digital objects such as computer programs. Some of these items, like images, texts, and music etc. may have associated intellectual property rights; but others, like statistics or geographical data, may not. Yet, they may be affected by other legal contexts, such as, for example, the one given by the EU INSPIRE Directive 1 for spatial data, concerning any data with a direct or indirect reference to a specific location or geographical area.
Now, in our digital era, most of the above subjects under consideration are handled by humans using computers, through numerical data. Scientists present new theories and results built and produced with numerical simulations and through the analysis of numerical datasets. They are usually stored in databases, manipulated or produced in digital environments using existing software, either Free/Open Source Software (FLOSS) 4 or commercial, or by means of software developed by research teams to address specific problems 2 , 3 .
In this specific scientific context, the aims and developments of Open Science practices are particularly relevant. Indeed, as remarked by 4 : "We must all accept that science is data and that data are science … ". Therefore, in this article we take into consideration the following definition of Open Science, in which the open access to Research Data (RD) and to Research Software (RS) is part of the core pillars 5 :
Open Science is the political and legal framework where research outputs are shared and disseminated in order to be rendered visible, accessible and reusable.
A more transversal and global vision can be found in the UNESCO Recommendation on Open Science 5 , 6 . See also 7 for another relevant example of ongoing work on the Open Science concept. But in this paper, following the analysis and the conclusions of 5 , we focus here on this restricted framework as more suitable for our purposes.
Among the most important kinds of research outputs of any scientific work, we focus on the trio formed by articles , software and data. Actually, among all the possible duos, the couple RS and RD present more similarities, although a light list of differences between software and data have been mentioned in 8 and 9 . On the other hand, regarding other duos, we think that differences are much stronger. For instance, unlike the dissemination of published articles, usually at the hands of scientific editors, the dissemination of software and data that have been produced in the research process is mostly at the hands of their producers, the research team. The analogies between RS and RD have been already summarily highlighted in 10 , such as those concerning the release protocols of RD and RS, which raises the same questions, at the same time, in the production context. As a direct consequence, it seems suitable to propose a similar dissemination procedure for both kinds of research outputs 11 .
Indeed, let us remark that, as mentioned in 11 , 12 , both RS and RD dissemination might involve the use of licenses to set their sharing conditions, such a core issue. Information about RS licenses and licensing can be found at the Free Software Foundation (FSF) 6 , the Open Source Initiative (OSI) 7 , and the Software Package Data Exchange (SPDX) 8 . The SPDX licenses list also includes licenses that can be used for databases, like the Creative Commons licenses 9 or the Open Data Commons Licenses 10 , see for example 13 .
Other similarities regarding RS and RD are related to management plans: for example, Data Management Plans are nowadays required by research funders (see for example 14 , 15 ) and, in the same mood, Software Management Plans have been recently proposed, see 16 and the references therein.
Finally, concerning evaluation, as observed in 3 , similar evaluation protocols can be proposed for both RS and RD.
Leaving aside the common issues in RS and RD for licensing and management plans, that have been already studied in the above mentioned references, the RS and RD dissemination and evaluation analogies are more closely analyzed in the article 12 that follows the present work, including FAIR related issues 17 and 5Stars Open Data 11 . On the other hand, in the current article we focus on the conceptual analogies of RS and RD, and their consequences (see Section 5 ).
As we will argue in the next sections, a definition for RD can be proposed following the main features of the RS definition given in our recent work 3 , 18 . However, we consider that formulating such proposal still remains a challenging issue that we dare to address here. In fact, although one of the most widely accepted RD definitions is the one proposed by the OECD (2007) 19 , other works have shown the difficulties to fix such a definition 20 , 21 . Indeed, establishing this concept has important and not well settled consequences, for example, concerning the context of RD sharing, as highlighted by C. Borgman in 22 :
Data sharing is thus a conundrum. […] The challenges are to understand which data might be shared, by whom, with whom, under what conditions, why, and to what effects. Answers will inform data policy and practice.
It is the intention of our present work to bring some answers to these questions.
The plan of this article is as follows. The next section introduces the concept of RS after a summary presentation of the key points involved in the notion of software as a legal object. Section 3 is devoted to discuss the different issues involved in the challenge towards a precise definition of data (in the more comprehensive sense of this concept). Section 4 describes partially the landscape of existing work addressing the RD definition, enumerating, again, some difficulties to settle such a concept.
There we propose our RD definition, based in three characteristics: the data should be produced (collected, processed, analyzed, shared & disseminated) to answer a scientific question, by a scientific team, and has yield a result published or disseminated in some article or scientific contribution of any kind. Comparisons with other RD definitions are examined.
The last and final section concludes with the proposition of some specific answers to Borgman’s conundrum challenges 22 . Let us remark that these conundrum challenges involve as well RD dissemination issues that are studied in detail in the article that follows this work 12 , which also includes the analysis of RD evaluation and FAIR issues.
The reader of the current work should be aware that its authors are not legal experts. Thus, in order to address our goals in this article, we have analyzed (French, Spanish, European and USA) legal documents and articles written by law experts 1 , 13 , 20 , 21 , 23 – 34 , but from the scientist’s point of view. Yet, a deeper understanding of legal issues may require the intervention of legal specialists.
Following the standard scientific protocol, the authors of this work (mathematicians) have, first, detected a problem – the need to provide a more suitable RD definition. Then, they have observed the involved landscape and studied the related literature; have focused on and structured different components of the problem; finally, they have proposed what they believe could be a solution for the challenge under consideration. As in any other research work, we, authors of the present work, believe that our proposal should be examined by the scientific community in order to evaluate its correctness, and to help improving it, if needed, advancing towards a better solution.
In this section we bring together some of the existing definitions of software as a legal object (see references below). We also recall our definition of RS coming from 3 , 18 .
In what follows we refer to the documents 26 – 29 dealing with a definition of software as a legal object. Note that the terms computer program , software , logiciel (in French), programa de ordenador (in Spanish) are synonyms in this work. The terms source code (or código fuente in Spanish), compiled code (or code compilé , código compilado ) correspond to subsets of a computer program.
The first definition that we would like to consider comes from the Directive 2009/24/EC of the European Parliament 26 , that states:
For the purpose of this Directive, the term “computer program” shall include programs in any form, including those which are incorporated into hardware. This term also includes preparatory design work leading to the development of a computer program provided that the nature of the preparatory work is such that a computer program can result from it at a later stage.
Moreover, in the Spanish Boletín Oficial del Estado n. 97 (1996) 27 we can find 12 :
A los efectos de la presente Ley se entenderá por programa de ordenador toda secuencia de instrucciones o indicaciones destinadas a ser utilizadas, directa o indirectamente, en un sistema informático para realizar una función o una tarea o para obtener un resultado determinado, cualquiera que fuere su forma de expresión y fijación. […] comprenderá también su documentación preparatoria.
[For the purpose of this Law, a computer program shall be understood as any sequence of instructions or indications intended to be used, directly or indirectly, in a computer system to perform a function or a task or to obtain a certain result, whatever expression and fixation form it can take. […] it can also include its preparatory documentation.]
Likewise, in the French Journal officiel de la République française (1982) 29 we can read:
Logiciel : Ensemble des programmes, procédés et règles, et éventuellement de la documentation, relatifs au fonctionnement d’un ensemble de traitement de données (en anglais : software). [ Software : All programs, procedures and rules, and possibly documentation, related to the performance of some data processing (in English: software).].
And in the French Code de la propriété intellectuelle (current regulation) 28 , Article L112-2, we can find:
Les logiciels, y compris le matériel de conception préparatoire, sont considérés notamment comme œuvres de l’esprit au sens du présent code. [Software, including the preparatory material, is considered as works protected by the present code.]
We observe that, in the above mentioned documents, the concept of software or computer program, logiciel or programa de ordenador refers to the set of instructions, of any kind, that are to be used in a computer system (including hardware). It is a work protected by the author rights. It can include the source code, the compiled code, and, eventually, the associated documentation and the preparatory material. It can be related to some data processing or to other tasks to be implemented in a computer system.
In order to complete this legal vision of the software concept we refer to item (11) of 26 :
For the avoidance of doubt, it has to be made clear that only the expression of a computer program is protected and that ideas and principles which underlie any element of a program, including those which underlie its interfaces, are not protected by copyright under this Directive. In accordance with this principle of copyright, to the extent that logic, algorithms and programming languages comprise ideas and principles, those ideas and principles are not protected under this Directive. In accordance with the legislation and case-law of the Member States and the international copyright conventions, the expression of those ideas and principles is to be protected by copyright.
Indeed, there is a difference between the concepts of algorithm and software from the legal point of view, as there is a difference between the mere idea for the plot of a novel and the final written work. Several persons could have the same idea for the plot, but its realization in a final document will deliver different novels by different writers, as the novel will reflect the personality of its author. Similarly, an algorithm remains on the side of ideas, and as such, it is not protected by copyright laws. On the other side, poetry, novels and software are protected under copyright laws. Moreover, a computer program can implement several algorithms, and the same algorithm can be implemented in several programs.
Finally, note the nature of software as a digital object underlying all the above considerations.
Beyond the vision of software as a legal object, we bring here the concept of Research Software (RS) as a scientific production, as defined in 3 , 18 :
Research Software is a well identified set of code that has been written by a (again, well identified) research team. It is software that has been built and used to produce a result published or disseminated in some article or scientific contribution. Each research software encloses a set (of files) that contains the source code and the compiled code. It can also include other elements as the documentation, specifications, use cases, a test suite, examples of input data and corresponding output data, and even preparatory material.
Thus, Section 2.1 of 3 introduces several definitions regarding the notions of scientific and research software as found in the literature, as a way to support the above definition, while 18 provides complementary analysis on this concept. Note that this definition does not take into consideration if the RS status is “ongoing” or “finalized”, and does not regard if the RS has been disseminated, its quality or scope, its size, or if it is documented, maintained, used only by the development team for the production of an article, or it is currently used in several labs … 2 .
Different recent works on the RS concept can be found, for example, on 35 and the references therein, where the RDA FAIR for Research Software (FAIR4RS) working group 13 proposes a definition of RS full of subtleties and details, albeit, perhaps, of complex interpretation in practice.
We observe, following our proposed definition, that RS can be characterized through three main features:
Note that documentation, licenses, examples, data, tests, Software Management Plans and other related information and materials can also be part of the set of files that constitutes a specific RS. Remark that the data we refer to in this list will qualify as RD (as defined in Section 4 ) if they have been produced by a research team, that can be the same team that has produced the RS, but not necessarily (notice that the role of the research team involved in the development of a RS has been thoroughly studied in Section 2.2 of 3 ). Indeed, Section 2.1 above shows that the preparatory design work and documentation are part of the software, and these are documents that can be included in the released version of a RS, following the choice of the RS producer team. There can be other elements as for example tests, input and output files to illustrate how to use the RS, licenses, etc. To include these elements in the released RS correspond to best practices that facilitate RS reuse. In our view, the release of a RD (see Section 4 and 12 ) can follow similar practices, that is, to include a documentation, some use examples, a license, a data management plan … this is to be decided by the producer team.
The initial origin of this RS definition is to be found in 2 , that contains a detailed and complete study comparing articles and software produced in a typical (French) research lab. As remarked in received comments and Referee reports to this article, this RS definition (as well as the RD definition proposed in Section 4 ) is placed in what can be considered as a narrow context, emphasizing the role of the scientific production context. The relevance of such context is widely accepted by the scientific community in the case of articles: not every article published in a newspaper qualifies as a research article, that requires to be released in a scientific journal and subject to a referee procedure. Similarly, the importance of the production context has been already highlighted in the case of data, regarding those that qualify as cultural data 23 .
Besides, our definition does not include as RS neither commercial software nor existing Free/Open Source Software (FLOSS) or other software developed outside Academia, a restriction which does not exclude that RS (or research articles, data...) can be produced in other contexts like private laboratories, for example. Rather, this means that we are not considering here differences between private or public funding of research. As a matter of fact, a research team can use RS produced by other teams for their scientific work, as well as FLOSS or other software developed outside the scientific community, but the present work is centered in the making-of aspects which are pertinent for the proposed definition. Obviously, a RS that has been initially developed in a research lab can evolve to become commercial software or just evolve outside its initial academic context. The above definition concerns its early, academic life.
Moreover, a RS development team may not just use software produced by other teams, but also include external software as a component inside the ongoing computer program, a procedure that could be facilitated by the FLOSS licenses. We consider that this external component qualifies as RS if it complies with the three characteristics given in the above definition. Moreover, the producers of the final RS should clearly identify the included external components, and their licenses. They should also highlight the used or included RS components, by means of a correct citation form 3 , 8 , 11 , 37 – 39 .
Furthermore, a RS may involve other software components that can remain external , and that are not included in the RS development and release. It is then left to the users the task to recover and install them, and to assemble these external components in order to get a running environment. Another situation, as the one we have analyzed in 18 , deals with the RS developed within a given software environment which is not perhaps fully disseminated with the RS. For example, the GeoGebra code developed by T. Recio and collaborators 14 does not disseminate the whole GeoGebra software 15 , but only some parts that are relevant for their goals and that include their code.
See 2 , 3 , 18 for more discussions and references that have motivated the RS definition we have sketched in this section.
As stated in 40 :
“Data” is a difficult concept to define, as data may take many forms, both physical and digital.
For example, unlike software, data is, as a legal object, much more difficult to grasp. In fact, according to 33 , data is not a legal concept, as it does not fall into a specific legal regime. For example, data can be either mere information or une œuvre , a work with associated intellectual property, when it involves creative choices in its production that reflect the author’s personality 32 . The Knowledge Exchange report 21 provides guidelines that can be used to assess the legal status of research data, and mentions:
It is important to know the legal status of the data to be shared. […] not all data are protected by law, and not every use of protected research data requires the author’s consent. […] Whether data are in fact protected must be determined on a case-by-case basis.
In relation with this legal context of data sharing and reuse, a very complete framework is introduced in 23 :
Les problématiques liées à la réutilisation nécessitent une maîtrise parfaite du droit de la propriété intellectuelle, du droit à l’image, du droit des données personnelles, du respect à la vie privée et du secret de la statistique, du droit des affaires, du droit de la concurrence, du droit de la culture, du droit européen et des règles de l’économie publique. [The issues related to reuse require a perfect mastership of intellectual property rights, image rights, personal data rights, respect for private life and statistical confidentiality, business law, competition law, cultural law, European law and the rules of the public economy.]
Another list of legal issues related to data is provided by 33 , similar but not equal to the one in the previous quote. Yet, it is also necessary to consider other legal contexts concerning, for example, les données couvertes par le secret médical ou le secret industriel et commercial [Data covered by medical secret or by the industrial and commercial secret] 16 . Let us remark that the section Applicable Laws and Regulations of 15 provides a broad overview of regulatory aspects that need to be taken into consideration when developing disciplinary RD management protocols in the European context. But, as declared in the introduction, it is not our intention to go deeper into these legal aspects, that should be also regarded from the perspective of many different laws.
The underlying problem is that data can refer to many different subjects or objects. We need to simplify the context to help us setting a manageable concept of research data adapted to the scientific framework. For this purpose we present here two relevant data definitions found in the data scientific literature.
The OECD data definition in its Glossary of Statistical Terms 17 states that:
DATA Definition: Characteristics or information, usually numerical, that are collected through observation. Context: Data is the physical representation of information in a manner suitable for communication, interpretation, or processing by human beings or by automatic means (Economic Commission for Europe of the United Nations (UNECE)), “Terminology on Statistical Metadata”, Conference of European Statisticians Statistical Standards and Studies, No. 53, Geneva, 2000.
Also, as a relevant precedent, let us quote here the data definition of the Committee for a Study on Promoting Access to Scientific and Technical Data for the Public Interest , as mentioned in 41 :
A data set is a collection of related data and information – generally numeric, word oriented, sound, and/or image – organized to permit search and retrieval or processing and reorganizing. Many data sets are resources from which specific data points, facts, or textual information is extracted for use in building a derivative data set or data product. A derivative data set, also called a value-added or transformative data set, is built from one or more preexisting data set(s) and frequently includes extractions from multiple data sets as well as original data (Committee for a Study on Promoting Access to Scientific and Technical Data for the Public Interest, 1999, p. 15).
We can notice that both definitions combine the concepts of data and information, yielding, again, to a challenging situation. Thus, to better grasp the connection between both terms we have consulted several sources of different nature, see Box 1 . Note that we can find in Box 1 that information among the data synonyms in the Larousse dictionary, but data is not among the information synonyms. On the other hand, Wikipedia mentions that both terms can be used interchangeably, but that they have different meanings.
I.1 Diccionario de la lengua española of the Real Academia Española
I.2 Diccionnaire Larousse de la langue française
I.3 Wikipedia
Extract from the Data page of Wikipedia ( https://en.wikipedia.org/wiki/Data ):
Data are characteristics or information, usually numeric, that are collected through observation. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum (singular of data) is a single value of a single variable. […] Although the terms “data” and “information” are often used interchangeably, these terms have distinct meanings. […] data are sometimes said to be transformed into information when they are viewed in context or in post-analysis. However, […] data are simply units of information.
Moreover, in 42 and in the web page of ISKO 18 , when discussing in detail the concept of data, an etymological and linguistic vision is also the starting point, and among other sources also, it mentions Wikipedia. The conclusion in 42 (section 2.5):
Therefore, our conclusion of this Section is that Kaase’s (2001, 3251) definition seems the most fruitful one suggested thus far: Data are information on properties of units of analysis.
See also 43 – 45 where ours readers can find further reflections on the concepts of data, information, knowledge, understanding, evidence and wisdom.
Such reflections bring to us an eclectic panorama on the ingredients that could form a data definition and their relation with the concept of information, attesting the involved difficulties in such goal.
Focusing in the scientific context, we can illustrate this complexity in full terms referring to the French Code de l’environnement 30 . In its Article L-124-2 19 we can appreciate the subtleties of the definition of environmental data in the following description:
Est considérée comme information relative à l’environnement au sens du présent chapitre toute information disponible, quel qu’en soit le support, qui a pour objet : 1. L’état des éléments de l’environnement, notamment l’air, l’atmosphère, l’eau, le sol, les terres, les paysages, les sites naturels, les zones côtières ou marines et la diversité biologique, ainsi que les interactions entre ces éléments ; 2. Les décisions, les activités et les facteurs, notamment les substances, l’énergie, le bruit, les rayonnements, les déchets, les émissions, les déversements et autres rejets, susceptibles d’avoir des incidences sur l’état des éléments visés au point 1 ; 3. L’état de la santé humaine, la sécurité et les conditions de vie des personnes, les constructions et le patrimoine culturel, dans la mesure où ils sont ou peuvent être altérés par des éléments de l’environnement, des décisions, des activités ou des facteurs mentionnés ci-dessus ; 4. Les analyses des coûts et avantages ainsi que les hypothèses économiques utilisées dans le cadre des décisions et activités visées au point 2 ; 5. Les rapports établis par les autorités publiques ou pour leur compte sur l’application des dispositions législatives et réglementaires relatives à l’environnement. [For the purposes of this chapter, information relating to the environment is considered to be any information available, whatever the medium, the purpose of which is: 1. The state of the elements of the environment, namely the air, atmosphere, water, soil, land, landscapes, natural sites, coastal or marine areas and biological diversity, as well as the interactions between these elements; 2. Decisions, activities and factors, namely substances, energy, noise, radiation, waste, emissions, spills and other discharges, likely to have an impact on the state of the elements concerned in point 1; 3. The state of human health, safety and living conditions of people, buildings and cultural heritage, insofar as they are or may be altered by elements of the environment, decisions, activities or the factors mentioned above; 4. The analyses of costs and advantages as well as the economic assumptions used in the context of the decisions and activities referred to in point 2; 5. Reports drawn up by public authorities or on their behalf on the application of legislative and regulatory provisions related to the environment. ] .
To be compared with the much more easier to understand concept of geographical data as introduced by the Article L127-1 20 of the same Code de l’environnement 30 :
Donnée géographique, toute donnée faisant directement ou indirectement référence à un lieu spécifique ou une zone géographique ; [Geographic data, any data that refers directly or indirectly to a specific place or geographic area;]
Another example to show the complexity of the representation and manipulation of data and information that we would like to mention here corresponds to the linguistic research work developed at the Laboratoire d’informatique Gaspard-Monge, where one of the authors of the present work resides, see for example the doctoral thesis 46 , 47 .
An additional factor that adds complexity to the concept of scientific data has to do with the potential use(s) and sharing of these data. As remarked by the OECD Glossary of Statistical Terms 21 :
The context provides detailed background information about the definition, its relevance, and in the case of data element definitions, the appropriate use(s) of the element described.
The importance of the context is also noted in 22 :
… research data take many forms, are handled in many ways, using many approaches, and often are difficult to interpret once removed from their initial context.
This opens the door to a series of complex issues. For example, to the need for complementary, technical information or documentation associated to a given dataset in order to facilitate its reuse. See 48 (p.16) (and also 40 ) that highlights the difficulties raised by the concept of temperature related data, as explained by a CENS biologist:
There are hundreds of ways to measure temperature. “The temperature is 98” is low-value compared to, “the temperature of the surface, measured by the infrared thermopile, model number XYZ, is 98.” That means it is measuring a proxy for a temperature, rather than being in contact with a probe, and it is measuring from a distance. The accuracy is plus or minus.05 of a degree. I [also] want to know that it was taken outside versus inside a controlled environment, how long it had been in place, and the last time it was calibrated, which might tell me whether it has drifted.
Another instance to further illustrate the complexity of technical information associated to a data set in the STRENDA Guidelines that have been developed to assist authors to provide data describing their investigations of enzyme activities. 22
Other examples from the collection of complex issues associated to data use(s) and sharing conditions are:
… l’utilisation d’une information publique par toute personne qui le souhaite à d’autres fins que celles de la mission de service public pour les besoins de laquelle les documents ont été élaborés ou détenus. [… the use of public information by anyone who wishes it for other purposes than those of the original needs for which the documents were prepared or held by the public service mission.].
finds a strong formulation for scientific data in 49 :
The value of data lies in their use. Full and open access to scientific data should be adopted as the international norm for the exchange of scientific data derived from publicly funded research. The public-good interests in the full and open access to and use of scientific data need to be balanced against legitimate concerns for the protection of national security, individual privacy, and intellectual property.
For more information on ‘re-use’ see, for example, 20 , 25 , 32 , 48 .
For the purposes of this article, data sharing is the release of research data for use by others. Release may take many forms, from private exchange upon request to deposit in a public data collection. Posting datasets on a public website or providing them to a journal as supplementary materials also qualifies as sharing.
Open data are data in an open format that can be freely used, re-used and shared by anyone for any purpose.
Closing the conceptual loop developed in this section, let us remark, again, that legal aspects arise quite naturally in the above list of items. Among others, some aspects are related to the fact that the datasets are usually organized in databases, where data is arranged in a systematic or methodical way and is individually accessible by electronic or other means 13 , 20 , 21 , 24 , 28 . The intellectual property rights can apply to the content of a database, the disposition of its elements and to the tools that make it working (for example software). The sui generis database rights primarily protects the producer of the database and may prohibit, for instance, the extraction and/or reuse of all or a substantial part of its content 24 .
Finally, let us quote here this paragraph from the OpenAIRE project report 20 (p.19) that highlights the difficulties to set a research data definition in the context of legal studies:
From a legal point of view, one of the very basic questions of this study is which kind of potentially protected data we are dealing with in the context of e-infrastructures for publications and research data such as OpenAIREplus. The term “research data” in this context does not seem to be very helpful, since there is no common definition of what research data basically is. It seems rather that every author or research study in this context uses its own definition of the term. Therefore, the term “research data” will not be strictly defined, but will include any kind of data produced in the course of scientific research, such as databases of raw data, tables, graphics, pictures or whatever else.
We can remark, that although the preceding quote does not provide a strict definition of research data, it highlights the relevance of the production context, as we have already mentioned in Section 2.2 .
In the previous section we have exemplified the complexity of the concept of data through different approaches. In this section we focus on the research data concept, proposing here a RD definition, directly derived from the RS definition presented in Section 2.2 . To this aim we start by gathering some previous definitions that are particularly relevant for our proposal.
The first one is the White House document 34 , and in particular the Intangible property section where we can find the following definition.
Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues.
Let us remark that, according to 34 this definition explicitly excludes:
(A) Trade secrets, commercial information, materials necessary to be held confidential by a researcher until they are published, or similar information which is protected under law; and (B) Personnel and medical information and similar information the disclosure of which would constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study.
The above RD definition has been extended in 55 , emphasizing, among other aspects, the scientific purpose of the recorded factual material and the link with the scientific community.
A second basic inspiration for our proposal is the Directive for Open Data 25 that states:
(Article 2 (27)) The volume of research data generated is growing exponentially and has potential for re-use beyond the scientific community. […] Research data includes statistics, results of experiments, measurements, observations resulting from fieldwork, survey results, interview recordings and images. It also includes meta-data, specifications and other digital objects. Research data is different from scientific articles reporting and commenting on findings resulting from their scientific research. […] (Article 2 (9)) ‘research data’ means documents in a digital form, other than scientific publications, which are collected or produced in the course of scientific research activities and are used as evidence in the research process, or are commonly accepted in the research community as necessary to validate research findings and results;
The third pillar that we consider essential to support our proposal is the OECD report 19 (p.13) where we can find one of the most largely accepted and adopted definitions of RD:
Research data are defined as factual records (numerical scores, textual records, images and sounds) used as primary sources for scientific research, and that are commonly accepted in the scientific community as necessary to validate research findings. A research data set constitutes a systematic, partial representation of the subject being investigated. This term does not cover the following: laboratory notebooks, pre-liminary analyses, and drafts of scientific papers, plans for future research, peer reviews, or personal communications with colleagues or physical objects (e.g. laboratory samples, strains of bacteria and test animals such as mice). Access to all of these products or outcomes of research is governed by different considerations than those dealt with here.
Finally, let us bring here the research data definition coming from the “Concordat on Open Research Data” 25 signed by the research councils of the UK Research and Innovation (UKRI) organisation 26 :
Research data are the evidence that underpins the answer to the research question, and can be used to validate findings regardless of its form (e.g. print, digital, or physical). These might be quantitative information or qualitative statements collected by researchers in the course of their work by experimentation, observation, modelling, interview or other methods, or information derived from existing evidence. Data may be raw or primary (e.g. direct from measurement or collection) or derived from primary data for subsequent analysis or interpretation (e.g. cleaned up or as an extract from a larger data set), or derived from existing sources here the rights may be held by others.
Let us observe that this last definition highlights the important role of data as a tool to find an answer to a scientific question, coinciding with the first characteristic of our RS definition, and also agreeing with 40 (p. 508): … data from scientific sensors are a means and not an end for their own research.
A remarkable “positive” aspect of these four definitions is that they separate the data from the subject under study, and establish what is, or is not, RD. This is relevant, as the legal context of the subjects under study sets up the legal (and ethical ) context of the RD.
We must say that we do not agree completely with all the terms in these definitions. For example, regarding the exclusion of the laboratory notebooks as RD elements, as we think they can be used to generate input data for other studies (how a laboratory works, which is the information that appears in some notebooks depending on the scientific matter). We think that these information and data can be of interest for other researchers.
Some other “negative” aspects: the role of the data producers does not appear in the above definitions, although it is more or less implicit when they refer to the connection with the scientific community. Indeed, their role is very important as observed in 48 (p.6):
Data creators usually have the most intimate knowledge about a given dataset, gained while designing, collecting, processing, analyzing and interpreting the data. Many individuals may participate in data creation, hence knowledge may be distributed among multiple parties over time.
Certainly, as for each research output, the producer team is the guarantor of the data quality, in particular to ensure that the data are not outdated, erroneous, falsified, irrelevant, and unusable. Note that this is particularly relevant in the case of RD, as a consequence of the lack of a widely accepted RD publication procedures, compared to the existing ones for articles in scientific journals, where the responsibility of the quality of the publication is somehow shared by the authors, the journal editors, and the reviewers. This is also confirmed by 56 (p. 73):
The concept of data quality is determined by multiple factors. The first is trust. This factor is complex in itself. […] Giarlo (2013) also mentions trust in first place, stating that it depends on subjective judgments on authenticity, acceptability or applicability of the data. Trust is also influenced by the given subject discipline, the reputation of those responsible for the creation of the data, and the biases of the persons who are evaluating the data.
Even more, note that, as remarked in 23 the quality of the producer legal entity defines the cultural quality of the data in legal terms, yielding in this way the qualification of cultural data.
On the other hand, in some of the above definitions, the RD scientific purpose is focused in its role to validate research findings , although RD can be reused for many other finalities in the scientific context as, for instance, to generate new knowledge, i.e. as primary sources for new scientific findings. Let us observe that these are two of the four rationales for data sharing examined in 22 .
Bearing all these arguments in mind, we propose the following RD definition.
Research data is a well identified set of data that has been produced (collected, processed, analyzed, shared & disseminated) by a (again, well identified) research team. The data has been collected, processed and analyzed to produce a result published or disseminated in some article or scientific contribution. Each research data encloses a set (of files) that contains the dataset maybe organized as a database, and it can also include other elements as the documentation, specifications, use cases, and any other useful material as provenance information, instrument information, etc. It can include the research software that has been developed to manipulate the dataset (from short scripts to research software of larger size) or give the references to the software that is necessary to manipulate the data (developed or not in an academic context).
We can summarize the above definition in the following three main characteristics:
We provide here some further considerations concerning this proposal. First, it is clear that we have followed closely the RS definition in Section 2.2 , in order to formulate this RD counterpart, which involves the transaltion of some RS features of strict digital nature to RD. This does not mean that we do not consider non digital data as possible RD, but rather we assume that the information extracted from the physical samples has been already treated as digital information to be manipulated in a computer system, which simplifies the manipulation of physical data and its inclusion in the proposed RD definition.
Secondly, we emphasize that our RD definition also follows the consideration of a restricted research production context, as in the case of our RS definition. But this limited context to set the RD definition does not mean that e.g. public sector data can not be used in the research work. Rather it means that the external components that have not been directly collected/produced by the research team should be well identified, indicating their origin, where the data is available, which is the license that allows the reuse. It is also necessary to indicate if the data has been reused (processed) without modification, or if some adaptations have been necessary for the analysis. External data components can have any origin, not just public sector. As we have highlighted in Section 3 , the production context of the data may have a lot of importance, as data can be difficult to interpret once removed from their initial context 22 .
Third, note that, according to our definition, documentation, licenses, Data Management Plans and other documents can also be part of the set of files that constitutes the RD. Moreover, as explained in Section 2.2 , a RS can also include data in the list of included materials that could also be qualified as RD. There are here a broad spectrum of possibilities, according to the size, the importance given by the research team and the chosen strategy in the dissemination stage. If the RD is considered of little size and less importance than the RS, it can be just included and disseminated as part of the software, and also the other way around, when the RS is considered less important than the RD, as for example when the software development effort is much less important than the time and effort invested in the data collection and analysis. It can also happen that both outputs are considered as of equal value, and can be disseminated separately. In this case it is important that both outputs are linked in order to allow other researchers to find easily the other output.
In a similar manner as for RS, RD can include other data components, and some can also qualify as research data. The RD producer team should explain how these components have been selected, mixed and analyzed, and highlight the reuse of other RD components by means of a correct citation form, see for example, 38 , 41 , 57 .
Moreover, software and data can have several versions and releases, and they can be manipulated alike and with similar tools (forges, etc …) 37 , 58 , 59 . One of the differences that we have detected between RS and RD is that while some research teams can decide to give access to early stages of the software development, what we observe in the consulted work is that RD is expected in its final form, ready for reuse, as mentioned in 22 :
If the rewards of the data deluge are to be reaped, then researchers who produce those data must share them, and do so in such a way that the data are interpretable and reusable by others.
This difference is a consequence of the distinct nature of the building process of both objects. In the FLOSS community, we find the release early, release often principle associated to the development of the Linux kernel 60 and to Agile developments. 27 This principle may not have the same sense in the building of a dataset for which a research team collects, processes and analyzes data with a very particular research purpose, maybe difficult to share with a large or external community in the early stages of the RD production.
Yet, in this work, we do not address some production issues like best software development practices or data curation, as they are out of the scope of the present article, and could be the object of future work. It is not that we do not give enough appreciation to these important issues, as they are part of the 3rd step of the proposed CDUR evaluation protocol for RS and RD, see sections 2.3 and 3.3 of 12 . For us, the research team decides when the research outputs have reached the right status for its dissemination. Neither we do enter in the different roles (see 22 ) that may appear in the RD team, taking care of actions involving: collection, cleaning, selection, documentation, analysis, curation, preservation, maintenance, or the role of Data Officer proposed in 15 .
While some authors highlight differences between software and data 8 , 9 , the present article leans toward profiting from the similarities shared by RS and RD. For example, taking into consideration the difference between the definition of software and the definition of RS has driven us to the proposition of a RD definition that is independent from the definition of data. Likewise, along the above sections we have emphasized other characteristics of RD that are grounded in the RS features. As a side effect of this approach, the fact that we can easily adopt issues from the RS definition formulation to RD, confirms and validates our proposed RS definition.
In the introduction we have mentioned Borgman’s conundrum challenges related to RD 22 :
The challenges are to understand which data might be shared, by whom, with whom, under what conditions, why, and to what effects. Answers will inform data policy and practice.
In our experience, Borgman's conundrum challenges correspond to questions that appear regularly at different stages of the RD production. We think that to provide the vision developed in Section 4 could be of help to deal with these questions, as a first step to tackle some problems in a well determined situation. Moreover, the view proposed in Section 4 is extended and completed with the dissemination and evaluation protocols of 12 . Our experience of many years confirms the need of these protocols for RS, and we think that they will be appropriated, useful and relevant for RD as well.
As a test for the soundness of the proposed RD definition we have used the conundrum queries as a benchmark, checking if our definition allows us to provide answers to the different questions, as well as to two extra ones that we consider equally relevant, namely how and where to share RD:
Which data might be shared? Following the arguments supporting our RD definition, we think that it is a decision of the research team: similarly to the stage in which the team decides to present some research work in the form of a document for its dissemination as a preprint, or a journal article, a conference paper, a book … the team should decide which data might be shared, in which form and when (following maybe funder or institutional Open Science requirements).
By whom? The research team that has collected, processed, analyzed the RD, and decided to share and disseminate it. That is the RD producer team, as stated in the second characteristic of our RD definition. On the other hand, data ownership issues have been discussed for example in 20 , 21 , 32 , 61 – 63 .
How? As observed in the precedent sections, the How? should follow some kind of dissemination procedure like the one proposed in 11 , 12 in order to identify correctly the RD set of files, to set a title and the list of persons in the producer team (that can be completed with their different roles), to determine the important versions and associated dates, to give a documentation, to verify the legal 21 , 33 (and ethical ) context of the RD and give the license to settle the sharing conditions 13 , etc. which can include the publication of a data paper and decisions about in which form and when the RD should be disseminated, maybe following grant funders or institutional Open Science requirements). In order to increase the return on public investments in scientific research, RD dissemination could respect principles and follow guidelines as described in 17 , 19 . Further analysis on RD dissemination issues can be found in 12 .
Where? There are different places to disseminate a RD, including the web pages of the producer team, of the funded project, or in a existing data repository. Let us remark that the Registry of Research Data Repository 28 is a global registry of RD repositories that covers repositories from different academic disciplines. It is funded by the German Research Foundation (DFG) 29 and it can help to find the right repository. Note that the Science Europe report 64 provides criteria for the selection of trustworthy repositories to deposit RD.
With whom? Each act of scholar communication has its own target public, and initially, the RD dissemination strategy can target the same public as the one that could be interested in the corresponding research article. But it can happen that the RD is of interdisciplinary value, possibly wider than the initial discipline associated to the scientific publication, and to assess what is the public involved in this larger context can be difficult. Indeed, as observed by 22 :
An investigator may be part of multiple, overlapping communities of interest, each of which may have different notions of what are data and different data practices. The boundaries of communities of interest are neither clear nor stable.
So, it can be complex to determine the community of interest for a particular RD, but this also happens for articles, see for example the studies on HIV/AIDS 65 making reference to automatic reasoning in elementary geometry in studies in its reference number 12, and it seems to us that this has never been an obstacle for sharing a publication. Thus 22 :
… the intended users may vary from researchers within a narrow specialty to the general public.
Under what conditions? As described previously, and in parallel with the case of RS, the sharing conditions are to be found in the license that goes with the RD, such as a Creative Commons license 30 or other licenses to settle the attribution, re-use, mining … conditions 13 . For example, in France, the law of 2016 for a Digital Republic Act sets in a Décret the list of licenses that can be used for RS or RD release 31 , 32 .
Why and to what effects? There maybe different reasons to release some RD, from the contribution to build more solid, and easy to validate science, to just comply with the recommendations or requirements of the funder of a project, of the institutions supporting the research team, or those of a scientific journal, including Open Science issues 5 . The works 22 , 49 give a thorough analysis on this subject. As documented there and already mentioned in Section 3 :
“The value of data lies in their use. Full and open access to scientific data should be adopted as the international norm for the exchange of scientific data derived from publicly funded research.”
As remarked in 5 and in the work analyzed there, the evaluation step is an important enabler in order to improve the adoption of Open Science best practices and to increase RD sharing and open access. To disseminate high quality RD outputs is a task that requires time, work and hands willing to verify the quality of the data, to write the associated documentation, etc. Incentives are needed to motivate the teams to accomplish these tasks. RD dissemination also asks for the establishment of best citation practices and evolution in the protocols of research evaluation. In particular, following the parallelism present all along this work, the CDUR protocol 3 proposed for RS evaluation can be also proposed for RD as developed in the article that extends the present work 12 .
Acknowledgments.
With many thanks to the Referees, to the Departamento de Matemáticas, Estadística y Computación de la Universidad de Cantabria (Spain) for hospitality, and to Prof. T. Margoni for useful comments and references.
[version 2; peer review: 3 approved]
This work is partially funded by the CNRS-International Emerging Action (IEA) PREOSI (2021-22).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
1 “We Are Star Dust” - Symphony of Science, https://www.youtube.com/watch?v=8g4d-rnhuSg
2 Cave of Altamira and Paleolithic Cave Art of Northern Spain, https://whc.unesco.org/en/list/310/
3 Prehistoric Sites and Decorated Caves of the Vézère Valley, https://whc.unesco.org/en/list/85/
4 https://en.wikipedia.org/wiki/Free_and_open-source_software
5 https://en.unesco.org/science-sustainable-future/open-science/recommendation
6 https://www.fsf.org/licensing/
7 https://opensource.org/licenses
8 https://spdx.org/licenses/
9 https://creativecommons.org/licenses/?lang=en
10 https://opendatacommons.org/licenses/
11 https://5stardata.info/en/
12 Note that the authors of this article provide their own translations. Authors prefer to keep the original text for two reasons. First, because of the legal nature of the involved quotations. Second, for French or Spanish speaking readers to enjoy it, very much in line with the Helsinki Initiative on Multilingualism in Scholarly Communication (2019), see https://doi.org/10.6084/m9.figshare.7887059 . These translations have been helped by Google Translate, https://translate.google.com/ and Linguee, https://www.linguee.fr/ .
13 https://www.rd-alliance.org/groups/fair-research-software-fair4rs-wg
14 https://matek.hu/zoltan/issac-2021.php
15 https://swmath.org/software/4203
16 See, for example, https://www.senat.fr/dossier-legislatif/pjl16-504.html
17 https://stats.oecd.org/glossary/detail.asp?ID=532
18 https://www.isko.org/cyclo/data
19 https://www.legifrance.gouv.fr/codes/article_lc/LEGIARTI000006832922/
20 https://www.legifrance.gouv.fr/codes/section_lc/LEGITEXT000006074220/LEGISCTA000022936254/
21 The entries of the glossary https://stats.oecd.org/glossary/ have several parts including Definition and Context as shown in the Data definition included in Section 3 . This quotation appears when placing the pointer over the Context part of the Data entry.
22 https://www.beilstein-institut.de/en/projects/strenda/guidelines/
23 https://en.wikipedia.org/wiki/Open_data
24 https://en.wikipedia.org/wiki/Big_data
25 https://www.ukri.org/wp-content/uploads/2020/10/UKRI-020920-ConcordatonOpenResearchData.pdf
26 https://www.ukri.org/
27 https://en.wikipedia.org/wiki/Agile_software_development
28 https://www.re3data.org/
29 http://www.dfg.de/
30 https://creativecommons.org/
Joachim schopfel.
1 GERiiCO Labor, University of Lille, Lille, France
The second version is fine with me. The authors replied to all comments; they fixed some issues, and they provided complementary arguments for other issues. I do not share all their viewpoints but that is science and not a problem. The paper is interesting and relevant.
Is the work clearly and accurately presented and does it cite the current literature?
If applicable, is the statistical analysis and its interpretation appropriate?
Not applicable
Are all the source data underlying the results available to ensure full reproducibility?
No source data required
Is the study design appropriate and is the work technically sound?
Are the conclusions drawn adequately supported by the results?
Are sufficient details of methods and analysis provided to allow replication by others?
Reviewer Expertise:
Information science
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
1 Instituto de Agroquímica y Tecnología de Alimentos, CSIC, Valencia, Spain
I do not have any further comments.
Open science, open research data, scholarly publications, open access policies
1 Institute of Learning Technologies, Eszterházy Károly University, Eger, Hungary
I am satisfied with the author’s reply, and found the other two reviews’ comments intriguing and useful for the authors. I have no further comments.
The research data management is a central dimension of the development of scientific research and related infrastructures. Also, any original attempt to define research data is welcome and helpful for the understanding of this field. This conceptual paper will be a valuable contribution to the discussion on the research but. Yet, it should be improved, and a couple of more or less minor issues should be fixed.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
LIGM, Gustave Eiffel University & CNRS, France
Many thanks to you, Joachim Schopfel, for your interesting comments that give us the opportunity to improve this work. A new version is in preparation, but we provide here some answers to your comments.
1. [translations into English]
Translations are included as footnotes, they will be moved to the main text.
2. [information science (eg ISKO).]
Many thanks for this reference, we are looking into it.
3. [Open science is a fuzzy concept…]
As indicated in the introduction: A more transversal and global vision can be found in the ongoing work for the UNESCO Recommendation on Open Science [Reference 6]. See also [Reference 7] . We will explain better this point.
4. [the paper is in some kind limited or reduced to the aspect of "research output". Generally, in the research process, research software and research data are not only output but also tools (software) and input (data). This needs clarification.]
In our view, each "research output" is a potential input for new research work. For example a RS can be a tool to manipulate data or an input for a new RS, this can be in the form of a component, or in the form of a new version done by the initial research team or another one. A RD can be used by other teams (as a tool) to understand some problem, it can be modified to produce a new RD, or it can be included as part of a larger data set, that can be as well a new RD. To better understand the production context is not, in our view, a limitation. But you are right, this point needs clarification.
5. [cites Wikipedia with " We must all accept that science is data and that data are science ".]
Please note that this cited phrase comes from [Reference 4], and as indicated to Referee T. Koltay, we have chosen to do this reference in a slightly different manner as done in the Borgman’s work, where we have found it.
6.1. [similarity/analogy]
When consulting Cambridge English Learner’s Dictionary dictionary we find:
analogy: a comparison that shows how two things are similar
6.2. [Software and data are different objects, with different issues (IP protection, communities etc.); the analysis of RS may be helpful for a better understanding of RD but this does not mean that both are more or less similar or even "fungible".]
It is one of the intentions of the present work to show the differences between data and software form the legal point of view. While software finds a somehow clear and simple presentation (Section 2.1), data is much more difficult to grasp, as studied in Section 3. But this is not an obstacle to present an unified vision of RS and RD as research outputs, as we can see in the RS and RD proposed definitions. The fact that we can propose a similar formulation for both definitions allows us to propose similar dissemination and evaluation protocols as you can find in the article that follows this work [Reference 13]. The fact that we can deal with RS and RD in a similar way does not mean that they are similar.
7. [describe the relationship between RS and RD, perhaps with "use cases".]
It seems to us that it is quite usual for the targeted research audience to use and/or produce RS and/or RD as part of their everyday research practices, and that this point does not require further explanation. Examples can be found easily in the literature, as for example in the bibliography included at the end of this work.
8. [I admit that the authors are not legal experts but section 3 should be more explicit (and perhaps shorter and more restrictive) about the different laws and legal frameworks. Are you speaking about French laws? Or about the EU regulation?]
As indicated in the introduction, we have consulted legal texts and legal experts’ work in order to understand and explain the legal context in which we place this work. We have consulted French, European and USA texts, and selected the parts that we have used to document the article. We consider that our role is restricted to this intention, due to the lack or further expertise in legal matters, which does not hide the efforts we have put in to understand and to explain some legal issues. But we are unable to give more information on the regulations that can be taken into consideration, as this is the role of legal experts in the light of a well defined setting.
9. [Another, related issue is the data typology. The paper is about research data but section 3 mentions (and apparently does not differentiate) environmental data, cultural data and public sector information.]
The goal of Section 3 is to show the difficulties existing to set a data definition from the legal point of view, which is a very different context as the one existing for software, as shown in Section 2.1. The case of cultural data is very interesting, as legally speaking [Reference 19] the quality of the producer legal entity defines the cultural quality of the data . Then we can establish the parallel with the quality of research for some data set, as the consequence of the research quality of the producer team. Data typology could be the object of future work.
10. [My suggestion would be to improve the structure of section 3 and to distinguish between concepts, typology, legal status and reuse/policy (subsections).]
We will consider this suggestion
11. [Section 4: I already mentioned it above - RD is not only output but also input, with different issues (third party rights etc). This requires clarification.]
As already explained, we study in here the production aspects, and other aspects are presented in [Reference 13]. But you are right, this needs better explanation.
12. [At the end of section 4, the paper states that "documentation, licenses, Data Management Plans and other documents can also be part of the set of files that constitutes the RD". ]
Section 2.1 shows that the preparatory design work and documentation are part of the software, and these are documents that can be included in the released version of a RS, following the choice of the RS producer team. There can be other elements as for example tests, input and output files to illustrate how to use the RS, licenses, etc. To include these elements in the released RS correspond to best practices that facilitate RS reuse. In our view, to release a RD can follow similar practices, that is, to include a documentation, some use examples, a license, a data management plan…this is to be decided by the producer team.
13. [Last comment: I like very much Borgman's assessment of RD and her "conundrum challenges" but I have a somewhat different understanding of the meaning of this - for me, these "challenges" are questions that require attention and evaluation in a given situation, not for all RD in a general way. For me, they provide a kind of "reading grid" to analyse a specific data community, or a specific instrument or infrastructure or workflow; but they don't require or demand a comprehensive response as such provided by the paper.]
In our experience, Borgman's conundrum challenges correspond to questions that appear regularly at different stages of the RD production. We think that to provide such vision as the one exposed in Section 4 could be of help to deal with these questions, and, as you said, as a first step to tackle some problems in a well determined situation. Moreover, this view proposed in Section 4 is extended and completed with the dissemination and evaluation protocols proposed in [Reference 13]. Our experience of many years confirms the need of these protocols for RS, and we think that they will be appropriated, useful and relevant for RD as well.
Teresa Gomez-Diaz and Tomas Recio
The authors proposed a Research Data (RD) definition "based in three characteristics: the data should be produced (collected, processed, analyzed, shared & disseminated) to answer a scientific question, by a scientific team, and has yield a result published or disseminated in some article or scientific contribution of any kind." From my point of view this definition restricts RD to those that are published by a scientific team, however what about the citizen science, or data produced by non-scientist staff? What about any other data that do not deserve be published but help to further research?
Authors say: "the RS is involved in the obtention of the results presented in scientific articles" - This is not necessarily true. RS is not always involved in the obtention of results because it can be developed for any other purpose, again the authors make a very strict definition.
Authors say: "As a matter of fact, a research team can use RS produced by other teams for their scientific work, as well as FLOSS or other software developed outside the scientific community, but the present work is centered in the making-of aspects which are pertinent for the proposed definition." - This restricts the definition of Research Software (RS) a lot by excluding all FLOSS produced by non-academic members.
The authors have missed any mention to the Directive (EU) 2019/1024 of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information, in which RD are defined and included as part of the public sector. In fact, the authors have cited it but they have not commented/mentioned the fact that RD has a wider meaning and that according to this Directive are considered public sector information, and they need not necessarily be published in a scientific journal but shared.
Definitions given by dictionaries are not particularly relevant to the scientific context/environment. I think this part should be omitted, it only adds some definitions in the authors' own languages.
"For example, to the need for complementary, technical information associated to a given dataset in order to facilitate its reuse." - This is part of the FAIR principles which are not mentioned/linked to this comment. Obviously, a dataset without any information about how data have been produced/obtained, etc. are not valuable.
Authors write: "In here, the research outputs have reach a status in which the research team is happy enough for its dissemination." - This seems a very naïve assertion. Because the authors "do not consider production issues like best software development practices or data curation", it seems they do not care about these important issues.
Conclusions again repeat the proposal of a RD definition. Concepts like linked data, FAR data, and open data have not been treated in the article. Their definition of RD is very strict and narrow, and they have not considered any semantic issues about data and the benefits and implications of being a 5star open data . Their definition is far from the 4th or 5th step of the stars.
In general, from my point of view, the article does not add any new ideas about RD definition and restricts it to data produced by scientific teams.
Many thanks to you, Remedios Melero, for these very interesting comments. We are preparing a new version of this article and we will include several of the proposed corrections. Meanwhile, we would like to provide in here some preliminary comments.
1. [this definition restricts RD to those that are published by a scientific team, however what about the citizen science, or data produced by non-scientist staff?]
[the article does not add any new ideas about RD definition and restricts it to data produced by scientific teams.]
It would be strange to consider any article published in a newspaper as a scientific publication.
On the other hand, scientists may read the newspapers and many other documents, including tweets, and may use these documents as input information for a research work. As already explained in our answer to Rob Hooft’s comment, yes, we have chosen a restricted definition for RD. It allows us to provide the answers to the Borgman’s conundrum challenges that are in the Conclusion section. As far as we know, we have not found in the consulted literature the proposition of such kind of answers in this complete view. Moreover, as the RD definition finds a similar formulation as the RS one, we can also translate RS dissemination and evaluation protocols to RD [Reference 13]. Once we understand well the restricted context, it can be studied its extension and then see which are the answers to Borgman’s conundrum challenges and the dissemination and evaluation protocols that can be proposed in the extended context.
The fact that we do not include e.g. public sector data as RD is different from the claim that these data cannot be used as input for a research work. As explained in section 3.2 of [Reference 13], these external data components should be correctly presented and referenced, and some can also fall in the category of RD.
2. [RS is not always involved in the obtention of results because it can be developed for any other purpose, again the authors make a very strict definition.]
[This restricts the definition of Research Software (RS) a lot by excluding all FLOSS produced by non-academic members.]
You are right, this point should be explained better. To obtain a research result may involve the use of software (FLOSS or not FLOSS), the development of software to support some work or service, and the development of RS by the research team as explained in [References 3, 14]. Note that RS can be also disseminated as FLOSS, which is the usual practice in the work of T. Recio and in the research lab of T. Gomez-Diaz. This is also similar for data and RD, that can be disseminated as open data, as well as for publications and research articles as seen in the previous point.
3. [Research data defined in the Directive (EU) 2019/1024 ]
This definition was included in the preparation versions of the present article, and it will be included again in the new version in preparation, following your advice.
4. [Definitions given by dictionaries]
In the difficulties to explain easily the concepts of data and information we have ended in the consultation of several dictionaries, including some in English. Some of the found definitions, mainly in Spanish and French have attracted our attention and we have decided to included them in Box 1. This box can be easily skipped by readers not interested in these definitions.
We prefer to leave the reading of the content of this box at the choice of readers.
5. [FAIR and "For example, to the need for complementary, technical information associated to a given dataset in order to facilitate its reuse."]
Please note that FAIR principles appear in the [Reference 55] dated 2016, while [Reference 36] that we have chosen to illustrate the need for complementary, technical information is dated 2012. Moreover, this is also related to the importance of context, that is explained in the OECD Glossary of Statistical Terms, with PDF and WORD download versions dated 2007 [ https://stats.oecd.org/glossary/download.asp ]. On the other hand, FAIR principles are considered in the second part of this work [Reference 13], as they are related to dissemination issues. We will also mention them in the second version of this first part.
6. ["In here, the research outputs have reached a status in which the research team is happy enough for its dissemination."]
[authors "do not consider production issues like best software development practices or data curation", it seems they do not care about these important issues.]
You are right, this point should be better explained in the new version of the article. It is not that we do not care about these important issues, as they are part of the 3 rd step of the proposed CDUR evaluation protocol for RS and RD, see sections 2.3 and 3.3 of [Reference 13].
7. [Concepts like linked data, FAIR data, and open data have not been treated in the article. Their definition of RD is very strict and narrow, and they have not considered any semantic issues about data and the benefits and implications of being a 5star open data . Their definition is far from the 4th or 5th step of the stars.]
Please note that FAIR data and open data are treated in [Reference 13]. We will include in the second version the mention of the 5star open data, many thanks for this reference.
Teresa Gomez-Diaz, Tomas Recio
The content of the first two paragraphs of the paper (especially the first one) seems to be less appropriate, compared to the purpose of your paper. I would thus advise you to consider rewriting these paragraphs.
Your practice of providing the cited texts in the original language (French or Spanish) and providing the translations of these passages only in the footnotes is unusual and may be not appropriate for a readership that probably reads and writes only in English, or is not familiar with Spanish and/or French texts. As I see it, if you would want to make a favour to your readers, who prefer French or Spanish, the solution could be reverse this order, i.e. putting the original texts into the footnotes.
Other remarks
I think that it would be better if the following sentence would be changed as follows:
This regards not only the form of citing, but content, because this remark comes from Borgman’s Conundrum, cited in your paper a couple of times.
You describe three main characteristics of RS:
In general, these three claims are correct. However, the first one of them is a little awkward. I would thus change it to anything like “the goal of the RS development is to support research. As stated by Kelly, it is developed to answer a general, or a specific scientific question. Writing the software requires close involvement of someone with deep domain knowledge in the application area related to the question. 32 ”. Theses sentences however may prove redundant, because you provide a more complete definition:
You write that “Indeed, there is a difference between the concepts of algorithm and software from the legal point of view, as there is a difference between the mere idea for the plot of a novel and the final written work.” This is a brilliant idea, although I believe that it should not be restricted to the legal point of view.
In my view, it seems to be dangerous to write about copyright issues without being legal experts. Personally, I have only basic knowledge of copyright laws, so I cannot judge the correctness of all your argument. Fortunately, what you describe is also related to different issues.
I do not see any further problems. Therefore, I will not enumerate passages that are correct and rather straightforward. My suggestion is however, that you carefully review you text in order to reach clarity of your argument.
Many thanks to you, Tibor Koltay, for these very interesting comments. We are preparing a new version of this article and we will include several of the proposed corrections. Meanwhile, we would like to provide in here some preliminary comments.
1. [first two paragraphs]
We have chosen to start in a ''light'' manner an article that can ask for some effort to be understood, this is our author’s choice. It is the reader’s choice to skip these two first paragraphs or to enjoy them, as this does not have any consequence for the understanding of the content of the article.
2. [translations to English]
We agree with you, the translations to English in the footnotes may hinder the fluent reading of this work, we will modify the presentation.
3. [Hanson et al. Reference]
You are right, we have found this reference in Borgman’s work, but we have consulted the original article and we have chosen to do this reference in a slightly different manner.
4. [RS definition characteristics]
We will modify the phrase to include your proposition as follows: “the goal of the RS development is to do or to support research’’. Please note that the composition of a research team involved in the
development of a RS has been thoroughly studied in section 2.2 of [Reference 3]. We will include this reference to clarify this point as you ask. Please, also note that long developments may involve many different contributions from developers with different status. As copyright issues enter into play, it is important that the RS developers and contributors are correctly listed.
5. [Algorithms and software]
Comparisons between algorithms and software can be done in several contexts, for example in mathematics, or in computer science, among others. We have highlighted the legal aspects as we detect regularly the confusion between these two concepts, and the [Reference 22] providers a pretty clear explanation.
6. [Copyright issues]
Please note that one of the authors has study copyright issues in order to write [Reference 2], work that has been validated by several experts, including legal experts. On the other hand, we are regularly in contact and follow the work of legal experts, in such a manner as to provide us with the necessary confidence to deal with copyright issues in the way we propose in this article. The remark included at the end of the Introduction gives the necessary warning to our readers on this point.
Database Search
Harvard Library licenses hundreds of online databases, giving you access to academic and news articles, books, journals, primary sources, streaming media, and much more.
The contents of these databases are only partially included in HOLLIS. To make sure you're really seeing everything, you need to search in multiple places. Use Database Search to identify and connect to the best databases for your topic.
In addition to digital content, you will find specialized search engines used in specific scholarly domains.
Home » Research Data – Types Methods and Examples
Table of Contents
Research data refers to any information or evidence gathered through systematic investigation or experimentation to support or refute a hypothesis or answer a research question.
It includes both primary and secondary data, and can be in various formats such as numerical, textual, audiovisual, or visual. Research data plays a critical role in scientific inquiry and is often subject to rigorous analysis, interpretation, and dissemination to advance knowledge and inform decision-making.
There are generally four types of research data:
This type of data involves the collection and analysis of numerical data. It is often gathered through surveys, experiments, or other types of structured data collection methods. Quantitative data can be analyzed using statistical techniques to identify patterns or relationships in the data.
This type of data is non-numerical and often involves the collection and analysis of words, images, or sounds. It is often gathered through methods such as interviews, focus groups, or observation. Qualitative data can be analyzed using techniques such as content analysis, thematic analysis, or discourse analysis.
This type of data is collected by the researcher directly from the source. It can include data gathered through surveys, experiments, interviews, or observation. Primary data is often used to answer specific research questions or to test hypotheses.
This type of data is collected by someone other than the researcher. It can include data from sources such as government reports, academic journals, or industry publications. Secondary data is often used to supplement or support primary data or to provide context for a research project.
There are several formats in which research data can be collected and stored. Some common formats include:
Some common research data collection methods include:
Some common research data analysis methods include:
Research data serves several important purposes, including:
Research data has numerous applications across various fields, including social sciences, natural sciences, engineering, and health sciences. The applications of research data can be broadly classified into the following categories:
Research data has numerous advantages, including:
Research data has several limitations that researchers should be aware of, including:
Researcher, Academic Writer, Web developer
All Subjects
That actually explain what's on your next test, research database, from class:, documentary production.
A research database is a structured collection of data and information that allows users to search, retrieve, and analyze content from various sources efficiently. These databases often include articles, books, multimedia, and archival materials relevant to specific subjects or fields, making them vital tools for gathering information and conducting in-depth analysis for various projects. They play a significant role in informing documentary production by providing factual content and sources during the research and development phases.
congrats on reading the definition of Research Database . now let's actually learn it.
Archival Research : The process of locating, accessing, and analyzing historical documents and records that provide context and depth to a documentary subject.
Metadata : Data that provides information about other data, helping users understand the context, content, and organization of information within a database.
Literature Review : A comprehensive survey of existing research and publications on a specific topic, often used to identify gaps in knowledge or establish a foundation for new research.
© 2024 fiveable inc. all rights reserved., ap® and sat® are trademarks registered by the college board, which is not affiliated with, and does not endorse this website..
Run a free plagiarism check in 10 minutes, generate accurate citations for free.
Methodology
Research methods are specific procedures for collecting and analyzing data. Developing your research methods is an integral part of your research design . When planning your methods, there are two key decisions you will make.
First, decide how you will collect data . Your methods depend on what type of data you need to answer your research question :
Second, decide how you will analyze the data .
Methods for collecting data, examples of data collection methods, methods for analyzing data, examples of data analysis methods, other interesting articles, frequently asked questions about research methods.
Data is the information that you collect for the purposes of answering your research question . The type of data you need depends on the aims of your research.
Your choice of qualitative or quantitative data collection depends on the type of knowledge you want to develop.
For questions about ideas, experiences and meanings, or to study something that can’t be described numerically, collect qualitative data .
If you want to develop a more mechanistic understanding of a topic, or your research involves hypothesis testing , collect quantitative data .
Qualitative | to broader populations. . | |
---|---|---|
Quantitative | . |
You can also take a mixed methods approach , where you use both qualitative and quantitative research methods.
Primary research is any original data that you collect yourself for the purposes of answering your research question (e.g. through surveys , observations and experiments ). Secondary research is data that has already been collected by other researchers (e.g. in a government census or previous scientific studies).
If you are exploring a novel research question, you’ll probably need to collect primary data . But if you want to synthesize existing knowledge, analyze historical trends, or identify patterns on a large scale, secondary data might be a better choice.
Primary | . | methods. |
---|---|---|
Secondary |
In descriptive research , you collect data about your study subject without intervening. The validity of your research will depend on your sampling method .
In experimental research , you systematically intervene in a process and measure the outcome. The validity of your research will depend on your experimental design .
To conduct an experiment, you need to be able to vary your independent variable , precisely measure your dependent variable, and control for confounding variables . If it’s practically and ethically possible, this method is the best choice for answering questions about cause and effect.
Descriptive | . . | |
---|---|---|
Experimental |
Discover proofreading & editing
Research method | Primary or secondary? | Qualitative or quantitative? | When to use |
---|---|---|---|
Primary | Quantitative | To test cause-and-effect relationships. | |
Primary | Quantitative | To understand general characteristics of a population. | |
Interview/focus group | Primary | Qualitative | To gain more in-depth understanding of a topic. |
Observation | Primary | Either | To understand how something occurs in its natural setting. |
Secondary | Either | To situate your research in an existing body of work, or to evaluate trends within a research topic. | |
Either | Either | To gain an in-depth understanding of a specific group or context, or when you don’t have the resources for a large study. |
Your data analysis methods will depend on the type of data you collect and how you prepare it for analysis.
Data can often be analyzed both quantitatively and qualitatively. For example, survey responses could be analyzed qualitatively by studying the meanings of responses or quantitatively by studying the frequencies of responses.
Qualitative analysis is used to understand words, ideas, and experiences. You can use it to interpret data that was collected:
Qualitative analysis tends to be quite flexible and relies on the researcher’s judgement, so you have to reflect carefully on your choices and assumptions and be careful to avoid research bias .
Quantitative analysis uses numbers and statistics to understand frequencies, averages and correlations (in descriptive studies) or cause-and-effect relationships (in experiments).
You can use quantitative analysis to interpret data that was collected either:
Because the data is collected and analyzed in a statistically valid way, the results of quantitative analysis can be easily standardized and shared among researchers.
Research method | Qualitative or quantitative? | When to use |
---|---|---|
Quantitative | To analyze data collected in a statistically valid manner (e.g. from experiments, surveys, and observations). | |
Meta-analysis | Quantitative | To statistically analyze the results of a large collection of studies. Can only be applied to studies that collected data in a statistically valid manner. |
Qualitative | To analyze data collected from interviews, , or textual sources. To understand general themes in the data and how they are communicated. | |
Either | To analyze large volumes of textual or visual data collected from surveys, literature reviews, or other sources. Can be quantitative (i.e. frequencies of words) or qualitative (i.e. meanings of words). |
If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.
Research bias
Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.
Quantitative methods allow you to systematically measure variables and test hypotheses . Qualitative methods allow you to explore concepts and experiences in more detail.
In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .
A sample is a subset of individuals from a larger population . Sampling means selecting the group that you will actually collect data from in your research. For example, if you are researching the opinions of students in your university, you could survey a sample of 100 students.
In statistics, sampling allows you to test a hypothesis about the characteristics of a population.
The research methods you use depend on the type of data you need to answer your research question .
Methodology refers to the overarching strategy and rationale of your research project . It involves studying the methods used in your field and the theories or principles behind them, in order to develop an approach that matches your objectives.
Methods are the specific tools and procedures you use to collect and analyze data (for example, experiments, surveys , and statistical tests ).
In shorter scientific papers, where the aim is to report the findings of a specific study, you might simply describe what you did in a methods section .
In a longer or more complex research project, such as a thesis or dissertation , you will probably include a methodology section , where you explain your approach to answering the research questions and cite relevant sources to support your choice of methods.
Other students also liked, writing strong research questions | criteria & examples.
✔ Free APA citation check included ✔ Unlimited document corrections ✔ Specialized in correcting academic texts
Discover the world's research
When we talk about big data, what do we really mean toward a more precise definition of big data.
Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this “no consensus” stance over the years. However, the lack of a clear definition and scope for Big Data results in scientific research and communication lacking a common ground. Even with the popular “V” characteristics, Big Data remains elusive. The term is broad and is used differently in research, often referring to entirely different concepts, which is rarely stated explicitly in papers. While many studies and reviews attempt to draw a comprehensive understanding of Big Data, there has been little systematic research on the position and practical implications of the term Big Data in research environments. To address this gap, this paper presents a Systematic Literature Review (SLR) on secondary studies to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term. Our study found that various Big Data technologies are being used in different scientific fields, including machine learning algorithms, distributed computing frameworks, and other tools. These manifestations of Big Data can be classified into four major categories: abstract concepts, large datasets, machine learning techniques, and the Big Data ecosystem. This study revealed that despite the general agreement on the “V” characteristics, researchers in different scientific fields have varied implicit understandings of Big Data. These implicit understandings significantly influence the content and discussions of studies involving Big Data, although they are often not explicitly stated. We call for a clearer articulation of the meaning of Big Data in research to facilitate smoother scientific communication.
As the amount of data being generated continues to grow over the last two decades ( Petroc, 2023 ), the need to process and analyze it became increasingly urgent. This led to the development of new tools and techniques for handling Big Data ( Khalid and Yousaf, 2021 ), such as distributed computing frameworks like Hadoop and Apache Spark, as well as machine learning algorithms like neural networks and decision trees. As researchers and practitioners in various fields began to integrate Big Data into their work, a body of literature emerged discussing its implementation, characteristics and features, as well as its potential benefits and limitations.
The concept of Big Data has become increasingly important for scientific research. Its characteristics have been extensively discussed in the literature, and numerous researchers have summarized the definitions of Big Data and its closely related features ( Khan et al., 2014 ; Chebbi et al., 2015 ). Epistemological discussions are common ( Kitchin, 2014 ; Ekbia et al., 2015 ; Succi and Coveney, 2019 ) and the recent emergence of new quantitative-based approaches, utilizing large text corpora,offers new opportunities to understand Big Data from different perspectives. Hansmann and Niemeyer (2014) , for example, apply a text mining approach to extract and characterize the elements of the Big Data concept by analyzing 248 publications relevant to Big Data. They focus on the concept of the term Big Data and summarize four dimensions to describe it: the dimensions of data, IT infrastructure, applied methods, and an applications perspective. Van Altena et al. (2016) , analyze a large number of articles from the biomedical literature to give a better understanding of the term Big Data by extracting relevant topics through topic modeling. Akoka et al. (2017) review studies using a systematic mapping approach based on more than a decade of academic papers relevant to Big Data. Their bibliometric review of Big Data publications shows that the research community has made a significant contribution in the field of Big Data, which is further evidenced by the continued increase in the number of scientific publications dealing with the concept.
Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this “no consensus” stance over the years. Many authors endeavor to provide comprehensive definitions of Big Data based on their research, aiming to better capture its essence. However, a universally accepted definition is yet to emerge. Instead, a commonly accepted description portrays Big Data through various “V” characteristics (volume, variety, velocity, etc.). Nonetheless, this widely accepted description does not ensure a profound common ground for discussing Big Data in different contexts. As a result, Big Data is still being perceived as a broad and vague concept, making it difficult to grasp in interdisciplinary contexts. Over the past few decades, there has been little systematic research on the position and practical implications of the term Big Data in research environments. When different authors spend significant portions of their studies discussing similar definitions and characteristics of Big Data, they are actually referring to different concepts (technology, platforms, or datasets), which is often not explicitly stated. Exploring this ambiguity is crucial, as it forms the basis for research and discussion. Many reviews aim to enrich a comprehensive understanding of Big Data, rather than retrospectively observing its actual usage in different contexts. we believe this inspection on the current useage sitation of Big Data will help to clarify the ambiguities and facilitates clearer communication among researchers by providing a understanding framework of what Big Data entails in different contexts.
To address this gap, this paper presents a Systematic Literature (SLR) on secondary studies ( Kitchenham et al., 2009 , 2010 ) to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term.
The rest of this paper is structured as follows: we first present our systematic literature review methodology. The results are then divided into four sections: an overview of the bibliographic information of the extracted articles, a summary of the most prominent technologies used, a discussion on the technologies that are considered to be Big Data, and a presentation of the perceived benefits and drawbacks of Big Data. Finally, before conclusion we discuss our findings in terms of the understanding of Big Data and its value, and suggest avenues for further research.
This study employs a SLR methodology to investigate the use of Big Data in scientific research. Our review, which follows the guidelines for SLR defined by Kitchenham et al. (2009) and Kitchenham et al. (2010) , enables a structured and replicable procedure that increases the reliability of research findings. An SLR study is referred to as a tertiary study when it applies the SLR methodology on secondary studies, i.e. it surveys literature reviews instead of primary studies. We selected a tertiary approach for this study due to the extensive range of Big Data technologies and applications in research. Collecting primary data from a large number of papers across domains can be a laborious task. By analyzing secondary data sources instead, our approach offers a comprehensive and holistic overview of the landscape of Big Data in the scientific community. Adopting this approach allows us to better control the bibliographic selection and obtain a comprehensive overview of the field, thereby ensuring high-quality data analysis.
The research questions addressed by this paper are:
1. Which technologies are considered as Big Data in different fields?
2. What Big Data implies in a scientific study context?
3. What is the perceived impact of the adoption of Big Data in each research domain?
For the purpose of answering RQ1, we have referenced the Big Data technology taxonomy developed by Patgiri (2019) . We anticipated that the reviewed papers would mention a broad spectrum of technologies, techniques, applications, and heterogeneities, which could pose a challenge in synthesizing the results. Therefore, we relied on Patgiri's taxonomy to guide our full-text review and data extraction process to identify the Big Data technologies utilized in the various scientific domains. This taxonomy offers a comprehensive overview of the various technical approaches to Big Data. By utilizing this taxonomy, we aim to initiate a discourse on what Big Data truly represents in each research domain, and how it has been understood by the respective scientific community. This discussion will enable us to further deliberate on RQ2, which examines the perception of Big Data in specific research domains. To structure our investigation for both research questions (RQ1 and RQ2) we utilize the subject taxonomy of Web of Science (WoS) to classify scientific fields, and apply no limitations to the scope of included scientific fields. As a result, both questions encompass a wide range of Big Data usage.
RQ2 is rooted in the fact that the scope of Big Data technologies has yet to be strictly defined and reach a consensus. Through this question we aim to investigate which technologies or objects are perceived as constituting Big Data in specific research domains. By these means we attempt to shed light on how Big Data as a concept is implicitly understood in each domain by linking it to specific technologies. In addition to offering an overview of the technological landscape, RQ3 aims to delve deeper into the discussion of the impact of Big Data in each domain by examining the benefits and drawbacks of its adoption, as reported by the examined secondary studies. By doing so, we seek to gain a comprehensive understanding of the role of Big Data in scientific domains.
This study uses the framework of Population Intervention Comparison Outcome and Context (PICOC) ( Petticrew and Roberts, 2008 ) to guide the research questions and the systematic literature review process, with the aim of exploring the use of Big Data technologies across various scientific domains and understanding the perceived meaning and value of Big Data in the scientific community. The use of the PICOC standard allowed for a structured approach to conducting the study and ensured that all relevant aspects were considered in the data collection, refinement, extraction, and analysis processes ( Mengist et al., 2020 ). The Population of interest consists of studies that explore the state of the art of Big Data in different research domains. Intervention refers to systematic secondary studies, such as systematic literature reviews, mapping studies, and surveys. The Comparison in this study is between the perception of Big Data as a concept and the adoption of related technologies. The desired Outcome of this study is a conceptual mapping of how the term Big Data applied in research across domains. In terms of Context, the study focuses on research domains that use the term Big Data to explore the potential of its technology and tool in advancing research without any restriction on the type of domains.
Figure 1 summarizes the process followed by this SLR. Each of the steps in this process is discussed in more detail in the following.
Figure 1 . Process of the research methodology.
Our aim for data collection was to explore as wide a range of studies in as many different domains as possible. As a result, we decided to retrieve candidate studies through queries on the online databases of Elsevier Scopus, Web of Science, and PubMed. Scopus is a database containing 77 million records from almost 5,000 international publishers ( Scopus Content Coverage Guide, 2023 ), Web of Science is one of the world's leading databases which covers all domains of science ( Cumbley and Church, 2013 ), and PubMed is a comprehensive medicine and biomedical database ( Falagas et al., 2007 ). Our search strategy for all three databases after a few stages of pilot searches, was to look for publications with the exact phrase “big data” and either technology, platform, application, or adoption in the title, along with review, survey, or “mapping study” in either title or keywords.
The specific search query for each database is shown in Figure 2 . As the databases use slightly different tags to identify search fields, the search query expression varied but had the same conceptual content. For example, Scopus used the tag TITLE for the title field, whereas WoS used TI and PubMed used square brackets behind the term. For the keywords field, Scopus automatically combined author keywords, index terms, trade names, and chemical names in the tag KEY. WoS provided two relevant fields for keywords, AK representing author keywords and KP representing keywords derived from the titles of article references based on a special algorithm. PubMed placed author keywords and other terms indicating major concepts in the field Other Term. We obtained 490 results from Scopus, 116 from WoS, and 18 from PubMed, resulting in a total of 624 articles before removing duplicates.
Figure 2 . Search queries and screening process.
We established inclusion and exclusion criteria listed in Table 1 to ensure the relevance of the retrieved studies to be included in our analysis. In terms of inclusion criteria, we only considered articles that were published in English between 2017 and 2022 (in order to ensure that the studies reflect the current state of the art in each field), and that were secondary studies. Specifically, we focused on comprehensive reviews or systematic literature reviews that provided an overview of Big Data adoption in a particular research field, rather than articles that focused solely on the application of Big Data technologies to address individual research questions or on the improvement of Big Data technologies themselves. Moreover, we only included articles that were published in their entirety in journals, rather than conference proceedings or abstracts. In contrast, we excluded early access and not final stage articles to avoid potential bias or incomplete information. Publications in languages other than English were also excluded due to language barriers. Additionally, we excluded primary studies, as our aim was to focus on the adoption, application, and implications of Big Data technologies in specific scientific domains, rather than specific research questions.
Table 1 . Inclusion and exclusion for data selection.
By implementing these criteria, we were able to ensure that the articles selected for our analysis were recent, relevant, and provided a comprehensive view of the Big Data landscape. Moreover, by focusing on secondary studies, we were able to avoid redundant coverage of primary research studies and instead obtain a broader perspective on the field. Ultimately, this approach allowed us to generate a more comprehensive and robust understanding of the different Big Data technologies, applications, and techniques being used in various scientific domains. This step resulted in a corpus of 76 records after removing the duplicates.
To ensure that the papers selected for full-text review meet the requirements of our study, we then conducted a Quality Assessment (QA) based on five standards:
1. Is Big Data adoption discussed at sufficient depth?
2. Is a specific methodology being used in the secondary study?
3. Are the bibliographic sources used included in the study design?
4. Is the number of included primary studies clear in the study?
5. Are the results well-organized?
Firstly, we examined whether the papers provide an overview of Big Data adoption in a particular research field, excluding other aspects such as the application of Big Data technologies that focus on individual research questions or the improvement of Big Data Technologies themselves. Secondly, we assessed whether the selected papers clearly define their research methodology or base it on a specific paradigm. Thirdly, we evaluated whether the papers clearly specify their bibliographic sources for the secondary studies. Fourthly, we documented whether the number of primary papers that are contained in the secondary studies is also clearly stated. Finally, we evaluated whether the results of the papers are well-organized around research questions.
We scored the papers based on these criteria in a binary fashion (fulfilled/not fulfilled), and to be considered for inclusion in our study, we set a threshold of at least three out of five standards that must be fulfilled. While QA is not meant in principle to be used for an additional filtering, we decided that given the observed wide disparities in the quality of the collected secondary studies, setting this threshold is justified. The resulting selected articles primarily use systematic literature reviews, although some pure review articles that are relevant to our study have also been included. This step leads to 33 studies left for further processing.
In addition to the papers collected, we employed the forward snowball sampling method ( Wohlin, 2014 ) to supplement our database. For each article under full-text review passing the QA step, we reviewed all the titles of its references and the articles citing it. We added records that met our inclusion and exclusion criteria and were relevant to our topics. As we are only interested in recent studies on the topic, we did not perform backward snowballing. Ultimately, we selected only one additional study for inclusion, resulting in 34 studies in total for data extraction. All authors actively contributed to each step of the research process, engaging in thorough discussions and collaborations. In every stage outlined above, one researcher took the lead and collaborated with the other two authors to ensure comprehensive coverage and rigorous analysis.
The data extraction form shown in Table 2 was designed to answer the research questions in this study. The extraction fields can be divided into three categories. The first part concerns demographics-related data to be extracted from each study in order to check for possible bias toward specific fields or publication venues. The second part is used to answer RQ1 and RQ2. For the Research Field aspect, we applied the subject classification from WoS to ensure consistency and heterogeneity of the level of the discipline described. This also enabled us to make cross-sectional comparisons and statistical summaries. Apart from the WoS categories, we added another field to specify the sub-level of the Application Domain in which the secondary study takes place. This helps us to better connect the technologies collected with their intended use cases. In terms of the technologies and applications that we collected, due to the wide range of information described from different perspectives and heterogeneity issues, we added one more column to clarify and support the documentation, namely Perspective. This column records the perspective from which the Big Data technologies are described. For example, some papers only summarized the use cases of machine learning methods as applications of Big Data. In this way, besides the concrete names of methods and models documented in the Technologies column (including also specific Big Data applications or platforms), the perspective will be marked as machine learning. This helps us to further analyze and categorize the discussed technologies. Reported shortcomings and benefits of the perceived impact of Big Data from the authors are extracted in the third part. This helps us to answer RQ3.
Table 2 . Fields of data extraction form.
The present study faces several potential threats to its validity. Firstly, the inclusion and exclusion criteria employed could introduce bias into the sample. If the criteria are too narrow, relevant articles may be omitted, while if they are too broad, irrelevant articles may be included, resulting in a less representative sample. To mitigate this threat, we thoroughly reviewed and refined our criteria through piloting to ensure their optimal balance between inclusivity and exclusivity. Secondly, the quality assessment of the chosen articles could be influenced by researcher bias, as different researchers might interpret the quality criteria differently. To address this issue, we utilized a standardized quality assessment tool during the evaluation process to ensure consistent interpretation of the criteria. Moreover, limiting the search to a specific publication date range could result in publication date bias, leading to an outdated or incomplete synthesis of the literature. To avoid this, we focused on secondary studies and restricted the publication date to the latest 5 years, which enabled us to include publications from a wide range of dates. To enhance the efficiency of the quality assessment, we established the quality assessment process before conducting the full-text review. This approach ensured that all selected articles were relevant to the topic and that their quality was assessed rigorously. Other possible threats to the validity of this study include publication bias and selective reporting of results. To mitigate these threats, we conducted a comprehensive search of multiple databases and employed the snowballing method, while carefully screening all articles for possible selective reporting.
This section presents the results of our study and is divided into several subsections. Firstly, we provide an overview of the background information of the articles in our corpus, including the distribution of disciplines, publishers, and popular databases cited for their secondary studies. The presentation of the results is using the classification of the secondary studies into research domains extracted for RQ1 and RQ2. In Section 3.2, we discuss our findings on the Big Data technologies extracted from our SLR in order to answer RQ1. In Section 3.3, we discuss the concepts that are perceived as Big Data as reported by the secondary studies in our corpus. Finally, we summarize the impact of the adoption of Big Data technologies in the various domains in Section 3.4.
The following section presents an overview of the created corpus used in the analysis, providing statistics on domain distribution, publishers, and bibliographic sources. Domain distribution helps to identify the research background of the analyzed articles, while publishers' distribution sheds light on those who are active in the field of Big Data. Additionally, the bibliographic sources used by secondary studies provide insights into the popularity of different databases and sources among researchers in the field. This information is crucial since it can affect the quality of the data and the generalizability of the findings and ensure transparency for analysis.
Figure 3 shows the distribution of WoS-defined disciplines of the articles extracted for the study. Health care emerged as the discipline that includes the most Big Data articles, with COVID-19 as a distinct area of interest that we promoted to its own domain. Transportation and Business & Economics were the next most popular areas of study. While several cross-domain studies did not specify the application domain of Big Data, they did focus on a specific aspect of Big Data applications such as storage and cybersecurity. Disaster response studies identified in the Public, Environmental & Occupational Health category are also notable. Other disciplines appear only once; these include Material Science that focused on Aerospace, Telecommunications (e.g. mobile Big Data), Construction & Building Technology, and Agriculture. It should be noted that this distribution does not necessarily represent the entire scientific community, and no conclusions regarding the interest of studying or using Big Data can be drawn from this.
Figure 3 . Publisher distribution of included articles.
Our analysis of the articles in our corpus, as summarized in Figure 4 , showed that MDPI and Elsevier published the most studies related to Big Data (with nine each), followed by Springer and IEEE (with three each). Eight more publishers have one publication each in our corpus and are omitted from the figure.
Figure 4 . Publishers' distribution of the included secondary studies.
In terms of databases chosen for secondary studies, Figure 5 shows that WoS and Scopus were the most popular sources, being in alignment with our choice of databases for our study, followed by IEEE, ScienceDirect, Google Scholar, Wiley Online Library, Sage, PubMed, and Association for Computing Machinery's (ACM) Digital Library.
Figure 5 . Popularity of databases used by the secondary studies.
Overall, a wide diversity of publishing venues and data sources can be attested for our corpus, providing evidence toward a lack of selection bias for our review.
To address RQ1, we refer to Patgiri (2019) , who presents a taxonomy of Big Data consisting of seven key categories: Semantics, Compute Infrastructure, Storage System, Big Data Management, Big Data Mining, Big Machine Learning, and Security & Privacy. This taxonomy covers various aspects of Big Data technologies, including implementation tools, system architectures, and operational processes. We categorize and describe our findings based on the six perspectives (excluding Semantics) proposed by Patgiri. Since the Semantics perspective primarily concerns the conceptual understanding of Big Data, we exclude it from our summary of technologies.
The data extraction results from our study reveal the use of various Big Data technologies already discussed in Patgiri (2019) , including Xplenty, Apache Cassandra, MongoDB, Hadoop, Datawrapper, Rapidminer, Tableau, KNIME, Storm, Cloudera Distributed Hadoop(CDH), Kafka, Spark, Mapreduce, Hive, Pig, Flume, Sqoop, Apache Tez, Flink, and Storm. These technologies are used to extract, process, analyze, and visualize large volumes of data across a range of scientific domains.
From the Compute Infrastructure perspective, the most commonly used technology is Hadoop, 1 which is used for distributed computing and data processing. Various disciplines have been found utilizing Hadoop, including health care [S12, S63, S87, S133], environmental science [S77], and transportation [S194]. Other technologies, such as Apache Cassandra 2 and MongoDB, 3 are used for distributed data storage and retrieval. These technologies enable data processing at scale and facilitate parallel processing, enabling users to analyze large volumes of data quickly and efficiently. In terms of storage systems, Apache Cassandra, Hadoop Distributed File System (HDFS), 4 HBase, 5 and MongoDB are popular choices for distributed data storage. These technologies are designed to handle large volumes of data and provide high scalability and fault tolerance. The reported application domains for these technologies include Internet of Things (IoT), healthcare, decision making and electric power data [S94]. [S133] notes that NoSQL database systems such as MongoDB, Cassandra, and HDFS can be used to handle exponential data growth to replace the traditional database management systems. Additionally, technologies such as Flume 6 and Sqoop 7 enable the efficient transfer of data between different storage systems [S74].
Regarding the Big Data Management perspective, frameworks such as Hadoop, MapReduce, Hive, 8 and Pig 9 are used for managing and processing large volumes of data. These technologies enable users to process data in a distributed environment, thereby increasing efficiency and scalability [S74, S133, S194]. Other technologies, such as Apache Tez 10 and Flink, 11 provide high-performance data processing and streaming capabilities [S74]. From the Big Data Mining & Machine Learning aspect, a range of machine learning models are used to analyze large volumes of data, including Artificial Neural Network (ANN), fuzzy logic, Decision Tree (DT), regression, Support vector machine (SVM), Self-Organizing Map (SOM), Fuzzy C-Means (FCM), K-means, Genetic algorithm (GA), and Convolution Neural Networks (CNN), DT, Random Forest, Rotation Forest, ANN, Bayesian, Boosted Regression Tree, Classification And Regression Trees (CART), Conditional Inference Tree, Maxent, K-means, Non-negative Matrix Fact, and others [S19, S27, S65, S96, S178, S203, S207]. These techniques enable users to extract insights from large volumes of data, identify patterns and trends, and make predictions based on data analysis. Finally, as for security and privacy, Intrusion Detection Systems (IDS) are used to detect anomalous behaviors in complex environments. Machine learning models such as unsupervised online deep neural networks and deep learning techniques are used to identify and analyze Controller Area Network (CAN-Bus) attacks. Log parsers such as Zookeeper, 12 Proxifier, 13 BGL, HPC and HDFS are used to secure data and ensure privacy in distributed environments [S88]. These technologies provide access control, authentication, and encryption capabilities to protect data against unauthorized access and misuse.
Although many Big Data technologies and their applications are presented in this section, a significant portion of our research found that there are additional technologies and applications that are discussed outside of Patgiti's taxonomy, yet are still labeled as Big Data. In order to distinguish between these two cases and further deepen the understanding of Big Data and related technologies in the scientific field, we discuss the latter case in the following research question.
In this section, we aim to provide an overview of the landscape on how Big Data is actually used based on our analysis of the literature to answer RQ2. Initially, we attempted to use the WoS categories to structure the discussion around domains; however, due to the heterogeneity issue at different angles, the difference in technology use was found to be significant even within one subject. As a result, we summarize the technologies according to the Perspective field recorded in the data extraction form. Table 3 summarizes our findings.
Table 3 . Perceptions of Big Data scope.
In this category, the studies do not exemplify the Big Data technologies in concrete circumstances, but rather as concepts or collections for application domains. For example, [S67] discusses using text mining methods to explore the application of Big Data disciplines. The research focuses on the broad application of Big Data but does not include specific technologies. Based on the presented text mining method, Big Data technologies are identified automatically as keywords without further classification, and no concrete technologies and their use cases are explored. Similarly, [S104] studies the implementation of cloud computing, Big Data, and blockchain technology adoption in Enterprise Resource Planning (ERP), using only publication keywords generated from literature review and covering a broad spectrum. Why certain technologies are being referred to as Big Data, and how they are specifically utilized is missing, however. In another case, the term “Big Data” and related expressions such as “Big Data Analytics” and “Big Data Applications” are used as abstract concepts throughout the paper without a clear definition of the scope of the technology or little mention of the technology itself. For instance, in an attempt to identify privacy issues in Big Data applications, [S181] uses phrases such as “Big Data analytics” to refer to the technology, without focusing on any specific technologies.
This category of studies focuses on newly established large data sources, such as databases, websites, and crowd-sourcing platforms, which are taken as the application target of Big Data. The term Big Data is not strictly focused on the technologies but rather the volume or velocity of the data, with no specific techniques linked to those data sources. The application of Big Data in this case mainly means the use of large volumes of data. For example, [S132] explores the utilization of Big Data, smart and digital technologies, and artificial intelligence in the field of psoriatic arthritis studies. The authors cite a series of examples of how Big Data, combined with techniques such as artificial neural networks, natural language processing, k-nearest neighbors, and Bayesian models, can help intercept patients with psoriatic arthritis early. In this case, Big Data mainly refers to the repositories, registries, and databases that are generated from surveys, medical insurance data, vital registration data, etc. A similar research study in healthcare is [S47]. The authors list a number of databases worldwide with concrete gastrointestinal and liver diseases sample sizes, and techniques that are briefly mentioned for analysis are R, python, statistics, and Natural Language Processing (NLP). Also in ecology, ecological datasets such as AmeriFlux are considered Big Data as indicated by [S16].
This category refers to Big Data as models, algorithms, and statistical methods for managing large data sets, closely related to Machine Learning, Artificial Intelligence, Data Mining, Internet of Things, etc. Big Data technology is equivalent to the definition of machine learning or artificial intelligence, sometimes referred to as deep learning. Specific machine learning models are mentioned, such as ANN, Neural Network (NN), SVM, K-means, decision trees, and also improved models that are based on those methods, to solve practical problems, such as prediction for a certain disease, optimization of transportation. In our research, the majority of the technologies and applications extracted fall into this category. Popular models and techniques summarized are decision trees, random forests and support vector logistic regression models, naive Bayes models, decision trees, and k-nearest neighbors (k-NN) classic linear statistic models, Bayesian networks, and CNNs. There are also some studies (e.g. [S203, S207]) that summarize the applications of Big Data from their research field by discussing a series of concrete improvements of existing models.
The Big Data ecosystem category represents a comprehensive structure of the different levels of technology solutions that exist. Various applications such as Hadoop, MapReduce, and NoSQL are commonly mentioned in this category. The authors of those studies show a strong familiarity with how these technologies fit into their respective disciplines, demonstrating a deep understanding of the subject matter. In one study [S194], for example, the authors explained the different layers in an architecture for Big Data relating to traffic, and how applications such as Hadoop, MapReduce, HDFS, Apache Spark, Apache Hive, Hut, and Apache Kafka work together in the system. The article provides valuable insights into the role that these applications play in the overall architecture. Another study [S216] focused on the use of MapReduce and HDFS functionality in Big Data Architecture.
Some articles in our study did not fit into the categories outlined in our classification, as it is challenging to extract useful information from them. For instance, some studies e.g., [S12,S63,S77,S172] presented a comparison of Big Data frameworks sourced from other literature without exploring how these frameworks are utilized in their respective fields after introducing their background discipline. While these frameworks can be categorized into the fourth category, we found it difficult to ascertain how Big Data is applied and understood in these disciplines. As a result, we did not include them in Table 3 . Articles that do not fit into this classification do not imply that the classification is invalid for the study, but rather that we cannot accurately estimate the authors' understanding of Big Data.
In this section, we summarize the perceived impact of Big Data adoption to each domain, as stated in the secondary studies and in order to answer RQ3. The impact of Big Data is categorized into benefits and shortcomings, which will be introduced below. Table 4 provides a summary of the main points extracted from the texts.
Table 4 . Summary of benefits and shortcomings of Big Data.
The advantages of Big Data in various fields have been extensively researched and documented. Big Data tools have the ability to store, process and analyze large volumes and varieties of data in real time, enabling researchers to extract valuable insights and improve their performance. In healthcare, the integration of different scientific fields such as informatics, clinical sciences, and analytics has been facilitated by the application of Big Data [S12]. Further reported benefits of Big Data in healthcare include, for example, cost reduction in medical treatments, elimination of illness-related risk factors, disease prediction, and improved preventative care and medication efficiency analysis [S12, S47, S132, S181]. Large sample sizes have enabled reliable capture of small variations in incidence or disease flare, and epidemiological/clinical Big Data has been instrumental in guiding public and global health policies [S47, S132]. In the field of transportation, Big Data technologies have been used to improve the effectiveness of traffic crash detection and prediction research, with the aim of preventing the occurrences of traffic crashes and secondary crashes [S178, S194]. In the ecosystem service research, Big Data and machine learning tools have been used to address data availability, uncertainty, and socio-ecological gaps [S16]. In addition, Big Data analytics platforms have been useful in revealing previously overlooked correlations, market trends, and valuable information from a large amount of Big Data [S66]. Machine learning has also been used in the smart grid area to sift through Big Data and extract useful information that can aid in demand and generation pattern recognition [S51]. Overall, the benefits of Big Data are diverse and far-reaching, and its application has the potential to revolutionize research and improve outcomes in various fields.
One of the primary challenges with Big Data is the assumption that having vast amounts of data guarantees accurate results. As [S74] notes, Big Data can give a false sense of security because having a lot of data does not necessarily mean that the results are valid. In the field of Psoriatic Arthritis, for example, there are significant variations in socio-demographic characteristics, co-morbidities, and major complication rates between individual (single- or multi-center) and database-based studies [S132]. This inconsistency raises concerns when critically appraising rheumatological and dermatological research, as well as risk adjustment modeling.
Another challenge with Big Data is that unnecessary utilization of Big Data can lead to a waste of resources, as it ties up computer resources [S74]. With the exponential growth of data, the storage and processing demands for Big Data have increased significantly. Unnecessary utilization of Big Data can exhaust computer resources and make them less available for other important tasks. This can result in increased costs for organizations, as they need to invest in more powerful computer systems to handle the increased demand. Big Data also poses physical challenges to current IT architecture, servers, and software. As pointed out by [S51], IoT devices generate a huge amount of data, which cannot be handled through conventional analysis techniques. While many data storage technologies have been proposed to store and process growing data, more efficient technologies are required for data acquisition, processing, pre-processing, storage, and management of Big Data [S94].
Security and privacy issues also arise with Big Data. In the healthcare industry, data security and patient privacy issues are a significant concern for authorities and patients [S66]. In the research concerning COVID-19 [S207], for example, there are ethical issues surrounding privacy, the use of personal data to limit the pandemic spread, and the need for security to protect data from being overused by technology.
Finally, there are challenges in managing data consistency, scalability, and integration. For instance, challenges in using Big Data in environmental Sciences include data cleansing, lack of labeled datasets, mismatched data ingestion, high costs of platforms, and lack of data governance and socio-technical infrastructure [S135].The transportation industry faces difficulties in collecting data from diverse sources and addressing quality concerns. When using Big Data in transportation, data collected from different sources needs to be analyzed to extract meaningful insights. However, this data often contains noise and uncertainty that must be addressed before use [S194].
In conclusion, several challenges need to be addressed to ensure that the use of Big Data is effective. These challenges include the need for integrated and comprehensive systems, efficient data acquisition and management technologies, security and privacy concerns, data consistency, scalability, as well as integration issues.
Going beyond our findings as reported in the previous section, in the following we identify three specific topics emerging from our analysis that need further consideration. These are how Big Data is being understood across scientific domains, how the adoption of the Big Data occurs across disciplines, and how the perception of the added value that Big Data technologies bring looks like.
The challenge in analyzing research using Big Data technologies stems from the lack of a clear definition of what a Big Data technology is. The authors of the secondary studies surveyed through this study elaborate on their understanding of Big Data and its applications from different perspectives. This results in a wide spectrum of technologies which are labeled as Big Data, while significant differences between them remain, even when applied in the same field. Additionally, terms closely related to Big Data technology, such as AI, ML, Big Data analytics, Big Data platform, IoT, and Deep Learning are often used interchangeably. While these terms may not require definition in each individual article surveyed, their meanings and scope may overlap in practice. Therefore, to gain a clear understanding of Big Data, it would be helpful to have a clear distinction and consistent use of each term with more precision to avoid confusion. In other words, we argue that when reviewing the literature, the primary task is to understand what authors mean when they use concepts such as Big Data and Big Data technology, rather than simply extracting relevant technical applications and their context.
The results show that the concept of Big Data technology can be very broad. At the same time, these tools and technologies may be specific to a particular discipline and may not have interdisciplinary significance. This ambiguity has led to the widespread use of Big Data terminology in some cases. Our findings also revealed that some articles provide a vague description of Big Data, making it difficult to extract useful information that cannot be categorized under any of the categories in Section 3.3. These articles focus on the theoretical knowledge of Big Data and do not delve into its practical applications in different domains. As a result, they provide a pure comparison of Big Data frameworks cited in other literature without exploring how these frameworks are used in their respective domains. This approach contributes to pervasive references to Big Data, however, without providing a clear understanding of its practical applications.
In conclusion, this study highlights the need for a clear and comprehensive definition of Big Data technology and its practical applications in different domains. By doing so, we can avoid the misuse of Big Data terminology and gain a better understanding of the tools and technologies that are truly related to Big Data.
In this study, we conducted a rigorous systematic literature review to investigate the applications of Big Data and used the WoS subject classification to categorize the related research fields. However, we acknowledge that this approach may not capture the full range of disciplines, as indicated by our analysis of the discipline distribution. While disciplines such as astronomy and high-energy physics are typically associated with managing large volumes of data ( Jacobs, 2009 ), we found that they were not represented in the papers we extracted. The lack of such data-intensive disciplines led us to question why they were not included, since our primary objective was to provide an overview of Big Data applications across the entire scientific community. We took great care to ensure that our SLR was not biased toward any specific domain by carefully selecting keywords and defining our inclusion and exclusion criteria. Additionally, we used comprehensive databases and an auditable and repeatable methodology. The absence of data-intensive subjects in our review suggests that some disciplines may be utilizing Big Data technologies without explicitly using the term “Big Data” in their research papers. One hypothesis is that computer science and related fields may consider the scale of data used as the norm, hence not explicitly mentioning “Big Data.” To understand this phenomenon better, we suggest a further step to investigate the underlying assumptions, such as the choice of terminologies, used in these disciplines as part of future work.
The enthusiasm surrounding Big Data arises from the belief that vast amount of information with the development of technologies, algorithms and machine learning techniques, can provide innovative insights that traditional research methods cannot achieve ( Agrawal et al., 2011 ; Chen and Zhang, 2014 ; Khalid and Yousaf, 2021 ). However, these optimistic views are not undisputed. As Big Data knowledge infrastructures emerge, researchers are increasingly discussing the challenges and limitations they present.
Numerous papers extol the merits of Big Data, with the main claim being that it offers unparalleled opportunities for scientific breakthroughs, leading to transformative research, as discussed in Section 3.4. The most significant advantage of the Big Data approach is its ability to address problems on larger and finer temporal and spatial scales, as well as provide information on which data are reliable or uncertain, thereby mapping ignorance ( Hampton et al., 2013 ; Harford, 2014 ; Isaac et al., 2020 ). Big Data also highlights blind spots and uncertainties in research, revealing gaps in existing knowledge ( Hortal et al., 2015 ). In this study, we have explored the benefits of Big Data tools. They can store, process, and analyze large volumes and varieties of data in real-time, enabling researchers to extract valuable insights and improve their performance. Big Data analytics platforms have proven useful in revealing previously overlooked correlations, market trends, and valuable information from large datasets. Additionally, machine learning has aided in sifting through Big Data to extract useful information for demand and generation pattern recognition in the smart grid.
However, according to Ekbia et al. (2015) , Big Data presents both conceptual and practical dilemmas based on a broad range of literature. They argue that an epistemological shift in science occurs due to the use of Big Data, where predictive modeling and simulation gain more importance than causal explanations based on repeatable experiments testing hypotheses. The authors Rosenheim and Gratton (2017) reject what they perceive as the suggestion of the most fervent proponents of Big Data that knowledge of correlation alone can replace knowledge of causality. They point out that understanding cause-and-effect relationships is critical in fields such as agricultural entomology, where research-oriented recommendations enable farmers to implement management actions that lead to desired outcomes. The study in Harford (2014) points out that conducting a correlation-based analysis without a theoretical framework is inherently vulnerable. Without understanding the underlying factors influencing a correlation, it becomes impossible to anticipate and account for factors that could potentially compromise its validity.
Similarly to the above, in our research, we found critical questions regarding whether vast amounts of data guarantee accurate results. Brady (2019) argues that social scientists must grasp the meaning of concepts and predictions generated by convoluted algorithms, weigh the relative value of prediction vs. causal inference, and cope with ethical challenges as their methods. Another notable challenge posed by Big Data that we found is managing data consistency, scalability and heterogeneity in different fields. Additionally, large amounts of data also pose challenges to IT architecture, servers and software. This was also testified by Fan et al. (2014) . They point out that the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. Furthermore, we found that the misuses of Big Data arouse ethical issues. This concern has already been widely discussed in previous literature, see for example Cumbley and Church (2013) and Knoppers and Thorogood (2017) .
In conclusion, while Big Data has generated much enthusiasm as a powerful tool for scientific breakthroughs, it is not without its challenges and limitations. Despite these challenges, Big Data holds promise as a tool for innovative research, and future work should continue to address these concerns while exploring new opportunities for knowledge discovery.
The lack of a clear definition and scope for Big Data results in scientific research and communication lacking a common ground. Even with the popular “V” characteristics, Big Data remains elusive. The term is broad and is used differently in research, often referring to entirely different concepts, which is rarely stated explicitly in papers. While many studies and reviews attempt to draw a comprehensive understanding of Big Data, there is a need to first retrospectively observe what Big Data actually means and refers to in concrete studies. This will help clarify ambiguities and enhance understanding of the role of Big Data in science. It will facilitate clearer communication among researchers by providing a framework for a common understanding of what Big Data entails in different contexts, which is crucial for interdisciplinary collaboration. To address this gap, we conducted a systematic literature review (SLR) of secondary studies to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term.
Our study found that various Big Data technologies are being used in different scientific fields, including machine learning algorithms, distributed computing frameworks, and other tools. These manifestations of Big Data can be classified into four major categories: abstract concepts, large datasets, machine learning techniques, and the Big Data ecosystem. All these aspects combined represent the true meaning of Big Data. This study revealed that despite the general agreement on the “V” characteristics, researchers in different scientific fields have varied implicit understandings of Big Data. These implicit understandings significantly influence the content and discussions of studies involving Big Data, although they are often not explicitly stated. We call for a clearer articulation of the meaning of Big Data in research to facilitate smoother scientific communication.
XH: Conceptualization, Writing – original draft. OG: Conceptualization, Supervision, Writing – review & editing. VA: Conceptualization, Methodology, Supervision, Validation, Writing – review & editing.
The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
1. ^ https://hadoop.apache.org/
2. ^ https://cassandra.apache.org/
3. ^ https://www.mongodb.com/
4. ^ https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
5. ^ https://hbase.apache.org/
6. ^ https://flume.apache.org/
7. ^ https://sqoop.apache.org/
8. ^ https://hive.apache.org/
9. ^ https://pig.apache.org/
10. ^ https://tez.apache.org/
11. ^ https://flink.apache.org/
12. ^ https://zookeeper.apache.org/
13. ^ https://www.proxifier.com/
Agrawal, D., Bernstein, P. A., Bertino, E., Davidson, S. B., Dayal, U., Franklin, M. J., et al. (2011). Challenges and opportunities with big data. Cyber Center Technical Reports, 2011–1. Available at: https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1000&context=cctech
Google Scholar
Akoka, J., Comyn-Wattiau, I., and Laoufi, N. (2017). Research on Big Data – a systematic mapping study. Comput. Stand. Interf . 54, 105–115. doi: 10.1016/j.csi.2017.01.004
Crossref Full Text | Google Scholar
Brady, H. E. (2019). The challenge of Big Data and data science. Ann. Rev. Polit. Sci . 22, 297–323. doi: 10.1146/annurev-polisci-090216-023229
Chebbi, I., Boulila, W., and Farah, I. R. (2015). “Big data: concepts, challenges and applications,” inComputational Collective Intelligence , eds. M. Núñez, N. T. Nguyen, D. Camacho, and B. Trawiński (Cham: Springer International Publishing), 638–647. doi: 10.1007/978-3-319-24306-1_62
Chen, C., and Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf. Sci . 275, 314–347. doi: 10.1016/j.ins.2014.01.015
PubMed Abstract | Crossref Full Text | Google Scholar
Cumbley, R., and Church, P. C. (2013). Is “Big Data" creepy? Comput. Law Secur. Rev . 29, 601–609. doi: 10.1016/j.clsr.2013.07.007
Ekbia, H. R., Mattioli, M., Kouper, I., Arave, G., Ghazinejad, A., Bowman, T. D., et al. (2015). Big data, bigger dilemmas: a critical review. J. Assoc. Inform. Sci. Technol . 66, 1523–1545. doi: 10.1002/asi.23294
Falagas, M. E., Pitsouni, E., Malietzis, G., and Pappas, G. (2007). Comparison of PubMed, Scopus, Web of Science, and Google Scholar: strengths and weaknesses. FASEB J . 22, 338–342. doi: 10.1096/fj.07-9492LSF
Fan, J., Han, F., and Liu, H. (2014). Challenges of Big Data analysis. Natl. Sci. Rev . 1, 293–314. doi: 10.1093/nsr/nwt032
Hampton, S. E., Strasser, C., Tewksbury, J. J., Gram, W. K., Budden, A. E., Batcheller, A. L., et al. (2013). Big data and the future of ecology. Front. Ecol. Environ . 11, 156–162. doi: 10.1890/120103
Hansmann, T., and Niemeyer, P. (2014). “Big Data - characterizing an emerging research field using topic models,” in 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) (New York, NY: ACM). doi: 10.1109/WI-IAT.2014.15
Harford, T. (2014, March 28). Big data: Are we making a big mistake? Financial Times. Retrieved from: https://www.ft.com/content/21a6e7d8-b479-11e3-a09a-00144feabdc0 (accessed August 25, 2024).
Hortal, J., De Bello, F., Diniz-Filho, J. A. F., Lewinsohn, T. M., Lobo, J. M., and Ladle, R. J. (2015). Seven shortfalls that beset large-scale knowledge of biodiversity. Annu. Rev. Ecol. Evol. Syst . 46, 523–549. doi: 10.1146/annurev-ecolsys-112414-054400
Isaac, N. J. B., Jarzyna, M. A., Keil, P., Dambly, L. I., Boersch-Supan, P. H., Browning, E., et al. (2020). Data integration for large-scale models of species distributions. Trends Ecol. Evol . 35, 56–67. doi: 10.1016/j.tree.2019.08.006
Jacobs, A. (2009). The pathologies of big data. Commun. ACM 52, 36–44. doi: 10.1145/1536616.1536632
Khalid, M., and Yousaf, M. H. (2021). A comparative analysis of big data frameworks: an adoption perspective. Appl. Sci . 11:11033. doi: 10.3390/app112211033
Khan, M. A.-U.-D., Uddin, M. J., and Gupta, N. (2014). Seven V's of Big Data understanding Big Data to extract value . doi: 10.1109/ASEEZone1.2014.6820689
Kitchenham, B., Brereton, O. P., Budgen, D., Seed, P. T., Bailey, J. E., Linkman, S., et al. (2009). Systematic literature reviews in software engineering – a systematic literature review. Inf. Softw. Technol . 51, 7–15. doi: 10.1016/j.infsof.2008.09.009
Kitchenham, B., Pretorius, R., Budgen, D., Brereton, O. P., Seed, P. T., Niazi, M., et al. (2010). Systematic literature reviews in software engineering – a tertiary study. Inf. Softw. Technol . 52, 792–805. doi: 10.1016/j.infsof.2010.03.006
Kitchin, R. (2014). Big Data, new epistemologies and paradigm shifts. Big Data Soc . 1:205395171452848. doi: 10.1177/2053951714528481
Knoppers, B. M., and Thorogood, A. (2017). Ethics and Big Data in health. Curr. Opin. Syst. Biol . 4, 53–57. doi: 10.1016/j.coisb.2017.07.001
Mengist, W., Soromessa, T., and Legese, G. (2020). Method for conducting systematic literature review and meta-analysis for environmental science research. MethodsX 7:100777. doi: 10.1016/j.mex.2019.100777
Patgiri, R. (2019). A taxonomy on Big Data: survey. arXiv [Preprint] . doi: 10.48550/arxiv.1808.08474
Petroc, T. (2023). Amount of data created, consumed, and stored 2010-2020, with forecasts to 2025. Statista. Available at: https://www.statista.com/statistics/871513/worldwide-data-created/
Petticrew, M., and Roberts, H. (2008). Systematic Reviews in the Social Sciences . Hoboken, NJ: John Wiley & Sons.
Rosenheim, J. A., and Gratton, C. (2017). Ecoinformatics (Big Data) for agricultural entomology: pitfalls, progress, and promise. Annu. Rev. Entomol . 62, 399–417. doi: 10.1146/annurev-ento-031616-035444
Scopus Content Coverage Guide (2023). In Scopus. Available at: https://assets.ctfassets.net/o78em1y1w4i4/EX1iy8VxBeQKf8aN2XzOp/c36f79db25484cb38a5972ad9a5472ec/Scopus_ContentCoverage_Guide_WEB.pdf
Succi, S., and Coveney, P. V. (2019). Big data: the end of the scientific method? Philos. Trans. A Math. Phys. Eng. Sci . 377:20180145. doi: 10.1098/rsta.2018.0145
Van Altena, A. J., Moerland, P. D., Zwinderman, A. H., and Olabarriaga, S. D. (2016). Understanding big data themes from scientific biomedical literature through topic modeling. J. Big Data 3. doi: 10.1186/s40537-016-0057-0
Wohlin, C. (2014). “Guidelines for snowballing in systematic literature studies and a replication in software engineering,” in Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering. doi: 10.1145/2601248.2601268
Table A1 . Secondary studies included in our review.
Keywords: Big Data definition, systematic literature review, scientific research, Big Data review, Big Data epistemology
Citation: Han X, Gstrein OJ and Andrikopoulos V (2024) When we talk about Big Data, What do we really mean? Toward a more precise definition of Big Data. Front. Big Data 7:1441869. doi: 10.3389/fdata.2024.1441869
Received: 31 May 2024; Accepted: 12 August 2024; Published: 10 September 2024.
Reviewed by:
Copyright © 2024 Han, Gstrein and Andrikopoulos. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Xiaoyao Han, x.han@rug.nl
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
IMAGES
VIDEO
COMMENTS
Research Tutorial. What Are Databases? In the context of libraries and doing research, the generally accepted definition of a database is "an electronic collection of searchable information on one or more topics." Right.
How AI-powered tool R Discovery offers access to over 150 million research articles, facilitating academic collaboration. And what are the essentials of academic database for researchers, including types like open access, full-text, subject-specific, multidisciplinary, abstracting, and citation databases.
Definition. Research databases are organized collections of digital data and resources that provide access to a wide range of scholarly articles, journals, books, and other academic materials. These databases are essential for conducting thorough research as they offer curated content, advanced search functionalities, and reliable sources that ...
A research database is an organized, searchable collection of information that allows you to quickly search many resources simultaneously. Databases can be general, such as Academic Search Complete or ProQuest , or subject-specific, such as PsycInfo , which has resources related to psychology, or America, History and Life , which has resources ...
Research databases, such as JSTOR and Academic Search Premier, uncover the world of scholarly information. Most of the content in these databases is only available through the library. The complete list of databases is on the Databases A-Z list. The Library has purchased access to hundreds of databases on your behalf.
Database Definition. A database is a way for organizing information, so users can quickly navigate data, spot trends and perform other actions. Although databases may come in different formats, most are stored on computers for greater convenience. Databases are stored on servers either on-premises at an organization's office or off-premises ...
Learn what research data are, how they are managed, and what types of data are excluded from sharing. Research data are the recorded facts that validate research findings and can be structured and stored in various formats.
Research data comes in many different formats and is gathered using a wide variety of methodologies. In this module, we will provide you with a basic definition and understanding of what research data are. We'll also explore how data fits into the scholarly research process.
A database is a collection of organized and stored information designed for search and retrieval. Databases come in various forms and can be used for different applications. Libraries typically subscribe to research databases. Research databases are electronic platforms that contain a collection of electronic information that is searchable and ...
Research Databases. Research databases are organized collections of computerized information or data such as periodical articles, books, graphics and multimedia that can be searched to retrieve information. Databases can be general or subject oriented with bibliographic citations, abstracts, and or full text. The sources indexed may be written ...
database, any collection of data, or information, that is specially organized for rapid search and retrieval by a computer. Databases are structured to facilitate the storage, retrieval, modification, and deletion of data in conjunction with various data-processing operations. A database management system (DBMS) extracts information from the ...
ERIC is a free database that the United States Department of Education sponsors to share resources for teachers and other academic professionals. It also has a thesaurus built into the database, which individuals can use while writing their research papers. 6. ScienceDirect.
Databases are online platforms that contain searchable resources such as journals, articles, ebooks, and data sets. Sources within databases are sometimes also published in print editions, but, if not, many of the resources that you find are printable in formats like .pdf. Your school's library contracts with vendors like EBSCO and Proquest ...
Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem. While methods and aims may differ between fields, the overall process of ...
Basic definition: A collection of data arranged for ease and speed of search and retrieval. In the library world, a database is a collection of articles, ebooks, videos, or other resources, which can be quickly searched by keyword, author, publication or other terms. Most materials in library databases are not available to the general public or ...
A library database is an electronic collection of information, organized to allow users to get that information by searching in various ways. Examples of Database information. Articles from magazines, newspapers, peer-reviewed journals and more. More unusual information such as medical images, audio recitation of a poem, radio interview ...
A database is an electronic collection of information that is organized so that it can easily be accessed, managed, and updated. Amazon.com is a familiar database as is the library's catalog, Find It. The library also subscribes to over 200 scholarly and research databases.
Background: Research Software is a concept that has been only recently clarified.In this paper we address the need for a similar enlightenment concerning the Research Data concept. Methods: Our contribution begins by reviewing the Research Software definition, which includes the analysis of software as a legal concept, followed by the study of its production in the research environment and ...
What is Database Search? Harvard Library licenses hundreds of online databases, giving you access to academic and news articles, books, journals, primary sources, streaming media, and much more. The contents of these databases are only partially included in HOLLIS. To make sure you're really seeing everything, you need to search in multiple places.
Research data refers to any information or evidence gathered through systematic investigation or experimentation to support or refute a hypothesis or answer a research question. It includes both primary and secondary data, and can be in various formats such as numerical, textual, audiovisual, or visual. Research data plays a critical role in ...
Research data is any information that has been collected, observed, generated or created to validate original research findings. Research data may be arranged or formatted in such a way as to make it suitable for communication, interpretation and processing. Data comes in many formats, both digital and physical. More information:
Definition. A research database is a structured collection of data and information that allows users to search, retrieve, and analyze content from various sources efficiently. These databases often include articles, books, multimedia, and archival materials relevant to specific subjects or fields, making them vital tools for gathering ...
This article contains a representative list of notable databases and search engines useful in an academic setting for finding and accessing articles in academic journals, institutional repositories, archives, or other collections of scientific and other articles. Databases and search engines differ substantially in terms of coverage and retrieval qualities. [1]
Research methods are specific procedures for collecting and analyzing data. Developing your research methods is an integral part of your research design. When planning your methods, there are two key decisions you will make. First, decide how you will collect data. Your methods depend on what type of data you need to answer your research question:
A database is an organized collection. of information, usually with one central topic. In a computer. database (as opposed to a paper database), the program that. you use to enter and manipulate ...
1 Department of Governance and Innovation, Campus Fryslan, University of Groningen, Leeuwarden, Netherlands; 2 Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence, University of Groningen, Groningen, Netherlands; Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this "no consensus ...
Central to research within the relational view is how interactions among multiple stakeholders are affected by their interests, hierarchies, and relationships (Castelló et al., 2016; Dawkins, 2015), as well as value congruence and strategic complementarity (Bundy et al., 2018).For example, in cities, stakeholders with different interests—what we term heterogeneous stakeholders—such as ...
Despite an increasing interest in understanding the mindset of entrepreneurs, little consensus exists on what an entrepreneurial mindset (EM) is, how it is developed, or its precise outcomes. Given the fragmented nature of the multidisciplinary study of EM, we review prior work in an effort to enhance scholarly progress. To this end, we identify and review 61 publications on the topic and ...