Open-source tools for behavioral video analysis: Setup, methods, and best practices

  Kevin Luxem
  Jennifer J Sun
  Sean P Bradley
  Keerthi Krishnan
  Jan Zimmermann

Is a corresponding author

  • Cellular Neuroscience, Leibniz Institute for Neurobiology, Germany ;
  • Department of Computing and Mathematical Sciences, California Institute of Technology, United States ;
  • Rodent Behavioral Core, National Institute of Mental Health, National Institutes of Health, United States ;
  • Department of Biochemistry and Cellular & Molecular Biology, University of Tennessee, United States ;
  • Department of Biological Sciences, Carnegie Mellon University, United States ;
  • Department of Neuroscience, University of Minnesota, United States ;
  • The Salk Institute of Biological Studies, United States ;
  • Department of Neuroscience, American University, United States ;
  Talmo D Pereira
  Mark Laubach
Quantitative tools for video analysis

A basic setup for video recordings in animal experiments, best practices for experimenters and developers, article and author information.

Recently developed methods for video analysis, especially models for pose estimation and behavior classification, are transforming behavioral quantification to be more precise, scalable, and reproducible in fields such as neuroscience and ethology. These tools overcome long-standing limitations of manual scoring of video frames and traditional ‘center of mass’ tracking algorithms to enable video analysis at scale. The expansion of open-source tools for video acquisition and analysis has led to new experimental approaches to understand behavior. Here, we review currently available open-source tools for video analysis and discuss how to set up these methods for labs new to video recording. We also discuss best practices for developing and using video analysis methods, including community-wide standards and critical needs for the open sharing of datasets and code, more widespread comparisons of video analysis methods, and better documentation for these methods especially for new users. We encourage broader adoption and continued development of these tools, which have tremendous potential for accelerating scientific progress in understanding the brain and behavior.

Traditional approaches to analyzing video data have involved researchers watching video playback and noting the times and locations of specific events of interest. These analyses are very time-consuming, require expert knowledge in the target species and experimental design, and are prone to user bias ( Anderson and Perona, 2014 ). Video recordings are often made for many different animals and behavioral test sessions, but only reviewed for a subset of experiments. Complete sets of videos are rarely made accessible in published studies and the analysis methods are often vaguely described. There are variations in scoring criteria across researchers and labs, even over time by a single researcher. Collectively, these issues present major challenges for research reproducibility and the difficulty and cost of manual video analysis has led to the dominance of easy-to-use measures (lever pressing, beam breaks) in the neuroscience literature, and this has limited our understanding of brain-behavior relationships ( Krakauer et al., 2017 ).

For example, ‘reward seeking’ has been a popular topic in recent years and is typically measured using beam breaks between response and reward ports located inside an operant arena (e.g., Cowen et al., 2012 ; Feierstein et al., 2006 ; Lardeux et al., 2009 ; van Duuren et al., 2009 ). By relying only on the discrete times when animals make a choice and receive a reward, it is not possible to describe how the animal moves during a choice or how it collects a reward. Animals may not move in the same way to a reward port when they expect a larger or smaller reward (e.g., Davidson et al., 1980 ). This could lead to, for example, a neural recording study labeling a cell as ‘reward encoding’ when it actually reflects differences in movement.

Commercial products (e.g., Ethovision by Noldus, Any-Maze by Stoelting) and open-source projects (e.g., JAABA: Kabra et al., 2013 ; SCORHE: Salem et al., 2015 ; OptiMouse: Ben-Shaul, 2017 ; ezTrack: Pennington et al., 2019 ) are available for semi-automated annotation and tracking of behaviors. These methods track animals based on differences between the animals and the background color or luminance. This can be challenging to do in naturalistic settings or for species or strains that do not have a uniform color (e.g., Long-Evans rats). These methods provide estimates of the overall position of an animal in its environment and can be used to measure the direction and velocity of its movements. These ‘center of mass’ tracking methods could be used to measure where an animal is and how fast it is moving. More sophisticated versions of these products may also detect the head and tail of common laboratory species such as rodents or zebrafish and draw inferences from the shape and location of the animal to classify a small subset of an animal’s behavioral repertoire. However, these simpler tracking methods cannot account for movements of discrete sets of body parts (e.g., head scanning in rodents, which is associated with a classic measure of reward-guided decisions called ‘vicarious trial-and-error’ behavior: see Redish, 2016 , for review).

More advanced analyses could be used to quantify movements across many pixels simultaneously in video recordings. For example, Stringer et al., 2019 , used dimensionality reduction methods to study the spontaneous coding of visual- and movement-related information in the mouse visual cortex in relation to facial movements. Musall et al., 2019 , used video recordings of motion data from several parts of the face of mice as they performed a decision-making task and related the measures from the video recordings to cortical imaging data. While these analyses would go beyond what is possible to achieve with a simple tracking method, the multivariate methods developed by Stringer and Musall are not themselves capable of categorizing movements, measuring transitions between different types of movements, or quantifying the dynamics of movement sequences. For these measures, a different approach is needed.

Methods for capturing the pose of an animal (the location and configuration of its body) have emerged in recent years (e.g., DeepLabCut: Mathis et al., 2018a ; SLEAP: Pereira et al., 2022 ). These methods can provide a description of an animal’s movement and posture during a behavioral task. They can be used to understand the dynamics of naturalistic movements and behaviors, as illustrated in Figure 1 . Pose estimation methods provide information on the position and orientation of multiple parts of an animal, with recent methods being able to measure pose information for groups of animals ( Chen et al., 2020 ; Lauer et al., 2021 ; Pereira et al., 2022 ; Walter and Couzin, 2021 ). Some recent methods now even allow for pose estimation to be run in real experimental time ( Kane et al., 2020 ; Lopes et al., 2015 ; Pereira et al., 2022 ; Schweihoff et al., 2021 ).

video analysis and research

Setup for video recording.

( A ) Cameras are mounted above and to the side of a behavioral arena. The cameras record sequences of images of an animal performing a behavioral task. The recordings are stored on a computer and analyzed with methods for pose estimation and behavior classification. ( B ) The animal’s pose trajectory captures the relevant kinematics of the animal’s behavior and is used as input to behavior quantification algorithms. Quantification can be done using either unsupervised (learning to recognize behavioral states) or supervised (learning to classify behaviors based on human annotated labels). In this example, transitions among three example behaviors (rearing, walking, and grooming) are depicted on the lower left and classification of video frames into the three main behaviors are depicted on the lower right.

Methods for pose estimation emerged in computer vision research in the late 1970s ( Marr et al., 1978 ; Nevatia and Binford, 1973 ). The methods became widely available for the analysis of pose in human behavior following improvements in computer vision ( Behnke, 2003 ), deep learning ( Szegedy et al., 2013 ), and computing using graphical processing units (GPUs) ( Oh and Jung, 2004 ). However, these methods were often not robust or required a lot of training data, which were at the time not easily available for animal studies. As a result, a number of open-source tools emerged for pose estimation in animals (e.g., DeepLabCut: Mathis et al., 2018a , LEAP: Pereira et al., 2019 ; DeepPoseKit: Graving et al., 2019a ). These tools are especially notable in that they were developed to address specific scientific questions by researchers and are not available from commercial sources. They are an outstanding example of the ‘open-source creative process’ ( White et al., 2019 ).

One of these methods, DeepLabCut, has been shown to outperform the commercial software package EthoVision XT14 and a hardware-based measurement system from TSE Systems, based on IR beam breaks ( Sturman et al., 2020 ). When tested across a set of common behavioral assays used in neuroscience (open field test, elevated plus maze, forced swim test), data from the pose estimation method was evaluated using a neural network classifier and performed as well as classifications by human experts, required data from fewer animals to detect differences due to experimental treatments, and in some cases (head dips in an elevated plus maze) detected effects of treatment (a drug) that was not detected by EthoVision.

In the case of reward seeking behavior, human annotation of videos could resolve the animal’s position and when and for how long specific behaviors occurred. These measurements could be made by annotating frames in the video recordings, using tools such as the VIA annotator ( Dutta and Zisserman, 2019 ), and commercial (e.g., EthoVision) or open-source (e.g., ezTrack) methods for whole-animal tracking. These measurements would not be able to account for coordinated movements of multiple body parts or for the dynamics of transitions between different behaviors that together comprise reward seeking behavior. These measurements are easily made using methods for pose estimation. These methods learn to track multiple body parts (for a rodent, the tip of snout, the ears, the base of the tail) and the positions of these body parts can be compared for different kinds of trials (small or large reward) using standard statistical models or machine learning methods. These analyses, together, allow for movements to be categorized (e.g., direct and indirect approach toward a reward port) and for transitions between different types of movements to be quantified (e.g., from turning to walking). It would even be possible to detect unique states associated with deliberation (e.g., head scanning between available choice options). All these measures could then be compared as a function of an experimental manipulation (drug or lesion) or used to assist in the analysis of simultaneously collected electrophysiological or imaging data. None of these measures are possible using conventional methods for annotating video frames or tracking overall the overall position of the animal in a behavioral arena.

Pose estimation methods have been crucial for several recent publications on topics as diverse as tracking fluid consumption to understand the neural coding of reward prediction errors ( Ottenheimer et al., 2020 ), accounting for the effects of wind on the behavior of Drosophila ( Okubo et al., 2020 ), understanding the contributions of tactile afferents and nociceptors to the perception of touch in freely moving mice ( Schorscher-Petcu et al., 2021 ), understanding interactions between tactile processing by the rodent whisker system and its ability to guide locomotion ( Warren et al., 2021 ), and measuring the relationship between eye movements and neural activity in freely behaving rodents ( Keshavarzi et al., 2022 ). While a number of studies are emerging that take advantage of methods for pose estimation, there is still not enough widespread adoption of the methods across the research community, perhaps in part due to the technical nature of collecting high-quality video recordings as well as setting up and using methods for pose estimation. These methods depend on access to computing systems with GPUs and the ability to set up and use the required computer software, which is usually available as computer code written in Python or MATLAB. A researcher who wants to get started with these approaches will therefore face a number of questions about how to set up video methods in a laboratory setting. New users may also need to learn some of the jargon associated with video analysis methods, and some of these terms are defined in Table 1 . The primary goals of this document are twofold: to provide information for researchers interested in setting methods for video analysis in a research lab and to propose best practices for the use and development of video analysis methods .

Frequently used terms for video analysis.

poseThe configuration (position and/or orientation) of an animal, object, or body parts in an image or video recording
keypoints/landmarksDistinct identifiable morphological features (e.g., the tip of the snout or the base of the tail in a rodent) that can be localized in 2D or 3D from images, typically via pose estimation
part groupingA process for assigning keypoints to individual animals
multi-object trackingIn multi-animal pose tracking, the task of determining which detected poses belong to which individual animal across time
re-identificationA process for identifying all images containing the same individual animal based primarily on their distinct appearance
kinematicsInformation about the angles and velocities of a set of keypoints
supervised learningMachine learning methods that use experimenter-provided labels (e.g., ground truth poses, or ‘running’ vs ‘grooming’) to train a predictive model
unsupervised learningMachine learning methods that only use unlabeled data to find patterns based on its intrinsic structure (e.g., clustering behavioral motifs based on the statistics of their dynamics)
transfer learningMachine learning methods that use models trained on one dataset to analyze other datasets (e.g., models of grooming in mice applied to rats)
self-supervised learningMachine learning methods that use only unlabeled data for training by learning to solve artificially constructed tasks (e.g., comparing two variants of the same image with noise added against other images; predicting the future; or filling in blanks)
embeddingA representation of high-dimensional data into lower dimensional representation
liftingA process through which 2D pose data are converted to 3D representations
behavioral segmentationA process for detecting occurrences of behaviors (i.e., starting and ending frames) from video or pose sequences

In a typical setup for video recording, cameras are placed above, and in some cases to the side or below, the behavioral arena ( Figure 1 ). The cameras send data to a computer and can be integrated with inputs from behavioral devices using custom-written programs using popular libraries such as OpenCV ( Bradski, 2000 ), open-source data collection systems such as Bonsai ( Lopes et al., 2015 ), or software included with many common commercial video capture boards (loopbio Motif). Video files can then be analyzed using a variety of open-source tools.

A common approach is to use methods for pose estimation, which track the position and orientation of the animal. This is done by denoting a set of ‘keypoints’ or “landmarks” (body parts) in terms of pixel locations on frames in the video recordings. Packages for pose estimation provide graphical user interfaces for defining keypoints and the keypoints are then analyzed with video analysis methods. In the example shown in Figure 1 , keypoints are the colored dots on the tip of the snout, the ears, forelimbs and paw, midpoint of back, hindlimbs and paws, and base, middle, and end of tail. Once body parts have been defined, computer algorithms are used to track the skeleton formed by the points and to track the position and orientation of the skeleton over frames in the video file. Many open-source tools use machine learning methods for these intensive computational processes, which require GPUs to run in reasonable time. To run these analyses, many labs have either dedicated computers, institutional computing clusters, or cloud computing services such as Google Colab. The outputs of pose estimation can be analyzed to account for movement variability associated with different behaviors, to relate position and orientation to simultaneously collected brain activity (electrophysiology, optical imaging), or with algorithms that can describe and predict states and dynamical transitions of behaviors.

Data acquisition

The first step in setting up for video recording is to purchase a camera with an appropriate lens. Researchers should determine if they need precisely timed video frames, for example, for integration with electrical or optical recordings. Inexpensive USB webcams with frame rates of at least 30 fps are suitable for many neuroscience experiments. However, it is important to make sure that each camera is connected to a dedicated USB channel in the computer used for video recording. Webcam cameras can be a challenge to integrate with systems used for behavioral control and electrophysiology or imaging because they lack a means of precisely synchronizing video frames to other related data. As such, the timing of specific behaviors must be based on the animal’s location or an observable event in the video field (e.g., onset of an LED indicating reward availability).

For more precise recordings, specialized cameras used in computer vision applications are needed (e.g., FLIR, Basler). Power and combined data over Ethernet (GigE PoE) is commonly used as it combines long cable length headroom with joint DC power delivery. Alternatively, USB3 cameras can be used, but have a maximum data cable length of 5 m, although active extender cables are available. Most machine vision cameras (GigE PoE or USB3) have general-purpose input output capabilities that allow for time synchronization of multiple cameras with other laboratory equipment (e.g., electrical or optical recording system). A single camera running at high resolution or frame rate can quickly saturate a standard 1 Gbit Ethernet link. Therefore, it is important to consider the computer used to collect video data, ensuring that it has a fast processor with multiple cores and perhaps also a GPU, which can aid in handling video compression during data collection and can be used for off-line analysis using pose estimation methods.

After choosing a camera, researchers must determine how to save and archive data from their recordings. By default, recorded videos from cameras may be in several formats, such as MP4 (MPEG-4 AVC/H. 264 encoding), MOV (MPEG-4 encoding), and AVI (DivX codec, higher quality but larger file size). These formats are generally universal and can be read by a variety of tools. Generally, video data files tend to be large (1 hr of RGB video at 30 Hz with resolution 1000×1000 can be 2–20 GB depending on compression) so data storage solutions for large-scale experiments are crucial. File compression should be evaluated before a system is deployed, as the computer used for video recordings must have sufficient memory (RAM) to remain stable over long recording sessions. In addition to considerations of file formats and codecs, it is important to plan for data storage. Many labs maintain internal lab servers for their video data. Cloud storage is another option to enable sharing. For sharing data publicly, there are a variety of hosting services available, such as the Open Science Foundation, Figshare, and Dryad (see section on ‘Best practices for experimenters and developers’ below for further comments on data archives and sharing).

Once cameras and lenses are acquired and data formats and storage resolved, the next question is where to position the cameras relative to the experimental preparation. Occlusions due to obstacles, cables, or conspecifics will have effects on the usability of some video analysis methods. A bottom-up view (from below the animal) works best in an open-field, while a top-down approach can be useful for studies in operant chambers and home cages. Bottom-up views capture behavioral information from the position of the animal’s feet ( Hsu and Yttri, 2021a ; Luxem et al., 2022a ). When multiple cameras are used, to reduce the effect of occlusion for downstream video analysis, cameras should be positioned such that at least one camera can visualize each keypoint at all times.

It is also necessary to think about lighting for the experimental setup. If all or some of the study is to be performed while house lights are off, then infrared (IR) lighting and IR-compatible cameras may be needed. One should consider if diffuse lighting will work or if modifications to eliminate reflective surfaces (especially metals) are necessary. These can lead to artifacts in video recordings from devices like IR LEDs and other sources of illumination and complicate the training and interpretations of measures obtained with analyses such as pose estimation. For example, it is possible to reduce reflections from surfaces and objects that are in direct line with IR LEDs. For top-down recordings, cage floors can be made from colored materials to provide contrast such as Delrin or pre-anodized aluminum (an option for long-term use) and the metal pans typically used below operant chambers to collect animal waste can be painted with flat black paint. Addressing these issues before beginning an experiment can greatly improve the quality of video recordings.

Finally, for some applications, it is necessary to invest time in calibrating the video system. Calibration is often overlooked and not easily accessible in many current software packages. The intrinsic parameters of a camera include the focal length of the lens and if the lens has obvious distortions (i.e., fisheye lens). Extrinsic parameters also affect the quality of video recordings and are largely due to the camera’s position in the scene. It is fairly easy to calibrate a single camera using a checkerboard or ArUco board. To do so, one sweeps a precalibrated board manually around the field of view of a camera and uses the extracted images to estimate the camera’s intrinsic parameters (focal length and distortions). This approach can scale easily to cameras with overlapping fields of view but becomes difficult if larger camera networks do not share extrinsic parameters or need to be repeatedly recalibrated (e.g., if one of the cameras is moved between experiments). If the environment has enough structure in it, structure from motion can estimate the intrinsic and extrinsic parameters by treating the multiple cameras as an exhaustive sweep of the environment. This process can be fully scripted and automatically performed on a daily basis leading to substantially increased reliability and precision in multi-camera system performance. Several references on these topics include Bala et al., 2020 ; Rameau et al., 2022 ; Schönberger et al., 2016 ; Schonberger and Frahm, 2016 .

Hardware and software for data analysis

Once video recordings are acquired, the researcher may proceed to setting up their computing environment for pose estimation and tracking. Modern markerless motion capture software tools like DeepLabCut ( Mathis et al., 2018a ) and SLEAP ( Pereira et al., 2022 ) rely on deep learning to automate this process. The most compute-intensive step of these methods involves a ‘training’ stage in which a deep neural network is optimized to learn to predict poses from user-provided examples. Training is typically accelerated with a GPU, a hardware component traditionally used for computer graphics, but which has been co-opted for deep learning due to its massively parallel processing architecture. Having a GPU can speed up training by 10- to 100-fold, resulting in model training times in as little as minutes with lightweight network architectures ( Pereira et al., 2022 ). For most researchers, the most practical option is to purchase a consumer-grade workstation GPU which can be installed in conventional desktop computers to afford local access to this hardware from the pose tracking software. In this case, any recent NVIDIA GPU with greater than 6 GB of memory will suffice for practical use of pose estimation tools. This type of computer hardware has, in recent years, been significantly impacted by supply chain shortages, driving prices up to >$1000, which makes this a less accessible option for many labs just starting off in video analysis. For this situation, most tools provide the means for using Google Colab, which provides limited access to GPUs on the cloud. This is an excellent way to set up analysis workflows while getting familiar with deep learning-based video analysis but may not be practical for sustained usage (e.g., processing 100 s of videos). Another common scenario is that institutions with a high-performance computing center will typically have GPUs available as a local shared resource. Other than GPUs, most other computer requirements are modest (modern CPU, 8–16 GB of RAM, minimal disk space).

Researchers will need to set up their software environment to be able to install and use pose tracking tools. Most commonly available open-source methods for pose estimation were developed using the Python language. It is highly recommended to make use of ‘environment managers’ such as Anaconda (‘conda’) which enable the creation of isolated installations of Python for each video analysis method of interest. This allows for the methods to be installed with all its dependencies without affecting other Python libraries on the system. Alternatives include Docker, which allows for running an entire virtual machine in isolation. This is done to facilitate the installation of GPU-related dependencies, which may be technically challenging for novice users.

2D pose estimation and tracking

Pose tracking methods ( Figure 2 , part 1) enable researchers to extract positional information about the body parts of animals from video recordings. Tools for pose tracking (see Table 2 ) decompose the problem of pose tracking into sub-tasks outlined below. A note on nomenclature: pose estimation is the term typically reserved to mean single-animal keypoint localization within a single image; multi-animal pose estimation refers to ; multi-animal pose estimation refers to keypoint localization and part grouping of multiple animals within a single image; and multi-animal pose tracking refers to combined keypoint localization , part grouping, and identification across video frames.

video analysis and research

Pipeline for video analysis.

Video recordings are analyzed with either keypoints from 2D or 3D pose estimation or directly by computing video features. These videos or trajectory features are then used by downstream algorithms to relate the keypoints to behavioral constructs such as predicting human-defined behavior labels (supervised learning) or discovering behavior motifs (unsupervised learning). Each part of the analysis steps outlined in the figure is described in more detail below.

Methods for 2D pose estimation.

DeepLabCut ( ; ) uses a popular architecture for deep learning ( ), called ResNet. DeepLabCut models are pre-trained on a massive dataset for object recognition called ImageNet ( ). Through a process called transfer learning, the DeepLabCut model learns the position of keypoints using as few as 200 labeled frames. This makes the model very robust and flexible in terms of what body parts (or objects) users want to label as the model provides a strong backbone of image filters within their ResNet architecture. To detect the keypoint position, DeepLabCut replaces the classification layer of the with deconvolutional layers to produce spatial probability densities from which the model learns to assign high probabilities to regions with the user labeled keypoints. DeepLabCut can provide very accurate pose estimations but can require extensive time for training.
SLEAP ( ) is based on an earlier method called LEAP ( ), which performed pose estimation on single animals. SLEAP uses simpler CNN architectures with repeated convolutional and pooling layers. This makes the model more lightweight compared to DLC’s ResNet architecture and, hence, the model is faster to train with comparable accuracy. Similar to DeepLabCut, the model uses a stack of upsampling or deconvolutional layers to estimate confidence maps during training and inference. Unlike DLC, SLEAP does not solely rely on transfer learning from general-purpose network models (though this functionality is also provided for flexible experimentation). Instead, it uses customizable neural network architectures that can be tuned to the needs of the dataset. SLEAP can produce highly accurate pose estimates starting at about 100 labeled frames for training combined and is quick to train on a GPU (<1 hour).
DeepPoseKit ( ; ) uses a type of CNN architecture, called stacked DenseNet, an efficient variant of the stacked hourglass ( ), and uses multiple down- and upsampling steps with densely connected hourglass networks to produce confidence maps on the input image. The model uses only about 5% of the amount of parameters used by DeepLabCut, providing speed improvements over DeepLabCut and LEAP.
B-KinD ( ; ) discovers key points without human supervision. B-KinD has the potential to transform how pose estimation is done, as keypoint analysis is one of the most time-consuming aspects of doing pose estimation analysis. However, there are challenges for the approach when occlusions occur in the video recordings, e.g., recordings of animals tethered to brain recording systems.

Keypoint localization involves recovering the spatial coordinates of each distinct keypoint. This is normally done by estimating body part confidence maps, that is, image-based representations that encode the probability of the body part being located at each pixel. Recovering the coordinates of each body part is reduced to the task of finding the pixel with highest probability. A key consideration of this task is that the larger the image, the larger the confidence maps. Computer memory requirements can potentially exceed the capacity of most consumer-grade GPUs. This can be compensated by reducing the resolution of the confidence maps, though this comes at the cost of potentially reduced accuracy. Subpixel refinement methods are typically employed to compensate for this, but ultimately confidence map resolution is one of the most impactful choices for achieving reliable keypoint localization.

For single-animal videos, there will be at most one instance of each keypoint type present in the image, so keypoint localization is the only step strictly required. For multi-animal videos, however, there may be multiple instances of each keypoint type, for example, multiple ‘heads’. Part grouping refers to the task of determining the set of keypoint detections that belong to the same animal within an image. This is often approached in either a bottom-up or top-down fashion. In bottom-up models, all parts are detected, the association between them estimated (e.g., by using part affinity fields: Cao et al., 2017 ), and then grouped. In top-down models, the animals are detected, cropped out of the image, and then keypoints are located in the same fashion as in the single-animal case. These approaches have specific trade-offs. Analyses of bottom-up recordings tend to be more memory-intensive but also more robust to transient occlusions and work well with animals with relatively large bodies (e.g., rodents). By contrast, top-down recordings tend to be analyzed in less time since only subsets of the image are processed. Top-down views work best with smaller body types that have fewer complex occlusions (e.g., flies). A notable consideration is that all single-animal pose estimation models can be used in the multi-animal setting if the animals can be detected and cropped as a preprocessing step ( Graving et al., 2019a ; Pereira et al., 2019 ). While both methods will work on most types of data, drastic improvements in performance and accuracy can be obtained by selecting the appropriate one – most pose estimation tools allow users to select between each approach type.

Once animals are detected and their keypoints located within a frame, the remaining task in multi-animal pose tracking is identification: repeatedly detecting the same animal across frame sequences. This can be approached as a multi-object tracking (MOT) problem, where animals are matched across frames based on a model or assumption about motion; or a re-identification (ReID) problem, where distinctive appearance features are used to unambiguously identify an animal. Both MOT and ReID (and hybrids) are available as standalone functionality in open-source tools, as well as part of multi-animal pose tracking packages. While MOT-based approaches can function on videos of animals with nearly indistinguishable appearances, they are prone to the error propagation issue inherent in methods with temporal dependencies: switching an animal’s identity even once will mean it is wrong for all subsequent frames. This presents a potentially intractable problem for long-term continuous recordings which may be impossible to manually proofread. ReID-like methods circumvent this problem by detecting distinguishing visual features, though this may not be compatible with existing datasets or all experimental paradigms.

The single most significant experimental consideration that will affect the identification problem is whether animals can be visually distinguished. A common experimental manipulation aimed at ameliorating this issue is to introduce visual markers to aid in unique identification of animals. This includes techniques such as grouping animals with different fur colors, painting them with non-toxic dyes ( Ohayon et al., 2013 ), or attaching barcode labels to a highly visible area of their body ( Crall et al., 2015 ). Though an essential part of the pose tracking workflow, identification remains a challenging problem in computer vision and its difficulty should not be underestimated when designing studies involving large numbers of interacting animals. We refer interested readers to previous reviews on multi-animal tracking ( Panadeiro et al., 2021 ; Pereira et al., 2020 ) for more comprehensive overviews of these topics.

Tools that are based on deep learning work by training deep neural networks (models) to reproduce human annotations of behavior. Methods that strictly depend on learning from human examples are referred to as fully supervised. In the case of animal pose tracking, these supervisory examples (labels) are provided in the form of images and the coordinates of the keypoints of each animal that can be found in them. Most pose tracking software tools fall within this designation and provide graphical interfaces to facilitate labeling. The usability of these interfaces is a crucial consideration as most of the time spent in setting up a pose tracking system will be devoted to manual labeling. The more examples and the greater their diversity, the better that pose tracking models will perform. Previous work has shown that hundreds to thousands of labeled examples may be required to achieve satisfactory results, with a single example taking as much as 2 min to manually label ( Mathis et al., 2018a ; Pereira et al., 2022 ; Pereira et al., 2019 ). To mitigate this, we strongly recommend adopting a human-in-the-loop labeling workflow. This is a practice in which the user trains a model with few labels, generates (potentially noisy) predictions, and imports those predictions into the labeling interface for manual refinement before retraining the model. This can drastically reduce the amount of time taken to generate thousands of labeled images necessary for reliable pose estimation models.

The rule of thumb is that ‘if you can see it, you can track it’, but this aphorism strongly depends on the examples provided to train the model. Important factors to consider in the labeling stage include labeling consistency and sample diversity. Labeling consistency involves minimizing the variability of keypoint placement within and across annotators which helps to ensure that models can learn generalizable rules for keypoint localization. This can be accomplished by formalizing a protocol for labeling, especially for ambiguous cases such as when an animal’s body part is occluded. For example, one convention may be to consistently place a ‘paw’ keypoint at the apex of the visible portion of the body rather than guessing where it may be located beneath an occluding object. Similarly, the principle of consistency should inform which body parts are selected as tracked keypoints. Body parts that are not easily located by the human eye will suffer from labeling inconsistency which may cause inferior overall performance as models struggle to find reliable solutions to detecting them. Sample diversity, on the other hand, refers to the notion that not all labeled examples have equal value when training neural networks. For example, labeling 1000 consecutive frames will ensure that the model is able to track data that looks similar to that segment of time, but will have limited capacity to generalize to data collected in a different session. As a best practice, labels should be sampled from the widest possible set of experimental conditions, time points, and imaging conditions that will be expected to be present in the final dataset.

Improving the capability of models to generalize to new data with fewer (or zero) labels is a currently active area of research. Techniques such as transfer learning and self-supervised learning aim to reduce the labeling burden by training models on related datasets or tasks. For example, B-KinD ( Sun et al., 2021a ) is able to discover semantically meaningful keypoints in behavioral videos using self-supervision without requiring human annotations. These approaches work by training models to solve similar problems and/or on similar data than those used for pose estimation, with the intuition that some of that knowledge can be reused and thereby will require fewer (or no) labeled examples before achieving the same performance as fully supervised equivalents. Future work in this domain is on track to produce reusable models for commonly encountered experimental species and conditions. We highly encourage practitioners to adopt open data and model sharing to facilitate these efforts where possible.

3D pose estimation

Several methods have emerged in recent years for 3D tracking based on pose data. For some applications, it is of interest to track animals in complete 3D space. This affords a more detailed representation of the kinematics by resolving ambiguities inherent in 2D projections – an especially desirable property when studying behaviors that involve significant out-of-plane movement, such as in large arenas or non-terrestrial behaviors.

It is important to note that 3D motion capture comes at a significant increase in technical complexity. As discussed above (see ‘Data acquisition’), camera synchronization and calibration are paramount for applications using 3D tracking as the result of this step will inform downstream algorithms as to the relative spatial configuration of the individual cameras. This step may be sensitive to small camera movements that occur during normal operation of behavioral monitoring systems, potentially requiring frequent recalibration. The number and positioning of cameras are also major determinants of 3D motion capture performance, both of which may depend on the specific behavior of interest, arena size and bandwidth, and computing capabilities on the acquisition computer. In some cases, it may be easiest to use mirrors instead of multiple cameras to allow for recording behavior from multiple perspectives.

Given a calibrated camera system, several approaches have emerged that can enable 3D pose estimation in animals. The simplest approaches rely on using 2D poses detected in each camera view, such as those produced by SLEAP or DeepLabCut as described above, and then triangulating them into 3D. 2D poses can be detected by training 2D pose models on each camera view independently, or by training a single model on all views, with varying results depending on how different the perspectives are. Once 2D poses can be obtained, methods such as Anipose ( Karashchuk et al., 2021 ), OpenMonkeyStudio ( Bala et al., 2020 ), and DeepFly3D ( Günel et al., 2019 ) are able to leverage camera calibration information to project poses into 3D for triangulation. This involves optimizing for the best 3D location of each keypoint that still maps back to the detected 2D location in each view. This can be further refined with temporal or spatial constraints, such as known limb lengths. Using this approach, more cameras will usually result in better triangulation, but will suffer (potentially catastrophically) when the initial 2D poses are incorrect. Since many viewpoints will have inherent ambiguities when not all body parts are visible, the 2D pose estimation error issue can be a major impediment to implementing 3D pose systems using the triangulation-based approach.

Alternative approaches attempt to circumvent triangulation entirely. LiftPose3D ( Gosztolai et al., 2021 ) describes a method for predicting 3D poses from single 2D poses, a process known as lifting . While this eliminates the need for multiple cameras, it requires a dataset of known 3D poses from which the 2D-3D correspondences can be obtained. This requirement depends on the multi-camera system being similar to the target 2D systems. DANNCE ( Dunn et al., 2021 ), on the other hand, achieves full 3D pose estimation by extending the standard 2D confidence map regression approach to 3D using volumetric convolutions. In their approach, images from each camera view are projected onto a common volume based on the calibration, before being fed into a 3D convolutional neural network that outputs a single volumetric part confidence map. This approach has the major advantage that it is not susceptible to 2D pose estimation errors since it solves for the 3D pose in a single step while also being able to reason about information present in distinct views. The trade-offs with this approach are that it requires significantly more computational power due to the 3D convolutions, as well as requiring 2D ground truth annotations on multiple views for a given frame.

Overall, a practitioner should be mindful of the caveats with implementing 3D pose estimation and is recommended to consider whether the advantages are truly necessary given the added complexity. We note that at the time of writing, none of the above methods can natively support the multi-animal case in 3D, other than by treating them as individual animals after preprocessing with a 2D multi-animal method for pose estimation. This limitation is due to issues with part grouping and identification as outlined above and would seem to be a future area of growth for animal pose estimation.

Behavior quantification

After using pose estimation to quantify the movements of animal body parts, there are a number of analyses that can be used to understand how movements differ by experimental conditions ( Figure 2 , parts 2–4). A simple option is to use statistical methods such as ANOVA to assess effects on discrete experimental variables such as the time spent in a given location or the velocity of movement between locations. These measures can also be performed with data from simpler tracking methods, such as the commercially available EthoVision, TopScan, and ANY-maze programs. The primary benefits of the open source pose estimation methods described in this paper over these commercially available programs are the richness of the data obtained from pose estimation (see Figure 1 ) and the flexibility and customization of behavioral features are tracked (see Figure 2 ).

If researchers want to go beyond kinematic readouts and investigate the behavior an animal is executing in more detail, then methods for segmenting behavior from the pose tracking data can be used. Behavioral segmentation methods are available to discern discrete episodes of individual events and/or map video or trajectory data to continuous lower-dimensional behavioral representations. Discrete episodes have a defined start and end in which the animal is performing a particular behavior, while continuous representations represent behavior more smoothly over time. For discrete episodes, depending on the experimental conditions, these episodes can last from milliseconds up to minutes or longer. Segmentation can be done per animal, for example, detecting locomotion, or globally per frame, which is especially of interest for social behavior applications. In a global setting researcher might be interested in finding behavioral episodes that are directed between animals such as attacking or mounting behaviors.

If one wants to understand sequences of behaviors, there are many methods available to embed pose data into lower-dimensional representations. Such structures can be discovered through unsupervised methods. Some methods provide generic embeddings and do not explicitly model the dynamics of the behaving animal. Two examples of this approach are B-SOiD ( Hsu and Yttri, 2021a ), which analyses pose data with unsupervised machine learning, and MotionMapper ( Berman et al., 2014 ), a method that does not use pose estimation methods. These models embed data points based on feature dynamics (e.g., distance, speed) into a lower-dimensional space. Within this space it is possible to apply clustering algorithms for the segmentation of behavioral episodes. Generally, dense regions in this space (regions with many data points grouped together) are considered to be conserved behaviors. Other methods are aimed at explicitly capturing structure from the dynamics ( Batty et al., 2019 ; Bregler, 1997 ; Costa et al., 2019 ; Luxem et al., 2022a ; Shi et al., 2021 ; Sun et al., 2021c ). These models learn a continuous embedding that can be used to identify lower-dimensional trajectory dynamics that can be correlated to neuronal activity and segmented in relation to significant behavioral events.

Behavioral segmentation and other methods for quantification require a similar computing environment to that used for pose estimation. The input to those methods is generally the output of a pose estimation method (i.e., keypoint coordinates) or time series from a dimensionality reduction method such as principal component analysis that accounts for the keypoints or the raw video. It is crucial that pose estimation is accurate as the segmentation capabilities of the subsequent methods is bounded by pose tracking quality. Highly noisy key points will drown out biological signals and make the segmentation results hard to interpret, especially for unsupervised methods. Furthermore, identity switches between virtual markers can be catastrophic for multi-animal tracking and segmentation. A summary of methods for behavioral segmentation is provided in Table 3 .

Methods for behavioral segmentation using pose data.

SimBA ( ; ) is a supervised learning pipeline for importing pose estimation data and a graphical interface for interacting with a popular machine learning algorithm called Random Forest ( ). SimBA was developed for studies in social behavior and aggression and has been shown to be able to discriminate between attack, pursuit, and threat behaviors in studies using rats and mice.
MARS ( ; ) is another supervised learning pipeline developed for studies of social interaction behaviors in rodents, such as attacking, mounting, and sniffing, and uses the XGBoost gradient boosting classifier ( ).
B-SOiD ( ; ) uses unsupervised methods to learn and discover the spatiotemporal features in pose data of ongoing behaviors, such as grooming and other naturalistic movements in rodents, flies, or humans. B-SOiD uses UMAP embedding ( ) to account for dynamic features within video frames that are grouped using an algorithm for cluster analysis, HDBSCAN ( ). Clustered spatiotemporal features are then used to train a classifier (Random Forest; ) to detect behavioral classes in data sets that were not used to train the model and with millisecond precision.
VAME ( ; ) uses self-supervised deep learning models to infer the full range of behavioral dynamics based on the animal movements from pose data. The variational autoencoder framework ( ) is used to learn a generative model. An encoder network learns a representation from the original data space into a latent space. A decoder network learns to decode samples from this space back into the original data space. The encoder and decoder are parameterized with recurrent neural networks. Once trained, the learned latent space is parameterized by a Hidden Markov Model to obtain behavioral motifs.
TREBA ( ; ) relates measures from pose estimation to other quantitative or qualitative data associated with each frame in a video recording. Similar to VAME, a neural network is trained to learn to predict movement trajectories in an unsupervised manner. TREBA can then incorporate behavioral attributes, such as movement speed, distance traveled, and heuristic labels for behavior (e.g., sniffing, mounting, attacking) into representations of the pose estimation data learned by its neural networks, thereby bringing aspects of supervised learning. This is achieved using a technique called task programming.

Before selecting any approach to segment animal behavior, it is important to first define the desired outcome. If the goal is to identify episodes of well-defined behaviors like rearing or walking, then the most straightforward approach is to use a supervised method. Moreover, it is generally a good starting point to use a supervised learning approach and the outputs of these models can be layered on top of unsupervised models to give them immediate interpretability. One tradeoff, however, is the extensive training datasets that are often required to ensure good supervised segmentation. Such methods can be established quite easily using standard machine learning libraries available for the Python, R, and MATLAB, if one has already experience in building these methods. Alternatively, open-source packages such as SimBA ( Nilsson et al., 2020a ) or MARS ( Segalin et al., 2021a ) can be used, and is especially beneficial for those who are relatively new to the topic of machine learning. However, if the researcher wants to understand more about the spatiotemporal structure of the behaving animal, they either need to label many different behaviors within the video or turn to unsupervised methods. Unsupervised methods offer the advantage to identify clusters in the video or keypoint time series and quantify behavior in each frame. Recently, A-SOiD, an active-learning algorithm, iteratively combines these supervised and unsupervised approaches to reduce the amount of training data required and enable the discovery of additional behavior and structure ( Schweihoff et al., 2022 ).

Interpreting the lower-dimensional structures in a 2D/3D projection plot can be difficult and it is advised to visualize examples from this projection space. Generative methods like VAME offer the possibility to sample cluster categories from this embedding space to qualitatively check if similar patterns are learned. Another task unsupervised methods are capable of is fingerprinting. Here, the embedding space is used as a signature to discern general changes in phenotypes ( Wiltschko et al., 2020 ). An alternative to using an explicitly supervised or unsupervised approach is to combine these approaches (semi-supervised), as implemented in a package called TREBA ( Sun et al., 2021c ). TREBA uses generative modeling in addition to incorporating behavioral attributes, such as movement speed, distance traveled, or heuristic labels for behavior (e.g., sniffing, mounting, attacking) into learned behavioral representations. It has been used in a number of different experimental contexts, most notably for understanding social interactions between animals.

Finally, as behavior is highly hierarchically structured, multiple spatio-temporal scales of description may be desired, for example, to account for bouts of locomotion and transitions running to escaping behavior ( Berman, 2018 ). It is possible to create a network representation and identify ‘cliques’ or ‘communities’ on the resulting graph ( Luxem et al., 2022a ; Markowitz et al., 2018 ). These descriptions represent human identifiable behavioral categories within highly interconnected sub-second segments of behavior. These representations can provide insights into the connection between different behavioral states and the transitions between states and their biological meaning.

Having described how to set up and use video recording methods and analysis methods for pose estimation, we would like to close by discussing some best practices in the use and development of methods for video analysis, including recommendations for the open sharing of video data and analysis code.

Best practices for experimenters

For those using video analysis methods in a laboratory setting, there are several key issues that should be followed as best practices. It is most crucial to develop a means of storing files in a manner in which they can be accessed in the lab, through cloud computing resources, and in data archives. These issues are discussed above in the ‘Hardware and software for data analysis’ section of this paper. Documentation of hardware is also a key best practice. All methods sections of manuscripts that use methods for video analysis should include details on the camera and lens that were used, the locations of and distances from the cameras relative to the behavioral arena, the acquisition rate and image resolution, environmental lighting (e.g., IR grids placed above the behavioral arena), properties of the arena (size, material, color, etc.).

Beyond within-lab data management and reporting details on hardware used in research manuscripts, more widespread sharing of video data is very much needed and is a core aspect of best practices for experimenters. In accordance with the demands of funders such as the NIH for data sharing, the open sharing of raw and processed videos and pose tracking data is crucial for research reproducibility and also for training new users on video methods. Several groups have created repositories to address this need ( Computational Behavior , OpenBehavior ). With widespread use, these repositories will help new users learn the required methods for data analysis, enable new analyses of existing datasets that could lead to new findings without having to do new experiments, and would enable comparisons of existing and newly developed methods for pose estimation and behavioral quantification. The latter benefit of data sharing could lead to insight into a major open question about methods for animal pose estimation: how choices about the parameters of any method for pose estimation or subsequent analysis impact analysis time, accuracy, and generalizability. Without these resources, it has not been possible to make confident statements about how existing methods compare across a wide range of datasets involving multiple types of research animals and in different experimental contexts. Guidance for how to implement data sharing can be found in several existing efforts of the machine learning community ( Gebru et al., 2021 ; Hutchinson et al., 2021 ; Stoyanovich and Howe, 2019 ). A more widespread use of these frameworks for sharing data can improve the transparency and accessibility of research data for video analysis.

Best practices for developers

We recommend three topics receive more attention by developers of methods for video analysis. First, there is a need for a common file format for storing results from pose estimation. Second, there is a need for methods to compare pose estimation packages and assess the impact of the parameters of each package on performance in terms of accuracy and user time. Third, there is a need for better code documentation and analysis reproducibility. Each of these issues is discussed below. In addition to these topics, we would like to encourage developers to design interfaces to make their tools more accessible to novice users. This will allow the tools to become more widely used and studied, and will further not limit use of the tools to researchers with advanced technical skills such as programming.

First, it is important to point out that there is no common and efficient data format available for tools that enable pose estimation in animal research. Such a format would allow users to compare methods without having to recode their video data. The FAIR data principles ( Wilkinson et al., 2016 ) are particularly apt for developing a common data format for video due to the large heterogeneity of data sources, intermediate analysis outputs, and end goals of the study. These principles call for data to be Findable (available in searchable repositories and with persistent and citable identifiers [DOIs]), Accessible (easily retrieved using the Internet), Interoperable (having a common set of terms to describe video data across datasets), and Reusable (containing information about the experimental conditions and outputs of any analysis or model to allow another group to readily make use of the data). A common file format for saving raw and processed video recordings and data from pose estimation models is needed to address these issues.

Second, there has also been a general lack of direct comparisons of different methods and parameter exploration within a given method on a standard set of videos. The choice of deep learning method and specific hyperparameters can affect the structural biases embedded in video data, thereby affecting the effectiveness of a given method ( Sculley et al., 2015 ). Yet, it seems that many users stick to default parameters available in popular packages. For example, in pose estimation, certain properties of neural network architectures such as its maximum receptive field size can dramatically impact the performance across species owing to the variability in morphological features ( Pereira et al., 2022 ). In addition to the intrinsic properties of particular species (e.g., Hayden et al., 2022 ), the analysis type will also dictate the importance of particular parameters on the task performance. For example, algorithms that achieve temporal smoothness in pose tracking are crucial for studies of fine motor control ( Wu et al., 2020 ), but perhaps not as essential as preventing identity swaps for studies of social behavior ( Pereira et al., 2022 ; Segalin et al., 2021a ). Another important issue is that most methods do not report well-calibrated measures of the confidence of model fits or predictions. This is important as it has become clear that machine learning tools tend to be overconfident in their predictions ( Abdar et al., 2021 ). Establishing standardized, interoperable data formats and datasets that include estimates of the fitted models and their predictions will enable comprehensive comparisons of existing and new methods for pose estimation and behavioral quantification.

For evaluating specific methods on lab-specific data, appropriate metrics and baseline methods for the research questions should be chosen. There may be cases where comparable baseline methods may not exist. For example, if a lab develops a new method for quantifying behavior for a specific organism or task on a lab-specific dataset, and there are no existing studies for that task. However, if related methods exist, it would be beneficial to compare performance of the new method against existing methods to study the advantages and disadvantages of the method. For more general claims (e.g., state-of-the-art pose estimator across organisms), evaluations on existing datasets and comparisons with baselines is important (see Table 4 ), to demonstrate the generality of the method and improvements over existing methods. A consensus on a standard set of data in the community for evaluation and an expansion to include more widely used behavioral tasks and assays would facilitate general model development and comparison. We show existing datasets in the community for method development in Table 4 and encourage the community to continue to open-source data and expand this list of available datasets to accelerate model development.

Datasets for model development.

2D/3D Pose EstimationVideos from 4 camera views with poses from motion captureHuman (single-agent)
2D Pose EstimationImages from uncontrolled settings with annotated posesHuman (multi-agent)
2D Pose Estimation & TrackingVideos from crowded scenes with annotated posesHuman (multi-agent)
2D Pose EstimationImages of diverse animal species with annotated posesDiverse species (single & multi-agent)
2D Pose EstimationVideos from 2 camera views with annotated posesMouse (multi-agent)
2D/3D Pose Estimation & TrackingVideos from 2 camera views with annotated posesZebrafish (multi-agent)
2D/3D Pose EstimationImages with annotated poses from a 62 camera setupMonkey (single-agent)
2D/3D Pose Estimation & TrackingVideos from 12 camera views with poses from motion captureRat (multi-agent)
2D/3D Pose Estimation & TrackingVideos from moving phone camera in challenging outdoor settingsHuman (multi-agent)
2D/3D Pose EstimationVideos from 14 camera views with poses from motion captureHuman (single-agent)
2D/3D Pose EstimationVideos from 12 camera views with poses from motion captureRat (single-agent)
Video-level Action ClassificationVideos from uncontrolled settings that cover 700 human actionsHuman (single & agent, may interact with other organisms/objects)
Video-level Action Classification (also has 3D poses)Videos from 80 views and depth with 60 human actionsHuman (single & multi-agent)
Frame-level Action ClassificationVideos from uncontrolled settings with 65 action classesHuman (single & multi-agent)
Frame-level Behavior ClassificationVideos from 2 views, with 13 annotated social behaviorsMouse (multi-agent)
Frame-level Behavior Classification (also has 2D poses)Videos & trajectory, with 10 annotated social behaviorsFly (multi-agent)
Frame-level Behavior Classification (also has 2D poses)Videos & trajectory, with 10 annotated social behaviorsMouse (multi-agent)
Frame-level Behavior Classification (also has 2D poses)Top-down views, 7 annotated keypoints, hundreds of videosMouse (multi-agent)

Third, reproducibility of results is crucial for acceptance of new methods for video analysis within the research community and for research transparency. Guidance for documenting the details of models and algorithms can be obtained from the Machine Learning Reproducibility Checklist . It is applicable to any computational model in general. Importantly, the checklist calls for including the range of hyperparameters considered for experiments, mean and variance of results from multiple runs, and an explanation of how samples were allocated for train/validation/test. Further guidance for sharing code is available in this GitHub resource: Publishing Research Code . It provides tips on open-sourcing research code, including specifications of code dependencies, training and evaluation code, and including pre-trained models as part of any code repository. Beyond these resources, we note that there is also a broader definition of reproducibility in that experiments should be robustly reproducible : experimental results should ideally not vary significantly under minor perturbations. For example, even if there are minor variations to lighting or arena size from the original experiments, the video analysis results should not change significantly. A framework to ensure robust reproducibility is currently an open question, but the existing frameworks should facilitate producing the same results under the same experimental conditions. Model interpretability is another important consideration depending on the purpose of the video analysis experiment. Many machine learning models are ‘black box’ models, and not easily interpretable; as such, post hoc explanations may not always be reliable ( Rudin, 2019 ). One way to generate human-interpretable models is through program synthesis ( Balog et al., 2017 ) and neurosymbolic learning ( Sun et al., 2022 ; Zhan et al., 2021 ). These methods learn compositions of symbolic primitives, which are closer in form to human-constructed models than neural networks. Interpretable models can facilitate reproducibility and trustworthiness in model predictions for scientific applications. Efforts at deploying these approaches for methods for video analysis and behavioral quantification are very much needed.

We hope that our review of the current state of open-source tools for behavioral video analysis will be helpful to the community. We described how to set up video methods in a lab, provided an overview on currently available methods, and provided guidance for best practices in using and developing the methods. As newer tools emerge and more research groups become proficient at using available methods, there is a clear potential for the tools to help with advancing our understanding of the neural basis of behavior.

National science foundation (1948181), national institutes of health (da046375), natural sciences and engineering research council of canada (pgsd3-532647-2019), national institutes of health (mh002952), national institutes of health (mh124042), national institutes of health (mh128177), national science foundation (2024581).

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.


This paper emerged from a working group on methods for video analysis organized by the OpenBehavior project in the summer and fall of 2021. Ann Kennedy, Greg Corder, and Sam Golden were major contributors to the working group and their ideas impacted this manuscript. We would like to thank Ann Kennedy, Samantha White, and Jensen Palmer for helpful comments on the manuscript. NSERC Award #PGSD3-532647-2019 to JJS; NIH MH002952 for SPB; NIH MH124042 for KK; NIH MH128177 and NSF 2024581 to JZ; NSF 1948181 and NIH DA046375 to ML.

© 2023, Luxem, Sun et al.

This article is distributed under the terms of the Creative Commons Attribution License

  • pose estimation
  • open source
  • reproducibility

In conclusion, previous research has focused on using video as a tool for data collection and analysis, but there are emerging opportunities for video to play an increased role in mixed methods research as a tool for the presentation of findings. By leveraging storytelling techniques used in documentary film, while staying true to the analytical methods of the research design, researchers can use video to effectively communicate implications of their work to an audience beyond academics and use video storytelling to disseminate findings to the public.

Using motion pictures to support ethnographic research began in the late nineteenth century when both fields were early in their development (Henley, 2010 ; “Using Film in Ethnographic Field Research, - The University of Manchester,” n.d ). While technologies have changed dramatically since the 1890s, researchers are still employing visual media to support social science research. Photographic imagery and video footage can be integral aspects of data collection, analysis, and reporting research studies. As digital cameras have improved in quality, size, and affordability, digital video has become an increasingly useful tool for researchers to gather data, aid in analysis, and present results.

Storytelling, however, has been around much longer than either video or ethnographic research. Using narrative devices to convey a message visually was a staple in the theater of early civilizations and remains an effective tool for engaging an audience today. Within the medium of video, storytelling techniques are an essential part of a documentary filmmaker’s craft. Storytelling can also be a means for researchers to document and present their findings. In addition, multimedia outputs allow for interactions beyond traditional, static text (R. Goldman, 2007 ; Tobin & Hsueh, 2007 ). Digital video as a vehicle to share research findings builds on the affordances of film, ethnography, and storytelling to create new avenues for communicating research (Heath, Hindmarsh, & Luff, 2010 ).

In this study, we look at the current literature regarding the use of video in research and explore how digital video affordances can be applied in the collection and analysis of quantitative and qualitative human subject data. We also investigate how video storytelling can be used for presenting research results. This creates a frame for how data collection and analysis can be crafted to maximize the potential use of video data to create an audiovisual narrative as part of the final deliverables from a study. As researchers we ask the question: have we leveraged the use of video to communicate our work to its fullest potential? By understanding the role of video storytelling, we consider additional ways that video can be used to not only collect and analyze data, but also to present research findings to a broader audience through engaging video storytelling. The intent of this study is to develop a frame that improves our understanding of the theoretical foundations and practical applications of using video in data collection, analysis, and the presentation of research findings.

Literature review

The review of relevant literature includes important aspects for situating this exploration of video research methods: the history, affordances and constraints of using video in research, the use of video in mixed methods design, and the traditions of research as storytelling. Although this overview provides an extensive foundation for understanding video research methods, this is not intended to serve as a meta-analysis of all publications related to video and research methods. Examples of prior work provide a conceptual and operational context for the role of video in mixed methods research and present theoretical and practical insights for engaging in similar studies. Within this context, we examine ethical and logistical/procedural concerns that arise in the design and application of video research methods, as well as the affordances and constraints of integrating video. In the following sections, the frame provided by the literature is used to view practical examples of research using video.

The history of using video in research is founded first in photography and next in film followed more recently, by digital video. All three tools provide the ability to create instant artifacts of a moment or period of time. These artifacts become data that can be analyzed at a later date, perhaps in a different place and by a different audience, giving researchers the chance to intricately and repeatedly examine the archive of information contained within. These records “enable access to the fine details of conduct and interaction that are unavailable to more traditional social science methods” (Heath et al., 2010 , p. 2).

In social science research, video has been used for a range of purposes and accompanies research observation in many situations. For example, in classroom research, video is used to record a teacher in practice and then used as a guide and prompt to interview the teacher as they reflect upon their practice (e.g. Tobin & Hsueh, 2007 ). Video captures events from a situated perspective, providing a record that “resists, at least in the first instance, reduction to categories or codes, and thus preserves the original record for repeated scrutiny” (Heath et al., 2010 , p. 6). In analysis, these audio-visual recordings allow the social science researcher the chance to reflect on their subjectivities throughout analysis and use the video as a microscope that “allow(s) actions to be observed in a detail not even accessible to the actors themselves” (Knoblauch & Tuma, 2011 , p. 417).

Examining the affordances and constraints of video in research provides a researcher the opportunity to examine the value of including video within a study . An affordance of video, when used in research, is that it allows the researcher to see an event through the camera lens either actively or passively and later share what they have seen, or more specifically, the way they saw it (Chalfen, 2011 ). Cameras can be used to capture an event in three different modes: Responsive, Interactive, and Constructive. Responsive mode is reactive. In this mode, the researcher captures and shows the viewer what is going on in front of the lens but does not directly interfere with the participants or events. Interactive mode puts the filmmaker into the storyline as a participant and allows the viewer to observe the interactions between the researcher and participant. One example of video captured in Interactive mode is an interview. In Constructive mode, the researcher reprocesses the recorded events to create an explicitly interpretive final product through the process of editing the video (MacDougall, 2011 ). All of these modes, in some way, frame or constrain what is captured and consequently shared with the audience.

Due to the complexity of the classroom-research setting, everything that happens during a study cannot be captured using video, observation, or any other medium. Video footage, like observation, is necessarily selective and has been stripped of the full context of the events, but it does provide a more stable tool for reflection than the ever-changing memories of the researcher and participants (Roth, 2007 ). Decisions regarding inclusion and exclusion are made by the researcher throughout the entire research process from the initial framing of the footage to the final edit of the video. Members of the research team should acknowledge how personal bias impacts these decisions and make their choices clear in the research protocol to ensure inclusivity (Miller & Zhou, 2007 ).

One affordance of video research is that analysis of footage can actually disrupt the initial assumptions of a study. Analysis of video can be standardized or even mechanized by seeking out predetermined codes, but it can also disclose the subjective by revealing the meaning behind actions and not just the actions themselves (S. Goldman & McDermott, 2007 ; Knoblauch & Tuma, 2011 ). However, when using subjective analysis the researcher needs to keep in mind that the footage only reveals parts of an event. Ideally, a research team has a member who acts as both a researcher and a filmmaker. That team member can provide an important link between the full context of the event and the narrower viewpoint revealed through the captured footage during the analysis phase.

Although many participants are initially camera-shy, they often find enjoyment from participating in a study that includes video (Tobin & Hsueh, 2007 ). Video research provides an opportunity for participants to observe themselves and even share their experience with others through viewing and sharing the videos. With increased accessibility of video content online and the ease of sharing videos digitally, it is vital from an ethical and moral perspective that participants understand the study release forms and how their image and words might continue to be used and disseminated for years after the study is completed.

Including video in a research study creates both affordances and constraints regarding the dissemination of results. Finding a journal for a video-based study can be difficult. Traditional journals rely heavily on static text and graphics, but newly-created media journals include rich and engaging data such as video and interactive, web-based visualizations (Heath et al., 2010 ). In addition, videos can provide opportunities for research results to reach a broader audience outside of the traditional research audience through online channels such as YouTube and Vimeo.

Use of mixed methods with video data collection and analysis can complement the design-based, iterative nature of research that includes human participants. Design-based video research allows for both qualitative and quantitative collection and analysis of data throughout the project, as various events are encapsulated for specific examination as well as analyzed comparatively for changes over time. Design research, in general, provides the structure for implementing work in practice and iterative refinement of design towards achieving research goals (Collins, Joseph, & Bielaczyc, 2004 ). Using an integrated mixed method design that cycles through qualitative and quantitative analyses as the project progresses gives researchers the opportunity to observe trends and patterns in qualitative data and quantitative frequencies as each round of analysis informs additional insights (Gliner et al., 2009 ). This integrated use also provides a structure for evaluating project fidelity in an ongoing basis through a range of data points and findings from analyses that are consistent across the project. The ability to revise procedures for data collection, systematic analysis, and presenting work does not change the data being collected, but gives researchers the opportunity to optimize procedural aspects throughout the process.

Research as storytelling refers to the narrative traditions that underpin the use of video methods to analyze in a chronological context and present findings in a story-like timeline. These traditions are evident in ethnographic research methods that journal lived experiences through a period of time and in portraiture methods that use both aesthetic and scientific language to construct a portrait (Barone & Eisner, 2012 ; Heider, 2009 ; Lawrence-Lightfoot, 2005 ; Lenette, Cox & Brough, 2013 ).

In existing research, there is also attention given to the use of film and video documentaries as sources of data (e.g. Chattoo & Das, 2014 ; Warmington, van Gorp & Grosvenor, 2011 ), however, our discussion here focuses on using media to capture information and communicate resulting narratives for research purposes. In our work, we promote a perspective on emergent storytelling that develops from data collection and analysis, allowing the research to drive the narrative, and situating it in the context from where data was collected. We rely on theories and practices of research and storytelling that leverage the affordances of participant observation and interview for the construction of narratives (Bailey & Tilley, 2002 ; de Carteret, 2008 ; de Jager, Fogarty & Tewson, 2017 ; Gallagher, 2011 ; Hancox, 2017 ; LeBaron, Jarzabkowski, Pratt & Fetzer, 2017 ; Lewis, 2011 ; Meadows, 2003 ).

The type of storytelling used with research is distinctly different from methods used with documentaries, primarily with the distinction that, while documentary filmmakers can edit their film to a predetermined narrative, research storytelling requires that the data be analyzed and reported within a different set of ethical standards (Dahlstrom, 2014 ; Koehler, 2012 ; Nichols, 2010 ). Although documentary and research storytelling use a similar audiovisual medium, creating a story for research purposes is ethically-bounded by expectations in social science communities for being trustworthy in reporting and analyzing data, especially related to human subjects. Given that researchers using video may not know what footage will be useful for future storytelling, they may need to design their data collection methods to allow for an abundance of video data, which can impact analysis timelines as well. We believe it important to note these differences in the construction of related types of stories to make overt the essential need for research to consider not only analysis but also creation of the reporting narrative when designing and implementing data collection methods.

This study uses existing literature as a frame for understanding and implementing video research methods, then employs this frame as perspective on our own work, illuminating issues related to the use of video in research. In particular, we focus on using video research storytelling techniques to design, implement, and communicate the findings of a research study, providing examples from Dr. Erica Walker’s professional experience as a documentary filmmaker as well as evidence from current and former academic studies. The intent is to improve understanding of the theoretical foundations and practical applications for video research methods and better define how those apply to the construction of story-based video output of research findings.

The study began with a systematic analysis of theories and practices, using interpretive analytic methods, with thematic coding of evidence for conceptual and operational aspects of designing and implementing video research methods. From this information, a frame was constructed that includes foundational aspects of using digital video in research as well as the practical aspects of using video to create narratives with the intent of presenting research findings. We used this frame to interpret aspects of our own video research, identifying evidence that exemplifies aspects of the frame we used.

A primary goal for the analysis of existing literature was to focus on evidentiary data that could provide examples that illuminate the concepts that underpin the understanding of how, when, and why video research methods are useful for a range of publishing and dissemination of transferable knowledge from research. This emphasis on communicating results in both theoretical and practical ways highlighted areas within the analysis for potential contextual similarities between our work and other projects. A central reason for interpreting findings and connecting them with evidence was the need to provide examples that could serve as potentially transferable findings for others using video with their research. Given the need for a fertile environment (Zhao & Frank, 2003 ) and attention to contextual differences to avoid lethal mutations (Brown & Campione, 1996 ), understand that these examples may not work for every situation, but the intent is to provide clear evidence of how video research methods can leverage storytelling to report research findings in a way that is consumable by a broader audience.

In the following section, we present findings from the review of research and practice, along with evidence from our work with video research, connecting the conceptual and operational frame to examples and teasing out aspects from existing literature.

Results and findings

When looking at the current literature regarding the use of video in research, we developed a Mind Map to categorize convergent and divergent themes in the current literature, see Fig.  1 . Although this is far from a complete meta-analysis on video research (notably absent is a comprehensive discussion of ethical concerns regarding video research), the Mind Map focuses on four main properties in regards to video: video as a tool for storytelling/research, properties of the camera/video itself, how video impacts the person/researcher, and methods by which the researcher/viewer consumes video.

figure 1

Mind Map of current literature regarding the use of video in mixed methods research. Link to the fully interactive Mind Map- http://clemsongc.com/ebwalker/mindmap/

Video, when used as a tool for research, can document and share ethnographic, epistemic, and storytelling data to participants and to the research team (R. Goldman, 2007 ; Heath et al., 2010 ; Miller & Zhou, 2007 ; Tobin & Hsueh, 2007 ). Much of the research in this area focuses on the properties (both positive and negative) inherent in the camera itself such as how video footage can increase the ability to see and experience the world, but can also act as a selective lens that separates an event from its natural context (S. Goldman & McDermott, 2007 ; Jewitt, n.d .; Knoblauch & Tuma, 2011 ; MacDougall, 2011 ; Miller & Zhou, 2007 ; Roth, 2007 ; Sossi, 2013 ).

Some research speaks to the role of the video-researcher within the context of the study, likening a video researcher to a participant-observer in ethnographic research (Derry, 2007 ; Roth, 2007 ; Sossi, 2013 ). The final category of research within the Mind Map focuses on the process of converting the video from an observation to records to artifact to dataset to pattern (Barron, 2007 ; R. Goldman, 2007 ; Knoblauch & Tuma, 2011 ; Newbury, 2011 ). Through this process of conversion, the video footage itself becomes an integral part of both the data and findings.

The focus throughout current literature was on video as data and the role it plays in collection and analysis during a study, but little has been written about how video could be used as a vehicle to present findings of a study. Current literature also did not address whether video-data could be used as a tool to communicate the findings of the research to a broader audience.

In a recent two-year study, the research team led by Dr. Erica Walker collected several types of video footage with the embedded intent to use video as both data and for telling the story of the study and findings once concluded (Walker, 2016 ). The study focused on a multidisciplinary team that converted a higher education Engineering course from lecture-based to game-based learning using the Cognitive Apprenticeship educational framework. The research questions examined the impact that the intervention had on student learning of domain content and twenty-first Century Skills. Utilizing video as both a data source and a delivery method was built into the methodology from the beginning. Therefore, interviews were conducted with the researchers and instructors before, during, and after the study to document consistency and changes in thoughts and observations as the study progressed. At the conclusion of the study, student participants reflected on their experience directly through individual video interviews. In addition, every class was documented using two static cameras, placed at different angles and framing, and a mobile camera unit to capture closeup shots of student-instructor, student-student, and student-content interactions. This resulted in more than six-hundred minutes of interview footage and over five-thousand minutes of classroom footage collected for the study.

Video data can be analyzed through quantitative methods (frequencies and word maps) as well as qualitative methods (emergent coding and commonalities versus outliers). Ideally, both methods are used in tandem so that preliminary results can continue to inform the overall analysis as it progresses. In order to capitalize on both methods, each interview was transcribed. The researchers leveraged digital and analog methods of coding such as digital word-search alongside hand coding the printed transcripts. Transcriptions contained timecode notations throughout, so coded segments could quickly be located in the footage and added to a timeline creating preliminary edits.

There are many software workflows that allow researchers to code, notate timecode for analysis, and pre-edit footage. In the study, Opportunities for Innovation: Game-based Learning in an Engineering Senior Design Course, NVivo qualitative analysis software was used together with paper-based analog coding. In a current study, also based on a higher education curriculum intervention, we are digitally coding and pre-trimming the footage in Adobe Prelude in addition to analog coding on the printed transcripts. Both workflows offer advantages. NVivo has built-in tools to create frequency maps and export graphs and charts relevant to qualitative analysis whereas Adobe Prelude adds coding notes directly into the footage metadata and connects directly with Adobe Premiere video editing software, which streamlines the editing process.

From our experience with both workflows, Prelude works better for a research team that has multiple team members with more video experience because it aligns with video industry workflows, implements tools that filmmakers already use, and Adobe Team Projects allows for co-editing and coding from multiple off-site locations. On the other hand, NVivo works better for research teams where members have more separate roles. NVivo is a common qualitative-analysis software so team members more familiar with traditional qualitative research can focus on coding and those more familiar with video editing can edit based on those codes allowing each team member to work within more familiar software workflows.

In both of these studies, assessments regarding storytelling occurred in conjunction with data processing and analysis. As findings were revealed, appropriate clips were grouped into timelines and edited to produce a library of short, topic-driven videos posted online , see Fig.  2 . A collection of story-based, topic-driven videos can provide other practitioners and researchers a first-hand account of how a study was designed and conducted, what worked well, recommendations of what to do differently, participant perspectives, study findings, and suggestions for further research. In fact, the videos cover many of the same topics traditionally found in publications, but in a collection of short videos accessible to a broad audience online.

figure 2

The YouTube channel created for Opportunities for Innovation: Game-based Learning in an Engineering Senior Design Course containing twenty-four short topical videos. Direct link- https://goo.gl/p8CBGG

By sharing the results of the study publicly online, conversations between practitioners and researchers can develop on a public stage. Research videos are easy to share across social media channels which can broaden the academic audience and potentially open doors for future research collaborations. As more journals move to accept multi-media studies, publicly posted videos provide additional ways to expose both academics and the general public to important study results and create easy access to related resources.

Video research as storytelling: The intersection and divergence of documentary filmmaking and video research

“Film and writing are such different modes of communication, filmmaking is not just a way of communicating the same kinds of knowledge that can be conveyed by an anthropological text. It is a way of creating different knowledge” (MacDougall, 2011 ).

When presenting research, choosing either mode of communication comes with affordances and constraints for the researcher, the participants, and the potential audience.

Many elements of documentary filmmaking, but not all, are relevant and appropriate when applied to gathering data and presenting results in video research. Documentary filmmakers have a specific angle on a story that they want to share with a broad audience. In many cases, they hope to incite action in viewers as a response to the story that unfolds on screen. In order to further their message, documentarians carefully consider the camera shots and interview clips that will convey the story clearly in a similar way to filmmakers in narrative genres. Decisions regarding what to capture and how to use the footage happen throughout the entire filmmaking process: prior to shooting footage (pre-production), while capturing footage (production), and during the editing phase (post-production).

Video researchers can employ many of the same technical skills from documentary filmmaking including interview techniques such as pre-written questions; camera skills such as framing, exposure, and lighting; and editing techniques that help draw a viewer through the storyline (Erickson, 2007 ; Tobin & Hsueh, 2007 ). In both documentary filmmaking and in video research, informed decisions are made about what footage to capture and how to employ editing techniques to produce a compelling final video.

Where video research diverges from documentary filmmaking is in how the researcher thinks about, captures, and processes the footage. Video researchers collect video as data in a more exploratory way whereas documentary filmmakers often look to capture preconceived video that will enable them to tell a specific story. For a documentary filmmaker, certain shots and interview responses are immediately discarded as they do not fit the intended narrative. For video researchers, all the video that is captured throughout a study is data and potentially part of the final research narrative. It is during the editing process (post-production) where the distinction between data and narrative becomes clear.

During post-production, video researchers are looking for clips that clearly reflect the emergent storylines seen in the collective data pool rather than the footage necessary to tell a predetermined story. Emergent storylines can be identified in several ways. Researchers look for divergent statements (where an interview subject makes unique observation different from other interviewees), convergent statements (where many different interviewees respond similarly), and unexpected statements (where something different from what was expected is revealed) (Knoblauch & Tuma, 2011 ).

When used thoughtfully, video research provides many sources of rich data. Examples include reflections of the experience, in the direct words of participants, that contain insights provided by body language and tone, an immersive glimpse into the research world as it unfolds, and the potential to capture footage throughout the entire research process rather than just during prescribed times. Video research becomes especially powerful when combined with qualitative and quantitative data from other sources because it can help reveal the context surrounding insights discovered during analysis.

We are not suggesting that video researchers should become documentary filmmakers, but researchers can learn from the stylistic approaches employed in documentary filmmaking. Video researchers implementing these tools can leverage the strengths of short-format video as a storytelling device to share findings with a more diverse audience, increase audience understanding and consumption of findings, and encourage a broader conversation around the research findings.

Implications for future work

As the development of digital media technologies continues to progress, we can expect new functionalities far exceeding current tools. These advancements will continue to expand opportunities for creating and sharing stories through video. By considering the role of video from the first stages of designing a study, researchers can employ methods that capitalize on these emerging technologies. Although they are still rapidly advancing, researchers can look for ways that augmented reality and virtual reality could change data analysis and reporting of research findings. Another emergent area is the use of machine learning and artificial intelligence to rapidly process video footage based on automated thematic coding. Continued advancements in this area could enable researchers to quickly quantify data points in large quantities of footage.

In addition to exploring new functionalities, researchers can still use current tools more effectively for capturing data, supporting analysis, and reporting findings. Mobile devices provide ready access to collect periodic video reflections from study participants and even create research vlogs (video blogs) to document and share ongoing studies as they progress. In addition, participant-created videos are rich artifacts for evaluating technical and conceptual knowledge as well as affective responses. Most importantly, as a community, researchers, designers, and documentarians can continue to take strengths from each field to further the reach of important research findings into the public sphere.

In conclusion, current research is focused on using video as a tool for data collection and analysis, but there are new, emerging opportunities for video to play an increased and diversified role in mixed methods research, especially as a tool for the presentation and consumption of findings. By leveraging the storytelling techniques used in documentary filmmaking, while staying true to the analytical methods of research design, researchers can use video to effectively communicate implications of their work to an audience beyond academia and leverage video storytelling to disseminate findings to the public.

Bailey, P. H., & Tilley, S. (2002). Storytelling and the interpretation of meaning in qualitative research. J Adv Nurs, 38(6), 574–583. http://doi.org/10.1046/j.1365-2648.2000.02224.x

Barone, T., & Eisner, E. W. (2012). Arts based research (pp. 1–183). https://doi.org/10.4135/9781452230627

Barron B (2007) Video as a tool to advance understanding of learning and development in peer, family, and other informal learning contexts. Video Research in the Learning Sciences:159–187

Brown AL, Campione JC (1996) Psychological theory and the design of innovative learning environments: on procedures, principles and systems. In: Schauble L, Glaser R (eds) Innovations in learning: new environments for education. Lawrence Erlbaum Associates, Hillsdale, NJ, pp 234–265

Google Scholar  

Chalfen, R. (2011). Looking Two Ways: Mapping the Social Scientific Study of Visual Culture. In E. Margolis & L. Pauwels (Eds.), The Sage handbook of visual research methods . books.google.com

Chattoo, C. B., & Das, A. (2014). Assessing the Social Impact of Issues-Focused Documentaries: Research Methods and Future Considerations Center for Media & Social Impact, 24. Retrieved from https://www.namac.org/wpcontent/uploads/2015/01/assessing_impact_social_issue_documentaries_cmsi.pdf

Collins, A., Joseph, D., & Bielaczyc, K. (2004). Design research: theoretical and methodological issues. Journal of the Learning Sciences, 13(1), 15–42. https://doi.org/ https://doi.org/10.1207/s15327809jls1301_2

Dahlstrom, M. F. (2014). Using narratives and storytelling to communicate science with nonexpert audiences. Proc Natl Acad Sci, 111(Supplement_4), 13614–13620. http://doi.org/10.1073/pnas.1320645111

de Carteret, P. (2008). Storytelling as research praxis, and conversations that enabled it to emerge. Int J Qual Stud Educ, 21(3), 235–249. http://doi.org/10.1080/09518390801998296

de Jager A, Fogarty A, Tewson A (2017) Digital storytelling in research: a systematic review. Qual Rep 22(10):2548–2582

Derry SJ (2007) Video research in classroom and teacher learning (Standardize that!). Video Research in the Learning Sciences:305–320

Erickson F (2007) Ways of seeing video: toward a phenomenology of viewing minimally edited footage. Video Research in the Learning Sciences:145–155

Gallagher, K. M. (2011). In search of a theoretical basis for storytelling in education research: story as method. International Journal of Research and Method in Education, 34(1), 49–61. http://doi.org/10.1080/1743727X.2011.552308

Gliner, J. A., Morgan, G. A., & Leech, N. L. (2009). Research Methods in Applied Settings: An Integrated Approach to Design and Analysis, Second Edition . Taylor & Francis

Goldman R (2007) Video representations and the perspectivity framework: epistemology, ethnography, evaluation, and ethics. Video Research in the Learning Sciences 37:3–37

Goldman S, McDermott R (2007) Staying the course with video analysis Video Research in the Learning Sciences:101–113

Hancox, D. (2017). From subject to collaborator: transmedia storytelling and social research. Convergence, 23(1), 49–60. http://doi.org/10.1177/1354856516675252

Heath, C., Hindmarsh, J., & Luff, P.(2010). Video in Qualitative Research. SAGE Publications. Retrieved from https://market.android.com/details?id=book-MtmViguNi4UC

Heider KG (2009) Ethnographic film: revised edition. University of Texas Press

Henley P (2010) The Adventure of the Real: Jean Rouch and the Craft of Ethnographic Cinema. University of Chicago Press

Jewitt, C. (n.d). An introduction to using video for research - NCRM EPrints Repository. National Centre for Research Methods. Institute for Education, London. Retrieved from http://eprints.ncrm.ac.uk/2259/4/NCRM_workingpaper_0312.pdf

Knoblauch H, Tuma R (2011) Videography: An interpretative approach to video-recorded micro-social interaction. The SAGE Handbook of Visual Research Methods :414–430

Koehler D (2012) Documentary and ethnography: exploring ethical fieldwork models. Elon Journal Undergraduate Research in Communications 3(1):53–59 Retrieved from https://www.elon.edu/docs/e-web/academics/communications/research/vol3no1/EJSpring12_Full.pdf#page=53i

Lawrence-Lightfoot, S. (2005). Reflections on portraiture: a dialogue between art and science. Qualitative Inquiry: QI, 11(1), 3–15. https://doi.org/10.1177/1077800404270955

LeBaron, C., Jarzabkowski, P., Pratt, M. G., & Fetzer, G. (2017). An introduction to video methods in organizational research. Organ Res Methods, 21(2), 109442811774564. http://doi.org/10.1177/1094428117745649

Lenette, C., Cox, L., & Brough, M. (2013). Digital storytelling as a social work tool: learning from ethnographic research with women from refugee backgrounds. Br J Soc Work, 45(3), 988–1005. https://doi.org/10.1093/bjsw/bct184

Lewis, P. J. (2011). Storytelling as research/research as storytelling. Qual Inq, 17(6), 505–510. http://doi.org/10.1177/1077800411409883

(2011) Anthropological filmmaking: An empirical art. In: The sage handbook of visual research methods. MacDougall, D, pp 99–113

Meadows D (2003) Digital storytelling: research-based practice in new media. Visual Com(2):189–193

Miller K, Zhou X (2007) Learning from classroom video: what makes it compelling and what makes it hard. Video Research in the Learning Sciences:321–334

Newbury, D. (2011). Making arguments with images: Visual scholarship and academic publishing. In Eric Margolis & (Ed.), The SAGE Handbook of Visual Research Methods . na

Nichols B (2010) Why are ethical issues central to documentary filmmaking? Introduction to Documentary , Second Edition . In: 42–66

Roth W-M (2007) Epistemic mediation: video data as filters for the objectification of teaching by teachers. In: Goldman R, Pea R, Barron B, Derry SJ (eds) Video research in the learning sciences. Lawrence Erlbaum Ass Mahwah, NJ, pp 367–382

Sossi, D. (2013). Digital Icarus? Academic Knowledge Construction and Multimodal Curriculum Development, 339

Tobin J, Hsueh Y (2007) The poetics and pleasures of video ethnography of education. Video Research in the Learning Sciences:77–92

Using Film in Ethnographic Field Research - Methods@Manchester - The University of Manchester. (n.d.). Retrieved March 12, 2018, from https://www.methods.manchester.ac.uk/themes/ethnographic-methods/ethnographic-field-research/

Walker, E. B. (2016). Opportunities for Innovation: Game-based Learning in an Engineering Senior Design Course (PhD). Clemson University. Retrieved from http://tigerprints.clemson.edu/all_dissertations/1805/

Warmington, P., van Gorp, A., & Grosvenor, I. (2011). Education in motion: uses of documentary film in educational research. Paedagog Hist, 47(4), 457–472. https://doi.org/10.1080/00309230.2011.588239

Zhao, Y., & Frank, K. A. (2003). Factors affecting technology uses in schools: an ecological perspective. Am Educ Res J , 40(4), 807–840. https://doi.org/10.3102/00028312040004807

  • Mixed methods
  • Storytelling
  • Video research

