VersatileGait: A Large-Scale Synthetic Gait Dataset with Fine-Grained Attributes and Complicated Scenarios

Submitted by neurta on Sun, 01/10/2021 - 16:25

. Introduction As an important and challenging problem in computer vision, gait recognition [9, 11, 15, 16, 31, 32, 38, 42] aims to identify the individual walking pattern, and has a wide range of applications such as visual surveillance [7], security checks [10], and video retrieval [34]. Compared with other biometric recognition approaches, it has the following advantages: 1) non-explicit cooperation with humans; 2) long-distance perception; 3) robustness to changes in accessories. *Corresponding author. E-mail: Horizontal view angle scale 0°~180° 0°Cams Real Persons Sequence Coarse-grained label Pitch angle scale 0° (a) Real dataset collection. Human Models Fine-grained labels Horizontal view angle scale 0°~180° Pitch angle scale 0°~60° Sequence (b) Synthetic data generation. Figure 1. The comparison of real and synthetic datasets generation. VersatileGait differs in the fine-grained attributes and complicated scenarios, which are highlighted with yellow color. In principle, a gait pattern in the wild is usually represented as a sequence of human silhouettes (i.e., binary masks without textures), which vary greatly with respect to many complicated intrinsic and extrinsic factors. Specifically, intrinsic factors usually refer to individual-specific attributes (e.g., genders, walking styles, and complex accessories), which affect the silhouettes intrinsically due to the differences in muscle action and body appearance. Extrinsic factors often correspond to camera settings (e.g., horizontal views and pitch angles), which may lead to silhouette distortion. Therefore, effective gait recognition in the wild requires datasets to cover a wide variety of fine-grained attributes and complicated scenarios. However, existing datasets are confronted with the following three limitations: 1) simple annotations with only ID 1 arXiv:2101.01394v1 [cs.CV] 5 Jan 2021 labels (without individual attributes); 2) ideal scenarios with a single camera pitch angle, which are shown in Fig. 1(a); 3) small scale of data in terms of either the individual ID number or the silhouette sequence number for each ID. For example, the CASIA-B dataset is composed of 124 individual IDs, 13632 silhouette sequences with clothing and bag condition. In the OU-MVLP dataset, each subject only walks twice without the change of clothing or bag. As a result, these datasets are inadequate for practical gait recognition. There is an urgent need for building a large-scale gait dataset with fine-grained attributes and complicated scenarios. In practice, constructing a real dataset for gait recognition is difficult due to the following factors: 1) data collection is very expensive and time-consuming, e.g., collecting OU-MVLP takes over 11 months with a huge cost [39]; 2) privacy issues for individual IDs; 3) limited data processing, which causes silhouettes quality damages, e.g., irregular cavity and body missing. Thus, we turn to build the dataset by computer simulation, which is a cheap and convenient method. We present a large-scale synthetic dataset called VersatileGait based on virtual data simulation with a game engine, as shown in Fig. 1(b). Specifically, to follow a high-quality realistic standard, our data simulation process is composed of the following steps: 3D model generation, walking animation collection, animation retargeting, complicated scene simulation, and silhouettes capture. Such a synthetic dataset distinguishes in: 1) fine-grained annotations thanks to parameterized human models; 2) complicated scenarios for different demands in terms of customized scenes which are seamless to real applications; 3) high quality, low costs, and no privacy issues; 4) small domain gap. Thanks to the colorless and textureless properties, the silhouette sequences generated by virtual data simulation are close to the real ones. This phenomenon makes it more suitable to generate synthetic data on gait recognition than the tasks which utilize RGB datasets. Based on VersatileGait, we improve the performance of the existing models by a considerable margin and propose numerous practical applications. Firstly, VersatileGait could be used to pretrain a deep learning model, which is then finetuned by task-specific real datasets. In this case, the performance of the SOTA method [15] increases by 1.1% of rank-1 accuracy on CASIA-B. Besides, we could set up a new attribute classification task together with the primary gait recognition, resulting in a multi-task learning problem. The results of attribute prediction could be used for gait retrieval acceleration. Meanwhile, we conduct the performance evaluations of gait recognition in complicated scenarios (e.g., multi-pitch angles). Last but not least, we provide some potential applications, i.e., the multi-person gait recognition and individual-related attributes for disentanglement learning. Therefore, VersatileGait is more suitable for real scenarios, and also promotes the gait recognition literature from the research perspective. We will also release our data generation code and datasets to the research community. On the whole, the main contributions of this work are summarized as follows: • We propose to develop a large-scale synthetic gait dataset called VersatileGait, which covers fine-grained attributes and complicated scenarios. To our knowledge, VersatileGait is the first synthetic gait dataset for research use and practical applications. • We enrich some new applications for potential research directions and also evaluate the performances of different gait recognition methods in complicated scenarios. • We conduct extensive experiments to demonstrate the effectiveness of using VersatileGait, which improves the rank-1 accuracy of mainstream methods by a considerable margin. 2. Related Work 2.1. Gait Recognition Gait recognition is a type of biometric authentication that identifies people by their walking patterns. Most previous works can be divided into two categories. Model-Based Methods This paradigm [3, 6, 8, 19, 27] concentrates on fitting articulated human body to images and uses 2D body joint features. They are robust to the covariates such as bags and clothes. However, this fashion relies heavily on the high-resolution images and the accurate pose estimation results Appearance-Based Methods These methods [9, 15, 17, 24, 25, 26, 29, 48] use the silhouette sequence, gait energy image (GEI) [20], or gait entropy image (GEnI) [4] as inputs and accomplish gait recognition in an end-to-end fashion. These methods are popular due to their flexibility, effectiveness, and conciseness. 2.2. Real Datasets in Gait Recognition Gait data in the real-world is usually collected in the neat laboratory followed by data processing, e.g., background subtraction. We introduce two mainstream datasets and list the statistics results of most of the gait datasets in Table 1. CASIA-B [36] This dataset is the most frequently used one, which contains 124 subjects, 3 walking conditions (NM, BG, CL), and 11 horizontal views. Each subject contains 6 normal (NM) sequences, 2 walking with bag (BG) sequences, and 2 wearing clothing (CL) sequences. 2 Table 1. The comparison of reviewed databases in gait recognition. The VP (viewpoints) of cameras include the horizontal and vertical angles. VersatileGait distinguishes in ID, SQ (sequences), VP (viewpoints), BG/CL (bag, clothing), and contains rich features such as AT (attributes), PA (multi-pitch angles), MP (multi-person gait). The minimum number of ID and sequences are 20 and 240. Dataset ID SQ VP BG/CL AT PA MP CASIA-A [43] 1× 1× 3 CMU MoBo [18] 1× 3× 6 USF [35] 6× 8× 2 X CASIA-B [36] 6× 57× 11 X WOSG [13] 8× 3× 8 CASIA-C [12] 8× 6× 1 X TUM-GAID [21] 15× 14× 1 X OU-LP [23] 200× 33× 2 OU-MVLP [39] 515× 1114× 14 VersatileGait 550× 4300× 33 X X X X OU-MVLP [39] It contains 10307 subjects, and each subject owns 14 horizontal views and 2 sequences per view. OU-MVLP is the largest public gait dataset. Previous datasets are collected under fairly simple scenarios and image quality problems exist due to data processing, which is shown in Fig. 2. (a) Irregular Cavity (b) Body Missing (c) Clothing Issue Figure 2. Common image quality problems in real dataset. 2.3. Synthetic Dataset Synthetic datasets have been widely used in various tasks in computer vision for three main advantages as follows: Explore More Applications 3D model is utilized to generate synthetic data for accurate human depth estimation and human part segmentation [41]. A virtual dataset [14] is collected to tackle with multi-person tracking and pose estimation under occlusion condition. Mitigate the Lack of Data A segmentation dataset [22] is proposed to mitigate the lack of occlusion-aware annotation. Likewise, a synthetic logo dataset [37] is presented to enrich the real manual labeled dataset. Enhance the Performance on Real Dataset This fashion such as [28, 46, 47, 49, 50], employs supervised learning and domain adaptation to achieve superior results. With the help of [44], better performance of crowd counting is achieved. Additionally, pretraining with synthetic data [40] improves the performance of models. 3. The VersatileGait Dataset The goal of our VersatileGait is to introduce a large-scale gait dataset with fine-grained attributes and complicated scenarios to satisfy various research demands and practical applications. This section presents the key challenges of synthetic data collection and details of our data generation pipeline. Moreover, we summarize the properties of our dataset for further studies. 3.1. Key Challenges Generating high-quality synthetic data with fine-grained attributes and complicated scenarios raises many challenges: 1) How to represent a walking individual? Generally, a walking individual is represented as appearance (human with accessories) and action (walking style). We bind a 3D human model and walking animations to simulate the combination of appearance and action. 2) How to select the attributes of individuals? Since silhouettes mainly represent the contours of pedestrians, we select several contoursensitive attributes such as genders and walking styles. 3) How to simulate the complicated scenario? We define the complicated scenario as the viewpoints (horizontal and vertical) of cameras and the number of pedestrians in the scene. Then, we simulate that with a game engine. 4) How to guarantee the quality of silhouettes? The data processing, such as background subtraction and segmentation, is inevitable in the process of real silhouette collection, which causes poor image quality. To tackle this problem, we use a dark background and white 3D human models without texture to directly synthesize silhouettes without any processing. Specifically, we illustrate our data generation pipeline as follows. 3.2. Dataset Generation Our pipeline for generating synthetic data consists of four steps, as illustrated in Fig. 3. 1) 3D Model Generation To get more realistic human models, we use Make Human [5] for the model generation. It is a frequently-used open-source tool for modeling parameterized 3D characters, which are highly correlated with the attribute parameters (e.g., genders, ages, heights) and strictly restricted by human morphology to make the human body more realistic. With the randomly generated parameters on Make Human, we create numbers of models in a unified pose for easy manipulation. Besides, to reflect attributes distinction on silhouettes, we add some artificial constraints, e.g., people of different genders have different preferences for accessories. Finally, we generate 150 real3 Female Muscle Age Standard Walk Step Length Male 1. 3D Model Generation 2. Animation Collection 3. Animation Retargeting &Silhouttes Arm swing 4. Scene Simulation Figure 3. The data generation pipeline of VersatileGait istic 3D human models with balanced attribute distribution to represent the appearance of walking individuals. 2) Walking Animation Collection Mixamo [1] is an online platform containing numerous built-in skeletal animations, which are commonly used to animate 3D characters. We collect some human walking animations from the platform to simulate the walking style of mankind, such as standard walking, brutal walking. Besides, to increase the diversity of our dataset, we adjust the step stride and arm angle of the animations to represent different walking habits. Altogether, we collect 100 human walking animations, which are also taken as a special attribute related to the temporal information of an individual’s action. 3) Animation Retargeting Animation retargeting aims at binding animations to a targeted human model, which is essential for the realistic and smooth gait sequence. We use the Mecanim animation system in Unity3D [2] to establish a connection between the models’ structure and the default humanoid bones. Then, we manually correct the binding errors for the sake of fluent and natural movements, resulting in 11,000 high-quality walking individuals. 4) Scene Simulation & Silhouettes Capture To satisfy the practical demands of complicated scenarios and get highquality silhouettes, we design scenarios in Unity3D, which is widely used in game development. Specifically, we use dark skyboxes as the background and adopt 6 orthogonal parallel lights for projection. There are up to 3 models in the scene at the same time. These textureless human models walk between specified starting and ending points under 33 cameras with 11 horizontal views, and 3 vertical views (i.e., pitch angles). These cameras directly capture binary silhouettes without any extra processing to ensure the high quality of our VersatileGait, and the annotations are automatically generated through the pipeline. As a result, we generate 72 million frames, with a resolution of 280 × 200, grouped into one million synthetic gait sequences with the aforementioned 11,000 subjects. 3.3. Properties of VersatileGait As a synthetic dataset, VersatileGait has five remarkable properties compared to existing datasets. 1) Small Domain Gap When it comes to synthetic datasets, it is inevitable to confront the issue of domain gap. As shown in Fig. 4, the severe domain gap exists in person ReID datasets due to the color and texture. By contrast, since silhouettes contain no color or texture information but contours of individuals, employing synthetic datasets with silhouettes introduces less domain gap. 2) Fine-Grained Annotations VersatileGait contains finegrained annotations such as genders and walking styles, which could be used for diverse applications. 3) High Quality The synthetic data generation pipeline directly produces silhouettes without any data processing such as background subtraction or segmentation, which would cause severe image quality damage. 4) Complicated Scenarios As far as we know, VersatileGait is the first public dataset that contains the multi-pitch angles and the multi-person gait scenario, which is very common in practical gait recognition problem but absent in existing real datasets. 5) Large Amount We use the proposed pipeline to synthesize a dataset consisting of 11000 subjects with more than one million sequences, which is the largest gait dataset containing the change of viewpoints. Besides, we release our data generation toolkit and users could utilize this to synthesize customized data to satisfy their research demands. 4. Dataset Effectiveness Studies Before introducing our VersatileGait to various applications, we conduct several experiments to validate the effectiveness of our datasets in the following steps: 1) we explore the effects of different dataset size; 2) we design several situations to evaluate the effects under the different mix ratio of real data and the synthetic data; 3) we analyze the effects of two-stage training strategies. The Effects of Dataset Size Larger datasets provide more diverse knowledge. However, as a retrieval task, the inference time of gait recognition increases dramatically as the size of the dataset increases. Therefore, we conduct several experiments to quantitatively explore the performance gain and inference time cost with respect to the size of the dataset. Specifically, we pretrain GaitSet [9] model on VersatileGait with size from 100 to 10000 and evaluate the model on CAISIA-B [36] without finetuning, and we keep the division between the training set and the test set unchanged. As shown in Fig 5, as the size of the dataset becomes larger, inference time increases dramatically while 4 Market-1501 images RandPerson images (a) The comparison of synthetic data [45] and real data [51] in person ReID. There is a significant difference between them, due to the texture, hue, etc. VersatileGait sequence CASIA-B sequence OU-MVLP sequence (b) The comparison of synthetic data and real data [39, 36] in gait recognition. Synthetic data are more realistic and high-quality thanks to textureless and colorless properties. It demonstrates the advantages of using synthetic data in the field of gait recognition. Figure 4. The comparison of synthetic and real datasets in gait recognition and other fields. the performance gain becomes smaller. Thus, by utilizing this finding, researchers could select the proper dataset size, according to the research demands and hardware platform. ! Figure 5. Averaged rank-1 accuracies and inference time cost under different amounts of the training dataset. Note that the scale of gallery subjects will also increase while the training dataset gets larger. '"%"%!)"&& $)() %#*'- ",+'()"$") +&'()"$") ",+'()"$") +&'()"$") Figure 6. Results for mix training with VersatileGait and real dataset. The performance is tested on corresponding real test dataset. The horizontal axis is in a log scale. The Effects of the Mix Training Mix training [30] with real data and synthetic data is an efficient way to improve the performance of deep learning models. To explore the effects of mix training, we design experiment settings as follows: first, we keep the training dataset size fixed and use five different mix ratios of real and synthetic data. Specifically, the real data ratio varies between 2.5%, 5%, 10%, 50%, 100%. Then the rest data is filled up with VersatileGait. Based on these experiment settings, we train GaitSet [9] models to show the performance gains from VersatileGait. The more detailed experiment settings are shown in the supplementary material. As shown in Fig. 6, it indicates two phenomena: 1) VersatileGait considerably compensates the performance decay when real data scarcity is severe; 2) we achieve comparable performance to the model trained with all real data, by only using 50% real data filled up with the synthetic data. The Effects of two-stage Training Generally, the performance of the model will be improved considerably if pretrained with a dataset with diversity. To evaluate the diversity of VersatileGait, we adopt two two-stage training strategies to advance the performance of existing powerful methods [9, 15]: 1) VersatileGait for pretraining; 2) mix the whole real train dataset with VersatileGait for pretraining. Both of them are followed with finetuning operation and tested on the corresponding real test dataset. As shown in Table 2, it indicates the performance of mainstream methods will increase by a considerable margin, thus VersatileGait is a practical dataset for pretraining. Further, we find it effective to help real-world gait recognition using VersatileGait from three aspects Table 2. Averaged rank-1 accuracies on CASIA-B under different conditions, excluding identical-view cases. The conditions include origin, pretrain model with VersatileGait, mix the whole real training set with VersatileGait for training. Gallery NM#1-4 0 ◦ - 180◦ mean Method Probe Condition 0 ◦ 18◦ 36◦ 54◦ 72◦ 90◦ 108◦ 126◦ 144◦ 162◦ 180◦ GaitSet NM#5-6 origin 90.8 97.9 99.4 96.9 93.6 91.7 95.0 97.8 98.9 96.8 85.8 95.0 pre+finetune 92.8 97.9 99.4 98.6 95.2 93.1 96.2 98.8 98.7 97.9 89.5 96.2 mix+finetune 93.7 98.6 99.0 98.4 95.1 92.9 96.7 98.7 99.0 98.2 90.0 96.4 BG#1-2 origin 83.8 91.2 91.8 88.8 83.3 81.0 84.1 90.0 92.2 94.4 79.0 87.2 pre+finetune 87.1 91.4 92.8 90.5 86.0 80.0 84.7 90.4 92.5 92.0 80.9 88.0 mix+finetune 87.7 91.9 93.3 90.6 84.2 79.0 84.9 92.0 95.4 92.2 81.6 88.5 CL#1-2 origin 61.4 75.4 80.7 77.3 72.1 70.1 71.5 73.5 73.5 68.4 50.0 70.4 pre+finetune 64.0 72.7 76.7 75.0 68.4 68.9 71.1 73.6 76.3 80.4 60.0 70.6 mix+finetune 62.1 74.3 78.0 77.2 69.7 67.7 70.4 73.4 74.3 76.9 59.2 70.6 GaitPart NM#5-6 origin 94.1 98.6 99.3 98.5 94.0 92.3 95.9 98.4 99.2 97.8 90.4 96.2 pre+finetune 95.5 98.4 99.6 98.6 96.0 92.4 96.1 98.5 99.3 98.1 92.7 96.8 mix+finetune 95.8 98.7 99.8 98.7 96.3 92.6 96.3 98.6 99.6 98.5 92.9 97.1 BG#1-2 origin 89.1 94.8 96.7 95.1 88.3 94.9 89.0 93.5 96.1 93.8 85.8 91.5 pre+finetune 91.2 95.4 95.0 94.0 87.6 83.9 89.1 94.3 95.7 94.2 86.9 91.6 mix+finetune 91.5 95.8 95.1 94.0 87.4 84.3 89.3 94.0 96.2 94.3 87.1 91.8 CL#1-2 origin 70.7 85.5 86.9 83.3 77.1 72.5 76.9 88.2 83.8 80.2 66.5 78.7 pre+finetune 73.5 85.1 85.3 81.4 78.3 70.6 77.0 81.6 84.2 81.5 68.5 78.9 mix+finetune 73.3 85.0 85.6 81.5 78.1 71.1 77.1 81.4 84.4 81.2 68.4 78.9 5. Diverse Applications of VersatileGait Based on VersatileGait, we explore practical applications in two aspects. Firstly, we utilize fine-grained attributes to conduct attribute-guided gait recognition in Section 5.1, which makes the gait recognition system more accurate and the retrieval (the inference stage of gait model) faster. Secondly, we focus on the scenario under multi-pitch angles in Section 5.2, where we explore the importance of the multipitch angle scenario and the relation between each pitch angle. 5.1. Attribute-Based Applications Attribute Selection Criterion The silhouettes contain no texture and color and are only related to the contour of the subjects. Therefore, only attributes corresponding to the contour can be further used in gait recognition. Followed by this criterion, genders and walking styles are used for further exploration. We utilize these fine-grained attributes to improve gait recognition from two perspectives: accuracy and speed. More Accurate Gait Recognition The fine-grained attribute labels could play the role of auxiliary supervision for gait recognition. To simplify the experiment, we use the network of GaitSet [9] but remove the multi-scale partition module as our baseline method. Then we expand the baseline method with simple attribute classifiers resulting in a multi-task learning framework, which is shown in Fig. 7. For convenience, we let A be the set of attributes and let La represent the cross-entropy loss of different attributes. Then, our optimization objective can be formulated Pooling Attribute Classification Label Identification Feature extractor FC1 CE Loss Walking style Gender Person ID CE Loss Triplet Loss FC2 FC0 A walking Young woman She is Jane Figure 7. The framework of attribute guided gait recognition, which follows the multi-task learning [33] architecture. These two attribute classifiers are composed of only two fully connected layers and a ReLU activation. as Eq. (1). Ltotal = Ltriplet + X a∈A λaLa (1) The gender and walking style can be discriminated well by our method, which achieves the accuracies of 88.8% and 99.63% for attribute prediction, respectively. We report the detailed results of gait recognition in Table 3, which shows that the performance of our baseline model is improved by a large margin under the multi-task learning framework. It shows the selected attributes have a significant promotion effect on the gait recognition task by introducing the supervision of fine-grained labels. Faster Retrieval Gait recognition needs to calculate the similarity between the query and the numerous gallery instances in the inference phase. It is bothered by the heavy computation cost, which slows down the speed of inference. To solve this problem, we propose to reduce the search space using the results of attributes prediction. As we have 6 Table 3. Rank-1 accuracy of the baseline method and the attribute-guided method tested on the CASIA-B. Gallery NM#1-4 0°- 180° Mean Probe Method 0° 18° 36° 54° 72° 90° 108° 126° 144° 162° 180° NM#5-6 Baseline 46.7 46.8 54.4 59.1 59.8 57.7 55.0 55.2 58.5 56.5 45.8 54.1 Attribute-guided 80.6 79.2 87.0 90.2 88.4 84.8 81.6 83.9 88.4 89.8 80.6 84.9 BG#01 Baseline 46.2 45.7 52.7 56.2 57.1 52.4 50.6 51.2 55.2 53.7 44.7 51.4 Attribute-guided 78.2 77.7 85.4 89.1 87.0 81.8 79.2 81.8 86.4 88.0 79.2 83.1 CL#01 Baseline 31.3 31.3 38.0 42.1 43.4 39.5 39.1 38.9 40.3 39.8 31.5 37.7 Attribute-guided 57.3 57.0 66.3 73.1 73.7 69.9 68.3 68.4 70.6 68.7 59.0 66.6 stated above, attribute prediction gets high accuracies for gender and walking style. Therefore, we use these two attributes as the criterion to reduce the search space. For simplicity, we use a single attribute for reduction each time. We treat the selected attribute of the instance as reliable if its confidence score is higher than the threshold. If the attribute of the query instance is reliable, we will use the gallery instances, which have the same reliable attribute and its value as the query instance, resulting in the search space reduction. For example, given a query instance’s walking style (brutal walking) is reliable, we choose all the gallery instances that have the same reliable attribute of walking style (brutal walking) for matching. Otherwise, the corresponding search space will include all the gallery instances. The curves of the relationship between the search space scale and the rank-1 accuracy could be seen in Fig. 8. With this strategy, we could speed up the gait recognition by more than two times while the accuracies decrease by no more than 10% and 4%, guided by walking styles and genders, respectively. It shows that: 1) using attributes as the search restriction, we could reduce the scale of gallery subjects with a tolerable performance drop of accuracy; 2) there is no definite relationship between accuracy and the scale of reduced search space. Sometimes, the accuracy could even be improved. 0% 20% 40% 60% 80% Reducing ratio of the search space 70.0 72.5% 75.0% 77.5% 80.0% 82.5% 85.0% 87.5% 90.0% Rank-1 accuracy with gender Gender Walking style 70.0 72.5% 75.0% 77.5% 80.0% 82.5% 85.0% 87.5% 90.0% Rank-1 accuracy with walking style Figure 8. Rank-1 accuracy on VersatileGait when using different attributes to accelerate the retrieval. 60°Cams 30°Cams 0°Cams Figure 9. Multi-pitch angle scenario. Three sequences in the cameras of different pitch angles, where pedestrians are distorted significantly. 5.2. Multi-Pitch Angle Gait Recognition Multi-pitch angles of cameras can cause severe distortion in real gait recognition scenarios but are not considered among the previous gait dataset. To analyze the impact of this kind of distortion, we evaluate the performance of GaitSet on our VersatileGait for the cross-pitch angle recognition problem. The Necessity of Multi-Pitch Angle Data Existing datasets ignore the multi-pitch angle scenarios, which may cause severe geometric distortion on silhouettes. To figure out the necessity of multi-pitch angle data, we conduct a pair of comparative experiments using the existing methods. As shown in Fig. 10, the averaged cross-pitch angle performance of GaitSet [9] will drop significantly if the multipitch angle data is not provided for training. Therefore, the collection of multi-pitch angle data is essential to gait recognition in the wild. The Effects of Different Pitch Angle The results above show models are not robust enough to multi-pitch angle scenario. However, the different pitch angles may play a different role in gait recognition. Therefore, we conduct two experiments to explore the effect of lacking different pitch angle data: 1) train the model with the data of pitch angles 0◦ , 30◦ , excluding 60◦ ; 2) train the model with the 7 Probe0° Probe30° Probe60° 0% 20% 40% 60% 80% 100% Rank-1 86 88 84 40 34 10 All pitch angles Only 0° pitch angle Figure 10. The results of GaitSet tested on multi-pitch angle data. Figure 11. The rank-1 accuracies of GaitSet for the cross-pitch angle recognition on VersatileGait. Left: train model without 60◦ data; Right: train model without 30◦ data. data of pitch angles 0◦ and 60◦ data, excluding 30◦ . We test the rank-1 accuracies of GaitSet for the cross-pitch angle and cross-view recognition, then the dimension of the horizontal view is averaged for visualization. As shown in Fig. 11, if the training set lacks higher pitch angle data, the performance of the model degrades dramatically. By contrast, if the training set lacks relatively low pitch angle data, the model maintains a relatively stable performance. The Analysis of Cross Viewpoints Gait Recognition In practical scenarios, there are cameras with various viewpoints. Therefore, we take both the 11 horizontal views and 3 pitch angles together into consideration. Concretely, we test the performance of GaitSet on VersatileGait for all possible viewpoint pairs (33 probe viewpoints vs. 33 gallery viewpoints). As shown in Fig. 12, the recognition accuracy decreases significantly on the probe of 60◦ pitch angle even the training data is provided. Besides, the cross horizontal view problem is severer when the cross pitch angle factor is included, especially on the view around 0◦ and 180◦ . 6. Potential Applications and Dissusion Multi-Person Gait Recognition Existing public datasets mainly focus on a single pedestrian scenario. However, multi-person walking cases are common in the real world, which is shown in Fig. 13. To cover this scenario, we simulate the scenarios with the person up to three at the same 0° 18° 36° 54° 72° 90° 108° 126° 144° 162° 180° View 60° 30° 0° Pitch Angle 78.0 78.9 83.0 86.4 86.4 84.1 83.2 82.5 84.0 80.3 71.4 84.0 86.0 91.3 93.3 92.3 89.1 87.8 89.0 90.3 90.3 84.6 82.9 86.0 90.7 92.6 91.4 87.5 86.0 87.6 89.5 89.9 85.7 75 80 85 90 Figure 12. The averaged rank-1 accuracies of GaitSet on VersatileGait for the 33 kinds of probes from different viewpoints. 60°Cams 30°Cams 0°Cams Multi-person sequence Figure 13. Multi-person gait. The silhouettes are severely occluded, which brings new challenges on gait recognition. time. We will release this multi-person gait dataset for further exploration. This application scenario may point to new directions for future research. Disentangled Representation Learning We can combine the same subject with different variables (e.g., accessories and walking styles). With the properties of our dataset, there are two perspectives of disentanglement. 1) Conducting disentanglement learning following the previous methods [26, 48] that decompose a subject into the gait-related feature and accessories-related feature. We can control the change of carriers and generate numerous training pairs; 2) A subject can be further disentangled into the individual-related feature and walking-style related feature. Based on this disentanglement paradigm, we can research the characteristics of the more intrinsic feature. More details can be seen in the supplementary material. 7. Conclusion In this paper, we have proposed a high-quality largescale synthetic gait dataset named VersatileGait rendered by a game engine, which contains much more fine-grained attributes and complicated scenarios than those of existing gait datasets. VersatileGait is composed of around one million silhouette sequences of 11,000 subjects. Based on VersatileGait, we have conducted a variety of learning effectiveness studies to improve the mainstream methods by a considerable margin. Besides, we enrich various applications including attribute guided gait recognition with multitask learning and gait retrieval acceleration by fast attribute filtering. Moreover, we have evaluated gait recognition per8 formance in the new scenario of multi-pitch angles. Extensive experiments have shown the great potential of VersatileGait in both the research community and the industry. References [1] Mixamo. 4 [2] Unity3d., 2020. 4 [3] G. Ariyanto and M. S. Nixon. Model-based 3d gait biometrics. In Int. Joint Conf. Biom., pages 1–7, 2011. 2 [4] K. Bashir, T. Xiang, and S. Gong. Gait recognition using gait entropy image. In Int. Conf. Image Crime Detection Prevention, pages 1–6, 2009. 2 [5] Manuel Bastioni, Simone Re, and Shakti Misra. Ideas and methods for modeling 3d human figures: the principal algorithms used by makehuman and their implementation in a new approach to parametric modeling. In Bangalore Annual Comput. Conf., pages 1–6, 2008. 3 [6] Robert Bodor, Andrew Drenner, Duc Fehr, Osama Masoud, and Nikolaos Papanikolopoulos. View-independent human motion classification using image-based reconstruction. Image Vis. Comput., 27(8):1194–1206, 2009. 2 [7] Imed Bouchrika. A survey of using biometrics for smart visual surveillance: Gait recognition. In Surveillance in Action, pages 3–23. Springer, 2018. 1 [8] N. V. Boulgouris and Z. X. Chi. Gait recognition based on human body components. In IEEE Int. Conf. Image Process., volume 1, pages I – 353–I – 356, 2007. 2 [9] Hanqing Chao, Yiwei He, Junping Zhang, and Jianfeng Feng. GaitSet: Regarding gait as a set for cross-view gait recognition. In AAAI, 2019. 1, 2, 4, 5, 6, 7 [10] P. Chattopadhyay, S. Sural, and J. Mukherjee. Frontal gait recognition from incomplete sequences using rgb-d camera. IEEE Trans. Inf. Forensics Secur., 9(11):1843–1856, 2014. 1 [11] Patrick Connor and Arun Ross. Biometric recognition by gait: A survey of modalities and features. Comput. Vis. Image Underst., 167:1–27, 2018. 1 [12] Daoliang Tan, Kaiqi Huang, Shiqi Yu, and Tieniu Tan. Efficient night gait recognition based on template matching. In Int. Conf. Pattern Recog., volume 3, pages 1000–1003, 2006. 3 [13] Brian DeCann, Arun Ross, and Jeremy Dawson. Investigating gait recognition in the short-wave infrared (swir) spectrum: dataset and challenges. In Biom. and Surveillance Technol. for Human and Activity Identification X, volume 8712, page 87120J. International Society for Optics and Photonics, 2013. 3 [14] Matteo Fabbri, Fabio Lanzi, Simone Calderara, Andrea Palazzi, Roberto Vezzani, and Rita Cucchiara. Learning to detect and track visible and occluded body joints in a virtual world. In Eur. Conf. Comput. Vis., 2018. 3 [15] Chao Fan, Yunjie Peng, Chunshui Cao, Xu Liu, Saihui Hou, Jiannan Chi, Yongzhen Huang, Qing Li, and Zhiqiang He. Gaitpart: Temporal part-based model for gait recognition. In IEEE Conf. Comput. Vis. Pattern Recog., June 2020. 1, 2, 5 [16] Davrondzhon Gafurov. A survey of biometric gait recognition: Approaches, security and challenges. In Annual Norwegian Comput. Sci. Conf., pages 19–21. Annual Norwegian Computer Science Conference Norway, 2007. 1 [17] M. Goffredo, I. Bouchrika, J. N. Carter, and M. S. Nixon. Self-calibrating view-invariant gait biometrics. Trans. Syst. Man Cybern. Syst., 40(4):997–1008, 2010. 2 [18] Ralph Gross and Jianbo Shi. The cmu motion of body (mobo) database. Technical Report CMU-RI-TR-01-18, Carnegie Mellon University, Pittsburgh, PA, June 2001. 3 [19] Guoying Zhao, Guoyi Liu, Hua Li, and M. Pietikainen. 3d gait recognition using multiple cameras. In Int. Conf. Automatic Face Gesture, pages 529–534, 2006. 2 [20] J. Han and B. Bhanu. Individual recognition using gait energy image. IEEE Trans. Pattern Anal. Mach. Intell., 28(2):316–322, 2006. 2 [21] Martin Hofmann, J. Geiger, S. Bachmann, B. Schuller, and G. Rigoll. The tum gait from audio, image and depth (gaid) database: Multimodal recognition of subjects and traits. J. Vis. Commun. Image Represent., 25:195–206, 2014. 3 [22] Yuan-Ting Hu, Hong-Shuo Chen, Kexin Hui, Jia-Bin Huang, and Alexander G. Schwing. Sail-vos: Semantic amodal instance level video object segmentation - a synthetic dataset and baselines. In IEEE Conf. Comput. Vis. Pattern Recog., June 2019. 3 [23] H. Iwama, M. Okumura, Y. Makihara, and Y. Yagi. The ouisir gait database comprising the large population dataset and performance evaluation of gait recognition. IEEE Trans. Inf. Forensics Secur., 7, Issue 5:1511–1521, Oct. 2012. 3 [24] W. Kusakunniran, Q. Wu, J. Zhang, H. Li, and L. Wang. Recognizing gaits across views through correlated motion co-clustering. IEEE Trans. Image Process., 23(2):696–709, 2014. 2 [25] W. Kusakunniran, Q. Wu, J. Zhang, Y. Ma, and H. Li. A new view-invariant feature for cross-view gait recognition. IEEE Trans. Inf. Forensics Secur., 8(10):1642–1653, 2013. 2 [26] Xiang Li, Yasushi Makihara, Chi Xu, Yasushi Yagi, and Mingwu Ren. Gait recognition via semi-supervised disentangled representation learning to identity and covariate features. In IEEE Conf. Comput. Vis. Pattern Recog., June 2020. 2, 8 [27] Rijun Liao, Shiqi Yu, Weizhi An, and Yongzhen Huang. A model-based gait recognition method with body pose and human prior knowledge. Pattern Recognition, 98:107069, 2020. 2 [28] Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang. Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2502–2511, 2019. 3 [29] Yasushi Makihara, Ryusuke Sagawa, Yasuhiro Mukaigawa, Tomio Echigo, and Yasushi Yagi. Gait recognition using a view transformation model in the frequency domain. In Eur. Conf. Comput. Vis., ECCV’06, page 151–163, Berlin, Heidelberg, 2006. Springer-Verlag. 2 [30] Farzan Erlik Nowruzi, Prince Kapoor, D. Kolhatkar, Fahed Al Hassanat, R. Laganiere, and J. Rebut. How much real ` data do we actually need: Analyzing object detection performance using synthetic and real data. ArXiv, abs/1907.07061, 2019. 5 9 [31] M Pushparani and D Sasikala. A survey of gait recognition approaches using pca and ica. Global J. Comput. Sci. Technol., 2012. 1 [32] I. Rida, N. Almaadeed, and S. Almaadeed. Robust gait recognition: a comprehensive survey. IET Biom., 8(1):14– 28, 2019. 1 [33] Sebastian Ruder. An overview of multi-task learning in deep neural networks. ArXiv, abs/1706.05098, 2017. 6 [34] Sina Samangooei and Mark S Nixon. Performing contentbased retrieval of humans using gait biometrics. Multimed. Tools. Appl., 49(1):195–212, 2010. 1 [35] S. Sarkar, P. J. Phillips, Z. Liu, I. R. Vega, P. Grother, and K. W. Bowyer. The humanid gait challenge problem: data sets, performance, and analysis. IEEE Trans. Pattern Anal. Mach. Intell., 27(2):162–177, 2005. 3 [36] Shiqi Yu, Daoliang Tan, and Tieniu Tan. A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In Int. Conf. Pattern Recog., volume 4, pages 441–444, 2006. 2, 3, 4, 5 [37] Hang Su, Xiatian Zhu, and S. Gong. Deep learning logo detection with data expansion by synthesising context. Winter Conf. Applications Comput. Vis., pages 530–539, 2017. 3 [38] K Sugandhi, Farha Fatina Wahid, and G Raju. Feature extraction methods for human gait recognition–a survey. In Int. Conf. Advances Comput. Data Sci., pages 377–385. Springer, 2016. 1 [39] Noriko Takemura, Yasushi Makihara, Daigo Muramatsu, Tomio Echigo, and Yasushi Yagi. Multi-view large population gait dataset and its performance evaluation for crossview gait recognition. IPSJ Trans. Comput. Vis. Appl., 10(1):4, 2018. 2, 3, 5 [40] J. Tremblay, Aayush Prakash, David Acuna, M. Brophy, V. Jampani, C. Anil, T. To, Eric Cameracci, Shaad Boochoon, and S. Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pages 1082–10828, 2018. 3 [41] Gul Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In IEEE Conf. Comput. Vis. Pattern Recog., July 2017. 3 [42] Changsheng Wan, Li Wang, and Vir V Phoha. A survey on gait recognition. ACM Comput. Surv., 51(5):1–35, 2018. 1 [43] Liang Wang, Huazhong Ning, Weiming Hu, and Tieniu Tan. Gait recognition based on procrustes shape analysis. In IEEE Int. Conf. Image Process., volume 3, pages III–III. IEEE, 2002. 3 [44] Qi Wang, Junyu Gao, Wei Lin, and Yuan Yuan. Learning from synthetic data for crowd counting in the wild. In IEEE Conf. Comput. Vis. Pattern Recog., June 2019. 3 [45] Yanan Wang, Shengcai Liao, and Ling Shao. Surpassing Real-World Source Training Data: Random 3D Characters for Generalizable Person Re-Identification. In ACM Int. Conf. Multimedia, 2020. 5 [46] S. Xiang, Y. Fu, G. You, and T. Liu. Unsupervised domain adaptation through synthesis for person re-identification. In Int. Conf. Multimedia and Expo, pages 1–6, 2020. 3 [47] Y. Zhang, P. David, and B. Gong. Curriculum domain adaptation for semantic segmentation of urban scenes. In Int. Conf. Comput. Vis., pages 2039–2049, 2017. 3 [48] Ziyuan Zhang, Luan Tran, Xi Yin, Yousef Atoum, Jian Wan, Nanxin Wang, and Xiaoming Liu. Gait recognition via disentangled representation learning. In IEEE Conf. Comput. Vis. Pattern Recog., Long Beach, CA, June 2019. 2, 8 [49] S. Zhao, B. Li, Xiangyu Yue, Y. Gu, P. Xu, Runbo Hu, Hua Chai, and K. Keutzer. Multi-source domain adaptation for semantic segmentation. In Adv. Neural Inform. Process. Syst., 2019. 3 [50] Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. In Eur. Conf. Comput. Vis., 2020. 3 [51] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In Int. Conf. Comput. Vis., pages 1116–1124, 2015. 5