International Journal of Geo-Information Article Elastic Spatial Query Processing in OpenStack Cloud Computing Environment for Time-Constraint Data Analysis Wei Huang † , Wen Zhang † , Dongying Zhang and Lingkui Meng * School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China; [email protected] (W.H.); [email protected] (W.Z.); [email protected] (D.Z.) * Correspondence: [email protected]; Tel.: +86-159-2706-6565 † These authors contributed equally to this work. Academic Editors: Ozgun Akcay and Wolfgang Kainz Received: 12 January 2017; Accepted: 13 March 2017; Published: 15 March 2017 Abstract: Geospatial big data analysis (GBDA) is extremely significant for time-constraint applications such as disaster response. However, the time-constraint analysis is not yet a trivial task in the cloud computing environment. Spatial query processing (SQP) is typical computation-intensive and indispensable for GBDA, and the spatial range query, join query, and the nearest neighbor query algorithms are not scalable without using MapReduce-liked frameworks. Parallel SQP algorithms (PSQPAs) are trapped in screw-processing, which is a known issue in Geoscience. To satisfy time-constrained GBDA, we propose an elastic SQP approach in this paper. First, Spark is used to implement PSQPAs. Second, Kubernetes-managed Core Operation System (CoreOS) clusters provide self-healing Docker containers for running Spark clusters in the cloud. Spark-based PSQPAs are submitted to Docker containers, where Spark master instances reside. Finally, the horizontal pod auto-scaler (HPA) would scale-out and scale-in Docker containers for supporting on-demand computing resources. Combined with an auto-scaling group of virtual instances, HPA helps to find each of the five nearest neighbors for 46,139,532 query objects from 834,158 spatial data objects in less than 300 s. The experiments conducted on an OpenStack cloud demonstrate that auto-scaling containers can satisfy time-constraint GBDA in clouds. Keywords: elasticity; spatial query processing; Spark; container; Kubernetes; OpenStack 1. Introduction With the volume of spatial datasets collected from ubiquitous sensors exceeding the capacity of current infrastructures, a new direction called Geospatial big data (GBD) has drawn great attention from academia and industry in recent years [1,2]. Analyzing spatial datasets can be valuable for many societal applications such as transport planning and management, disaster response, and climate change research [3]. However, efficient processing of them is still a challenging task, especially when obtaining timely results is preliminary for emergency responses [4,5]. Regardless of the data source, parallel spatial query processing (SQP) is indispensable for GBD analysis and a must for most spatial databases [6,7]. Since a (Graphics Processing Unit) GPU-based parallel requires significant effort in redesigning relevant algorithms, t state-of-art research directs parallel SQP (PSQP) for handling big spatial data in the cloud computing environment [8,9]. For example, Zhong et al. [9] implemented several MapReduce-based spatial query operators for parallel SQP algorithms (PSQPAs). Although MapReduce-based PSQPAs perform well with enhanced scalability, the efficiency of the PSQPAs depends on their Hadoop-based property system. Moreover, MapReduce-based algorithms used in Hadoop suffer dense disk Input/Output (I/O) and network ISPRS Int. J. Geo-Inf. 2017, 6, 84; doi:10.3390/ijgi6030084 www.mdpi.com/journal/ijgi ISPRS Int. J. Geo-Inf. 2017, 6, 84 2 of 17 communication costs and are thus inappropriate for timely data analysis [10]. To achieve higher performance, You et al. [8] designed a prototype system, SpatialSpark, based on Spark for large-scale spatial join queries in cloud computing. Spark is an advanced computing model (CM) similar to but different from Hadoop MapReduce. Through transferring transformation to in-memory datasets with resilient distributed datasets (RDDs) abstraction [11], Spark has become the leading in-memory big data analysis paradigm in recent years [12]. The extensible nature of RDDs cultivates useful frameworks like GeoSpark and LocationSpark to efficiently process big spatial data [13,14]. The computing model has shown a potential for high performance GBD computing in a 207 nodes composed of a Hadoop cluster on the Amazon EC2 cloud, as reported in [12]. However, geometry objects are multidimensional, and the geometry computation on big spatial datasets would aggressively deplete limited computing resources. Moreover, the property cloud computing environment only poses ‘vendor-lock-in’ limitation for private cloud users. Further, even building sophisticated spatial indexes and spatially declustering the huge datasets, the screw-processing caused by the size or density of the geometry objects may introduce long latency [15]. Cloud computing emerged as a paradigm with the promise of provisioning theoretically infinite computing resources for cloud applications. Elasticity, one of the main characteristics of cloud computing, represents the capability to scale up and down the number of allocated resources for applications on demand [16]. Both the vertical and the horizontal elasticity provide improvement in applications performance and cost reduction [17]. The vertical elasticity means that the computing resource of a single virtual server such as memory and virtual Central Processing Units (VCPUs) can be scaled up and down on demand, while the horizontal elasticity means the capability to scale in and out virtual resources on demand. The ‘pay-as-you-go’ paradigm encourages users to reduce costs by offloading the investment of infrastructures. However, it’s a difficult task for cloud users to identify the right amount of virtual resources to use, and cloud service providers often cannot satisfy the Service Level Agreement (SLA) contracts for cloud users [18]. Diversified auto-scaling techniques for elastic applications in clouds have been proposed [19]. However, there is some geoscience research that leverages the auto-scaling technologies in the cloud computing environment [20]. The purpose of this paper is to investigate auto-scaling strategies to support time-constraint spatial query processing in clouds. To the best of our knowledge, we can find a single work that is a little similar to ours [20]. They suggested using auto-scaling groups in OpenStack clouds with Heat for defining the rules for adding or removing virtual machines (VMs) on demand. We focus on horizontal auto-scaling containers, which would be more valuable for cloud applications. Compared with hypervisor-based virtualization, container-based virtualization has been considered an alternative for providing isolated environments for applications [21]. Application containers can be regarded as lightweight VMs that share the same kernel of the underlying operating system. Applications running inside traditional hypervisor-based VMs depend on the scheduling capacity of gust Operation System (OS), which introduces an extra level of abstraction [22]. Horizontal auto-scaling VMs often take several minutes to build virtual computing clusters, which may be unsuitable for time-constraint computational tasks. In fact, the efficiency of PSQPAs is not only determined by the volume of data but is also highly relevant with internal parameters and underlying CMs. To reduce the time costs of PSQPAs, we first introduce the implementation of Spark-based PSQPAs and identify the parameters that impact the efficiency of the PSQPAs in Section 2. After introducing Docker for container management on a single computing node, we detail Kubernetes-based container management and scheduling for auto-scaling containers across cloud computing nodes. Then, some auto-scaling strategies and a spatial query processing case are given in Section 3. The experiments and results are described in Section 4, followed by a discussion in Section 5. Finally, we conclude and mention our future works in Section 6. 2. Parallel SQPAs Using Spark CM Let D and Q be a data object collection and a query object collection respectively. Spatial range query (SRQ) represents searching data objects of D related to query objects of Q within a certain ISPRS Int. J. Geo-Inf. 2017, 6, 84 ISPRS Int. J. Geo-Inf. 2017, 6, 84 3 of 17 3 of 17 object. The query ‘find the nearest hotels close to a given space’data is anobjects example NNQ, where the distance. The Nearest neighbor query (NNQ) searches the closest to aofgiven query object. hotels are ‘find the data objects hotels and the space the query Spatial of join query (SJQ)the combines pairs The query the nearest close to aisgiven space’object. is an example NNQ, where hotels are the of objects the space two datasets based on their spatial relationships [23]. Before the data objectsfrom and the is the query object. Spatial join query (SJQ) combines pairs identifying of objects from parameters impacting of Spark-based PSQPAs, wethebriefly introduce the the two datasets based on the their efficiency spatial relationships [23]. Before identifying parameters impacting implementation and execution of them. the efficiency of Spark-based PSQPAs, we briefly introduce the implementation and execution of them. 2.1. PSQPAs PSQPAs Using Using Spark Spark CM CM Figure execution of aofSpark-based SRQ SRQ algorithm (SSRQA). First, spatial data objects Figure 11shows showsthe the execution a Spark-based algorithm (SSRQA). First, spatial data are partitioned and stored distributed nodes. nodes. Then, the query are broadcast to the nodes objects are partitioned and in stored in distributed Then, the objects query objects are broadcast to the where the partitions reside.reside. In these nodes, each each queryquery object is used to match the the datadata objects by nodes where the partitions In these nodes, object is used to match objects performing a spatial range predicate computation. We employ GeoSpark for processing large scale by performing a spatial range predicate computation. We employ GeoSpark for processing large spatial data indata this in work, a software provides geometrical operations. scale spatial thiswhich work,iswhich is a package softwarethat package thatbasic provides basic geometrical BitTorrent-enabled broadcastingbroadcasting is the main approach forapproach the Spark to share theto immutable operations. BitTorrent-enabled is the main formaster the Spark master share the objects withobjects Spark workers [10]. Sharing[10]. data, the volume does exceed Virtual immutable with Spark workers Sharing data, of thewhich volume of not which doesthe notJava exceed the Machine (JVM) heap size of the Spark is efficient. Implementation of a simple SSRQA Java Virtual Machine (JVM) heap sizemaster of the instances, Spark master instances, is efficient. Implementation of a includes only one mapping transformation. simple SSRQA includes only one mapping transformation. Figure 1. 1. The The execution execution flow flow of of Spark-based Spark-based spatial spatial range range query query (SRQ) (SRQ) algorithms. algorithms. The The partitioned partitioned Figure spatial data objects are spatial resilient distributed datasets (RDDs). Spark executors on Spark worker spatial data objects are spatial resilient distributed datasets (RDDs). Spark executors on Spark worker nodes would do spatial range predicate computation after the mapping transformation tasks are nodes would do spatial range predicate computation after the mapping transformation tasks are received from the Spark Master instance. received from the Spark Master instance. A Spark-based NNQ algorithm (SNNQA) executes as follows. If we suppose that the query A Spark-based NNQ algorithm (SNNQA) executes as follows. If we suppose that the query objects objects are small and that the data objects are huge, the query objects can be broadcast to the are small and that the data objects are huge, the query objects can be broadcast to the computing nodes computing nodes where the partitioned data objects reside. In these nodes, Spark workers execute where the partitioned data objects reside. In these nodes, Spark workers execute spatial predicate spatial predicate computations using local spatial data objects. It is assumed that the query objects computations using local spatial data objects. It is assumed that the query objects are too large to be are too large to be shared, whereas the data objects are relative small. We can then spatially partition shared, whereas the data objects are relative small. We can then spatially partition the data objects the data objects into grids and, if necessary, build an index for them [24]. The grids can then be into grids and, if necessary, build an index for them [24]. The grids can then be broadcasted to the broadcasted to the computing nodes where the query objects reside. If both the data objects and the computing nodes where the query objects reside. If both the data objects and the query objects are query objects are too large to be shared, it is still possible to divide the query objects into chunks and too large to be shared, it is still possible to divide the query objects into chunks and submit the same submit the same algorithms with the different chunks as their input. Implementation of a simple algorithms with the different chunks as their input. Implementation of a simple SNNQA includes only SNNQA includes only one mapping transformation. Since the execution flow of a SNNQA is fairly one mapping transformation. Since the execution flow of a SNNQA is fairly similar to that of a SSRQA, similar to that of a SSRQA, we omit the schematic diagram here in order to save space. we omit the schematic diagram here in order to save space. Figure 2 shows a Spark-based SJQ algorithm (SSJQA) that executes as follows. After loading Figure 2 shows a Spark-based SJQ algorithm (SSJQA) that executes as follows. After loading datasets in computing nodes, a Spark engine will partition them into grids according to their datasets in computing nodes, a Spark engine will partition them into grids according to their approximate spatial location, such as the minimum bounding rectilinear rectangle (MBR). Next, the approximate spatial location, such as the minimum bounding rectilinear rectangle (MBR). Next, Spark engine joins the datasets according to their grid IDs [25]. For those spatial objects that have the the Spark engine joins the datasets according to their grid IDs [25]. For those spatial objects that have same grid ID, the spatial overlap that predicates computation is evaluated in the same computing the same grid ID, the spatial overlap that predicates computation is evaluated in the same computing nodes. If two types of objects overlap, they are kept in the final results. Finally, the results are ISPRS Int. J. Geo-Inf. 2017, 6, 84 4 of 17 ISPRS Int. J. Geo-Inf. 2017, 6, 84 4 of 17 nodes. If two types of objects overlap, they are kept in the final results. Finally, the results are grouped grouped by rectangles with theobjects duplicate objects removed. The implementation of include a SSJQA may by rectangles with the duplicate removed. The implementation of a SSJQA may several include several joins, mapping, and filtering transformations. joins, mapping, and filtering transformations. Figure 2. The execution flow of Spark-based spatial join query (SJQ) algorithms. Data and query objects are partitioned andflow stored grids according to their spatial(SJQ) location. Spark executors on query Spark Figure 2. The execution of in Spark-based spatial join query algorithms. Data and worker are nodes would doand spatial overlap predicate computation thelocation. joining transformation tasks objects partitioned stored in grids according to their after spatial Spark executors on are received from the Spark Master instance. Spark worker nodes would do spatial overlap predicate computation after the joining transformation tasks are received from the Spark Master instance. 2.2. Identifying Factors Impacting the Efficiency of Spark-Based PSQPAs 2.2. Identifying Factors Impacting theefficiently Efficiencyquery of Spark-Based PSQPAs Since the practical method to against big spatial data is to employ the divide and conquer strategy [9,26], most MapReduce-based PSQPAs use certain types of space filling curves, such Since the practical method to efficiently query against big spatial data is to employ the divide as Hilbert space-filling curve, most to map MBRs to grids based on theuse spatial correlation optimizing and conquer strategy [9,26], MapReduce-based PSQPAs certain types offorspace filling efficiency [27,28]. We space-filling simply treatcurve, the number of grids p as one ofon thethe internal of curves, such as Hilbert to map MBRs to grids based spatial parameters correlation for Spark-basedefficiency PSQPAs.[27,28]. We focus SNNQAs a SSRQA can as a simple case optimizing Weon simply treatexclusively. the numberSince of grids p as one of be theseen internal parameters of Spark-based SNNQAs, thePSQPAs. spatial range predicates of the SSRQAs thatSince can bea wrapped into single We focus on SNNQAs exclusively. SSRQA can beaseen asmapping a simple transformation; thatthe of SNNQAs. SSJQAs are more than SNNQAs. Thewrapped basic units composing case of SNNQAs, spatial range predicates ofcomplex the SSRQAs that can be into a single this complexity are multiway AlthoughSSJQAs joins are unavoidable and than time-consuming, thebasic projection mapping transformation; thatjoins. of SNNQAs. are more complex SNNQAs. The units operation mapping spatial correlated datasets into the same grids is commonly used in the filter composing this complexity are multiway joins. Although joins are unavoidable and refinement stages the [28].projection From theoperation viewpointmapping of Spark spatial CM, thecorrelated number ofdatasets grids determines the number time-consuming, into the same grids is of tasks thatused should executed, which can directly impact theFrom efficiency of the PSQPAs. The projection commonly inbe the filter and refinement stages [28]. the viewpoint of Spark CM, the operation of networking communication. Sincewhich building models for number of only gridsdetermines determinesthe thecost number of tasks that should be executed, can cost directly impact networking communication is complex, we identify the number grids as the internal of the efficiency of the PSQPAs. The projection operation only of determines the cost ofparameters networking PSQPAs. Our purpose is to evaluate the relationship between the number of grids and the execution communication. Since building cost models for networking communication is complex, we identify timenumber of the PSQPAs, may be useful for self-adaptive systems. this paper, exclusively the of gridswhich as the internal parameters of PSQPAs. Our In purpose is toweevaluate the use a Spark standalone because it isand easythe to execution configure time a Spark cluster without using other relationship between themodel number of grids of the PSQPAs, which may be frameworks in a cloud computing environment. Moreover, the standalone model demonstrates good useful for self-adaptive systems. In this paper, we exclusively use a Spark standalone model because reliability, in our earlier The using parameters the Spark standalone model include; it is easy as to demonstrated configure a Spark clusterwork. without otherofframeworks in a cloud computing driver-core, driver-memory, executor-cores. The first two environment. Moreover, theexecutor-memory, standalone modeltotal-executor-cores, demonstrates goodand reliability, as demonstrated in parameters application-related and defined users tostandalone set the resource requirement the Spark our earlierarework. The parameters of thebySpark model include; for driver-core, master instances. executor-memory, The last three parameters are used to set the resource requirementThe for each the driver-memory, total-executor-cores, and executor-cores. first oftwo Spark worker instances. Since the value of the total-executor-cores is an expectation of users who parameters are application-related and defined by users to set the resource requirement for the handlemaster large amounts of spatial data, it isparameters consideredare another workers for running Spark instances. The last three used impactor. to set the Further, resourceSpark requirement each onthe containers compete with each other available resources. Therefore, do not consider the of Spark worker instances. Since the for value of the total-executor-cores is we an expectation of users executor-memory and executor-cores as parameters. The other possible parameter of SNNQAs would who handle large amounts of spatial data, it is considered another impactor. Further, Spark workers be the k parameter, which specifieswith the desired number the nearest neighbor’s query objects. running on containers compete each other for of available resources. Therefore, we do not consider the executor-memory and executor-cores as parameters. The other possible parameter of SNNQAs would be the k parameter, which specifies the desired number of the nearest neighbor’s query objects. ISPRS Int. J. Geo-Inf. 2017, 6, 84 ISPRS Int. J. Geo-Inf. 2017, 6, 84 5 of 17 5 of 17 3. Horizontal Auto-Scaling Containers for Elastic Spatial Query Processing 3. Horizontal Auto-Scaling Containers for Elastic Spatial Query Processing 3.1. Docker with Kubernetes Orchestration for Clustering Containers 3.1. Docker with Kubernetes Orchestration for Clustering Containers Our testing system is based on Docker, which is an open source project for automating the Our testing system is based on Docker, an open source project is fora automating the deployment of containerized applications in awhich Linux is environment [29]. Docker framework that deployment applicationscontainers. in a Linux environment [29]. and Docker a framework that manages theof containerized lifecycle of application An application its isdependencies are manages the lifecycle of application containers. An application and its dependencies are encompassed encompassed into an image. A Docker image can be built from a basic image and other existing into an image. A Docker can be built from asystem basic image and other existing images. The basic images. The basic image image is a minimum operating such as a community enterprise operating image is a minimum operating system such as a community enterprise operating system (CentOS), system (CentOS), Rancheros, or CoreOS. Applications and a basic image are combined into a single Rancheros, CoreOS.Layer-by-layer Applications and a basicfacilitates image are combined into aupdating single layer-wised image. layer-wisedor image. storage sharing and the application Layer-by-layer facilitates sharing and updating application components at minimum cost. components atstorage minimum cost. From Docker's point ofthe view, a container is the basic management From Docker’s point of view, a container is the basic management unit in which an application runs. unit in which an application runs. Docker containers share the basic operating system with other Docker containers shareuse theanother basic operating system with other rather than instruct use another copy. containers rather than copy. Figure 3 shows how containers Docker clients would a Docker Figure showscontainers how Docker instructwith a Docker hostimages. to launch for running host to 3launch for clients runningwould applications specified Thecontainers Docker image registry applications withgallery specified Theimages. DockerWe image is image the dedicated storing the is the dedicated for images. storing the useregistry a private registrygallery to storefor Spark-master images. We use a private and Spark-worker imagesimage in ourregistry tests. to store Spark-master and Spark-worker images in our tests. Figure forfor lifecycle management of containers in a host. imagesimages are stored Figure 3.3.Docker Dockerframework framework lifecycle management of containers in a Docker host. Docker are in Docker image registries. Each host has a Docker engine that deploys and controls these containers stored in Docker image registries. Each host has a Docker engine that deploys and controls these as requested Docker clients. containers asby requested by Docker clients. The Docker Docker facilitates facilitates application application executions executions in in aa single single host. host. To To fabricate fabricate computing computing clusters clusters The using containers, containers, higher higher level level tools tools such such as as Docker Docker Swarm Swarm and and Kubernetes Kubernetes are necessary [30,31]. using are necessary [30,31]. Docker Swarm is a Docker-native container orchestration system for cluster computing. Since it’s Docker Swarm is a Docker-native container orchestration system for cluster computing. Since it’s imple and flexible in deployment, it has been widely-used in OpenStack clouds for data imple and flexible in deployment, it has been widely-used in OpenStack clouds for data analysis [32]. analysis [32]. However, on different cannot withwithout each other without using However, containers on containers different hosts cannot hosts interact with interact each other using the current the current unstable Docker overlay network. Moreover, at the time of this writing, many advanced unstable Docker overlay network. Moreover, at the time of this writing, many advanced functionalities functionalities suchand as auto-scaling self-healing are andnotauto-scaling areDocker not supported by Docker Swarm. such as self-healing supported by Swarm. Kubernetes is another Kubernetes another open source in project. It was in from June the 2014 and originates from the open source is project. It was released June 2014 andreleased originates requirement for managing requirement for managing billions of containers in the infrastructure. Kubernetes, which is more billions of containers in the infrastructure. Kubernetes, which is more complex than Docker Swarm, complex than by Docker has been used by Google for decades. complexities cultivate has been used GoogleSwarm, for decades. The complexities cultivate numerousThe benefits for containerized numerous benefits for containerized applications such as high availability, self-healing, and applications such as high availability, self-healing, and auto-scaling containers with fine-grained auto-scaling containers with fine-grained cluster monitoring. Figure 4 shows how Kubernetes cluster monitoring. Figure 4 shows how Kubernetes adapts master-slave architecture to manage adapts master-slave manage containers Kubelet and Kube-proxy running containers in minions.architecture Kubelet andtoKube-proxy runningin onminions. minion, are components for disseminating on minion, are components for disseminating changing workloads to pods and routing changing workloads to pods and routing services-oriented access to pods, respectively. Here each services-oriented access to pods, respectively. Here each pod is a logical group of containers that run pod is a logical group of containers that run on the same application. The Application Programming on the same application. The Application Programming (API) is the main services Interface (API) server is the main services gallery and an Interface access point forserver Kubectl (a command line gallery and an access point for Kubectl (a command line interface for running commands against interface for running commands against Kubernetes clusters) to query and define cluster workloads. Kubernetes clusters) query and define work cluster The Kube-scheduler and The Kube-scheduler and to Kube-controller-manager withworkloads. the API server for scheduling workloads Kube-controller-manager work specified with the pods API server forrunning scheduling workloads in the form pods in the form of pods and ensuing replicate in minions, respectively. For of reasons and ensuing specified pods replicate running in minions, respectively. For reasons of simplicity, of simplicity, they are not shown in Figure 4. they are not shown in Figure 4. ISPRS Int. J. Geo-Inf. 2017, 6, 84 ISPRS Int. J. Geo-Inf. 2017, 6, 84 6 of 17 6 of 17 Master Minion-1 (kubelet, kube-proxy) kubelet API Server scheduling Repliation controller Pod1 PodN Container1 Container2 kubectl Container1 … Container2 ContainerN ContainerN Minion-2 (kubelet, kube-proxy) User acess kube-proxy Service1 (virtual IP + port) Pod1 … PodN Figure Figure 4. 4. Kubernetes Kubernetes components components for for orchestrating orchestrating containers containers distributed distributed in in minion minion nodes. nodes. 3.2. 3.2. Horizontal Horizontal Auto-Scaling Auto-Scaling Containers Containers for for Elastic Elastic Spatial Spatial Query Query Processing Processing To explore elastic elastic spatial spatial query query processing processing in cloud computing computing environment, To explore in aa cloud environment, we we assume assume that that there are C cores, I maximum instances, and M gigabytes of memory leased for a tenant. there are C cores, I maximum instances, and M gigabytes of memory leased for a tenant. Figure Figure 55 shows shows the the computing computing resources, resources, which which are are used used as as follows. follows. The The left left part part consists consistsof ofaa stable stablecluster, cluster, in which the Kubernetes Master manages containers on a fixed number of minion nodes. A in which the Kubernetes Master manages containers on a fixed number of minion nodes. A horizontal horizontal pod auto-scaler (HPA) is a plugin for Kubernetes that automatically increases and/or pod auto-scaler (HPA) is a plugin for Kubernetes that automatically increases and/or decreases the decreases number of pods across the minions. To build a robust cluster on topcan of number ofthe pods across the minions. To build a robust Spark cluster on topSpark of Kubernetes, users Kubernetes, users can to create atoreplication controller ensure request Kubernetes to request create a Kubernetes replication controller ensure one, and onlytoone, Podone, thatand runsonly the one, Pod that runs the Spark-master image. Kubernetes allows users to explicitly define resource Spark-master image. Kubernetes allows users to explicitly define resource quotas for pods. Therefore, quotas for has pods. Therefore, a pod and thatcores has sufficient andused cores be exclusively used a pod that sufficient memory, would bememory, exclusively bywould the Spark-master instance. by the Spark-master instance. Then, the HPA would create an auto-scaling group of containers for Then, the HPA would create an auto-scaling group of containers for Spark-worker instances based Spark-worker instances based on some auto-scaling strategies. Finally, Spark-workers connect to the on some auto-scaling strategies. Finally, Spark-workers connect to the Spark-master instances by Spark-master instances by for using a Kubernetes service forcluster. fabricating an elastic Spark using a Kubernetes service fabricating an elastic Spark Kubernetes services arecluster. logical Kubernetes services are logical groups of pods. They are used for routing the incoming traffic to groups of pods. They are used for routing the incoming traffic to background pods. Three parameters background pods. Three the parameters are strategies. mainly used define the and auto-scaling strategies. The are mainly used to define auto-scaling ThetominReplicas maxReplicas parameters minReplicas and maxReplicas parameters minimum and maximum number of pods declare the minimum and maximum numberdeclare of podsthe allowed, respectively. Kubernetes calculates the allowed, respectively. Kubernetes calculates the arithmetic mean of the pods’ CPU utilization with arithmetic mean of the pods’ CPU utilization with the target value, as defined by the target percentage the target value, definedadjusts by the the target percentage andright if necessary adjusts the number parameter, and if as necessary number of podsparameter, allowed. The part includes an auto-scaling of podsinallowed. The right part includes an auto-scaling in whichtovirtual machines (VMs) group, which virtual machines (VMs) scale in and scalegroup, out according the policies defined in scale in and scale out according to the policies defined in the Heat templates. The minimum and the Heat templates. The minimum and maximum size properties of an auto-scaling group are used maximum size properties of an auto-scaling group are used to define the threshold of instances to define the threshold of instances allowed for cloud users. To horizontally scale Docker containers allowed for cloud users. To horizontally scale Docker containers in an auto-scaling group, the in an auto-scaling group, the preliminary requirement is that the VMs can automatically join the preliminary requirement is that the VMs can automatically join the Kubernetes cluster. Fortunately, Kubernetes cluster. Fortunately, the cloud-config files of OpenStack can be used in this context [33]. the cloud-config files of OpenStack can be used in this context [33]. The loading of the kube-proxy The loading of the kube-proxy and kubelet services on the minion nodes can be done automatically on and kubelet services on the minion nodes can be done automatically on the boot of the VMs. the boot of the VMs. This proposed empirical strategy divides the computing resources into two parts and provides This proposed empirical strategy divides the computing resources into two parts and provides a way to distribute the containers over all the nodes. By separating the Spark-master and a way to distribute the containers over all the nodes. By separating the Spark-master and Spark-workers Spark-workers into containers and leveraging the HPA to distribute the workload, we can build into containers and leveraging the HPA to distribute the workload, we can build self-healing Spark self-healing Spark clusters to process huge volumes of spatial datasets. The experiments and results clusters to process huge volumes of spatial datasets. The experiments and results are presented in are presented in Section 4. Section 4. ISPRS Int. J. Geo-Inf. 2017, 6, 84 ISPRS Int. J. Geo-Inf. 2017, 6, 84 7 of 17 7 of 17 Figure Figure 5. 5. Horizontal Horizontal auto-scaling auto-scaling containers containers in in OpenStack OpenStack Clouds Clouds for for elastic elastic spatial spatial query query processing processing of big spatial data. The left part is a Kubernetes cluster, and the right part is an auto-scaling group, in of big spatial data. The left part is a Kubernetes cluster, and the right part is an auto-scaling group, which kube-proxy and kubelet daemons would run on relevant virtual machines. in which kube-proxy and kubelet daemons would run on relevant virtual machines. 4. 4. Experiments Experiments and and Results Results 4.1. 4.1. Spatial Spatial Datasets Datasets for for A A Spatial Spatial Query Query Processing Processing Case Case In this paper, paper,we weintroduce introduce concept of spatial datasets to better understand patterns of In this thethe concept of spatial datasets to better understand patterns of human human mobility in urban areas. In order to perform this analysis, which is fundamental for city mobility in urban areas. In order to perform this analysis, which is fundamental for city planners, planners, trajectory data are collected from GPS-equipped inexpensive GPS-equipped devices, such taxicabs. trajectory data are collected from inexpensive devices, such as taxicabs. Taxiastrip data is Taxi trip datafor is mining widely-used traffic flow patterns, as described in [34,35]. ForOpen example, widely-used traffic for flowmining patterns, as described in [34,35]. For example, the NYC Data the NYC Open Data Plan has built a portal for publishing its digital public data for city-wide Plan has built a portal for publishing its digital public data for city-wide aggregation, as required by aggregation, as required by local law. Taxi In addition, the New York City and Limousine local law. In addition, the New York City and Limousine Commission has Taxi published, since 2009, Commission since 2009,ataxi data sets to the of portal. 2015, totalwere of 11,534,883 taxi trip data has sets published, to the portal. In 2015, totaltrip of 11,534,883 rows greenIntaxi tripadata recorded. rows of green taxi trip data were recorded. The records include field capturing, pick-up and drop-off The records include field capturing, pick-up and drop-off dates, times, locations, distances, payment dates, times, locations, distances, payment andpick-up driver-reported passenger counts. Since types, and driver-reported passenger counts.types, Since the and drop-off locations are just pairsthe of pick-up and drop-off locations are just pairs of geographical coordinates, additional spatial data sets geographical coordinates, additional spatial data sets that include land use type (LUT) are required. that include land use (LUT) areMapPluto required. Tax We Lot collected a data setDepartment named NYC Tax We collected a data settype named NYC from the NYC of MapPluto City Planning. Lot NYC Department City Planning. The in the data describe set include XCoord, Thefrom fieldsthe in the data set includeof LandUse, XCoord, andfields YCoord, which the LandUse, land use categories and YCoord, which describe the land use categories such as Family Building, Commercial and Office such as Family Building, Commercial and Office Buildings, Industrial and Manufacturing, and the Buildings, Industrial and Manufacturing, and the approximate location of the lots. The Tax Lot approximate location of the lots. The Tax Lot dataset contains 834,158 records. A taxi customer’s dataset 834,158 records. A taxiand customer’s pick-up near location near a Family Building and pick-up contains location near a Family Building drop-off location a Commercial and Office Building drop-off location near a Commercial and Office Building at rush hour, indicates that the person’s at rush hour, indicates that the person’s taxi request may be work related. We explore the data taxi be workmentioned related. WeSNNQAs. explore the data sets mentioned sets request with themay previously Readers canwith findthe thepreviously datasets by using the SNNQAs. following Readers can find the datasets by using the following links: https://data.cityofnewyork.us/ links: https://data.cityofnewyork.us/Transportation/2015-Green-Taxi-Trip-Data/n4kn-dy2y and Transportation/2015-Green-Taxi-Trip-Data/n4kn-dy2y and http://www1.nyc.gov/site/planning/ http://www1.nyc.gov/site/planning/data-maps/open-data.page (last access date: 2017/2/8) data-maps/open-data.page (last access date: 2017/2/8) 4.2. Cloud Computing Environemt 4.2. Cloud Computing Environemt There are six machines used for setting-up our OpenStack cloud. The Liberty version of OpenStack, Thereto are six in machines setting-up cloud.1,The version released public Octoberused 2015,for was choosen.our AsOpenStack shown in Table the Liberty controller node of is OpenStack, released to public in October 2015,4 cores, was choosen. As shown Table the controller a commodity PC equipped with 1 processor, 4 GB memory, andin500 GB 1, disks, on which node is a commodity equipped with 1service, processor, 4 cores, 4 GB memory, and 500Glance GB disks, on the Keystone identityPC service, Telemetry Neutron networking service, and service which theNeutron Keystone identity service, service, Neutron networking service, Glance run. The Linux bridge agentsTelemetry running on computing nodes are connected withand each other service run. The Neutronfor Linux bridge agents running computing nodes are connected each to fabricate networking virtual instances. We use on four Dell PowerEdge M610 serverswith to build other to fabricate networking We two use physical four DellCPUs, PowerEdge M610 servers to the computing nodes clusters, for withvirtual one ofinstances. them having 24 cores, 92 GB of main build the computing nodes clusters, with one of them having two physical CPUs, 24 cores, 92 GB of main memory, and a 160 GB disk, and three of them having two physical CPUs, 24 cores, 48 GB of ISPRS Int. J. Geo-Inf. 2017, 6, 84 8 of 17 memory, and a 160 GB disk, and three of them having two physical CPUs, 24 cores, 48 GB of main memory, and a 500 GB disk. The block storage node is a HP Z220 workstation configured with two physical CPUs, 8 cores, 32 GB of main memory, and 3 TB disks. The Cinder services running on the HP workstation is using Network File System (NFS) to provide virtual instance with storage. To simulate complex network infrastructure, we let the computing nodes traverse two different subnets. 1000 Mbps networking devices are used to bridge the two subnets. All nodes in the cloud computing environment adapt a Ubuntu 14.04.4 LTS 64 bits operating system. All required packages that support the OpenStack cloud are installed with their native package. More specifically, RabbitMQ 3.5.4 is used for coordinating operations and status information among services. MariaDB 5.5.52 is used for services to store data. The libvirt KVM and Linux Bridge are used to provide platform virtualization and networking for virtual instances. We exclusively use the vxlan mechanism to provide networking with virtual instances. For some convenience, we abbreviate the auto-scaling VM instances and the auto-scaling Docker containers as ASVI and ASVC, respectively. Table 1. Specification of the OpenStack cloud computing environment. Node Cloud Actors 192.168.203.135 Controller 192.168.203.16 192.168.200.97 192.168.200.109 192.168.200.111 192.168.203.16 Compute Compute Compute Compute Block Storage Specification Services 4 cores, 4 GB memory and 500 GB disks 24 cores, 92 GB memory and 160 GB disks 24 cores, 48 GB memory and 500 GB disks 24 cores, 48 GB memory and 500 GB disks 24 cores, 48 GB memory and 500 GB disks 8 cores, 32 GB memory and 3 TB disks Nova-cert, Nova-consoleauth, Nova-scheduler, Nova-conductor, Cinder-scheduler, Neutron-metadata-agent, Neutron-linuxbridge-agent, Neutron-l3-agent, Neutron-dhcp-agent, Heat-engine Nova-compute, Neutron-linuxbridge-agent Nova-compute, Neutron-linuxbridge-agent Nova-compute, Neutron-linuxbridge-agent Nova-compute, Neutron-linuxbridge-agent Cinder-volume 4.3. Result 4.3.1. Comparison of SNNQAs using ASVI and ASVC To test the efficiency of SQP using ASVI and ASVC, we construct two clusters using two OpenStack Heat templates. As shown in Table 2, the only difference between the clusters is the image and software used. The two clusters can be comparable in this context, since the deployment of instances on which computing nodes can be exactly specified by using an OS::Nova::Server availability zone property in Heat templates. In the Ubuntu cluster, the node having 4VCPUs and 8 GB of memory hosts the Spark master instance, while the other four instances host the Spark worker instances. In the CoreOS cluster, the nodes holding each type of Spark component are determined at runtime. More specifically, the softwares and their versions are listed in Table 3. Table 2. Two comparable clusters in the OpenStack cloud. Virtual Cluster Ubuntu_cluster CoreOS_cluster Specification 1 instance with 4 VCPUs and 8 GB RAM, and 4 instances each with 2V-CPus and 4 GB RAM 1 instance with 4 VCPUs and 8 GB RAM, and 4 instances each with 2V-CPus and 4 GB RAM Images and Software Ubuntu 14.04.5 trusty image with Spark 1.5.2, Java(TM) SE Runtime Environment (build 1.7.0_45-b18), Scala 2.10.4 CoreOS 1185.3.0 image with Kubernetes 1.3.4, Heapster 1.1.0 and Docker 1.11.2 Assuming the clusters’ launching time is negligible, we first test the efficiency of the SNNQAs under the two clusters. Since it is necessary to specify the Ubuntu image in advance, the number of Spark workers running on each virtual instance is the same when using the ASVI. As shown in Figure 6, the efficiency of SNNQAs using the two auto-scaling strategies on the OpenStack cloud is ISPRS Int. J. Geo-Inf. 2017, 6, 84 9 of 17 ISPRS Int. J. Geo-Inf. 2017, 6, Table 84 3. The summary table of software used in the following tests. Software 9 of 17 Version 1.11.2 very different. When p = 3000 and k = 5,Docker there are 3000 grids on which to store spatial data objects, CoreOS 1185.3.0 and each query object queries the five nearest spatial data objects in the grids. The average execution Spark 1.5.2 time is recorded when each SNNQA submission is executed 10 times with the total-executor-cores Scala 2.10.4 parameter set to six. We observe that increasing the number of Spark workers does not guarantee better JDK 1.7.0_45-b18 efficiency of the algorithms. For example, when there are 12 Spark workers, the average execution Heapster 1.1.0 time of the SNNQA using the ASVI strategy is 235.84 s. Doubling the number of Spark workers to Kubernetes 1.3.4 24 raises the average execution time of the SNNQA using the ASVI strategy (SNNQA-ASVI) to nearly 400 s, andAssuming when the number Spark workers is increased 32,test thethe average execution time of the the clusters’oflaunching time is negligible, we to first efficiency of the SNNQAs SNNQA-ASVI fallsclusters. to 281.18 s. Then, when wetoincrease of Spark workers the to 36, the average under the two Since it is necessary specify the the number Ubuntu image in advance, number of running on each instance is the same when using the ASVI. As shown in time Spark cost ofworkers the SNNQA-ASVI rises virtual to 420 s. The high SNNQA-ASVI efficiency variability is possibly Figure 6, the efficiency of SNNQAs using the two auto-scaling strategies on the OpenStack cloud is due to two factors; (1) the performance disturbance of the virtual instances and (2) the uncertainty very different. When p = 3000 and k = 5, there are 3000 grids on which to store spatial data objects, of the number of Spark workers introduces SNNQA variations using the ASVI. We notice that Spark and are eachquite queryoften objectlost queries thetests. five nearest spatial data objects in the grids. averageon execution workers in the Data-intensive computational tasksThe executed each Spark time is recorded when each SNNQA submission is executed 10 times with the total-executor-cores worker require too many memory resources, which may disrupt the connection between the Spark parameter set to six. We observe that increasing the number of Spark workers does not guarantee workers and the Spark master instance. Further, there are no self-healing mechanisms for Spark better efficiency of the algorithms. For example, when there are 12 Spark workers, the average workers in this context. When given fewer than 28 Docker containers to the algorithm, the execution execution time of the SNNQA using the ASVI strategy is 235.84 s. Doubling the number of Spark time workers is alwaystoless 270the s. As the results in Figure given 24 Docker containers, 24 than raises average execution time 6ofshow, the when SNNQA using the ASVI strategy the average execution time is 269.91 s. When the number of Docker containers is limited to 20, (SNNQA-ASVI) to nearly 400 s, and when the number of Spark workers is increased tothe 32, average the execution time falls totime 250.31 s. The time cost offalls thetoSNNQAs, usingwhen the two strategies, average execution of the SNNQA-ASVI 281.18 s. Then, we increase the grows numberrapidly of workersof to 36, workers the average timethan cost32. ofThese the SNNQA-ASVI s. The using high the whenSpark the number Spark is more results indicaterises thattothe420 SNNQAs SNNQA-ASVI efficiency is possibly due to is two factors; (1) the ASVC could be robust if the variability Docker container number smaller than 32. performance disturbance of the virtual instances and (2) the uncertainty of the number of Spark workers introduces SNNQA The summary table of software used in the tests.lost in the tests. variations usingTable the 3. ASVI. We notice that Spark workers are following quite often Data-intensive computational tasks executed on each Spark worker require too many memory Software between Version resources, which may disrupt the connection the Spark workers and the Spark master instance. Further, there are no self-healing mechanisms for Spark workers in this context. When Docker 1.11.2 given fewer than 28 Docker containersCoreOS to the algorithm, the execution time is always less than 270 s. 1185.3.0 As the results in Figure 6 show, whenSpark given 24 Docker containers, the average execution time is 1.5.2 Scala 2.10.4to 20, the average execution time falls to 269.91 s. When the number of Docker containers is limited JDK the two 1.7.0_45-b18 250.31 s. The time cost of the SNNQAs, using strategies, grows rapidly when the number of Heapster Spark workers is more than 32. These results indicate 1.1.0 that the SNNQAs using the ASVC could be Kubernetes 1.3.4 robust if the Docker container number is smaller than 32. Figure 6. Comparison of Spark-based K Nearest Neighbor (KNN) algorithms using auto-scaling Figure 6. Comparison of Spark-based K Nearest Neighbor (KNN) algorithms using auto-scaling virtual virtual machine (VM) instances (ASVI) and auto-scaling Docker containers (ASVC) strategies. machine (VM) instances (ASVI) and auto-scaling Docker containers (ASVC) strategies. To investigate the effectiveness of the index to the SNNQAs, we conduct the second experiment as follows. When using the ASVI and 20 Spark workers, the Spark-based KNN algorithms using the Sort-Tile-Recursive tree (STRtree) index (SKNN-STRtree) completes in about 297.92 s, as shown in Figure 7. The Spark-based KNN algorithms without the STRtree index complete in about 284.89 s, ISPRS Int. J. Geo-Inf. 2017, 6, 84 ISPRS Int. J. Geo-Inf. 2017, 6, 84 10 of 17 10 of 17 To investigate the effectiveness of the index to the SNNQAs, we conduct the second experiment as follows. When using the ASVI and 20 Spark workers, the Spark-based KNN algorithms using the as shown in Figure 6. These results indicate that using the index doesin not reduce thes,execution Sort-Tile-Recursive tree (STRtree) index (SKNN-STRtree) completes about 297.92 as showntime in of the SNNQAs using the ASVI strategies. The reason may be because the number of data objects in the Figure 7. The Spark-based KNN algorithms without the STRtree index complete in about 284.89 s, as gridsshown is not large; therefore in this context may be does necessary. The grid number determines in Figure 6. Theseindexing results indicate that using thenot index not reduce the execution time of the SNNQAs using the ASVI strategies. The reason maySpark be because the number of data inis the the average number of objects that are processed by each worker. When the grid objects number small, grids is ofnot large;istherefore indexing this context maymore not be necessary. The grid number the number objects large. Each Spark in worker consumes memory and increases the amount determines the average objects that are processed bythe each Spark worker. When the the grid of time spent executing thenumber spatial of predicate calculation. When grid number is large, number number is small, the number of objects is large. Each Spark worker consumes more memory and of objects is small. When using the ASVC, the Spark-based K Nearest Neighbor (KNN) algorithms increases the amount of time spent executing the spatial predicate calculation. When the grid using the STRtree index complete in less than 300 s if working with 20 to 28 containers. These results number is large, the number of objects is small. When using the ASVC, the Spark-based K Nearest indicate that the SNNQAs with the STRtree index using the ASVC could be robust if the container Neighbor (KNN) algorithms using the STRtree index complete in less than 300 s if working with 20 number is containers. smaller than 28, and when giving more 28 containers to the Spark costs of to 28 These results indicate that the than SNNQAs with the STRtree indexworkers, using thethe ASVC the SKNN-STRtree increase. These resultsthan indicate that the giving CoreOS cluster be too small could be robust obviously if the container number is smaller 28, and when more thanmay 28 containers for Kubernetes auto-scale We also obviously notice that the efficiency of theindicate SKNN-STRtree to the Sparktoworkers, the many costs ofcontainers. the SKNN-STRtree increase. These results that varies when using the ASVI strategy. These similar results indicate that, regardless of whether the CoreOS cluster may be too small for Kubernetes to auto-scale many containers. We also notice or that the efficiency theefficiency SKNN-STRtree when using strategy. These similarvariable results in not indexing is used,ofthe of thevaries SNNQAs usingthe theASVI ASVI strategy is highly indicate that, regardless whetherdaemons or not indexing is on used, the efficiency the SNNQAs using the clouds. The number of Sparkofworker running virtual instancesofvaries. The Spark daemons ASVI strategy is highly variable in clouds. The number of Spark worker daemons running on virtual residing on each virtual instance are prone to be out of service for many reasons. Data-intensive tasks instances varies. The Spark daemons residing on each virtual instance are prone to be out of service consume lots of the memory of a Spark worker instance, which may lead to an ‘out of memory’ error. for many reasons. Data-intensive tasks consume lots of the memory of a Spark worker instance, Additionally, networking in the cloud is typically unstable, and the networking between the Spark which may lead to an ‘out of memory’ error. Additionally, networking in the cloud is typically workers and the Spark master can sometimes be broken. We observe that the Spark worker daemons unstable, and the networking between the Spark workers and the Spark master can sometimes be frequently the tests. wedaemons focus ourfrequently remaining in this exclusively broken.fail Weduring observe that the Therefore, Spark worker failresearch, during the tests.paper, Therefore, we on the ASVC focus our strategies. remaining research, in this paper, exclusively on the ASVC strategies. Figure 7. Comparison Spark-basedKKNearest Nearest Neighbor Neighbor Sort-Tile-Recursive treetree (KNN-STRtree) Figure 7. Comparison of of Spark-based Sort-Tile-Recursive (KNN-STRtree) algorithms using ASVI and ASVC strategies. algorithms using ASVI and ASVC strategies. 4.3.2. SNNQAs Using ASVC 4.3.2. SNNQAs Using ASVC To understand how the parameter p impacts the efficiency of the SNNQAs, we conduct the To understand the parameter p impacts of thethe SNNQAs, we conduct the third third experimenthow as follows. Each submission ofthe the efficiency SNNQA using ASVC is executed 10 times experiment follows. Each submission SNNQA using the execution ASVC is executed 10 times with theastotal-executor-cores parameter of setthe to six, and the average time is recorded. Aswith the total-executor-cores parameter setfewer to six, and32the average execution timetoisSpark recorded. As the shown shown in Figure 8, when there are than Docker containers relative workers, SNNQAs complete less than than 300 s.32 When setting p = 3000,relative k = 5, and Docker containers, in Figure 8, when thereinare fewer Docker containers tousing Spark20workers, the SNNQAs the average time of pthe 250 s. 20 When setting p = 2500, k the = 5, average and complete in lessminimum than 300 execution s. When setting = SNNQAs 3000, k = is 5, about and using Docker containers, using 20 Docker containers, theSNNQAs average minimum execution timesetting of the SNNQAs 270 using s. minimum execution time of the is about 250 s. When p = 2500,isk about = 5, and When setting p = 2000, k = 5, and using 20 Docker containers, the average minimum execution time of 20 Docker containers, the average minimum execution time of the SNNQAs is about 270 s. When the SNNQAs is about 271 s. We found that reducing the value of parameter p would increase the setting p = 2000, k = 5, and using 20 Docker containers, the average minimum execution time of the execution time of SNNQAs in this context. For example, when setting p = 1500, k = 5, and using 20 SNNQAs is about 271 s. We found that reducing the value of parameter p would increase the execution Docker containers, the average minimum execution time of the SNNQAs is about 330 s. When time of SNNQAs in this context. For example, when setting p = 1500, k = 5, and using 20 Docker containers, the average minimum execution time of the SNNQAs is about 330 s. When setting p = 1000, k = 5, and using 20 Docker containers, the average minimum execution time of the SNNQAs is about 437 s. The results testified that Spark is more suitable for handling small parallel tasks. As discussed ISPRS Int. J. Geo-Inf. 2017, 6, 84 11 of 17 ISPRS Int. J. Geo-Inf. 2017, 6, 84 11 of 17 setting p = 1000, k = 5, and using 20 Docker containers, the average minimum execution time of the SNNQAs is about 437 s. The results testified that Spark is more suitable for handling small parallel tasks. discussedp earlier, thethe parameter the number of parallel tasks each transformation stage of the earlier, theAs parameter controls numberpofcontrols parallel tasks at each stage of theat Spark Spark transformation chains. p isspatial small, objects the number of spatial processed chains. When the parameter p isWhen small,the theparameter number of processed by objects each Spark executor by each Spark executor would be large, and the latency of the algorithms would thereafter increase. would be large, and the latency of the algorithms would thereafter increase. We do not explicitly do not explicitly state which is best this paper, the curves of the average stateWe which configuration is best inconfiguration this paper, but the in curves of thebut average execution time of the execution time of the SNNQAs using the ASVC exhibits great consistence when the Docker SNNQAs using the ASVC exhibits great consistence when the Docker container number is fewer than container number is fewer than 32. Our testing reveals that Spark workers running on the same host 32. Our testing reveals that Spark workers running on the same host communicate with each other communicate with each other at low cost; therefore the performance of the SNNQAs is reliable. We at low cost; therefore the performance of the SNNQAs is reliable. We observe that when more than observe that when more than 32 Docker containers are given to Spark workers, the execution cost of 32 Docker containers are given to Spark workers, the execution of thethe SNNQAs the SNNQAs increases, mainly because Kubernetes cannotcost sustain requiredincreases, number mainly of because Kubernetes cannot sustain the required number of containers in the small CoreOS clusters. containers in the small CoreOS clusters. Figure 8. impact The impact of parameter p on the efficiency of Spark-based neighbor query Figure 8. The of parameter p on the efficiency of Spark-based NearestNearest neighbor query algorithms algorithms (SNNQAs) using ASVC. (SNNQAs) using ASVC. k is an important parameter for the SNNQAs and, according to our tests, demonstrates a similar k is an important parameter for the SNNQAs and,ofaccording to our a similar pattern to our parameter p observation. In the interest saving space, the tests, resultsdemonstrates are not depicted in pattern to our parameter p observation. In the interest of saving space, the results are not depicted this paper. We prefer to know the efficiency of the SNNQAs using the ASVC in large clusters. Therefore, a fourth as follows. CoreOS cluster virtual is in this paper. we Weconduct prefer to knowexperiment the efficiency of theASNNQAs usingwith the 19 ASVC ininstances large clusters. built inwe which one instance four VCPUs 8 GB A of CoreOS RAM, and other with instances each have two is Therefore, conduct a fourthhas experiment as and follows. cluster 19 virtual instances andone 4 GBinstance of RAM.has Thefour average execution for each of the 10 submissions. As two builtVCPUs in which VCPUs and time 8 GBisofrecorded RAM, and other instances each have shown in Table 4, when setting the total-executor-cores parameter to six, the SNNQAs with different VCPUs and 4 GB of RAM. The average execution time is recorded for each of the 10 submissions. parameters complete in less than 300 s. The average execution time of the SNNQAs is from 259.53 s As shown in Table 4, when setting the total-executor-cores parameter to six, the SNNQAs with different to 288.50 s. When providing 90 Docker containers with Spark workers, the average execution time of parameters complete in less than 300 s. The average execution time of the SNNQAs is from 259.53 s to the SNNQAs is about 259.53 s. The results have verified our earlier discussion. In a small Kubernetes 288.50 s. When providing 90 Docker containers with Spark workers, the average execution time of cluster, a large number of Docker containers may pose higher completion to the underlying OS, the SNNQAs 259.53 s. The results have verified earlier declines. discussion. In a small Kubernetes which is is theabout primary reason that the efficiency of theour SNNQAs Although Kubernetes cluster, a large number ofhundreds Docker containers may poseofhigher completion the underlying OS,not which enables auto-scaling and even thousands containers in a fewtoseconds, we suggest is theusing primary that the efficiency ofinthe SNNQAs declines. Although Kubernetes enables a largereason number of Docker containers small clusters in order to avoid overloading. auto-scaling hundreds and even thousands of containers in a few seconds, we suggest not using a large Table 4. The impact of in container number tointhe efficiency of SNNQAs using ASVC and setting number of Docker containers small clusters order to avoid overloading. p = 3000 and k = 5. Container Number Execution Time using ASVC and setting p = 3000 Table 4. The impact of container number to the efficiency of SNNQAs 32 271.89 s and k = 5. 36 40 Container Number 50 60 32 70 36 80 40 90 50100 285.16 s 274.20 s Execution 288.50 s Time 276.39 s 271.89 s 279.95 s 285.16 272.63 s s 274.20 259.53 s s 288.50 271.92 s s 60 70 80 90 100 276.39 s 279.95 s 272.63 s 259.53 s 271.92 s ISPRS Int. J. Geo-Inf. 2017, 6, 84 ISPRS Int. J. Geo-Inf. 2017, 6, 84 12 of 17 12 of 17 Since the CoreOS cluster has a total of 40 cores, the value of the parameter ‘total-executor-cores’ Since the CoreOS cluster has a total of 40 cores, the value of the parameter ‘total-executor-cores’ can be larger than six in the above tests. To understand the impact of the Docker container number can be larger than six in the above tests. To understand the impact of the Docker container number on the efficiency of the SNNQAs using more cores, we conduct our fifth experiment as follows. Each on the efficiency of the SNNQAs using more cores, we conduct our fifth experiment as follows. submission of an SNNQA using ASVI is executed 10 times with p = 3000 and k = 5, and the average Each submission of an SNNQA using ASVI is executed 10 times with p = 3000 and k = 5, and the execution time is recorded. As shown in Figure 9, when the value of the ‘total-executor-cores’ is from average execution time is recorded. As shown in Figure 9, when the value of the ‘total-executor-cores’ 6 to 18, the average execution time is about 300 s. Even when the Docker container number is set to is from 6 to 18, the average execution time is about 300 s. Even when the Docker container number 100, the time cost never exceeds 300 s. However, the performance varies when there are 24 is set to 100, the time cost never exceeds 300 s. However, the performance varies when there are total-executor-cores. For example, when giving 50 Docker containers to Spark workers, the average 24 total-executor-cores. For example, when giving 50 Docker containers to Spark workers, the average execution time is 343.92 s. When increasing the number of Docker containers given to the Spark execution time is 343.92 s. When increasing the number of Docker containers given to the Spark workers to 70, the average execution time rises to more than 400 s. Increasing this number of Docker workers to 70, the average execution time rises to more than 400 s. Increasing this number of Docker containers further, to 90, reduces the average execution time down to approximately 300 s. These containers further, to 90, reduces the average execution time down to approximately 300 s. These results results indicate that the container number and total executor cores used by Spark workers are the indicate that the container number and total executor cores used by Spark workers are the determining determining factors for the performance of the SNNQAs. When giving 40 Docker containers for factors for the performance of the SNNQAs. When giving 40 Docker containers for Spark workers and Spark workers and setting the total-executor-cores to 12, the average execution time of the SNNQAs setting the total-executor-cores to 12, the average execution time of the SNNQAs is 203.33 s. Using the is 203.33 s. Using the same number of Docker containers for Spark workers but increasing the same number of Docker containers for Spark workers but increasing the number of total-executor-cores number of total-executor-cores to 18 increases the SNNQAs’ average execution time to 208.93 s. to 18 increases the SNNQAs’ average execution time to 208.93 s. These results indicate that if we choose These results indicate that if we choose a relatively small number of Docker containers with a relatively small number of Docker containers with moderate total executor cores for Spark workers, moderate total executor cores for Spark workers, the SNNQAs’ performance reliability and the SNNQAs’ performance reliability and efficiency increase. It is important to note that the time efficiency increase. It is important to note that the time cost of the SNNQAs is nearly 300 s when cost of the SNNQAs is nearly 300 s when giving six, 12, and 18 cores to Spark workers. We observe giving six, 12, and 18 cores to Spark workers. We observe that the overall performance of the that the overall performance of the SNNQAs using 12 cores is more robust than that using six and SNNQAs using 12 cores is more robust than that using six and 18 cores. The total-executor-cores 18 cores. The total-executor-cores parameter is an upper threshold that represents the maximum parameter is an upper threshold that represents the maximum number of cores that Spark worker number of cores that Spark worker containers can consume. As more cores are given to Spark workers, containers can consume. As more cores are given to Spark workers, the number of chances for each the number of chances for each Spark worker to execute tasks increases, but the larger value of the Spark worker to execute tasks increases, but the larger value of the parameter also introduces more parameter also introduces more scheduling tasks for the Kubernetes scheduler. This situation is clear scheduling tasks for the Kubernetes scheduler. This situation is clear when we use a larger number when we use a larger number of Docker containers since Kubernetes would take more time to assure of Docker containers since Kubernetes would take more time to assure that the specified number of that the specified number of Docker containers is ready. The larger the number of total executor cores Docker containers is ready. The larger the number of total executor cores used by Spark workers, the used by Spark workers, the greater the amount of competition that occurs between containers that greater the amount of competition that occurs between containers that run Spark worker daemons. run Spark worker daemons. Kubernetes takes more time to reschedule crashed Spark workers, which Kubernetes takes more time to reschedule crashed Spark workers, which in this situation causes the in this situation causes the Spark master instance to reschedule failed tasks. The execution of the Spark master instance to reschedule failed tasks. The execution of the SNNQAs was about 625 s SNNQAs was about 625 s when giving 30 cores to Spark workers and setting the container number to when giving 30 cores to Spark workers and setting the container number to 50, and execution of the 50, and execution of the SNNQAs always failed when giving 30 cores to Spark workers with more SNNQAs always failed when giving 30 cores to Spark workers with more than 60 containers. than 60 containers. 9. The impact of total executor cores and container number on the efficiency of SNNQAs SNNQAs Figure 9. using ASVC. 4.3.3. SNNQAs Using Different ASVC ASVC Strategies Strategies To investigate investigate the the efficiency efficiency of of the the SNNQAs SNNQAs using using different different auto-scaling auto-scaling strategies, strategies, we conduct To experiment as follows. follows. A Heapster plugin is used to collect CPU workload metrics from the sixth experiment kubelet daemons. Each submission of the algorithm is executed 10 times and the average execution ISPRS Int. J. Geo-Inf. 2017, 6, 84 13 of 17 kubelet daemons. Each submission of the algorithm is executed 10 times and the average execution time is recorded. As shown in Table 5, when the MinReplicas, MaxReplica, and TargetPercentage are set to 1%, 40%, and 50% respectively, the average execution time of the SNNQAs is 303.49 s. When the values of the MinReplicas, MaxReplica, and TargetPercentage parameters are 10%, 40%, and 50% respectively, the average execution time of the SNNQAs is 247.68 s. These results indicate that the minimum number of replicas of Spark workers guaranteed by Kubernetes has a certain impact on the efficiency of the SNNQAs, but the impact is not clear, as observed in our testing. For example, when we set the values of the MinReplicas, MaxReplica, and TargetPercentage to 1%, 40%, and 80% respectively, the average execution time of the SNNQAs is 272.56 s. When the values of the MinReplicas, MaxReplica, and TargetPercentage parameters are 10%, 40%, and 80% respectively, the average execution time is 249.89 s. We observe that the number of containers never grows to 40 when we set the value of the MinReplicas parameter to 10%, 20%, and 30% because of two factors. The first reason is that Kubernetes master fails to calculate the averaged CPU workload in the very short period of each submission. The second reason is that the Heapster plugin may fail to collect the metrics for the containers. Our goal, in the near future, is to find alternatives that provide real time monitoring for Kubernetes. Although there are still some flaws in the auto-scaling strategies, we observe that when setting the value of the MinReplicas to 1%, an HPA works well in all the tests. Table 5. The impact of different auto-scaling strategies to the efficiency of SNNQAs using ASVC. Total Executor Cores MinReplicas MaxReplica TargetPercentage Execution Time 6 6 6 6 6 6 6 6 1 10 20 30 1 10 20 30 40 40 40 40 40 40 40 40 50% 50% 50% 50% 80% 80% 80% 80% 303.49 s 247.68 s 253.91 s 242.62 s 272.56 s 249.89 s 252.53 s 262.80 s 4.3.4. SNNQAs Using ASVC in an Auto-Scaling Group To investigate the efficiency of the SNNQAs using an HPA combined with an auto-scaling group, we conduct the seventh experiment as follows. First, a Heat template is used to depict the minimum, the desired, and the maximum number of CoreOS instances, which should be 5, 5, and 10, respectively. Each CoreOS instance has two VCPUs and 4 GB of RAM. Next, we force the auto-scaling group to scale out when the CPU utilization rate is above 80% and lasts about 1 min. Second, a Kubernetes template is used to depict the values of the minimum and maximum replicas of Spark workers, which should be 1 and 200, respectively. The TargetPercentage parameter is set to 80%. Finally, the SNNQAs are submitted to the auto-scaling group using the parameters: total-executor-cores = 18, p = 3000, and k = 5. Each submission of the SNNQAs is executed five times and the average execution time is recorded. As shown in Figure 10, increasing the number of query objects monotonically increases the execution time of the SNNQAs using the HPA exhibits when the auto-scaling group is not used. Finding the five nearest neighbors from 834,158 spatial data objects for each of the 34,604,649 query objects took approximately 660.64 s, while the execution time that uses the HPA combined with the auto-scaling group results in great performance. All of the runs complete in less than 300 s. The results show that the HPA combined with the auto-scaling group is beneficial for reliable spatial big data analysis. It is suggested that future research be directed toward automatically configuring these parameters. ISPRS Int. J. Geo-Inf. 2017, 6, 84 ISPRS Int. J. Geo-Inf. 2017, 6, 84 14 of 17 14 of 17 Figure 10. Comparison of the efficiency of SNNQAs using the horizontal pod auto-scaler (HPA) and HPA combined combined with with auto-scaling auto-scaling group group (ASG) (ASG)in inOpenStack OpenStackcloud. cloud. HPA 5. Discussion 5. Discussion Elasticity is is one one of of the the main main characteristics characteristics of of cloud cloud computing, computing, which which means means that that computing computing Elasticity resources can be increased or decreased in a dynamic and on demand way according to the current resources can be increased or decreased in a dynamic and on demand way according to the current workload of of applications applications [17]. [17]. Current Current solutions solutions direct direct use use of of the the auto-scaling auto-scaling groups groups to to spawn spawn VMs VMs workload on which a dedicated VM is used to run Spark master daemons and other VMs are used to run Spark on which a dedicated VM is used to run Spark master daemons and other VMs are used to run workerworker daemons. Since the performance variability of VMs running Spark master daemons can Spark daemons. Since the performance variability of VMs running Spark master daemons impact the the scheduling tasks of the Spark engine, it it may regression can impact scheduling tasks of the Spark engine, maybebeless lesslikely likelyto to build build useful useful regression models for estimating the execution time of Spark-based PSQPAs. Moreover, the Spark master and models for estimating the execution time of Spark-based PSQPAs. Moreover, the Spark master and worker daemons daemonsrunning running can work using without using an recovery additional recovery worker on on VMsVMs can be outbeof out workofwithout an additional mechanism. mechanism. Additionally, the capacity of the dedicated VM would be over-provisioned or Additionally, the capacity of the dedicated VM would be over-provisioned or under-provisioned let under-provisioned let alone the computing resource ineffectively used. As shown in Figure 6, when alone the computing resource ineffectively used. As shown in Figure 6, when using the ASVI strategies, using the ASVI the is efficiency of the SNNQAs is highly variable.daemon Since each Spark worker the efficiency of strategies, the SNNQAs highly variable. Since each Spark worker consumes a fixed daemon consumes a fixed number of memory resources on the Ubuntu VMs, we cannot specifyofa number of memory resources on the Ubuntu VMs, we cannot specify a large value to the number large value to that the run number Spark workers run on we theobserved same VM. our testing, we Spark workers on theofsame VM. Duringthat our testing, thatDuring Spark worker daemons observed thatlost. Spark daemons with are the frequently lost. Control This may be (TCP)-oriented correlated withNetty the are frequently This worker may be correlated Transmission Protocol Transmission Control Protocol (TCP)-oriented Netty protocol used for receiving the commands protocol used for receiving the commands from the Spark master and transmitting data between Spark from the If Spark master and data between Spark If the connection lost, then workers. the connection is transmitting lost, then Spark workers would beworkers. out of services. In addition,isthe limited Spark workers would be out of services. In addition, the limited memory space of Spark executors memory space of Spark executors can’t satisfy the computation requirement when handling too much can’t or satisfy the computation requirement when handling toothe much data of or Spark too many query objects. data too many query objects. The relationship between number worker daemons The relationship the number of Spark daemons runningToonsatisfy VMs and the execution running on VMs between and the execution time of theworker SNNQAs is indefinite. time-constrained time of the SNNQAs is indefinite. To satisfy time-constrained data analysis, the uncertainty should data analysis, the uncertainty should be removed. The performance variability of the VMs and the be removed. The performance variability of the VMs and the lack of a self-healing mechanism lack of a self-healing mechanism for Spark workers running on VMs preclude us from using for an Spark workers running on VMs preclude us from using an ASVI strategy, while the overall ASVI strategy, while the overall performance of SNNQAs using ASVC is more stable than that using performance of SNNQAs is more stable than that ASVC The time ASVC strategies. The timeusing costs ASVC of SNNQAs could complete in using less than 280strategies. s if we provide less costs than of Docker SNNQAs could complete in less We than 280 sthat if we thanrapidly 28 Docker inthan this 28 containers in this context. notice theprovide time costless grows whencontainers using more context. We notice that the time cost grows rapidly when using more than 32 containers for Spark 32 containers for Spark workers and an ASVC strategy. The reason may be highly correlated with workers an ASVC strategy.processes, The reason may highly correlated the kernel the kerneland burden for switching and thebe completion between with Kubernetes podsburden would for be switching processes, and the completion between Kubernetes pods would be high. Since Kubernetes high. Since Kubernetes adapts an implicit optimization policy that always tries to achieve the specified adaptsaccording an implicit policy that always to achieve thetake specified status to status to optimization the current available resource, thetries scheduler would more time toaccording reassign the the current available resource, the scheduler would take more time to reassign the failed containers failed containers to proper nodes. Compared with VMs, Linux containers demonstrate equal or better to proper nodes. Compared with VMs, [22].As Linux containers demonstrate equal or better performance performance under various workloads shown in Figure 8, SNNQAs using ASVI strategies under various workloads [22].As shown in Figure 8, SNNQAs using ASVI strategies complete less complete in less than 300 s when fewer than 32 containers for Spark workers are used. Otherinthan than 300 s containers, when fewer than 32 containers for Spark workers aretoused. Other computing than self-healing self-healing Kubernetes facilitates auto-scaling containers all available nodes containers, Kubernetes facilitates auto-scaling containers to all available computing nodes through through the use of an HPA. We also explore the relationship between the execution time of SNNQAs the use of an HPA. We also explore the relationship between the execution time of SNNQAs and the number of containers in large clusters. Figure 9 shows that a small number of containers with a ISPRS Int. J. Geo-Inf. 2017, 6, 84 15 of 17 and the number of containers in large clusters. Figure 9 shows that a small number of containers with a moderate number of total executor cores used by Spark, results in robust executions of the SNNQAs. The relationship is definite when the total executor cores do not exceed the number of virtual cores of all available VMs. Lastly, we check the feasibility of an HPA combined with an auto-scaling group for the elastic provision of containers in a scale-out manner. The results, as shown in Figure 10, indicate that an HPA combined with an auto-scaling group is useful for predicting an SQP in a cloud computing environment. Although tremendous efficiency can be obtained when an HPA is used, we observe that the Heapster plugin sometimes fails to collect CPU metrics, which may impact auto-scaling container strategies. Another shortage is that Spark components running in Docker containers have no simple way to share data and software packages. We use the volume NFS image for sharing data among Kubernetes Pods, which may impact the total execution time of SNNQAs. This is an area future research. 6. Conclusions In this paper, we proposed an elastic spatial query processing approach in an OpenStack cloud computing environment by using Kubernetes-managed Docker containers for auto-scaling an Apache Spark computing engine. By leveraging the self-healing and auto-scaling mechanism of Kubernetes and utilizing the in-memory computing paradigm of Spark, we can build an elastic cluster computing environment for spatial query processing in a cloud computing environment. To satisfy time-constrained data analysis, we introduces the implementation of Spark-based spatial query processing algorithms (SQPAs) in this paper, and then we identified three factors that would impact the execution time of SQPAs. The first factor is the internal parameters of the SQPAs, which includes the number of grids p that are used for grouping spatial correlation datasets. Since the Spark engine is suitable for tackling tremendous but relatively small tasks, we suggested using a large value of the parameter p and considering the volume of spatial datasets and the underlying computing resource. The second factor is the parameters of the Spark computing model such as total-executor-cores. For example, an inappropriate number of total executor cores used by the Spark engine would bring about intense completion between containers. We suggested using a small number of containers with a moderate number of total executor cores for reliable spatial query processing of big spatial data. The last factor is the parameters used for defining auto-scaling strategies. For example, MinReplicas, MaxReplica, and Targetpercentage parameters would impact the execution time of SQPAs. The experiments and the tests conducted on an OpenStack cloud computing platform indicate a chance for time-constrained SQPAs. Considering the parameters, our future work will investigate how to build cost models for Spark-based spatial query algorithms that use the ASVC strategies and how to automate the settings of these parameters in OpenStack clouds for processing spatial big data in the big data era. Acknowledgments: We gratefully acknowledge funding support for this research provided by the National High-Resolution Earth Observation System Projects (civil part): Hydrological monitoring system by high spatial resolution remote sensing image (first phase) (08-Y30B07-9001-13/15). Author Contributions: Wei Huang designed the experiment and procedure, Wen Zhang analyzed the results, and they both wrote the manuscript. The work was supervised by Lingkui Meng, who contributed to all stages of the work. Dongying Zhang made major contributions to the interpretations and modification. Conflicts of Interest: The authors declare no conflict of interest. References 1. 2. Lee, J.G.; Kang, M. Geospatial big data: Challenges and opportunities. Big Data Res. 2015, 2, 74–81. [CrossRef] Yang, C.; Huang, Q.; Li, Z.; Liu, K.; Hu, F. Big data and cloud computing: Innovation opportunities and challenges. Int. J. Digit. Earth 2017, 10, 13–53. [CrossRef] ISPRS Int. J. Geo-Inf. 2017, 6, 84 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 16 of 17 Li, S.; Dragicevic, S.; Castro, F.A.; Sester, M.; Winter, S.; Coltekin, A.; Pettit, C.; Jiang, B.; Haworth, J.; Stein, A. Geospatial big data handling theory and methods: A review and research challenges. ISPRS J. Photogramm. Remote Sens. 2016, 115, 119–133. [CrossRef] Li, Z.; Yang, C.; Liu, K.; Hu, F.; Jin, B. Automatic scaling hadoop in the cloud for efficient process of big geospatial data. ISPRS Int. J. Geo-Inf. 2016, 5, 173. [CrossRef] Yang, C.; Goodchild, M.; Huang, Q.; Nebert, D.; Raskin, R.; Xu, Y.; Bambacus, M.; Fay, D. Spatial cloud computing: How can the geospatial sciences use and help shape cloud computing? Int. J. Digit. Earth 2011, 4, 305–329. [CrossRef] Aji, A.; Wang, F. High performance spatial query processing for large scale scientific data. In Proceedings of the SIGMOD/PODS 2012 PhD Symposium, Scottsdale, AZ, USA, May 2012; ACM: New York, NY, USA, 2012; pp. 9–14. Orenstein, J.A. Spatial query processing in an object-oriented database system. In ACM SIGMOD Record; ACM: New York, NY, USA, 1986; Volume 15, pp. 326–336. You, S.; Zhang, J.; Gruenwald, L. Large-scale spatial join query processing in cloud. In Proceedings of the 2015 31st IEEE International Conference on Data Engineering Workshops (ICDEW), Seoul, Korea, 13–17 April 2015; pp. 34–41. Zhong, Y.; Han, J.; Zhang, T.; Li, Z.; Fang, J.; Chen, G. Towards parallel spatial query processing for big spatial data. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), Shanghai, China, 21–25 May 2012; pp. 2085–2094. Huang, W.; Meng, L.; Zhang, D.; Zhang, W. In-memory parallel processing of massive remotely sensed data using an apache spark on hadoop yarn model. IEEE J.Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 10, 3–19. [CrossRef] Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A.; Ma, J.; McCauley, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, San Jose, CA, USA, April 2012; USENIX Association: Berkeley, CA, USA, 2012; p. 2. Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.J. Apache spark: A unified engine for big data processing. Commun. ACM 2016, 59, 56–65. [CrossRef] Yu, J.; Wu, J.; Sarwat, M. Geospark: A cluster computing framework for processing large-scale spatial data. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems; ACM: New York, NY, USA, 2015; p. 70. Tang, M.; Yu, Y.; Malluhi, Q.M.; Ouzzani, M.; Aref, W.G. Locationspark: A distributed in-memory data management system for big spatial data. Proc. VLDB Endow. 2016, 9, 1565–1568. [CrossRef] Ray, S.; Simion, B.; Brown, A.D.; Johnson, R. Skew-resistant parallel in-memory spatial join. In Proceedings of the 26th International Conference on Scientific and Statistical Database Management; ACM: New York, NY, USA, 2014; p. 6. Herbst, N.R.; Kounev, S.; Reussner, R. Elasticity in cloud computing: What it is, and what it is not. In Proceedings of the 10th International Conference on Autonomic Computing (ICAC 13), San jose, CA, USA, 26–28 June 2013; pp. 23–27. Galante, G.; De Bona, L.C.E.; Mury, A.R.; Schulze, B.; da Rosa Righi, R. An analysis of public clouds elasticity in the execution of scientific applications: A survey. J. Grid Comput. 2016, 14, 193–216. [CrossRef] Leitner, P.; Cito, J. Patterns in the chaos—a study of performance variation and predictability in public iaas clouds. ACM Trans. Internet Tech. (TOIT) 2016, 16, 15. [CrossRef] Lorido-Botran, T.; Miguel-Alonso, J.; Lozano, J.A. A review of auto-scaling techniques for elastic applications in cloud environments. J. Grid Comput. 2014, 12, 559–592. [CrossRef] Kang, S.; Lee, K. Auto-scaling of geo-based image processing in an openstack cloud computing environment. Remote Sens. 2016, 8, 662. [CrossRef] Soltesz, S.; Pötzl, H.; Fiuczynski, M.E.; Bavier, A.; Peterson, L. Container-based operating system virtualization: A scalable, high-performance alternative to hypervisors. In ACM SIGOPS Operating Systems Review; ACM: New York, NY, USA, 2007; pp. 275–287. ISPRS Int. J. Geo-Inf. 2017, 6, 84 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 17 of 17 Felter, W.; Ferreira, A.; Rajamony, R.; Rubio, J. An updated performance comparison of virtual machines and linux containers. In Proceedings of the 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Philadelphia, PA, USA, 29–31 March 2015; pp. 171–172. Brinkhoff, T.; Kriegel, H.-P.; Seeger, B. Efficient Processing of Spatial Joins Using R-trees; ACM: New York, NY, USA, 1993; Volume 22. Akdogan, A.; Demiryurek, U.; Banaei-Kashani, F.; Shahabi, C. Voronoi-based geospatial query processing with mapreduce. In Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), Indianapolis, IN, USA, 30 November–3 December 2010 ; pp. 9–16. Brinkhoff, T.; Kriegel, H.-P.; Schneider, R.; Seeger, B. Multi-step Processing of Spatial Joins; ACM: New York, NY, USA, 1994; Volume 23. Lee, K.; Ganti, R.K.; Srivatsa, M.; Liu, L. Efficient spatial query processing for big data. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems; ACM: New York, NY, USA, 2014; pp. 469–472. Chen, H.-L.; Chang, Y.-I. Neighbor-finding based on space-filling curves. Inf. Syst. 2005, 30, 205–226. [CrossRef] Gupta, H.; Chawda, B.; Negi, S.; Faruquie, T.A.; Subramaniam, L.V.; Mohania, M. Processing multi-way spatial joins on map-reduce. In Proceedings of the 16th International Conference on Extending Database Technology; ACM: New York, NY, USA, 2013; pp. 113–124. Mouat, A. Using Docker: Developing and Deploying Software with Containers; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2015. Peinl, R.; Holzschuher, F.; Pfitzer, F. Docker cluster management for the cloud-survey results and own solution. J. Grid Comput. 2016, 14, 265–282. [CrossRef] Burns, B.; Grant, B.; Oppenheimer, D.; Brewer, E.; Wilkes, J. Borg, omega, and kubernetes. Commun. ACM 2016, 59, 50–57. [CrossRef] Jansen, C.; Witt, M.; Krefting, D. Employing docker swarm on openstack for biomedical analysis. In Proceedings of the International Conference on Computational Science and Its Applications; Springer: Berlin, Germany, 2016; pp. 303–318. Jackson, K.; Bunch, C.; Sigler, E. Openstack Cloud Computing Cookbook; Packt Publishing Ltd.: Birmingham, UK, 2015. Liu, Y.; Kang, C.; Gao, S.; Xiao, Y.; Tian, Y. Understanding intra-urban trip patterns from taxi trajectory data. J. Geogr. Syst. 2012, 14, 463–483. [CrossRef] Liu, X.; Gong, L.; Gong, Y.; Liu, Y. Revealing travel patterns and city structure with taxi trip data. J. Transp. Geogr. 2015, 43, 78–90. [CrossRef] © 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).