Uploaded by User14496

ijgi-06-00084

advertisement
International Journal of
Geo-Information
Article
Elastic Spatial Query Processing in OpenStack
Cloud Computing Environment for Time-Constraint
Data Analysis
Wei Huang † , Wen Zhang † , Dongying Zhang and Lingkui Meng *
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China;
[email protected] (W.H.); [email protected] (W.Z.); [email protected] (D.Z.)
* Correspondence: [email protected]; Tel.: +86-159-2706-6565
† These authors contributed equally to this work.
Academic Editors: Ozgun Akcay and Wolfgang Kainz
Received: 12 January 2017; Accepted: 13 March 2017; Published: 15 March 2017
Abstract: Geospatial big data analysis (GBDA) is extremely significant for time-constraint applications
such as disaster response. However, the time-constraint analysis is not yet a trivial task in the cloud
computing environment. Spatial query processing (SQP) is typical computation-intensive and
indispensable for GBDA, and the spatial range query, join query, and the nearest neighbor query
algorithms are not scalable without using MapReduce-liked frameworks. Parallel SQP algorithms
(PSQPAs) are trapped in screw-processing, which is a known issue in Geoscience. To satisfy
time-constrained GBDA, we propose an elastic SQP approach in this paper. First, Spark is used
to implement PSQPAs. Second, Kubernetes-managed Core Operation System (CoreOS) clusters
provide self-healing Docker containers for running Spark clusters in the cloud. Spark-based PSQPAs
are submitted to Docker containers, where Spark master instances reside. Finally, the horizontal
pod auto-scaler (HPA) would scale-out and scale-in Docker containers for supporting on-demand
computing resources. Combined with an auto-scaling group of virtual instances, HPA helps to find
each of the five nearest neighbors for 46,139,532 query objects from 834,158 spatial data objects in
less than 300 s. The experiments conducted on an OpenStack cloud demonstrate that auto-scaling
containers can satisfy time-constraint GBDA in clouds.
Keywords: elasticity; spatial query processing; Spark; container; Kubernetes; OpenStack
1. Introduction
With the volume of spatial datasets collected from ubiquitous sensors exceeding the capacity of
current infrastructures, a new direction called Geospatial big data (GBD) has drawn great attention
from academia and industry in recent years [1,2]. Analyzing spatial datasets can be valuable for many
societal applications such as transport planning and management, disaster response, and climate
change research [3]. However, efficient processing of them is still a challenging task, especially when
obtaining timely results is preliminary for emergency responses [4,5].
Regardless of the data source, parallel spatial query processing (SQP) is indispensable for
GBD analysis and a must for most spatial databases [6,7]. Since a (Graphics Processing Unit)
GPU-based parallel requires significant effort in redesigning relevant algorithms, t state-of-art research
directs parallel SQP (PSQP) for handling big spatial data in the cloud computing environment [8,9].
For example, Zhong et al. [9] implemented several MapReduce-based spatial query operators for
parallel SQP algorithms (PSQPAs). Although MapReduce-based PSQPAs perform well with enhanced
scalability, the efficiency of the PSQPAs depends on their Hadoop-based property system. Moreover,
MapReduce-based algorithms used in Hadoop suffer dense disk Input/Output (I/O) and network
ISPRS Int. J. Geo-Inf. 2017, 6, 84; doi:10.3390/ijgi6030084
www.mdpi.com/journal/ijgi
ISPRS Int. J. Geo-Inf. 2017, 6, 84
2 of 17
communication costs and are thus inappropriate for timely data analysis [10]. To achieve higher
performance, You et al. [8] designed a prototype system, SpatialSpark, based on Spark for large-scale
spatial join queries in cloud computing. Spark is an advanced computing model (CM) similar to but
different from Hadoop MapReduce. Through transferring transformation to in-memory datasets with
resilient distributed datasets (RDDs) abstraction [11], Spark has become the leading in-memory big data
analysis paradigm in recent years [12]. The extensible nature of RDDs cultivates useful frameworks like
GeoSpark and LocationSpark to efficiently process big spatial data [13,14]. The computing model has
shown a potential for high performance GBD computing in a 207 nodes composed of a Hadoop cluster
on the Amazon EC2 cloud, as reported in [12]. However, geometry objects are multidimensional,
and the geometry computation on big spatial datasets would aggressively deplete limited computing
resources. Moreover, the property cloud computing environment only poses ‘vendor-lock-in’ limitation
for private cloud users. Further, even building sophisticated spatial indexes and spatially declustering
the huge datasets, the screw-processing caused by the size or density of the geometry objects may
introduce long latency [15].
Cloud computing emerged as a paradigm with the promise of provisioning theoretically infinite
computing resources for cloud applications. Elasticity, one of the main characteristics of cloud
computing, represents the capability to scale up and down the number of allocated resources for
applications on demand [16]. Both the vertical and the horizontal elasticity provide improvement in
applications performance and cost reduction [17]. The vertical elasticity means that the computing
resource of a single virtual server such as memory and virtual Central Processing Units (VCPUs) can
be scaled up and down on demand, while the horizontal elasticity means the capability to scale in
and out virtual resources on demand. The ‘pay-as-you-go’ paradigm encourages users to reduce
costs by offloading the investment of infrastructures. However, it’s a difficult task for cloud users to
identify the right amount of virtual resources to use, and cloud service providers often cannot satisfy
the Service Level Agreement (SLA) contracts for cloud users [18]. Diversified auto-scaling techniques
for elastic applications in clouds have been proposed [19]. However, there is some geoscience research
that leverages the auto-scaling technologies in the cloud computing environment [20].
The purpose of this paper is to investigate auto-scaling strategies to support time-constraint spatial
query processing in clouds. To the best of our knowledge, we can find a single work that is a little similar
to ours [20]. They suggested using auto-scaling groups in OpenStack clouds with Heat for defining the
rules for adding or removing virtual machines (VMs) on demand. We focus on horizontal auto-scaling
containers, which would be more valuable for cloud applications. Compared with hypervisor-based
virtualization, container-based virtualization has been considered an alternative for providing isolated
environments for applications [21]. Application containers can be regarded as lightweight VMs that
share the same kernel of the underlying operating system. Applications running inside traditional
hypervisor-based VMs depend on the scheduling capacity of gust Operation System (OS), which
introduces an extra level of abstraction [22]. Horizontal auto-scaling VMs often take several minutes
to build virtual computing clusters, which may be unsuitable for time-constraint computational tasks.
In fact, the efficiency of PSQPAs is not only determined by the volume of data but is also highly
relevant with internal parameters and underlying CMs. To reduce the time costs of PSQPAs, we first
introduce the implementation of Spark-based PSQPAs and identify the parameters that impact the
efficiency of the PSQPAs in Section 2. After introducing Docker for container management on a single
computing node, we detail Kubernetes-based container management and scheduling for auto-scaling
containers across cloud computing nodes. Then, some auto-scaling strategies and a spatial query
processing case are given in Section 3. The experiments and results are described in Section 4, followed
by a discussion in Section 5. Finally, we conclude and mention our future works in Section 6.
2. Parallel SQPAs Using Spark CM
Let D and Q be a data object collection and a query object collection respectively. Spatial range
query (SRQ) represents searching data objects of D related to query objects of Q within a certain
ISPRS Int. J. Geo-Inf. 2017, 6, 84
ISPRS Int. J. Geo-Inf. 2017, 6, 84
3 of 17
3 of 17
object. The
query
‘find
the nearest
hotels
close
to a given
space’data
is anobjects
example
NNQ,
where
the
distance.
The
Nearest
neighbor
query
(NNQ)
searches
the closest
to aofgiven
query
object.
hotels
are ‘find
the data
objects hotels
and the
space
the query
Spatial of
join
query
(SJQ)the
combines
pairs
The
query
the nearest
close
to aisgiven
space’object.
is an example
NNQ,
where
hotels are
the
of objects
the space
two datasets
based
on their
spatial
relationships
[23]. Before
the
data
objectsfrom
and the
is the query
object.
Spatial
join query
(SJQ) combines
pairs identifying
of objects from
parameters
impacting
of Spark-based
PSQPAs,
wethebriefly
introduce
the
the
two datasets
based on the
their efficiency
spatial relationships
[23]. Before
identifying
parameters
impacting
implementation
and execution
of them.
the
efficiency of Spark-based
PSQPAs,
we briefly introduce the implementation and execution of them.
2.1. PSQPAs
PSQPAs Using
Using Spark
Spark CM
CM
Figure
execution
of aofSpark-based
SRQ SRQ
algorithm
(SSRQA).
First, spatial
data objects
Figure 11shows
showsthe
the
execution
a Spark-based
algorithm
(SSRQA).
First, spatial
data
are
partitioned
and stored
distributed
nodes. nodes.
Then, the
query
are broadcast
to the nodes
objects
are partitioned
and in
stored
in distributed
Then,
the objects
query objects
are broadcast
to the
where
the partitions
reside.reside.
In these
nodes,
each each
queryquery
object
is used
to match
the the
datadata
objects
by
nodes where
the partitions
In these
nodes,
object
is used
to match
objects
performing
a spatial
range
predicate
computation.
We employ
GeoSpark
for processing
large scale
by performing
a spatial
range
predicate
computation.
We employ
GeoSpark
for processing
large
spatial
data indata
this in
work,
a software
provides
geometrical
operations.
scale spatial
thiswhich
work,iswhich
is a package
softwarethat
package
thatbasic
provides
basic geometrical
BitTorrent-enabled
broadcastingbroadcasting
is the main approach
forapproach
the Spark
to share
theto
immutable
operations. BitTorrent-enabled
is the main
formaster
the Spark
master
share the
objects
withobjects
Spark workers
[10].
Sharing[10].
data,
the volume
does
exceed
Virtual
immutable
with Spark
workers
Sharing
data, of
thewhich
volume
of not
which
doesthe
notJava
exceed
the
Machine
(JVM)
heap size
of the
Spark
is efficient.
Implementation
of a simple SSRQA
Java Virtual
Machine
(JVM)
heap
sizemaster
of the instances,
Spark master
instances,
is efficient. Implementation
of a
includes
only one
mapping
transformation.
simple SSRQA
includes
only
one mapping transformation.
Figure 1.
1. The
The execution
execution flow
flow of
of Spark-based
Spark-based spatial
spatial range
range query
query (SRQ)
(SRQ) algorithms.
algorithms. The
The partitioned
partitioned
Figure
spatial
data
objects
are
spatial
resilient
distributed
datasets
(RDDs).
Spark
executors
on
Spark
worker
spatial data objects are spatial resilient distributed datasets (RDDs). Spark executors on Spark worker
nodes
would
do
spatial
range
predicate
computation
after
the
mapping
transformation
tasks
are
nodes would do spatial range predicate computation after the mapping transformation tasks are
received
from
the
Spark
Master
instance.
received from the Spark Master instance.
A Spark-based NNQ algorithm (SNNQA) executes as follows. If we suppose that the query
A Spark-based NNQ algorithm (SNNQA) executes as follows. If we suppose that the query objects
objects are small and that the data objects are huge, the query objects can be broadcast to the
are small and that the data objects are huge, the query objects can be broadcast to the computing nodes
computing nodes where the partitioned data objects reside. In these nodes, Spark workers execute
where the partitioned data objects reside. In these nodes, Spark workers execute spatial predicate
spatial predicate computations using local spatial data objects. It is assumed that the query objects
computations using local spatial data objects. It is assumed that the query objects are too large to be
are too large to be shared, whereas the data objects are relative small. We can then spatially partition
shared, whereas the data objects are relative small. We can then spatially partition the data objects
the data objects into grids and, if necessary, build an index for them [24]. The grids can then be
into grids and, if necessary, build an index for them [24]. The grids can then be broadcasted to the
broadcasted to the computing nodes where the query objects reside. If both the data objects and the
computing nodes where the query objects reside. If both the data objects and the query objects are
query objects are too large to be shared, it is still possible to divide the query objects into chunks and
too large to be shared, it is still possible to divide the query objects into chunks and submit the same
submit the same algorithms with the different chunks as their input. Implementation of a simple
algorithms with the different chunks as their input. Implementation of a simple SNNQA includes only
SNNQA includes only one mapping transformation. Since the execution flow of a SNNQA is fairly
one mapping transformation. Since the execution flow of a SNNQA is fairly similar to that of a SSRQA,
similar to that of a SSRQA, we omit the schematic diagram here in order to save space.
we omit the schematic diagram here in order to save space.
Figure 2 shows a Spark-based SJQ algorithm (SSJQA) that executes as follows. After loading
Figure 2 shows a Spark-based SJQ algorithm (SSJQA) that executes as follows. After loading
datasets in computing nodes, a Spark engine will partition them into grids according to their
datasets in computing nodes, a Spark engine will partition them into grids according to their
approximate spatial location, such as the minimum bounding rectilinear rectangle (MBR). Next, the
approximate spatial location, such as the minimum bounding rectilinear rectangle (MBR). Next,
Spark engine joins the datasets according to their grid IDs [25]. For those spatial objects that have the
the Spark engine joins the datasets according to their grid IDs [25]. For those spatial objects that have
same grid ID, the spatial overlap that predicates computation is evaluated in the same computing
the same grid ID, the spatial overlap that predicates computation is evaluated in the same computing
nodes. If two types of objects overlap, they are kept in the final results. Finally, the results are
ISPRS Int. J. Geo-Inf. 2017, 6, 84
4 of 17
ISPRS Int. J. Geo-Inf. 2017, 6, 84
4 of 17
nodes. If two types of objects overlap, they are kept in the final results. Finally, the results are grouped
grouped
by rectangles
with theobjects
duplicate
objects
removed.
The implementation
of include
a SSJQA
may
by rectangles
with the duplicate
removed.
The
implementation
of a SSJQA may
several
include
several joins,
mapping,
and filtering transformations.
joins, mapping,
and filtering
transformations.
Figure 2. The execution flow of Spark-based spatial join query (SJQ) algorithms. Data and query
objects are
partitioned
andflow
stored
grids according
to their
spatial(SJQ)
location.
Spark executors
on query
Spark
Figure
2. The
execution
of in
Spark-based
spatial
join query
algorithms.
Data and
worker are
nodes
would doand
spatial
overlap
predicate
computation
thelocation.
joining transformation
tasks
objects
partitioned
stored
in grids
according
to their after
spatial
Spark executors
on
are
received
from
the
Spark
Master
instance.
Spark worker nodes would do spatial overlap predicate computation after the joining transformation
tasks are received from the Spark Master instance.
2.2. Identifying Factors Impacting the Efficiency of Spark-Based PSQPAs
2.2. Identifying
Factors Impacting
theefficiently
Efficiencyquery
of Spark-Based
PSQPAs
Since the practical
method to
against big
spatial data is to employ the divide and
conquer
strategy
[9,26], most
MapReduce-based
PSQPAs
use certain
types
of space
filling curves,
such
Since
the practical
method
to efficiently query
against
big spatial
data
is to employ
the divide
as Hilbert
space-filling
curve, most
to map
MBRs to grids based
on theuse
spatial
correlation
optimizing
and
conquer
strategy [9,26],
MapReduce-based
PSQPAs
certain
types offorspace
filling
efficiency
[27,28].
We space-filling
simply treatcurve,
the number
of grids
p as one
ofon
thethe
internal
of
curves,
such
as Hilbert
to map MBRs
to grids
based
spatial parameters
correlation for
Spark-basedefficiency
PSQPAs.[27,28].
We focus
SNNQAs
a SSRQA
can
as a simple
case
optimizing
Weon
simply
treatexclusively.
the numberSince
of grids
p as one
of be
theseen
internal
parameters
of Spark-based
SNNQAs, thePSQPAs.
spatial range
predicates
of the SSRQAs
thatSince
can bea wrapped
into
single
We focus
on SNNQAs
exclusively.
SSRQA can
beaseen
asmapping
a simple
transformation;
thatthe
of SNNQAs.
SSJQAs
are more
than SNNQAs.
Thewrapped
basic units
composing
case
of SNNQAs,
spatial range
predicates
ofcomplex
the SSRQAs
that can be
into
a single
this complexity
are multiway
AlthoughSSJQAs
joins are
unavoidable
and than
time-consuming,
thebasic
projection
mapping
transformation;
thatjoins.
of SNNQAs.
are
more complex
SNNQAs. The
units
operation
mapping
spatial
correlated
datasets
into
the
same
grids
is
commonly
used
in
the
filter
composing this complexity are multiway joins. Although joins are unavoidable and
refinement stages the
[28].projection
From theoperation
viewpointmapping
of Spark spatial
CM, thecorrelated
number ofdatasets
grids determines
the number
time-consuming,
into the same
grids is
of tasks thatused
should
executed,
which
can directly
impact
theFrom
efficiency
of the PSQPAs.
The projection
commonly
inbe
the
filter and
refinement
stages
[28].
the viewpoint
of Spark
CM, the
operation
of networking
communication.
Sincewhich
building
models
for
number
of only
gridsdetermines
determinesthe
thecost
number
of tasks that
should be executed,
can cost
directly
impact
networking
communication
is complex,
we identify
the number
grids as the
internal
of
the
efficiency
of the PSQPAs.
The projection
operation
only of
determines
the
cost ofparameters
networking
PSQPAs.
Our
purpose
is
to
evaluate
the
relationship
between
the
number
of
grids
and
the
execution
communication. Since building cost models for networking communication is complex, we identify
timenumber
of the PSQPAs,
may
be useful
for self-adaptive
systems.
this paper,
exclusively
the
of gridswhich
as the
internal
parameters
of PSQPAs.
Our In
purpose
is toweevaluate
the
use a Spark standalone
because
it isand
easythe
to execution
configure time
a Spark
cluster
without
using
other
relationship
between themodel
number
of grids
of the
PSQPAs,
which
may
be
frameworks
in
a
cloud
computing
environment.
Moreover,
the
standalone
model
demonstrates
good
useful for self-adaptive systems. In this paper, we exclusively use a Spark standalone model because
reliability,
in our earlier
The using
parameters
the Spark standalone
model
include;
it
is easy as
to demonstrated
configure a Spark
clusterwork.
without
otherofframeworks
in a cloud
computing
driver-core, driver-memory,
executor-cores.
The first two
environment.
Moreover, theexecutor-memory,
standalone modeltotal-executor-cores,
demonstrates goodand
reliability,
as demonstrated
in
parameters
application-related
and defined
users tostandalone
set the resource
requirement
the Spark
our
earlierarework.
The parameters
of thebySpark
model
include; for
driver-core,
master instances. executor-memory,
The last three parameters
are used to set the
resource
requirementThe
for each
the
driver-memory,
total-executor-cores,
and
executor-cores.
first oftwo
Spark
worker
instances.
Since
the
value
of
the
total-executor-cores
is
an
expectation
of
users
who
parameters are application-related and defined by users to set the resource requirement for the
handlemaster
large amounts
of spatial
data,
it isparameters
consideredare
another
workers for
running
Spark
instances.
The last
three
used impactor.
to set the Further,
resourceSpark
requirement
each
onthe
containers
compete
with each
other
available
resources. Therefore,
do not consider
the
of
Spark worker
instances.
Since
the for
value
of the total-executor-cores
is we
an expectation
of users
executor-memory
and
executor-cores
as
parameters.
The
other
possible
parameter
of
SNNQAs
would
who handle large amounts of spatial data, it is considered another impactor. Further, Spark workers
be the k parameter,
which
specifieswith
the desired
number
the nearest
neighbor’s
query objects.
running
on containers
compete
each other
for of
available
resources.
Therefore,
we do not
consider the executor-memory and executor-cores as parameters. The other possible parameter of
SNNQAs would be the k parameter, which specifies the desired number of the nearest neighbor’s
query objects.
ISPRS Int. J. Geo-Inf. 2017, 6, 84
ISPRS Int. J. Geo-Inf. 2017, 6, 84
5 of 17
5 of 17
3. Horizontal Auto-Scaling Containers for Elastic Spatial Query Processing
3. Horizontal Auto-Scaling Containers for Elastic Spatial Query Processing
3.1. Docker with Kubernetes Orchestration for Clustering Containers
3.1. Docker with Kubernetes Orchestration for Clustering Containers
Our testing system is based on Docker, which is an open source project for automating the
Our testing
system is based
on Docker,
an open source
project is
fora automating
the
deployment
of containerized
applications
in awhich
Linux is
environment
[29]. Docker
framework that
deployment
applicationscontainers.
in a Linux environment
[29]. and
Docker
a framework that
manages theof containerized
lifecycle of application
An application
its isdependencies
are
manages
the
lifecycle
of
application
containers.
An
application
and
its
dependencies
are
encompassed
encompassed into an image. A Docker image can be built from a basic image and other existing
into
an image.
A Docker
can be built
from asystem
basic image
and
other existing
images. The
basic
images.
The basic
image image
is a minimum
operating
such as
a community
enterprise
operating
image
is
a
minimum
operating
system
such
as
a
community
enterprise
operating
system
(CentOS),
system (CentOS), Rancheros, or CoreOS. Applications and a basic image are combined into a single
Rancheros,
CoreOS.Layer-by-layer
Applications and
a basicfacilitates
image are combined
into aupdating
single layer-wised
image.
layer-wisedor image.
storage
sharing and
the application
Layer-by-layer
facilitates
sharing
and updating
application
components
at minimum
cost.
components atstorage
minimum
cost. From
Docker's
point ofthe
view,
a container
is the basic
management
From
Docker’s
point
of
view,
a
container
is
the
basic
management
unit
in
which
an
application
runs.
unit in which an application runs. Docker containers share the basic operating system with other
Docker
containers
shareuse
theanother
basic operating
system
with other
rather
than instruct
use another
copy.
containers
rather than
copy. Figure
3 shows
how containers
Docker clients
would
a Docker
Figure
showscontainers
how Docker
instructwith
a Docker
hostimages.
to launch
for running
host to 3launch
for clients
runningwould
applications
specified
Thecontainers
Docker image
registry
applications
withgallery
specified
Theimages.
DockerWe
image
is image
the dedicated
storing the
is the dedicated
for images.
storing the
useregistry
a private
registrygallery
to storefor
Spark-master
images.
We use a private
and Spark-worker
imagesimage
in ourregistry
tests. to store Spark-master and Spark-worker images in our tests.
Figure
forfor
lifecycle
management
of containers
in a host.
imagesimages
are stored
Figure 3.3.Docker
Dockerframework
framework
lifecycle
management
of containers
in a Docker
host. Docker
are
in
Docker
image
registries.
Each
host
has
a
Docker
engine
that
deploys
and
controls
these
containers
stored in Docker image registries. Each host has a Docker engine that deploys and controls these
as
requested
Docker clients.
containers
asby
requested
by Docker clients.
The Docker
Docker facilitates
facilitates application
application executions
executions in
in aa single
single host.
host. To
To fabricate
fabricate computing
computing clusters
clusters
The
using containers,
containers, higher
higher level
level tools
tools such
such as
as Docker
Docker Swarm
Swarm and
and Kubernetes
Kubernetes are
necessary [30,31].
using
are necessary
[30,31].
Docker
Swarm
is
a
Docker-native
container
orchestration
system
for
cluster
computing.
Since
it’s
Docker Swarm is a Docker-native container orchestration system for cluster computing. Since it’s
imple
and
flexible
in
deployment,
it
has
been
widely-used
in
OpenStack
clouds
for
data
imple and flexible in deployment, it has been widely-used in OpenStack clouds for data analysis [32].
analysis [32].
However,
on different
cannot
withwithout
each other
without
using
However,
containers
on containers
different hosts
cannot hosts
interact
with interact
each other
using
the current
the
current
unstable
Docker
overlay
network.
Moreover,
at
the
time
of
this
writing,
many
advanced
unstable Docker overlay network. Moreover, at the time of this writing, many advanced functionalities
functionalities
suchand
as auto-scaling
self-healing are
andnotauto-scaling
areDocker
not supported
by Docker
Swarm.
such
as self-healing
supported by
Swarm. Kubernetes
is another
Kubernetes
another
open
source in
project.
It was
in from
June the
2014
and originates
from the
open
source is
project.
It was
released
June 2014
andreleased
originates
requirement
for managing
requirement
for
managing
billions
of
containers
in
the
infrastructure.
Kubernetes,
which
is more
billions of containers in the infrastructure. Kubernetes, which is more complex than Docker Swarm,
complex
than by
Docker
has been
used by Google
for decades.
complexities
cultivate
has
been used
GoogleSwarm,
for decades.
The complexities
cultivate
numerousThe
benefits
for containerized
numerous
benefits
for
containerized
applications
such
as
high
availability,
self-healing,
and
applications such as high availability, self-healing, and auto-scaling containers with fine-grained
auto-scaling
containers
with
fine-grained
cluster
monitoring.
Figure
4
shows
how
Kubernetes
cluster monitoring. Figure 4 shows how Kubernetes adapts master-slave architecture to manage
adapts master-slave
manage containers
Kubelet
and Kube-proxy
running
containers
in minions.architecture
Kubelet andtoKube-proxy
runningin
onminions.
minion, are
components
for disseminating
on
minion,
are
components
for
disseminating
changing
workloads
to
pods
and
routing
changing workloads to pods and routing services-oriented access to pods, respectively. Here
each
services-oriented
access
to
pods,
respectively.
Here
each
pod
is
a
logical
group
of
containers
that
run
pod is a logical group of containers that run on the same application. The Application Programming
on the same
application.
The
Application
Programming
(API)
is the
main services
Interface
(API)
server is the
main
services gallery
and an Interface
access point
forserver
Kubectl
(a command
line
gallery
and
an
access
point
for
Kubectl
(a
command
line
interface
for
running
commands
against
interface for running commands against Kubernetes clusters) to query and define cluster workloads.
Kubernetes
clusters)
query and define work
cluster
The
Kube-scheduler
and
The
Kube-scheduler
and to
Kube-controller-manager
withworkloads.
the API server
for scheduling
workloads
Kube-controller-manager
work specified
with the pods
API server
forrunning
scheduling
workloads
in the form
pods
in
the form of pods and ensuing
replicate
in minions,
respectively.
For of
reasons
and
ensuing
specified
pods
replicate
running
in
minions,
respectively.
For
reasons
of
simplicity,
of simplicity, they are not shown in Figure 4.
they are not shown in Figure 4.
ISPRS Int. J. Geo-Inf. 2017, 6, 84
ISPRS Int. J. Geo-Inf. 2017, 6, 84
6 of 17
6 of 17
Master
Minion-1 (kubelet, kube-proxy)
kubelet
API Server
scheduling
Repliation controller
Pod1
PodN
Container1
Container2
kubectl
Container1
…
Container2
ContainerN
ContainerN
Minion-2 (kubelet, kube-proxy)
User acess
kube-proxy
Service1
(virtual IP + port)
Pod1
…
PodN
Figure
Figure 4.
4. Kubernetes
Kubernetes components
components for
for orchestrating
orchestrating containers
containers distributed
distributed in
in minion
minion nodes.
nodes.
3.2.
3.2. Horizontal
Horizontal Auto-Scaling
Auto-Scaling Containers
Containers for
for Elastic
Elastic Spatial
Spatial Query
Query Processing
Processing
To
explore elastic
elastic spatial
spatial query
query processing
processing in
cloud computing
computing environment,
To explore
in aa cloud
environment, we
we assume
assume that
that
there
are
C
cores,
I
maximum
instances,
and
M
gigabytes
of
memory
leased
for
a
tenant.
there are C cores, I maximum instances, and M gigabytes of memory leased for a tenant. Figure
Figure 55
shows
shows the
the computing
computing resources,
resources, which
which are
are used
used as
as follows.
follows. The
The left
left part
part consists
consistsof
ofaa stable
stablecluster,
cluster,
in
which
the
Kubernetes
Master
manages
containers
on
a
fixed
number
of
minion
nodes.
A
in which the Kubernetes Master manages containers on a fixed number of minion nodes. A horizontal
horizontal
pod
auto-scaler
(HPA)
is
a
plugin
for
Kubernetes
that
automatically
increases
and/or
pod auto-scaler (HPA) is a plugin for Kubernetes that automatically increases and/or decreases the
decreases
number
of pods
across
the minions.
To build
a robust
cluster on
topcan
of
number ofthe
pods
across the
minions.
To build
a robust Spark
cluster
on topSpark
of Kubernetes,
users
Kubernetes,
users can
to create atoreplication
controller
ensure
request Kubernetes
to request
create a Kubernetes
replication controller
ensure one,
and onlytoone,
Podone,
thatand
runsonly
the
one,
Pod
that
runs
the
Spark-master
image.
Kubernetes
allows
users
to
explicitly
define
resource
Spark-master image. Kubernetes allows users to explicitly define resource quotas for pods. Therefore,
quotas
for has
pods.
Therefore,
a pod and
thatcores
has sufficient
andused
cores
be exclusively
used
a pod that
sufficient
memory,
would bememory,
exclusively
bywould
the Spark-master
instance.
by
the
Spark-master
instance.
Then,
the
HPA
would
create
an
auto-scaling
group
of
containers
for
Then, the HPA would create an auto-scaling group of containers for Spark-worker instances based
Spark-worker
instances
based
on
some
auto-scaling
strategies.
Finally,
Spark-workers
connect
to
the
on some auto-scaling strategies. Finally, Spark-workers connect to the Spark-master instances by
Spark-master
instances
by for
using
a Kubernetes
service
forcluster.
fabricating
an elastic
Spark
using a Kubernetes
service
fabricating
an elastic
Spark
Kubernetes
services
arecluster.
logical
Kubernetes
services
are
logical
groups
of
pods.
They
are
used
for
routing
the
incoming
traffic to
groups of pods. They are used for routing the incoming traffic to background pods. Three parameters
background
pods.
Three the
parameters
are strategies.
mainly used
define the and
auto-scaling
strategies.
The
are mainly used
to define
auto-scaling
ThetominReplicas
maxReplicas
parameters
minReplicas
and maxReplicas
parameters
minimum
and maximum
number
of pods
declare the minimum
and maximum
numberdeclare
of podsthe
allowed,
respectively.
Kubernetes
calculates
the
allowed,
respectively.
Kubernetes
calculates
the
arithmetic
mean
of
the
pods’
CPU
utilization
with
arithmetic mean of the pods’ CPU utilization with the target value, as defined by the target percentage
the
target value,
definedadjusts
by the the
target
percentage
andright
if necessary
adjusts
the number
parameter,
and if as
necessary
number
of podsparameter,
allowed. The
part includes
an auto-scaling
of
podsinallowed.
The right
part includes
an auto-scaling
in whichtovirtual
machines
(VMs)
group,
which virtual
machines
(VMs) scale
in and scalegroup,
out according
the policies
defined
in
scale
in
and
scale
out
according
to
the
policies
defined
in
the
Heat
templates.
The
minimum
and
the Heat templates. The minimum and maximum size properties of an auto-scaling group are used
maximum
size properties of an auto-scaling group are used to define the threshold of instances
to define the threshold of instances allowed for cloud users. To horizontally scale Docker containers
allowed
for
cloud users. To horizontally scale Docker containers in an auto-scaling group, the
in an auto-scaling group, the preliminary requirement is that the VMs can automatically join the
preliminary
requirement
is that the VMs can automatically join the Kubernetes cluster. Fortunately,
Kubernetes cluster. Fortunately, the cloud-config files of OpenStack can be used in this context [33].
the cloud-config files of OpenStack can be used in this context [33]. The loading of the kube-proxy
The loading of the kube-proxy and kubelet services on the minion nodes can be done automatically on
and kubelet services on the minion nodes can be done automatically on the boot of the VMs.
the boot of the VMs.
This proposed empirical strategy divides the computing resources into two parts and provides
This proposed empirical strategy divides the computing resources into two parts and provides
a way to distribute the containers over all the nodes. By separating the Spark-master and
a way to distribute the containers over all the nodes. By separating the Spark-master and Spark-workers
Spark-workers into containers and leveraging the HPA to distribute the workload, we can build
into containers and leveraging the HPA to distribute the workload, we can build self-healing Spark
self-healing Spark clusters to process huge volumes of spatial datasets. The experiments and results
clusters to process huge volumes of spatial datasets. The experiments and results are presented in
are presented in Section 4.
Section 4.
ISPRS Int. J. Geo-Inf. 2017, 6, 84
ISPRS Int. J. Geo-Inf. 2017, 6, 84
7 of 17
7 of 17
Figure
Figure 5.
5. Horizontal
Horizontal auto-scaling
auto-scaling containers
containers in
in OpenStack
OpenStack Clouds
Clouds for
for elastic
elastic spatial
spatial query
query processing
processing
of
big
spatial
data.
The
left
part
is
a
Kubernetes
cluster,
and
the
right
part
is
an
auto-scaling
group,
in
of big spatial data. The left part is a Kubernetes cluster, and the right part is an auto-scaling
group,
which
kube-proxy
and
kubelet
daemons
would
run
on
relevant
virtual
machines.
in which kube-proxy and kubelet daemons would run on relevant virtual machines.
4.
4. Experiments
Experiments and
and Results
Results
4.1.
4.1. Spatial
Spatial Datasets
Datasets for
for A
A Spatial
Spatial Query
Query Processing
Processing Case
Case
In
this paper,
paper,we
weintroduce
introduce
concept
of spatial
datasets
to better
understand
patterns
of
In this
thethe
concept
of spatial
datasets
to better
understand
patterns
of human
human
mobility
in
urban
areas.
In
order
to
perform
this
analysis,
which
is
fundamental
for
city
mobility in urban areas. In order to perform this analysis, which is fundamental for city planners,
planners,
trajectory
data are
collected
from GPS-equipped
inexpensive GPS-equipped
devices,
such
taxicabs.
trajectory data
are collected
from
inexpensive
devices, such as
taxicabs.
Taxiastrip
data is
Taxi
trip datafor
is mining
widely-used
traffic
flow patterns,
as described
in [34,35].
ForOpen
example,
widely-used
traffic for
flowmining
patterns,
as described
in [34,35].
For example,
the NYC
Data
the
NYC
Open
Data
Plan
has
built
a
portal
for
publishing
its
digital
public
data
for
city-wide
Plan has built a portal for publishing its digital public data for city-wide aggregation, as required by
aggregation,
as required
by local
law. Taxi
In addition,
the New
York City
and Limousine
local law. In addition,
the New
York City
and Limousine
Commission
has Taxi
published,
since 2009,
Commission
since
2009,ataxi
data sets to
the of
portal.
2015,
totalwere
of 11,534,883
taxi trip data has
sets published,
to the portal.
In 2015,
totaltrip
of 11,534,883
rows
greenIntaxi
tripadata
recorded.
rows
of
green
taxi
trip
data
were
recorded.
The
records
include
field
capturing,
pick-up
and
drop-off
The records include field capturing, pick-up and drop-off dates, times, locations, distances, payment
dates,
times,
locations, distances,
payment
andpick-up
driver-reported
passenger
counts.
Since
types, and
driver-reported
passenger
counts.types,
Since the
and drop-off
locations
are just
pairsthe
of
pick-up
and
drop-off
locations
are
just
pairs
of
geographical
coordinates,
additional
spatial
data
sets
geographical coordinates, additional spatial data sets that include land use type (LUT) are required.
that
include land
use
(LUT)
areMapPluto
required. Tax
We Lot
collected
a data
setDepartment
named NYC
Tax
We collected
a data
settype
named
NYC
from the
NYC
of MapPluto
City Planning.
Lot
NYC
Department
City Planning.
The
in the
data describe
set include
XCoord,
Thefrom
fieldsthe
in the
data
set includeof
LandUse,
XCoord,
andfields
YCoord,
which
the LandUse,
land use categories
and
YCoord,
which
describe
the
land
use
categories
such
as
Family
Building,
Commercial
and
Office
such as Family Building, Commercial and Office Buildings, Industrial and Manufacturing, and
the
Buildings,
Industrial
and
Manufacturing,
and
the
approximate
location
of
the
lots.
The
Tax
Lot
approximate location of the lots. The Tax Lot dataset contains 834,158 records. A taxi customer’s
dataset
834,158
records.
A taxiand
customer’s
pick-up near
location
near a Family
Building
and
pick-up contains
location near
a Family
Building
drop-off location
a Commercial
and Office
Building
drop-off
location
near
a
Commercial
and
Office
Building
at
rush
hour,
indicates
that
the
person’s
at rush hour, indicates that the person’s taxi request may be work related. We explore the data
taxi
be workmentioned
related. WeSNNQAs.
explore the
data sets
mentioned
sets request
with themay
previously
Readers
canwith
findthe
thepreviously
datasets by
using the SNNQAs.
following
Readers
can
find
the
datasets
by
using
the
following
links:
https://data.cityofnewyork.us/
links: https://data.cityofnewyork.us/Transportation/2015-Green-Taxi-Trip-Data/n4kn-dy2y and
Transportation/2015-Green-Taxi-Trip-Data/n4kn-dy2y
and http://www1.nyc.gov/site/planning/
http://www1.nyc.gov/site/planning/data-maps/open-data.page
(last access date: 2017/2/8)
data-maps/open-data.page (last access date: 2017/2/8)
4.2. Cloud Computing Environemt
4.2. Cloud Computing Environemt
There are six machines used for setting-up our OpenStack cloud. The Liberty version of OpenStack,
Thereto are
six in
machines
setting-up
cloud.1,The
version
released
public
Octoberused
2015,for
was
choosen.our
AsOpenStack
shown in Table
the Liberty
controller
node of
is
OpenStack,
released
to public
in October
2015,4 cores,
was choosen.
As shown
Table
the controller
a commodity
PC equipped
with
1 processor,
4 GB memory,
andin500
GB 1,
disks,
on which
node
is a commodity
equipped
with 1service,
processor,
4 cores,
4 GB memory,
and
500Glance
GB disks,
on
the Keystone
identityPC
service,
Telemetry
Neutron
networking
service,
and
service
which
theNeutron
Keystone
identity
service,
service,
Neutron
networking
service,
Glance
run. The
Linux
bridge
agentsTelemetry
running on
computing
nodes
are connected
withand
each
other
service
run. The
Neutronfor
Linux
bridge
agents running
computing
nodes are
connected
each
to fabricate
networking
virtual
instances.
We use on
four
Dell PowerEdge
M610
serverswith
to build
other
to fabricate
networking
We two
use physical
four DellCPUs,
PowerEdge
M610
servers
to
the computing
nodes
clusters, for
withvirtual
one ofinstances.
them having
24 cores,
92 GB
of main
build the computing nodes clusters, with one of them having two physical CPUs, 24 cores, 92 GB of
main memory, and a 160 GB disk, and three of them having two physical CPUs, 24 cores, 48 GB of
ISPRS Int. J. Geo-Inf. 2017, 6, 84
8 of 17
memory, and a 160 GB disk, and three of them having two physical CPUs, 24 cores, 48 GB of main
memory, and a 500 GB disk. The block storage node is a HP Z220 workstation configured with
two physical CPUs, 8 cores, 32 GB of main memory, and 3 TB disks. The Cinder services running
on the HP workstation is using Network File System (NFS) to provide virtual instance with storage.
To simulate complex network infrastructure, we let the computing nodes traverse two different subnets.
1000 Mbps networking devices are used to bridge the two subnets. All nodes in the cloud computing
environment adapt a Ubuntu 14.04.4 LTS 64 bits operating system. All required packages that support
the OpenStack cloud are installed with their native package. More specifically, RabbitMQ 3.5.4 is
used for coordinating operations and status information among services. MariaDB 5.5.52 is used for
services to store data. The libvirt KVM and Linux Bridge are used to provide platform virtualization
and networking for virtual instances. We exclusively use the vxlan mechanism to provide networking
with virtual instances. For some convenience, we abbreviate the auto-scaling VM instances and the
auto-scaling Docker containers as ASVI and ASVC, respectively.
Table 1. Specification of the OpenStack cloud computing environment.
Node
Cloud Actors
192.168.203.135
Controller
192.168.203.16
192.168.200.97
192.168.200.109
192.168.200.111
192.168.203.16
Compute
Compute
Compute
Compute
Block Storage
Specification
Services
4 cores, 4 GB memory and 500 GB disks
24 cores, 92 GB memory and 160 GB disks
24 cores, 48 GB memory and 500 GB disks
24 cores, 48 GB memory and 500 GB disks
24 cores, 48 GB memory and 500 GB disks
8 cores, 32 GB memory and 3 TB disks
Nova-cert, Nova-consoleauth,
Nova-scheduler, Nova-conductor,
Cinder-scheduler, Neutron-metadata-agent,
Neutron-linuxbridge-agent,
Neutron-l3-agent, Neutron-dhcp-agent,
Heat-engine
Nova-compute, Neutron-linuxbridge-agent
Nova-compute, Neutron-linuxbridge-agent
Nova-compute, Neutron-linuxbridge-agent
Nova-compute, Neutron-linuxbridge-agent
Cinder-volume
4.3. Result
4.3.1. Comparison of SNNQAs using ASVI and ASVC
To test the efficiency of SQP using ASVI and ASVC, we construct two clusters using two OpenStack
Heat templates. As shown in Table 2, the only difference between the clusters is the image and software
used. The two clusters can be comparable in this context, since the deployment of instances on which
computing nodes can be exactly specified by using an OS::Nova::Server availability zone property
in Heat templates. In the Ubuntu cluster, the node having 4VCPUs and 8 GB of memory hosts the
Spark master instance, while the other four instances host the Spark worker instances. In the CoreOS
cluster, the nodes holding each type of Spark component are determined at runtime. More specifically,
the softwares and their versions are listed in Table 3.
Table 2. Two comparable clusters in the OpenStack cloud.
Virtual Cluster
Ubuntu_cluster
CoreOS_cluster
Specification
1 instance with 4 VCPUs and 8 GB
RAM, and 4 instances each with
2V-CPus and 4 GB RAM
1 instance with 4 VCPUs and 8 GB
RAM, and 4 instances each with
2V-CPus and 4 GB RAM
Images and Software
Ubuntu 14.04.5 trusty image with Spark
1.5.2, Java(TM) SE Runtime Environment
(build 1.7.0_45-b18), Scala 2.10.4
CoreOS 1185.3.0 image with Kubernetes
1.3.4, Heapster 1.1.0 and Docker 1.11.2
Assuming the clusters’ launching time is negligible, we first test the efficiency of the SNNQAs
under the two clusters. Since it is necessary to specify the Ubuntu image in advance, the number
of Spark workers running on each virtual instance is the same when using the ASVI. As shown in
Figure 6, the efficiency of SNNQAs using the two auto-scaling strategies on the OpenStack cloud is
ISPRS Int. J. Geo-Inf. 2017, 6, 84
9 of 17
ISPRS Int. J. Geo-Inf. 2017, 6, Table
84
3. The summary table of software used in the following tests.
Software
9 of 17
Version
1.11.2
very different. When p = 3000 and k = 5,Docker
there are 3000
grids on which to store spatial data objects,
CoreOS
1185.3.0
and each query object queries the five nearest
spatial data
objects in the grids. The average execution
Spark
1.5.2
time is recorded when each SNNQA submission is executed 10 times with the total-executor-cores
Scala
2.10.4
parameter set to six. We observe that increasing
the number
of Spark workers does not guarantee better
JDK
1.7.0_45-b18
efficiency of the algorithms. For example, when there are 12 Spark workers, the average execution
Heapster
1.1.0
time of the SNNQA using the ASVI strategy is 235.84 s. Doubling the number of Spark workers to
Kubernetes
1.3.4
24 raises the average execution time of the SNNQA using the ASVI strategy (SNNQA-ASVI) to nearly
400 s, andAssuming
when the
number
Spark workers
is increased
32,test
thethe
average
execution
time of the
the
clusters’oflaunching
time is negligible,
we to
first
efficiency
of the SNNQAs
SNNQA-ASVI
fallsclusters.
to 281.18
s. Then,
when wetoincrease
of Spark
workers the
to 36,
the average
under the two
Since
it is necessary
specify the
the number
Ubuntu image
in advance,
number
of
running on each
instance
is the
same when using
the ASVI.
As shown
in
time Spark
cost ofworkers
the SNNQA-ASVI
rises virtual
to 420 s.
The high
SNNQA-ASVI
efficiency
variability
is possibly
Figure
6,
the
efficiency
of
SNNQAs
using
the
two
auto-scaling
strategies
on
the
OpenStack
cloud
is
due to two factors; (1) the performance disturbance of the virtual instances and (2) the uncertainty
very
different.
When
p
=
3000
and
k
=
5,
there
are
3000
grids
on
which
to
store
spatial
data
objects,
of the number of Spark workers introduces SNNQA variations using the ASVI. We notice that Spark
and are
eachquite
queryoften
objectlost
queries
thetests.
five nearest
spatial data
objects in the grids.
averageon
execution
workers
in the
Data-intensive
computational
tasksThe
executed
each Spark
time is recorded when each SNNQA submission is executed 10 times with the total-executor-cores
worker require too many memory resources, which may disrupt the connection between the Spark
parameter set to six. We observe that increasing the number of Spark workers does not guarantee
workers and the Spark master instance. Further, there are no self-healing mechanisms for Spark
better efficiency of the algorithms. For example, when there are 12 Spark workers, the average
workers
in this context. When given fewer than 28 Docker containers to the algorithm, the execution
execution time of the SNNQA using the ASVI strategy is 235.84 s. Doubling the number of Spark
time workers
is alwaystoless
270the
s. As
the results
in Figure
given
24 Docker
containers,
24 than
raises
average
execution
time 6ofshow,
the when
SNNQA
using
the ASVI
strategy the
average
execution
time
is
269.91
s.
When
the
number
of
Docker
containers
is
limited
to
20,
(SNNQA-ASVI) to nearly 400 s, and when the number of Spark workers is increased tothe
32, average
the
execution
time
falls totime
250.31
s. The
time cost offalls
thetoSNNQAs,
usingwhen
the two
strategies,
average
execution
of the
SNNQA-ASVI
281.18 s. Then,
we increase
the grows
numberrapidly
of
workersof to
36, workers
the average
timethan
cost32.
ofThese
the SNNQA-ASVI
s. The using
high the
whenSpark
the number
Spark
is more
results indicaterises
thattothe420
SNNQAs
SNNQA-ASVI
efficiency
is possibly
due to is
two
factors;
(1) the
ASVC
could be robust
if the variability
Docker container
number
smaller
than
32. performance disturbance
of the virtual instances and (2) the uncertainty of the number of Spark workers introduces SNNQA
The summary
table
of software
used in the
tests.lost in the tests.
variations usingTable
the 3.
ASVI.
We notice
that
Spark workers
are following
quite often
Data-intensive computational tasks executed on each Spark worker require too many memory
Software between
Version
resources, which may disrupt the connection
the Spark workers and the Spark master
instance. Further, there are no self-healing
mechanisms
for Spark workers in this context. When
Docker
1.11.2
given fewer than 28 Docker containersCoreOS
to the algorithm,
the execution time is always less than 270 s.
1185.3.0
As the results in Figure 6 show, whenSpark
given 24 Docker
containers, the average execution time is
1.5.2
Scala
2.10.4to 20, the average execution time falls to
269.91 s. When the number of Docker containers
is limited
JDK the two
1.7.0_45-b18
250.31 s. The time cost of the SNNQAs, using
strategies, grows rapidly when the number of
Heapster
Spark workers is more than 32. These results indicate 1.1.0
that the SNNQAs using the ASVC could be
Kubernetes
1.3.4
robust if the Docker container number
is smaller than 32.
Figure 6. Comparison of Spark-based K Nearest Neighbor (KNN) algorithms using auto-scaling
Figure
6. Comparison of Spark-based K Nearest Neighbor (KNN) algorithms using auto-scaling virtual
virtual machine (VM) instances (ASVI) and auto-scaling Docker containers (ASVC) strategies.
machine (VM) instances (ASVI) and auto-scaling Docker containers (ASVC) strategies.
To investigate the effectiveness of the index to the SNNQAs, we conduct the second experiment
as follows. When using the ASVI and 20 Spark workers, the Spark-based KNN algorithms using the
Sort-Tile-Recursive tree (STRtree) index (SKNN-STRtree) completes in about 297.92 s, as shown in
Figure 7. The Spark-based KNN algorithms without the STRtree index complete in about 284.89 s,
ISPRS Int. J. Geo-Inf. 2017, 6, 84
ISPRS Int. J. Geo-Inf. 2017, 6, 84
10 of 17
10 of 17
To investigate the effectiveness of the index to the SNNQAs, we conduct the second experiment
as follows. When using the ASVI and 20 Spark workers, the Spark-based KNN algorithms using the
as shown
in Figure 6. These
results indicate
that using the index
doesin
not
reduce
thes,execution
Sort-Tile-Recursive
tree (STRtree)
index (SKNN-STRtree)
completes
about
297.92
as showntime
in of
the SNNQAs
using
the
ASVI
strategies.
The
reason
may
be
because
the
number
of
data
objects
in the
Figure 7. The Spark-based KNN algorithms without the STRtree index complete in about 284.89 s, as
gridsshown
is not large;
therefore
in this context
may
be does
necessary.
The grid
number determines
in Figure
6. Theseindexing
results indicate
that using
thenot
index
not reduce
the execution
time of
the SNNQAs
using
the ASVI
strategies.
The reason
maySpark
be because
the number
of data
inis
the
the average
number
of objects
that
are processed
by each
worker.
When the
grid objects
number
small,
grids is ofnot
large;istherefore
indexing
this context
maymore
not be
necessary.
The grid number
the number
objects
large. Each
Spark in
worker
consumes
memory
and increases
the amount
determines
the average
objects that
are processed
bythe
each
Spark
worker.
When the
the grid
of time
spent executing
thenumber
spatial of
predicate
calculation.
When
grid
number
is large,
number
number is small, the number of objects is large. Each Spark worker consumes more memory and
of objects is small. When using the ASVC, the Spark-based K Nearest Neighbor (KNN) algorithms
increases the amount of time spent executing the spatial predicate calculation. When the grid
using the STRtree index complete in less than 300 s if working with 20 to 28 containers. These results
number is large, the number of objects is small. When using the ASVC, the Spark-based K Nearest
indicate that the SNNQAs with the STRtree index using the ASVC could be robust if the container
Neighbor (KNN) algorithms using the STRtree index complete in less than 300 s if working with 20
number
is containers.
smaller than
28, and
when
giving
more
28 containers
to the Spark
costs of
to 28
These
results
indicate
that
the than
SNNQAs
with the STRtree
indexworkers,
using thethe
ASVC
the SKNN-STRtree
increase.
These
resultsthan
indicate
that
the giving
CoreOS
cluster
be too small
could be robust obviously
if the container
number
is smaller
28, and
when
more
thanmay
28 containers
for Kubernetes
auto-scale
We also obviously
notice that
the efficiency
of theindicate
SKNN-STRtree
to the Sparktoworkers,
the many
costs ofcontainers.
the SKNN-STRtree
increase.
These results
that
varies
when
using
the
ASVI
strategy.
These
similar
results
indicate
that,
regardless
of
whether
the CoreOS cluster may be too small for Kubernetes to auto-scale many containers. We also notice or
that the efficiency
theefficiency
SKNN-STRtree
when using
strategy.
These
similarvariable
results in
not indexing
is used,ofthe
of thevaries
SNNQAs
usingthe
theASVI
ASVI
strategy
is highly
indicate
that, regardless
whetherdaemons
or not indexing
is on
used,
the efficiency
the SNNQAs
using
the
clouds.
The number
of Sparkofworker
running
virtual
instancesofvaries.
The Spark
daemons
ASVI
strategy
is
highly
variable
in
clouds.
The
number
of
Spark
worker
daemons
running
on
virtual
residing on each virtual instance are prone to be out of service for many reasons. Data-intensive tasks
instances varies. The Spark daemons residing on each virtual instance are prone to be out of service
consume
lots of the memory of a Spark worker instance, which may lead to an ‘out of memory’ error.
for many reasons. Data-intensive tasks consume lots of the memory of a Spark worker instance,
Additionally, networking in the cloud is typically unstable, and the networking between the Spark
which may lead to an ‘out of memory’ error. Additionally, networking in the cloud is typically
workers and the Spark master can sometimes be broken. We observe that the Spark worker daemons
unstable, and the networking between the Spark workers and the Spark master can sometimes be
frequently
the
tests.
wedaemons
focus ourfrequently
remaining
in this
exclusively
broken.fail
Weduring
observe
that
the Therefore,
Spark worker
failresearch,
during the
tests.paper,
Therefore,
we
on the
ASVC
focus
our strategies.
remaining research, in this paper, exclusively on the ASVC strategies.
Figure
7. Comparison
Spark-basedKKNearest
Nearest Neighbor
Neighbor Sort-Tile-Recursive
treetree
(KNN-STRtree)
Figure
7. Comparison
of of
Spark-based
Sort-Tile-Recursive
(KNN-STRtree)
algorithms
using
ASVI
and
ASVC
strategies.
algorithms using ASVI and ASVC strategies.
4.3.2. SNNQAs Using ASVC
4.3.2. SNNQAs Using ASVC
To understand how the parameter p impacts the efficiency of the SNNQAs, we conduct the
To understand
the parameter
p impacts
of thethe
SNNQAs,
we conduct
the third
third
experimenthow
as follows.
Each submission
ofthe
the efficiency
SNNQA using
ASVC is executed
10 times
experiment
follows. Each submission
SNNQA
using
the execution
ASVC is executed
10 times
with theastotal-executor-cores
parameter of
setthe
to six,
and the
average
time is recorded.
Aswith
the total-executor-cores
parameter
setfewer
to six,
and32the
average
execution
timetoisSpark
recorded.
As the
shown
shown in Figure 8, when
there are
than
Docker
containers
relative
workers,
SNNQAs
complete
less
than than
300 s.32
When
setting
p = 3000,relative
k = 5, and
Docker containers,
in Figure
8, when
thereinare
fewer
Docker
containers
tousing
Spark20workers,
the SNNQAs
the average
time of pthe
250 s. 20
When
setting
p = 2500, k the
= 5, average
and
complete
in lessminimum
than 300 execution
s. When setting
= SNNQAs
3000, k = is
5, about
and using
Docker
containers,
using 20
Docker containers,
theSNNQAs
average minimum
execution
timesetting
of the SNNQAs
270 using
s.
minimum
execution
time of the
is about 250
s. When
p = 2500,isk about
= 5, and
When
setting
p
=
2000,
k
=
5,
and
using
20
Docker
containers,
the
average
minimum
execution
time
of
20 Docker containers, the average minimum execution time of the SNNQAs is about 270 s. When
the SNNQAs is about 271 s. We found that reducing the value of parameter p would increase the
setting p = 2000, k = 5, and using 20 Docker containers, the average minimum execution time of the
execution time of SNNQAs in this context. For example, when setting p = 1500, k = 5, and using 20
SNNQAs is about 271 s. We found that reducing the value of parameter p would increase the execution
Docker containers, the average minimum execution time of the SNNQAs is about 330 s. When
time of SNNQAs in this context. For example, when setting p = 1500, k = 5, and using 20 Docker
containers, the average minimum execution time of the SNNQAs is about 330 s. When setting p = 1000,
k = 5, and using 20 Docker containers, the average minimum execution time of the SNNQAs is about
437 s. The results testified that Spark is more suitable for handling small parallel tasks. As discussed
ISPRS Int. J. Geo-Inf. 2017, 6, 84
11 of 17
ISPRS Int. J. Geo-Inf. 2017, 6, 84
11 of 17
setting p = 1000, k = 5, and using 20 Docker containers, the average minimum execution time of the
SNNQAs is about 437 s. The results testified that Spark is more suitable for handling small parallel
tasks.
discussedp earlier,
thethe
parameter
the
number
of parallel
tasks
each transformation
stage of the
earlier,
theAs
parameter
controls
numberpofcontrols
parallel
tasks
at each
stage of
theat
Spark
Spark
transformation
chains.
p isspatial
small, objects
the number
of spatial
processed
chains.
When
the parameter
p isWhen
small,the
theparameter
number of
processed
by objects
each Spark
executor
by
each
Spark
executor
would
be
large,
and
the
latency
of
the
algorithms
would
thereafter
increase.
would be large, and the latency of the algorithms would thereafter increase. We do not explicitly
do not
explicitly state
which
is best
this paper,
the curves
of the average
stateWe
which
configuration
is best
inconfiguration
this paper, but
the in
curves
of thebut
average
execution
time of the
execution time of the SNNQAs using the ASVC exhibits great consistence when the Docker
SNNQAs using the ASVC exhibits great consistence when the Docker container number is fewer than
container number is fewer than 32. Our testing reveals that Spark workers running on the same host
32. Our testing reveals that Spark workers running on the same host communicate with each other
communicate with each other at low cost; therefore the performance of the SNNQAs is reliable. We
at low
cost; therefore the performance of the SNNQAs is reliable. We observe that when more than
observe that when more than 32 Docker containers are given to Spark workers, the execution cost of
32 Docker
containers
are given
to Spark
workers,
the execution
of thethe
SNNQAs
the SNNQAs
increases,
mainly
because
Kubernetes
cannotcost
sustain
requiredincreases,
number mainly
of
because
Kubernetes
cannot
sustain
the
required
number
of
containers
in
the
small
CoreOS
clusters.
containers in the small CoreOS clusters.
Figure
8. impact
The impact
of parameter
p on
the efficiency
of Spark-based
neighbor
query
Figure
8. The
of parameter
p on the
efficiency
of Spark-based
NearestNearest
neighbor
query algorithms
algorithms (SNNQAs) using ASVC.
(SNNQAs) using ASVC.
k is an important parameter for the SNNQAs and, according to our tests, demonstrates a similar
k is an important
parameter
for the SNNQAs
and,ofaccording
to our
a similar
pattern
to our parameter
p observation.
In the interest
saving space,
the tests,
resultsdemonstrates
are not depicted
in
pattern
to
our
parameter
p
observation.
In
the
interest
of
saving
space,
the
results
are
not
depicted
this paper. We prefer to know the efficiency of the SNNQAs using the ASVC in large clusters.
Therefore,
a fourth
as follows.
CoreOS cluster
virtual
is
in this
paper. we
Weconduct
prefer to
knowexperiment
the efficiency
of theASNNQAs
usingwith
the 19
ASVC
ininstances
large clusters.
built inwe
which
one instance
four VCPUs
8 GB A
of CoreOS
RAM, and
other with
instances
each have
two is
Therefore,
conduct
a fourthhas
experiment
as and
follows.
cluster
19 virtual
instances
andone
4 GBinstance
of RAM.has
Thefour
average
execution
for each
of the
10 submissions.
As two
builtVCPUs
in which
VCPUs
and time
8 GBisofrecorded
RAM, and
other
instances
each have
shown
in
Table
4,
when
setting
the
total-executor-cores
parameter
to
six,
the
SNNQAs
with
different
VCPUs and 4 GB of RAM. The average execution time is recorded for each of the 10 submissions.
parameters complete in less than 300 s. The average execution time of the SNNQAs is from 259.53 s
As shown in Table 4, when setting the total-executor-cores parameter to six, the SNNQAs with different
to 288.50 s. When providing 90 Docker containers with Spark workers, the average execution time of
parameters complete in less than 300 s. The average execution time of the SNNQAs is from 259.53 s to
the SNNQAs is about 259.53 s. The results have verified our earlier discussion. In a small Kubernetes
288.50
s. When providing 90 Docker containers with Spark workers, the average execution time of
cluster, a large number of Docker containers may pose higher completion to the underlying OS,
the SNNQAs
259.53
s. The
results
have verified
earlier declines.
discussion.
In a small
Kubernetes
which is is
theabout
primary
reason
that
the efficiency
of theour
SNNQAs
Although
Kubernetes
cluster,
a large
number ofhundreds
Docker containers
may poseofhigher
completion
the underlying
OS,not
which
enables
auto-scaling
and even thousands
containers
in a fewtoseconds,
we suggest
is theusing
primary
that
the efficiency
ofinthe
SNNQAs
declines.
Although
Kubernetes enables
a largereason
number
of Docker
containers
small
clusters in
order to avoid
overloading.
auto-scaling hundreds and even thousands of containers in a few seconds, we suggest not using a large
Table
4. The
impact of in
container
number tointhe
efficiency
of SNNQAs
using ASVC and setting
number of
Docker
containers
small clusters
order
to avoid
overloading.
p = 3000 and k = 5.
Container
Number
Execution
Time using ASVC and setting p = 3000
Table 4. The impact of container number
to the
efficiency
of SNNQAs
32
271.89 s
and k = 5.
36
40
Container Number
50
60
32 70
36 80
40 90
50100
285.16 s
274.20 s
Execution
288.50 s Time
276.39 s
271.89 s
279.95 s
285.16
272.63
s s
274.20
259.53
s s
288.50
271.92
s s
60
70
80
90
100
276.39 s
279.95 s
272.63 s
259.53 s
271.92 s
ISPRS Int. J. Geo-Inf. 2017, 6, 84
ISPRS Int. J. Geo-Inf. 2017, 6, 84
12 of 17
12 of 17
Since the CoreOS cluster has a total of 40 cores, the value of the parameter ‘total-executor-cores’
Since the CoreOS cluster has a total of 40 cores, the value of the parameter ‘total-executor-cores’
can be larger than six in the above tests. To understand the impact of the Docker container number
can be larger than six in the above tests. To understand the impact of the Docker container number
on the efficiency of the SNNQAs using more cores, we conduct our fifth experiment as follows. Each
on the efficiency of the SNNQAs using more cores, we conduct our fifth experiment as follows.
submission of an SNNQA using ASVI is executed 10 times with p = 3000 and k = 5, and the average
Each submission of an SNNQA using ASVI is executed 10 times with p = 3000 and k = 5, and the
execution time is recorded. As shown in Figure 9, when the value of the ‘total-executor-cores’ is from
average execution time is recorded. As shown in Figure 9, when the value of the ‘total-executor-cores’
6 to 18, the average execution time is about 300 s. Even when the Docker container number is set to
is from 6 to 18, the average execution time is about 300 s. Even when the Docker container number
100, the time cost never exceeds 300 s. However, the performance varies when there are 24
is set to 100, the time cost never exceeds 300 s. However, the performance varies when there are
total-executor-cores. For example, when giving 50 Docker containers to Spark workers, the average
24 total-executor-cores. For example, when giving 50 Docker containers to Spark workers, the average
execution time is 343.92 s. When increasing the number of Docker containers given to the Spark
execution time is 343.92 s. When increasing the number of Docker containers given to the Spark
workers to 70, the average execution time rises to more than 400 s. Increasing this number of Docker
workers to 70, the average execution time rises to more than 400 s. Increasing this number of Docker
containers further, to 90, reduces the average execution time down to approximately 300 s. These
containers further, to 90, reduces the average execution time down to approximately 300 s. These results
results indicate that the container number and total executor cores used by Spark workers are the
indicate that the container number and total executor cores used by Spark workers are the determining
determining factors for the performance of the SNNQAs. When giving 40 Docker containers for
factors for the performance of the SNNQAs. When giving 40 Docker containers for Spark workers and
Spark workers and setting the total-executor-cores to 12, the average execution time of the SNNQAs
setting the total-executor-cores to 12, the average execution time of the SNNQAs is 203.33 s. Using the
is 203.33 s. Using the same number of Docker containers for Spark workers but increasing the
same number of Docker containers for Spark workers but increasing the number of total-executor-cores
number of total-executor-cores to 18 increases the SNNQAs’ average execution time to 208.93 s.
to 18 increases the SNNQAs’ average execution time to 208.93 s. These results indicate that if we choose
These results indicate that if we choose a relatively small number of Docker containers with
a relatively small number of Docker containers with moderate total executor cores for Spark workers,
moderate total executor cores for Spark workers, the SNNQAs’ performance reliability and
the SNNQAs’ performance reliability and efficiency increase. It is important to note that the time
efficiency increase. It is important to note that the time cost of the SNNQAs is nearly 300 s when
cost of the SNNQAs is nearly 300 s when giving six, 12, and 18 cores to Spark workers. We observe
giving six, 12, and 18 cores to Spark workers. We observe that the overall performance of the
that the overall performance of the SNNQAs using 12 cores is more robust than that using six and
SNNQAs using 12 cores is more robust than that using six and 18 cores. The total-executor-cores
18 cores. The total-executor-cores parameter is an upper threshold that represents the maximum
parameter is an upper threshold that represents the maximum number of cores that Spark worker
number of cores that Spark worker containers can consume. As more cores are given to Spark workers,
containers can consume. As more cores are given to Spark workers, the number of chances for each
the number of chances for each Spark worker to execute tasks increases, but the larger value of the
Spark worker to execute tasks increases, but the larger value of the parameter also introduces more
parameter also introduces more scheduling tasks for the Kubernetes scheduler. This situation is clear
scheduling tasks for the Kubernetes scheduler. This situation is clear when we use a larger number
when we use a larger number of Docker containers since Kubernetes would take more time to assure
of Docker containers since Kubernetes would take more time to assure that the specified number of
that the specified number of Docker containers is ready. The larger the number of total executor cores
Docker containers is ready. The larger the number of total executor cores used by Spark workers, the
used by Spark workers, the greater the amount of competition that occurs between containers that
greater the amount of competition that occurs between containers that run Spark worker daemons.
run Spark worker daemons. Kubernetes takes more time to reschedule crashed Spark workers, which
Kubernetes takes more time to reschedule crashed Spark workers, which in this situation causes the
in this situation causes the Spark master instance to reschedule failed tasks. The execution of the
Spark master instance to reschedule failed tasks. The execution of the SNNQAs was about 625 s
SNNQAs was about 625 s when giving 30 cores to Spark workers and setting the container number to
when giving 30 cores to Spark workers and setting the container number to 50, and execution of the
50, and execution of the SNNQAs always failed when giving 30 cores to Spark workers with more
SNNQAs always failed when giving 30 cores to Spark workers with more than 60 containers.
than 60 containers.
9. The impact of total executor cores and container number on the efficiency of SNNQAs
SNNQAs
Figure 9.
using ASVC.
4.3.3. SNNQAs Using Different ASVC
ASVC Strategies
Strategies
To investigate
investigate the
the efficiency
efficiency of
of the
the SNNQAs
SNNQAs using
using different
different auto-scaling
auto-scaling strategies,
strategies, we conduct
To
experiment as follows.
follows. A Heapster plugin is used to collect CPU workload metrics from
the sixth experiment
kubelet daemons. Each submission of the algorithm is executed 10 times and the average execution
ISPRS Int. J. Geo-Inf. 2017, 6, 84
13 of 17
kubelet daemons. Each submission of the algorithm is executed 10 times and the average execution
time is recorded. As shown in Table 5, when the MinReplicas, MaxReplica, and TargetPercentage are
set to 1%, 40%, and 50% respectively, the average execution time of the SNNQAs is 303.49 s. When
the values of the MinReplicas, MaxReplica, and TargetPercentage parameters are 10%, 40%, and 50%
respectively, the average execution time of the SNNQAs is 247.68 s. These results indicate that the
minimum number of replicas of Spark workers guaranteed by Kubernetes has a certain impact on
the efficiency of the SNNQAs, but the impact is not clear, as observed in our testing. For example,
when we set the values of the MinReplicas, MaxReplica, and TargetPercentage to 1%, 40%, and
80% respectively, the average execution time of the SNNQAs is 272.56 s. When the values of the
MinReplicas, MaxReplica, and TargetPercentage parameters are 10%, 40%, and 80% respectively,
the average execution time is 249.89 s. We observe that the number of containers never grows to
40 when we set the value of the MinReplicas parameter to 10%, 20%, and 30% because of two factors.
The first reason is that Kubernetes master fails to calculate the averaged CPU workload in the very
short period of each submission. The second reason is that the Heapster plugin may fail to collect
the metrics for the containers. Our goal, in the near future, is to find alternatives that provide real
time monitoring for Kubernetes. Although there are still some flaws in the auto-scaling strategies,
we observe that when setting the value of the MinReplicas to 1%, an HPA works well in all the tests.
Table 5. The impact of different auto-scaling strategies to the efficiency of SNNQAs using ASVC.
Total Executor Cores
MinReplicas
MaxReplica
TargetPercentage
Execution Time
6
6
6
6
6
6
6
6
1
10
20
30
1
10
20
30
40
40
40
40
40
40
40
40
50%
50%
50%
50%
80%
80%
80%
80%
303.49 s
247.68 s
253.91 s
242.62 s
272.56 s
249.89 s
252.53 s
262.80 s
4.3.4. SNNQAs Using ASVC in an Auto-Scaling Group
To investigate the efficiency of the SNNQAs using an HPA combined with an auto-scaling group,
we conduct the seventh experiment as follows. First, a Heat template is used to depict the minimum,
the desired, and the maximum number of CoreOS instances, which should be 5, 5, and 10, respectively.
Each CoreOS instance has two VCPUs and 4 GB of RAM. Next, we force the auto-scaling group to scale
out when the CPU utilization rate is above 80% and lasts about 1 min. Second, a Kubernetes template
is used to depict the values of the minimum and maximum replicas of Spark workers, which should
be 1 and 200, respectively. The TargetPercentage parameter is set to 80%. Finally, the SNNQAs are
submitted to the auto-scaling group using the parameters: total-executor-cores = 18, p = 3000, and k = 5.
Each submission of the SNNQAs is executed five times and the average execution time is recorded.
As shown in Figure 10, increasing the number of query objects monotonically increases the execution
time of the SNNQAs using the HPA exhibits when the auto-scaling group is not used. Finding the
five nearest neighbors from 834,158 spatial data objects for each of the 34,604,649 query objects took
approximately 660.64 s, while the execution time that uses the HPA combined with the auto-scaling
group results in great performance. All of the runs complete in less than 300 s. The results show that
the HPA combined with the auto-scaling group is beneficial for reliable spatial big data analysis. It is
suggested that future research be directed toward automatically configuring these parameters.
ISPRS Int. J. Geo-Inf. 2017, 6, 84
ISPRS Int. J. Geo-Inf. 2017, 6, 84
14 of 17
14 of 17
Figure 10. Comparison of the efficiency of SNNQAs using the horizontal pod auto-scaler (HPA) and
HPA combined
combined with
with auto-scaling
auto-scaling group
group (ASG)
(ASG)in
inOpenStack
OpenStackcloud.
cloud.
HPA
5. Discussion
5.
Discussion
Elasticity is
is one
one of
of the
the main
main characteristics
characteristics of
of cloud
cloud computing,
computing, which
which means
means that
that computing
computing
Elasticity
resources
can
be
increased
or
decreased
in
a
dynamic
and
on
demand
way
according
to
the
current
resources can be increased or decreased in a dynamic and on demand way according to the current
workload of
of applications
applications [17].
[17]. Current
Current solutions
solutions direct
direct use
use of
of the
the auto-scaling
auto-scaling groups
groups to
to spawn
spawn VMs
VMs
workload
on
which
a
dedicated
VM
is
used
to
run
Spark
master
daemons
and
other
VMs
are
used
to
run
Spark
on which a dedicated VM is used to run Spark master daemons and other VMs are used to run
workerworker
daemons.
Since the
performance
variability
of VMs
running
Spark
master
daemons
can
Spark
daemons.
Since
the performance
variability
of VMs
running
Spark
master
daemons
impact
the the
scheduling
tasks
of the
Spark
engine,
it it
may
regression
can
impact
scheduling
tasks
of the
Spark
engine,
maybebeless
lesslikely
likelyto
to build
build useful
useful regression
models
for
estimating
the
execution
time
of
Spark-based
PSQPAs.
Moreover,
the
Spark
master
and
models for estimating the execution time of Spark-based PSQPAs. Moreover, the Spark master and
worker daemons
daemonsrunning
running
can
work using
without
using an recovery
additional
recovery
worker
on on
VMsVMs
can be
outbeof out
workofwithout
an additional
mechanism.
mechanism.
Additionally,
the
capacity
of
the
dedicated
VM
would
be
over-provisioned
or
Additionally, the capacity of the dedicated VM would be over-provisioned or under-provisioned let
under-provisioned
let
alone
the
computing
resource
ineffectively
used.
As
shown
in
Figure
6,
when
alone the computing resource ineffectively used. As shown in Figure 6, when using the ASVI strategies,
using
the ASVI
the is
efficiency
of the SNNQAs
is highly
variable.daemon
Since each
Spark worker
the
efficiency
of strategies,
the SNNQAs
highly variable.
Since each
Spark worker
consumes
a fixed
daemon
consumes
a
fixed
number
of
memory
resources
on
the
Ubuntu
VMs,
we
cannot
specifyofa
number of memory resources on the Ubuntu VMs, we cannot specify a large value to the number
large value
to that
the run
number
Spark
workers
run on we
theobserved
same VM.
our testing,
we
Spark
workers
on theofsame
VM.
Duringthat
our testing,
thatDuring
Spark worker
daemons
observed
thatlost.
Spark
daemons with
are the
frequently
lost. Control
This may
be (TCP)-oriented
correlated withNetty
the
are
frequently
This worker
may be correlated
Transmission
Protocol
Transmission
Control
Protocol
(TCP)-oriented
Netty
protocol
used
for
receiving
the
commands
protocol used for receiving the commands from the Spark master and transmitting data between Spark
from the If
Spark
master and
data
between
Spark
If the connection
lost,
then
workers.
the connection
is transmitting
lost, then Spark
workers
would
beworkers.
out of services.
In addition,isthe
limited
Spark
workers
would
be
out
of
services.
In
addition,
the
limited
memory
space
of
Spark
executors
memory space of Spark executors can’t satisfy the computation requirement when handling too much
can’t or
satisfy
the computation
requirement
when handling
toothe
much
data of
or Spark
too many
query
objects.
data
too many
query objects.
The relationship
between
number
worker
daemons
The relationship
the number
of Spark
daemons
runningToonsatisfy
VMs and
the execution
running
on VMs between
and the execution
time
of theworker
SNNQAs
is indefinite.
time-constrained
time
of
the
SNNQAs
is
indefinite.
To
satisfy
time-constrained
data
analysis,
the
uncertainty
should
data analysis, the uncertainty should be removed. The performance variability of the VMs and
the
be
removed.
The
performance
variability
of
the
VMs
and
the
lack
of
a
self-healing
mechanism
lack of a self-healing mechanism for Spark workers running on VMs preclude us from using for
an
Spark
workers
running
on
VMs
preclude
us
from
using
an
ASVI
strategy,
while
the
overall
ASVI strategy, while the overall performance of SNNQAs using ASVC is more stable than that using
performance
of SNNQAs
is more
stable
than that
ASVC
The time
ASVC
strategies.
The timeusing
costs ASVC
of SNNQAs
could
complete
in using
less than
280strategies.
s if we provide
less costs
than
of Docker
SNNQAs
could complete
in less We
than
280 sthat
if we
thanrapidly
28 Docker
inthan
this
28
containers
in this context.
notice
theprovide
time costless
grows
whencontainers
using more
context.
We
notice
that
the
time
cost
grows
rapidly
when
using
more
than
32
containers
for
Spark
32 containers for Spark workers and an ASVC strategy. The reason may be highly correlated with
workers
an ASVC
strategy.processes,
The reason
may
highly correlated
the kernel
the
kerneland
burden
for switching
and
thebe
completion
between with
Kubernetes
podsburden
would for
be
switching
processes,
and
the
completion
between
Kubernetes
pods
would
be
high.
Since
Kubernetes
high. Since Kubernetes adapts an implicit optimization policy that always tries to achieve the specified
adaptsaccording
an implicit
policy that
always
to achieve
thetake
specified
status
to
status
to optimization
the current available
resource,
thetries
scheduler
would
more time
toaccording
reassign the
the
current
available
resource,
the
scheduler
would
take
more
time
to
reassign
the
failed
containers
failed containers to proper nodes. Compared with VMs, Linux containers demonstrate equal or better
to proper nodes.
Compared
with VMs, [22].As
Linux containers
demonstrate
equal or
better
performance
performance
under
various workloads
shown in Figure
8, SNNQAs
using
ASVI
strategies
under
various
workloads
[22].As
shown
in
Figure
8,
SNNQAs
using
ASVI
strategies
complete
less
complete in less than 300 s when fewer than 32 containers for Spark workers are used. Otherinthan
than 300 s containers,
when fewer
than 32 containers
for Spark workers
aretoused.
Other computing
than self-healing
self-healing
Kubernetes
facilitates auto-scaling
containers
all available
nodes
containers,
Kubernetes
facilitates
auto-scaling
containers
to
all
available
computing
nodes
through
through the use of an HPA. We also explore the relationship between the execution time of SNNQAs
the use of an HPA. We also explore the relationship between the execution time of SNNQAs and the
number of containers in large clusters. Figure 9 shows that a small number of containers with a
ISPRS Int. J. Geo-Inf. 2017, 6, 84
15 of 17
and the number of containers in large clusters. Figure 9 shows that a small number of containers with
a moderate number of total executor cores used by Spark, results in robust executions of the SNNQAs.
The relationship is definite when the total executor cores do not exceed the number of virtual cores of
all available VMs. Lastly, we check the feasibility of an HPA combined with an auto-scaling group for
the elastic provision of containers in a scale-out manner. The results, as shown in Figure 10, indicate
that an HPA combined with an auto-scaling group is useful for predicting an SQP in a cloud computing
environment. Although tremendous efficiency can be obtained when an HPA is used, we observe
that the Heapster plugin sometimes fails to collect CPU metrics, which may impact auto-scaling
container strategies. Another shortage is that Spark components running in Docker containers have
no simple way to share data and software packages. We use the volume NFS image for sharing data
among Kubernetes Pods, which may impact the total execution time of SNNQAs. This is an area
future research.
6. Conclusions
In this paper, we proposed an elastic spatial query processing approach in an OpenStack
cloud computing environment by using Kubernetes-managed Docker containers for auto-scaling
an Apache Spark computing engine. By leveraging the self-healing and auto-scaling mechanism
of Kubernetes and utilizing the in-memory computing paradigm of Spark, we can build an elastic
cluster computing environment for spatial query processing in a cloud computing environment.
To satisfy time-constrained data analysis, we introduces the implementation of Spark-based spatial
query processing algorithms (SQPAs) in this paper, and then we identified three factors that would
impact the execution time of SQPAs. The first factor is the internal parameters of the SQPAs, which
includes the number of grids p that are used for grouping spatial correlation datasets. Since the
Spark engine is suitable for tackling tremendous but relatively small tasks, we suggested using
a large value of the parameter p and considering the volume of spatial datasets and the underlying
computing resource. The second factor is the parameters of the Spark computing model such as
total-executor-cores. For example, an inappropriate number of total executor cores used by the Spark
engine would bring about intense completion between containers. We suggested using a small number
of containers with a moderate number of total executor cores for reliable spatial query processing of
big spatial data. The last factor is the parameters used for defining auto-scaling strategies. For example,
MinReplicas, MaxReplica, and Targetpercentage parameters would impact the execution time of
SQPAs. The experiments and the tests conducted on an OpenStack cloud computing platform indicate
a chance for time-constrained SQPAs. Considering the parameters, our future work will investigate
how to build cost models for Spark-based spatial query algorithms that use the ASVC strategies and
how to automate the settings of these parameters in OpenStack clouds for processing spatial big data
in the big data era.
Acknowledgments: We gratefully acknowledge funding support for this research provided by the National
High-Resolution Earth Observation System Projects (civil part): Hydrological monitoring system by high spatial
resolution remote sensing image (first phase) (08-Y30B07-9001-13/15).
Author Contributions: Wei Huang designed the experiment and procedure, Wen Zhang analyzed the results,
and they both wrote the manuscript. The work was supervised by Lingkui Meng, who contributed to all stages of
the work. Dongying Zhang made major contributions to the interpretations and modification.
Conflicts of Interest: The authors declare no conflict of interest.
References
1.
2.
Lee, J.G.; Kang, M. Geospatial big data: Challenges and opportunities. Big Data Res. 2015, 2, 74–81. [CrossRef]
Yang, C.; Huang, Q.; Li, Z.; Liu, K.; Hu, F. Big data and cloud computing: Innovation opportunities and
challenges. Int. J. Digit. Earth 2017, 10, 13–53. [CrossRef]
ISPRS Int. J. Geo-Inf. 2017, 6, 84
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
16 of 17
Li, S.; Dragicevic, S.; Castro, F.A.; Sester, M.; Winter, S.; Coltekin, A.; Pettit, C.; Jiang, B.; Haworth, J.; Stein, A.
Geospatial big data handling theory and methods: A review and research challenges. ISPRS J. Photogramm.
Remote Sens. 2016, 115, 119–133. [CrossRef]
Li, Z.; Yang, C.; Liu, K.; Hu, F.; Jin, B. Automatic scaling hadoop in the cloud for efficient process of big
geospatial data. ISPRS Int. J. Geo-Inf. 2016, 5, 173. [CrossRef]
Yang, C.; Goodchild, M.; Huang, Q.; Nebert, D.; Raskin, R.; Xu, Y.; Bambacus, M.; Fay, D. Spatial cloud
computing: How can the geospatial sciences use and help shape cloud computing? Int. J. Digit. Earth 2011,
4, 305–329. [CrossRef]
Aji, A.; Wang, F. High performance spatial query processing for large scale scientific data. In Proceedings of
the SIGMOD/PODS 2012 PhD Symposium, Scottsdale, AZ, USA, May 2012; ACM: New York, NY, USA, 2012;
pp. 9–14.
Orenstein, J.A. Spatial query processing in an object-oriented database system. In ACM SIGMOD Record;
ACM: New York, NY, USA, 1986; Volume 15, pp. 326–336.
You, S.; Zhang, J.; Gruenwald, L. Large-scale spatial join query processing in cloud. In Proceedings of the
2015 31st IEEE International Conference on Data Engineering Workshops (ICDEW), Seoul, Korea, 13–17 April
2015; pp. 34–41.
Zhong, Y.; Han, J.; Zhang, T.; Li, Z.; Fang, J.; Chen, G. Towards parallel spatial query processing for big spatial
data. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Workshops & PhD Forum (IPDPSW), Shanghai, China, 21–25 May 2012; pp. 2085–2094.
Huang, W.; Meng, L.; Zhang, D.; Zhang, W. In-memory parallel processing of massive remotely sensed data
using an apache spark on hadoop yarn model. IEEE J.Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 10, 3–19.
[CrossRef]
Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A.; Ma, J.; McCauley, M.; Franklin, M.J.; Shenker, S.; Stoica, I.
Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings
of the 9th USENIX Conference on Networked Systems Design and Implementation, San Jose, CA, USA, April 2012;
USENIX Association: Berkeley, CA, USA, 2012; p. 2.
Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.;
Franklin, M.J. Apache spark: A unified engine for big data processing. Commun. ACM 2016, 59, 56–65.
[CrossRef]
Yu, J.; Wu, J.; Sarwat, M. Geospark: A cluster computing framework for processing large-scale spatial data.
In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems;
ACM: New York, NY, USA, 2015; p. 70.
Tang, M.; Yu, Y.; Malluhi, Q.M.; Ouzzani, M.; Aref, W.G. Locationspark: A distributed in-memory data
management system for big spatial data. Proc. VLDB Endow. 2016, 9, 1565–1568. [CrossRef]
Ray, S.; Simion, B.; Brown, A.D.; Johnson, R. Skew-resistant parallel in-memory spatial join. In Proceedings of
the 26th International Conference on Scientific and Statistical Database Management; ACM: New York, NY, USA,
2014; p. 6.
Herbst, N.R.; Kounev, S.; Reussner, R. Elasticity in cloud computing: What it is, and what it is not.
In Proceedings of the 10th International Conference on Autonomic Computing (ICAC 13), San jose, CA,
USA, 26–28 June 2013; pp. 23–27.
Galante, G.; De Bona, L.C.E.; Mury, A.R.; Schulze, B.; da Rosa Righi, R. An analysis of public clouds elasticity
in the execution of scientific applications: A survey. J. Grid Comput. 2016, 14, 193–216. [CrossRef]
Leitner, P.; Cito, J. Patterns in the chaos—a study of performance variation and predictability in public iaas
clouds. ACM Trans. Internet Tech. (TOIT) 2016, 16, 15. [CrossRef]
Lorido-Botran, T.; Miguel-Alonso, J.; Lozano, J.A. A review of auto-scaling techniques for elastic applications
in cloud environments. J. Grid Comput. 2014, 12, 559–592. [CrossRef]
Kang, S.; Lee, K. Auto-scaling of geo-based image processing in an openstack cloud computing environment.
Remote Sens. 2016, 8, 662. [CrossRef]
Soltesz, S.; Pötzl, H.; Fiuczynski, M.E.; Bavier, A.; Peterson, L. Container-based operating system
virtualization: A scalable, high-performance alternative to hypervisors. In ACM SIGOPS Operating Systems
Review; ACM: New York, NY, USA, 2007; pp. 275–287.
ISPRS Int. J. Geo-Inf. 2017, 6, 84
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
17 of 17
Felter, W.; Ferreira, A.; Rajamony, R.; Rubio, J. An updated performance comparison of virtual machines and
linux containers. In Proceedings of the 2015 IEEE International Symposium on Performance Analysis of
Systems and Software (ISPASS), Philadelphia, PA, USA, 29–31 March 2015; pp. 171–172.
Brinkhoff, T.; Kriegel, H.-P.; Seeger, B. Efficient Processing of Spatial Joins Using R-trees; ACM: New York, NY,
USA, 1993; Volume 22.
Akdogan, A.; Demiryurek, U.; Banaei-Kashani, F.; Shahabi, C. Voronoi-based geospatial query processing
with mapreduce. In Proceedings of the 2010 IEEE Second International Conference on Cloud Computing
Technology and Science (CloudCom), Indianapolis, IN, USA, 30 November–3 December 2010 ; pp. 9–16.
Brinkhoff, T.; Kriegel, H.-P.; Schneider, R.; Seeger, B. Multi-step Processing of Spatial Joins; ACM: New York,
NY, USA, 1994; Volume 23.
Lee, K.; Ganti, R.K.; Srivatsa, M.; Liu, L. Efficient spatial query processing for big data. In Proceedings of
the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems; ACM:
New York, NY, USA, 2014; pp. 469–472.
Chen, H.-L.; Chang, Y.-I. Neighbor-finding based on space-filling curves. Inf. Syst. 2005, 30, 205–226.
[CrossRef]
Gupta, H.; Chawda, B.; Negi, S.; Faruquie, T.A.; Subramaniam, L.V.; Mohania, M. Processing multi-way
spatial joins on map-reduce. In Proceedings of the 16th International Conference on Extending Database Technology;
ACM: New York, NY, USA, 2013; pp. 113–124.
Mouat, A. Using Docker: Developing and Deploying Software with Containers; O’Reilly Media, Inc.: Sebastopol,
CA, USA, 2015.
Peinl, R.; Holzschuher, F.; Pfitzer, F. Docker cluster management for the cloud-survey results and own
solution. J. Grid Comput. 2016, 14, 265–282. [CrossRef]
Burns, B.; Grant, B.; Oppenheimer, D.; Brewer, E.; Wilkes, J. Borg, omega, and kubernetes. Commun. ACM
2016, 59, 50–57. [CrossRef]
Jansen, C.; Witt, M.; Krefting, D. Employing docker swarm on openstack for biomedical analysis.
In Proceedings of the International Conference on Computational Science and Its Applications; Springer: Berlin,
Germany, 2016; pp. 303–318.
Jackson, K.; Bunch, C.; Sigler, E. Openstack Cloud Computing Cookbook; Packt Publishing Ltd.: Birmingham,
UK, 2015.
Liu, Y.; Kang, C.; Gao, S.; Xiao, Y.; Tian, Y. Understanding intra-urban trip patterns from taxi trajectory data.
J. Geogr. Syst. 2012, 14, 463–483. [CrossRef]
Liu, X.; Gong, L.; Gong, Y.; Liu, Y. Revealing travel patterns and city structure with taxi trip data.
J. Transp. Geogr. 2015, 43, 78–90. [CrossRef]
© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Download