A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters

https://doi.org/10.1016/S0167-9473(02)00302-XGet rights and content

Abstract

We propose a new graph-based strategy for the detection of spatial clusters of arbitrary geometric form in a map of geo-referenced populations and cases. Our test statistic is based on the likelihood ratio test previously formulated by Kulldorff and Nagarwalla for circular clusters. A new technique of adaptive simulated annealing is developed, focused on the problem of finding the local maxima of a certain likelihood function over the space of the connected subgraphs of the graph associated to the regions of interest. Given a map with n regions, on average this algorithm finds a quasi-optimal solution after analyzing snlog(n) subgraphs, where s depends on the cases density uniformity in the map. The algorithm is applied to a study of homicide clusters detection in a Brazilian large metropolitan area.

Introduction

Since the 1980s, there has been an increasing interest in the identification of spatially localized adverse health risk conditions. The reasons for the existence of such clustering are various. They can be due to environmental causes concentrated on small regions such as a localized pollution source (Biggeri et al., 1996; Katsouyanni et al., 1991; Xu et al., 1989). Another possible reason is the population differences on their genetic constituency or social habits such as diet (Barbujani and Sokal, 1990; Walsh and DeChello, 2001). Other possibilities include differences on regional medical services such as ascertainment of new cases or disease treatment protocols (Karjalainen, 1990; Goodwin et al., 1998) or a viral agent generating clustering patterns (Kinlen, 1995).

A number of methods have been proposed to test for the presence of spatial clusters of elevated risk and to identify their locations. Thorough and recent reviews can be found in Lawson et al. (1999) where the many different methods are compared. The methods assume that we have at our disposal a map of regions, each one with a defined risk population and a certain number of observed cases. The cases correspond to the individuals in each population, who have a special designation, such as an infected individual or a crime victim. The most useful cluster detection methods are those based on moving windows and related approaches. These methods superimpose circular regions over the study region and then evaluate the significance of the number of cases that fall within each circle. The methods differ according to the circles’ definition. Openshaw et al. (1987) superimpose a regular grid and circles of varying radii are drawn at the grid nodes. Those circles attaining a certain significance level are flagged and shown in the map. Besag and Newell (1991) invert the circle definition: rather than fixing the circles radii and looking for the significant ones, they propose to fix the number of cases within a circle and then search for those with such a small risk population that make them highly significant.

The two methods described above have been criticized for the multiple testing problem since neither of them control for the simultaneous inference problem on the set of all circles considered. That is, considering the multiple locations and circles sizes simultaneously tested, error type I of these multiple tests are not controlled leading to subjective decision making. The authors suggest considering as significant circles those with p-values as small as 0.005 or 0.0005 but there is no theoretical justification for a specific choice. This is a major problem since the choice of the cut-off p-value size determines the sensitivity and specificity of the test.

Kulldorff and Nagarwalla (1995) and Kulldorff (1997) proposed a methodology based on scan statistics that overcomes this difficult problem. It is a maximum likelihood ratio test with a parametric space including all circles tested. The test finds the maximum likelihood sweeping over all zones circumscribed by circles of varying radii centered at each of the regions of the map. The likelihood is not differentiable with respect to the parameters and hence we need to use a Monte Carlo testing procedure to evaluate its significance. We describe this spatial scan statistical test in more detail in Section 2.

A major problem with scan statistic methods is the fixed shape of the clusters to be detected, typically circular clusters although other specific shapes can be defined such as rectangular (Kulldorff 1999a, Kulldorff 1999b). The reason for such a restriction is the computationally infeasible number of possible areas to be tested. However, in real situations, we frequently find spatial clusters with quite different shapes from circular ones. A single pollution source could increase the risk of respiratory disease around it but the wind patterns destroy a possibly symmetric cluster (Biggeri et al., 1996). Increased risks along rivers, transport ways or powerlines create clusters with shapes highly different from circular ones (Feychting and Ahlbom, 1993; Verkasalo, 1993). Usually, other environmental or social causes do not have a circular symmetry and hence do not induce clusters with such a shape.

Applying the usual circular cluster detection methods to find clusters in such a situation has one possible consequence. The cluster is not well localized, generally being either much larger than the real cluster present in the map or leaving out areas with higher risks but whose incorporation in the circular window tested would also bring many other areas with lower risk.

In this paper, we introduce a graph-based algorithm that overcomes this limitation of Kulldorff and Nagarwalla (1995) and Kulldorff (1997) by using a simulated annealing (SA) approach. Differently from these authors, our strategy is not restricted to the detection of clusters with fixed shape, such as rectangular or circular shape, but it looks for connected clusters with arbitrary shape. We implemented our algorithm in a conveniently fast C++ code, which can be freely obtained from the authors. Experiments with our method show that it can identify clusters and test their statistical significance for real-life problems in a short amount of time using modest computer resources.

The structure of this paper is as follows. We first describe the spatial scan statistic in Section 2. We then introduce some notation and present our algorithm in Section 3. In Section 4, we analyze the computational performance of the algorithm and discuss its asymptotic behavior for simple situations. In Section 5, we present the results of experiments to verify the computational complexity of the algorithm. Section 6 applies the method to the problem of finding a spatial cluster of homicides in Belo Horizonte, a large city in Brazil, contrasting the results with those obtained by the circular spatial scan statistics. Section 7 contains a concluding discussion.

Section snippets

The spatial scan statistic

The geographical map is reduced to the centroids of its component areas and, to each one of them, we have the respective associated cases and risk population. The circular spatial scan statistic imposes a circular window on each centroid in turn and the radius of this window is changed continuously to take any value between zero and some prescribed upper limit. Each circular window defines a region composed by one or more areas, which is a potential cluster of increased risk. Let Z be the set

Detecting clusters with arbitrary shape

For the standard spatial scan statistic, a potential cluster is formed by the areas whose centroids lie within a circular region centered in a certain centroid. The circle radius and the centroid are not specified a priori. A natural extension is to allow the potential clusters to be any subset of adjacent areas. For simplicity, we represent the regions by polygons, and the common frontier between two given regions is either a single point or a non-trivial segment (see Fig. 1A). Two regions are

On the convergence of the algorithm

It is a well-known fact that combinatorial algorithms that use SA converge to the optimal solution in exponential time in the worst case, but usually find quasi-optimal solutions in much less time (see Aarts and Korst, 1989; Winkler, 1995).

The process of Section 3 of finding the cluster (if it exists) can be described as a stochastic process as follows. Suppose that to each vertex vi, i=1,…,k of the map graph is assigned a binary variable bi, such that bi=1 if vi is in the set of vertices of

Experimental results

We will present briefly some experimental results on the performance of the algorithm described in Section 3. Due to lack of space, a detailed description of the results of our simulations may be found at the web address www.est.ufmg.br/~duczmal.

For the purpose of simplicity and uniformity in our exposition, we establish a standard map corresponding to the graph realization depicted in Fig. 4A. It consists of 625 vertices disposed in a 25×25 square. Each internal vertex has 6 adjacent vertices,

An application

We use our method to study the geographical distribution of homicides in the year 1995 in Belo Horizonte, a 2 million inhabitants city in Southeast Brazil. The homicide incidence map is shown in Fig. 6, for the 240 areas of the city. There were a total of 273 cases among a population of 2,189,630 inhabitants. The total annual mortality rate was 12.47 per 100,000 persons.

Fig. 7A and Table 2 show the results using the standard spatial scan statistic. The most likely cluster is shadowed. The nine

Concluding remarks

The choice of the regions in real maps deserves some attention. We would like to choose regions that are small enough to circumscribe a relatively homogeneous area, in such a way that we can consider the population and the cases inside each region as roughly similar. If this condition cannot be fulfilled, it may not be possible to consider the attributes of the graph nodes as adequate descriptions of the regions. In this case, it would be necessary to further refine the regions in the original

Acknowledgements

The authors wish to express their appreciation to the editor and the referees for their valuable comments and suggestions, which improved the original manuscript. This research was partially supported by the Fundaçào de Amparo à Pesquisa de Minas Gerais (FAPEMIG) Grant CEX-336/99, and the Ford Foundation.

References (22)

  • E. Aarts et al.

    Simulated Annealing and Boltzmann Machines

    (1989)
  • Barbujani, G., Sokal, R., 1990. Zones of sharp genetic change in Europe are also linguistic boundaries. Proceedings of...
  • J. Besag et al.

    The detection of clusters in rare diseases

    J. Roy. Statist. Soc. Ser. A

    (1991)
  • A. Biggeri et al.

    Air pollution and lung cancer in Triestespatial analysis of risk as a function of distance from sources

    Environ. Health Perspect.

    (1996)
  • M. Dwass

    Modified randomization tests for nonparametric hypotheses

    Ann. Math. Statist.

    (1957)
  • M. Feychting et al.

    Magnetic fields and cancer in children residing near Swedish high voltage power lines

    Amer. J. Epidemiol.

    (1993)
  • J.S. Goodwin et al.

    Geographic variations in breast cancer mortalitydo higher rates imply elevated incidence or poorer survival?

    Amer. J. Public Health

    (1998)
  • S. Karjalainen

    Geographical variation in cancer patient survival in Finlandchance, confounding, or effect of treatment?

    J. Epidemiol. Community Health

    (1990)
  • K. Katsouyanni et al.

    A case-control study of air pollution and tobacco smoking in lung cancer among women in Athens

    Prev. Med.

    (1991)
  • Kemp, R., 1984. Fundamentals of the average case analysis of particular algorithms. Teubner Series in Computer Science,...
  • L.J. Kinlen

    Epidemiological evidence for an infective basis in childhood leukaemia

    British J. Cancer

    (1995)
  • Cited by (0)

    View full text