Classification of risk micro-areas using data mining
Clasificación de microáreas de riesgo con uso de mineración de datos
Andreia MalucelliI; Altair von Stein JuniorII; Laudelino BastosIII; Deborah CarvalhoIV; Marcia Regina CubasV; Emerson Cabrera ParaísoI
IPrograma de Pós-Graduação em Informática. Pontifícia Universidade Católica do Paraná (PUC-PR). Curitiba, PR, Brasil
IISecretaria Estadual de Saúde do Paraná. Curitiba, PR, Brasil
IIIPrograma de Pós-Graduação em Computação Aplicada. Universidade Tecnológica Federal do Paraná. Curitiba, PR, Brasil
IVInstituto Paranaense de Desenvolvimento Econômico. Curitiba, PR, Brasil
VPrograma de Pós-Graduação em Tecnologia em Saúde. PUC-PR. Curitiba, PR, Brasil
OBJECTIVE: To identify, with the assistance of computational techniques, rules concerning the conditions of the physical environment for the classification of risk micro-areas.
METHODS: Exploratory research carried out in Curitiba, Southern Brazil, in 2007. It was divided into three phases: the identification of attributes to classify a micro-area; the construction of a database; and the process of discovering knowledge in a database through the use of data mining. The set of attributes included the conditions of infrastructure; hydrography; soil; recreation area; community characteristics; and existence of vectors. The database was constructed with data obtained in interviews by community health workers using questionnaires with closed-ended questions, developed with the essential attributes selected by specialists.
RESULTS: There were 49 attributes identified, 41 of which were essential and eight irrelevant. There were 68 rules obtained in the data mining, which were analyzed through the perspectives of performance and quality and divided into two sets: the inconsistent rules and the rules that confirm the knowledge of experts. The comparison between the groups showed that the rules that confirm the knowledge, despite having lower computational performance, were considered more interesting.
CONCLUSIONS: The data mining provided a set of useful and understandable rules capable of characterizing risk areas based on the characteristics of the physical environment. The use of the proposed rules allows a faster and less subjective area classification, maintaining a standard between the health teams and overcoming the influence of individual perception by each team member.
Descriptors: Databases as Topic. Databases, Factual. Knowledge Bases. Artificial intelligence. Environmental Indicators. Environmental Risks. Risk Map.
OBJETIVO: Identificar, con auxilio de técnicas computacionales, reglas relacionadas con las condiciones del ambiente físico para la clasificación de microáreas de riesgo.
MÉTODOS: Investigación exploratoria, desarrollada en la ciudad de Curitiba, Sur de Brasil, en 2007, dividida en tres etapas: identificación de atributos para clasificar una microárea; construcción de una base de datos; y aplicación del proceso de descubrimiento de conocimiento en base de datos, por medio de la aplicación de mineración de datos. El conjunto de atributos involucró las condiciones de infraestructura, hidrografía, suelo, área de diversión, características de la comunidad y existencia de vectores. La base de datos fue construida con datos obtenidos en entrevistas con agentes comunitarios de salud, siendo utilizado un cuestionario con respuestas cerradas, elaborado con los atributos esenciales, seleccionados por especialistas.
RESULTADOS: Fueron identificados 49 atributos, siendo 41 esenciales y ocho irrelevantes. Fueron obtenidas 68 reglas con la mineración de datos, las cuales fueron analizadas bajo la perspectiva de desempeño y calidad y divididas en dos conjuntos: las inconsistentes y las que confirman el conocimiento de especialistas. La comparación entre los conjuntos mostró que las reglas que confirmaban el conocimiento, a pesar de tener desempeño computacional inferior, fueron consideradas más interesantes.
CONCLUSIONES: La mineración de datos ofreció un conjunto de reglas útiles y comprensibles, capaces de caracterizar microáreas, clasificándolas con respecto al grado de riesgo, con base en características del ambiente físico. La utilización de las reglas propuestas permite que la clasificación de una microárea pueda ser realizada de forma más rápida, menos subjetiva, manteniendo un patrón entre los equipos de salud, superando la influencia de la percepción particular de cada componente del equipo.
Descriptores: Bases de Datos como Asunto. Bases de Datos Factuales. Bases del Conocimiento. Inteligencia Artificial. Indicadores Ambientales. Riesgos Ambientales. Mapa de Riesgo.
Decentralization, as a principle for the construction of the National Healthcare System, includes strategies for changing the model of care, among them, approaching health from a territorial perspective. In this sense, territory is understood not only as geographical space, but as a territorial process, a social space in which people with personal characteristics associate with other people in social movements of transforming their territory.6
In order to begin carrying out the actions, the health teams performed a process of data appropriation and analysis regarding the community's conditions in their territory of activity, denominated territorialization. This process consists of the systematic collection of demographic, socioeconomic, political-cultural, epidemiological and health data, used to construct basic or thematic maps. Besides initializing or strengthening the ties between the health team and the community, this process identifies the delimitation of small asymmetrically shaped spaces called micro-areas.8,10
The micro-areas are defined as a subdivision of small extent within the territory of the Basic Health Unit. Its inhabitants have a homogeneous quality of life that can determine health risks.6
The risks of a micro-area can be classified in different levels depending on the characteristics that expose the residents to risks or that determine the development of worsened health.
The recognition of the risk micro-areas is fundamental for establishing priorities to be worked on by health teams, as well as for planning adequate actions for the actual problems of the community.6 To do this, primary data sources are used, which can be the product of interviews with key informants, who live in the region, and secondary data sources are used from diverse databases of the city departments' information systems or from other governmental or non-governmental organizations.9
A technique recommended for the collection of data that identifies risk micro-areas is rapid assessment, which proposes stages of information gathering; preparing questionnaires; understanding of the territory; the formulation of hypotheses for the sub-division of the territory into micro-areas; and the identification of key informants from the community to validate the collected information.6
The effective result of the process of delimitating the micro-areas is the product of the subjective analysis of the combination of data. Currently, the risk micro-areas are delineated by health teams supported by community health agents (CHA), who know the local problems, since they experience them through being residents of the region.
In this regard, the use of strategies to analyze the health situation in areas with similar living conditions can help in the identification and prioritization of health problems. Likewise, it can contribute to the adoption of intersectoral intervention strategies, capable of modifying the living conditions and contributing to actions related to health care.7
Within this context and because of the importance of the information resulting from analyzing the data of a territory, this is an area in which the field of computing can provide support through techniques and tools for data management, including the process known as Knowledge Discovery in Databases (KDD).
KDD is a process that seeks to identify patterns, associations, models or relevant information that remain hidden in databases, repositories and other forms of storing data. It allows for the identification of significant, new, potentially useful and understandable patterns and involves various scientific disciplines such as the study of machines, databases, statistics, pattern recognition and visualization, among others.2
Currently, KDD is applied in diverse fields such as administration, marketing analysis and medicine.3 Nonetheless, in order for the identified patterns to be a source for generating new knowledge capable of supporting decisions, it is important that the patterns are interesting, useful and understandable to the potential administrators.
KDD consists of the following steps: pre-processing, data-mining and post-processing. The pre-processing phase is considered very important and has the objective of preparing the databases to extract patterns. After the pre-processing step, the data-mining step begins, which is considered the central step in the discovery of knowledge and involves the choosing and application of the tool and the algorithm to be used. Rule induction and genetic algorithms are among the possible algorithms to be used in this step.4 The post-processing occurs last, when the obtained results are analyzed and interpreted. In this phase the patterns found are evaluated to verify if they satisfy the criteria to be considered an important element for supporting decision-making.
Considering that analysis to classify risk micro-areas is a subjective process of data manipulation and that the field of informatics has techniques that can make this manipulation objective, the aim of the present study was to identify, through the use of computational techniques, rules about the conditions of the physical environment that are capable of contributing to the classification of risk micro-areas.
An exploratory study was carried out in three stages in the city of Curitiba, Southern Brazil, in 2007.
Stage 1 - identification of attributes for the classification of micro-areas. The initial list was obtained from a review of literature. The group of attributes was submitted to validation by eight specialists in the field of collective health, five being nurses and three doctors, who classified them into irrelevant, important or essential attributes. The criteria for the selection of the specialists were that they had to be public health professionals for at least two years, with an academic affiliation and at least a Masters degree.
Stage 2 - construction of the database. From the attributes that the specialists considered essential, a questionnaire for data collection was designed for implementation by the CHA of the municipal health network. The only CHA excluded were the ones on vacation, on leave or missing from their activities during the period of data collection. The data was organized in an electronic folder creating a database with 531 entries about the physical environment of the micro-areas, representing a sample of 46.2% of the total micro-areas in Curitiba.
Stage 3 - applying the KDD process. This stage followed the stages of pre-processing and consisted of data cleaning, selection and transformation. For the data-mining stage the Waikato Environment for Knowledge Analysis (WEKA) tool was utilized.ª Because this was a classification problem, algorithm J48 was used, which presents the results in the form of a decision tree, able to be transformed into a set of rules in the format: "IF...THEN...".
The evaluation during the post-processing stage was done through the perspective of computational performance and of the quality of the set of rules. To evaluate computational performance, the measures coverage and success were considered, and these measures were understood as:
coverage: indicates the number of examples covered by the association rules. High coverage with a high success rate can indicate a common sense rule.
success rate: presents the percentage of correctly classified cases in relation to coverage, indicating the credibility of the rule, and it was calculated using the following expression:
In this expression, the error is provided by the cases incorrectly classified by the algorithm.
To evaluate quality, the rules were analyzed in terms of how understandable and interesting they were for scientists, who were not involved in stage one. To evaluate understandability, the size of the rule, or in other words, the number of conditions per rule, was considered.
To evaluate how interesting the rules are, these were analyzed by three specialists in the field of collective health, who were selected according to the following criteria: public health workers (involved in service provision) for more than two years and at least a professional specialty in collective health or family health. The specialists attributed one of three possible scores to each rule: irrelevant (incompatible with reality); confirms their knowledge (confirms what they already know); and interesting (shows patterns that agree with reality, but were unknown until then). The estimate for "how interesting the rule was" was elaborated based on the value given by the specialists, such that the larger the mean, the more interesting the rule.
The study was approved by the Research Ethics Committee of the Pontifícia Universidade Católica do Paraná and by the Research Ethics Committee of the Secretaria Municipal de Saúde de Curitiba.
From a list of 49 attributes (Table 1) the specialists, who were included in the stage of identifying the attributes, classified 41 attributes as essential and eight as irrelevant. The set of attributes involves the conditions of infrastructure, hydrography, soil, recreational areas, community characteristics and vectors. The following attributes were considered irrelevant: supermarket, grocery, bar, town squares, irregular land, climate, occasionally moist land and hospital.
The database was composed of the information collected by the questionnaire, accounting for 46.2% of the total 1,148 CHA in Curitiba. In some questionnaires, several values were assigned to one attribute, making the correct classification of a micro-area more difficult.
To improve the performance of the data-mining process, the values of some attributes had to be transformed. For example, the attribute "distribution of electricity" could be assigned both "regular" and "clandestine"; therefore, the attribute became named "regular distribution of electricity", with the options "yes", "no" and "partially".
This transformation allowed for an increased success rate of the classifier, from 87.5% to 88.7%; decreased the number of rules generated, from 130 to 79; decreased the number of rules not covered by the examples in the database, from 57 to 10; and presented more easily understood rules due to a more objective value for the attribute that anteceded the rule.
The 68 rules obtained were expressed in the following format:
Antecedent: → IF <condition>
Consequence: THEN <conclusion>
The statistical results for coverage, success rates and the amount of conditions per rule, are presented in Table 2. The results show the performance and quality measures used to evaluate the set of 68 rules.
Considering the median value for the relevance of each rule, the rules were divided into two sets. Set A was composed of the 37 inconsistent rules and had a median value of one. Set B was composed of the 31 rules that confirmed the specialist knowledge and had a median value of two (Table 3). Table 4 presents a comparison between Set A (rules considered inconsistent) and Set B (rules that confirmed the knowledge of specialists). Despite 12 being the maximum sample number for the quantity of conditions, Set A contained 66.7% of rules within the average of 5.89 (SD=2.4). On average, the quantity of conditions per rule in Set A is discretely greater than in relation to Set B.
In relation to the total set of rules, the average amount of conditions per rule (5.74; SD=2.11) remained close to the ideal of a practical rule, or in other words, 73.5% of rules were between four and seven associations. The success rate for the set of rules was 91.6% (SD=14.00), which is a satisfactory result for the set obtained.
The rules obtained were also utilized to identify the attributes that best differentiated the micro-areas into low, medium or high risk. In this way, the attributes positioned among the first five associations and with a greater frequency were considered as the most important (Table 5).
Of the attributes considered essential by specialists during the identification stage, six did not appear as associations in the rules, including: policing; cat; dog; pigeon; sufficient space in childcare; and public lighting. Therefore, these can be given less weight in the classification of micro-areas.
The identification of homogenous areas of risk helps in the prioritization of collective actions focused on disease prevention, directed to territorial spaces where inequities are greater, which results in a greater impact on the risk associations.1
Thus, some of the attributes identified as essential, including social resources and recreation areas, can be altered by public policies. Other attributes, such as the existence of vectors, are modified through intersectoral actions and with community participation. Therefore, an information system that monitors the association of these attributes can assist in the planning of actions at the local, regional and central levels. It even allows for the identification of some conditions to be improved by incentivizing the community.
The fact that the amount of conditions per rule in Set A is discretely larger in relation to Set B can indicate that the complexity of the rules did not interfere when the specialists evaluated them. This discounts the hypothesis that the specialists experienced difficulty in interpreting the rules and in judging them as inconsistent.
Nonetheless, the average of coverage for Set B indicates that there is a greater probability of having rules considered as commonsensical. Despite being a set of rules that confirms the knowledge of specialists, the success rate did not surpass Set A.
The average of coverage for Set B was greater in relation to Set A. This relationship suggests that the rules contained in Set B tend to represent common sense, which was confirmed by the opinions of specialists when they described the rules as confirming their knowledge.
Thus, the rules of Set B (Table 3), despite having a lower computational performance according to evaluation by specialists, are the ones that better classify a micro-area in relation to the risk contained in the physical environment.
This unexpected divergence between specialist opinion and the statistical measures indicates that this set may contain some interesting rules. Nonetheless, in evaluating the rules, the specialists may have been resistant to accepting new patterns or to understanding new models that opposed to previous knowledge.
Detailing and collectively discussing the diversity, that originated from the different perceptions of the territory, help to close the gap between the problems identified and the possible solutions, which should be collectively prioritized.5
The data mining provided a combination of useful and understandable rules able to characterize micro-areas, classifying them in regards to the degree of risk, when considering the characteristics of the physical environment. Nonetheless, the physical environment does not constitute the only factor for classifying a micro-area, since an effective classification should also include information about the epidemiology of the region, the organization of the community and administrative facts.
The utilization of the proposed rules allows for the classification of a micro-area to be done in a faster, less subjective way that maintains a standard between the health teams, overcoming the influence of each individual perception.
The influence of subjectivity can be understood by the fact that different participants in the evaluation process have their own set of personal values, constructed based on their experience and interaction with different cultural, economic and social contexts. This greatly influences the importance given to certain attributes in detriment to other attributes. 11
The classification of risk micro-areas is an important management and service tool because it involves the distribution of resources and services for the population of a given territory. Performing the classification in a way that converges the inherent subjectivity of the process with more objective analytical methods allows for the optimization of actions and resources.
1. Chiesa AM, Westphal MF, Kashiwagi NM. Geoprocessamento e a promoção da saúde: desigualdades sociais e ambientais em São Paulo. Rev Saude Publica. 2002;36(5):559-67. DOI:10.1590/S0034-89102002000600004
2. Fayyad U, Piatesky-Shapiro G; Smyth P. From data mining to knowledge discovery in databases. AI Magazine. 1996;17(3):37-54.
3. Han J, Kamber M. Data mining: concepts and techniques. San Francisco: Morgan Kaufmann; 2001.
4. Rezende SO, Plugliesi JB, Melanda EA, de Paula MF. Mineração de dados. In: Rezende SO, editor. Sistemas inteligentes: fundamentos e aplicações. Barueri: Manole; 2005. p.307-35.
5. Ribeiro PT. Direito à saúde: integridade, diversidade e territorialidade. Cienc Saude Coletiva. 2007;12(6):1525-32. DOI:10.1590/S1413-81232007000600014
6. Silva AMR, Oliveira MSM, Nunes EFPA, Torres ZF. A unidade básica de saúde e seu território. In: Andrade SM, Soares DA, Cordoni Junior L, organizadores. Bases da saúde coletiva. Londrina: UEL; 2001. p.145-60.
7. Silva LMV, Paim JS, Costa MCN. Desigualdades na mortalidade, espaço e estratos sociais. Rev Saude Publica. 1999;33(2):187-97. DOI:10.1590/S0034-89101999000200011
8. Souza CMN, Moraes LRS, Bernardes RS. Doenças relacionadas à precariedade dos sistemas de drenagem de águas pluviais: proposta de classificação ambiental e modelos causais. Cad Saude Coletiva (Rio J). 2005;13(1):157-68.
9. Takeda S. A organização de serviços de atenção primária à saúde. In: Duncan BB, Schmidt MI, Giucliani WRJ, organizadores. Medicina ambulatorial: condutas clínicas em atenção primária baseadas em evidências. 3.ed. Porto Alegre: Artmed; 2004. p.76-87.
10. Teixeira CF. Promoção e vigilância da saúde no contexto da regionalização da assistência à saúde no SUS. Cad Saude Publica. 2002;18(Supl):153-62. DOI:10.1590/S0102-311X2002000700015
11. Uchimura KY, Bosi MLM. Qualidade e subjetividade na avaliação de programas e serviços em saúde. Cad Saude Publica. 2002;18(6):1561-9. DOI:10.1590/S0102-311X2002000600009
R. Imaculada Conceição, 1155. Prado Velho
80215-901 Curitiba, PR, Brasil
Article based on the masters dissertation by Von Stein Júnior A, presented to the Programa de Pós-Graduação em Tecnologia em Saúde da Pontifícia Universidade Católica do Paraná, in 2008.
a The University of Waikato. WEKA Version 3.5. [computer program]. [cited 2007 Mar 02]. Available from: http://www.cs.waikato.ac.nz/ml/weka/