Abstract
Patients with post-COVID-19 syndrome benefit from health promotion programs. Their rapid identification is important for the cost-effective use of these programs. Traditional identification techniques perform poorly especially in pandemics. A descriptive observational study was carried out using 105,008 prior authorizations paid by a private health care provider with the application of an unsupervised natural language processing method by topic modeling to identify patients suspected of being infected by COVID-19. A total of 6 models were generated: 3 using the BERTopic algorithm and 3 Word2Vec models. The BERTopic model automatically creates disease groups. In the Word2Vec model, manual analysis of the first 100 cases of each topic was necessary to define the topics related to COVID-19. The BERTopic model with more than 1,000 authorizations per topic without word treatment selected more severe patients - average cost per prior authorizations paid of BRL 10,206 and total expenditure of BRL 20.3 million (5.4%) in 1,987 prior authorizations (1.9%). It had 70% accuracy compared to human analysis and 20% of cases with potential interest, all subject to analysis for inclusion in a health promotion program. It had an important loss of cases when compared to the traditional research model with structured language and identified other groups of diseases - orthopedic, mental and cancer. The BERTopic model served as an exploratory method to be used in case labeling and subsequent application in supervised models. The automatic identification of other diseases raises ethical questions about the treatment of health information by machine learning.
Keywords:
COVID-19; Natural Language Processing; Health Care; Selection Criteria; Proprietary Health Facilities
Introduction
The COVID-19 11. Adil MT, Rahman R, Whitelaw D, Jain V, Al-Taan O, Rashid F, et al. SARS-CoV-2 and the pandemic of COVID-19. Postgrad Med J 2021; 97:110-6. pandemic reinforced the historical concern of researchers regarding the threat of new viruses and mutation of existing ones. It implied pressure on already overburdened health care services 22. Noronha KVMS, Guedes GR, Turra CM, Andrade MV, Botega L, Nogueira D, et al. The COVID-19 pandemic in Brazil: analysis of supply and demand of hospital and ICU beds and mechanical ventilators under different scenarios. Cad Saúde Pública 2020; 36:e00115320., by severe forms of the disease (approximately 25% of vulnerable patients or patients with comorbidities) and a high mortality rate (5.6% in the firstwave 33. Li J, Huang DQ, Zou B, Yang H, Hui WZ, Rui F, et al. Epidemiology of COVID-19: a systematic review and meta-analysis of clinical characteristics, risk factors, and outcomes. J Med Virol 2021; 93:1449-58.). Additionally, structural changes in health care services, greater impact on low- and middle-income countries 44. Victora CG, Hartwig FP, Vidaletti LP, Martorell R, Osmond C, Richter LM, et al. Effects of early-life poverty on health and human capital in children and adolescents: analyses of national surveys and birth cohort studies in LMICs. Lancet 2022; 399:1741-52., ethical conflicts in the prioritization of care 55. Mannelli C. Whose life to save? Scarce resources allocation in the COVID-19 outbreak. J Med Ethics 2020; 46:364-66. and financial challenges accentuated their impact. Challenges were aggravated by the emergence of long COVID-19 or post-COVID syndrome 66. Crook H, Raza S, Nowell J, Young M, Edison P. Long covid-mechanisms, risk factors, and management. BMJ 2021; 374:n1648.,77. Hope AA, Evering TH. Postacute sequelae of severe acute respiratory syndrome coronavirus 2 infection. Infect Dis Clin North Am 2022; 36:379-95., which affects 10% to 30% of patients 88. Pavli A, Theodoridou M, Maltezou HC. Post-COVID syndrome: incidence, clinical spectrum, and challenges for primary healthcare professionals. Arch Med Res 2021; 52:575-81.. New pandemics are expected to emerge in the future 99. Khan A, Khan M, Ullah S, Wei D-Q. Hantavirus: the next pandemic we are waiting for? Interdiscip Sci 2021; 13:147-52. and early identification of patients will be important for correct and cost-effective adoption of care.
The treatment of information is a challenge, due to its increasing volume 1010. Fayyad U, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Magazine 1996; 17:37-54. or due to the peculiarities of the different areas of knowledge. In health care, data are incomplete, heterogeneous, multidimensional, unstructured and inaccurate 1111. Dinov ID. Volume and value of big healthcare data. J Med Stat Inform 2016; 4:3.,1212. Esfandiari N, Babavalian MR, Moghadam A-ME, Tabar VK. Knowledge discovery in medicine: current issue and future trend. Expert Systems with Applications 2014; 41:4434-63.. To address these challenges, it was proposed the discovery of knowledge through KDD (knowledge discovery in database) in the mining (data mining) of large volumes of data (big data) 1313. Cios KJ, Kurgan LA. Trends in data mining and knowledge discovery. In: Pal NR, Jain L, editors. Advanced techniques in knowledge discovery and data mining. London: Springer London; 2005. p. 1-26.,1414. Idri A, Benhar H, Fernández-Alemán JL, Kadi I. A systematic map of medical data preprocessing in knowledge discovery. Comput Methods Programs Biomed 2018; 162:69-85..
Machine learning (ML) techniques enable the algorithm to learn patterns that are unidentifiable by classification or prediction techniques 1515. Alloghani M, Al-Jumeily D, Mustafina J, Hussain A, Aljaaf AJ. A systematic review on supervised and unsupervised machine learning algorithms for data science. In: Berry MW, Mohamed A, Yap BW, editors. Supervised and unsupervised learning for data science. Cham: Springer International Publishing; 2020. p. 3-21.. This learning can be supervised - with labels that classify the object of study - or unsupervised - with no classification. In this case, exploratory techniques are used for the creation of labels and subsequent application of supervised techniques 1515. Alloghani M, Al-Jumeily D, Mustafina J, Hussain A, Aljaaf AJ. A systematic review on supervised and unsupervised machine learning algorithms for data science. In: Berry MW, Mohamed A, Yap BW, editors. Supervised and unsupervised learning for data science. Cham: Springer International Publishing; 2020. p. 3-21.. The labeling of medical data is difficult and depends on specialized work, being a limiting factor in studies of the pandemic 1616. Dogan O, Tiwari S, Jabbar MA, Guggari S. A systematic review on AI/ML approaches against COVID-19 outbreak. Complex Intell Systems 2021; 7:2655-78.. Thus, unsupervised exploratory techniques are an important step in the application of ML on large volumes of data for knowledge discovery.
Text data mining refers to the discovery of patterns as proposed by Fayyad et al. 10, while natural language processing (NLP) is seen as a branch of artificial intelligence that deals with human language 1717. Lauriola I, Lavelli A, Aiolli F. An introduction to deep learning in natural language processing: models, techniques, and tools. Neurocomputing 2022; 470:443-56. or makes this language understandable to computers 1818. Junaid T, Sumathi D, Sasikumar AN, Suthir S, Manikandan J, Khilar R, et al. A comparative analysis of transformer based models for figurative language classification. Comput Electr Eng 2022; 101:108051. thereby enabling different approaches, including the grouping of texts by topics (“topic modeling”). Topics are groups of similar objects, being a particular case of clustering.
Health care providers process data necessary for regulatory 1919. Agência Nacional de Saúde Suplementar. TISS - padrão para troca de informação de saúde suplementar. https://www.gov.br/ans/pt-br/assuntos/prestadores/padrao-para-troca-de-informacao-de-saude-suplementar-2013-tiss (accessed on 20/Dec/2021).
https://www.gov.br/ans/pt-br/assuntos/pr... and health care cohesion. Among them, prior authorization is the process of verifying the eligibility of patients and the coherence between the disease and treatment. It is requested before health care. This process is indirectly regulated by the Brazilian National Supplementary Health Agency (ANS) by guaranteeing service deadlines 2020. Agência Nacional de Saúde Suplementar. Resolução Normativa nº 259, de 17 de junho de 2011. Dispõe sobre a garantia de atendimento dos beneficiários de plano privado de assistência à saúde e altera a Instrução Normativa - IN nº 23, de 1º de dezembro de 2009, da Diretoria de Normas e Habilitação dos Produtos DIPRO. Diário Oficial da União 2011; 20 jun..
Prior authorization analysis provides an opportunity for early patient selection. However, due to medical confidentiality, there is no information on the International Classification of Diseases, 10th revision (ICD-10). Also, the requested care procedures do not allow the correct correlation with the disease to be treated and the complementary information of the prior authorization is not structured. Therefore, there is an opportunity for innovative solutions in the identification of patients in health care providers in Brazil. This is an important economic sector that covers approximately 25% of the Brazilian population with expenditures equivalent to 5.7% of gross domestic product (GDP) 2121. Instituto Brasileiro de Geografia e Estatística. Conta-satélite de saúde: Brasil - 2010-2019. https://biblioteca.ibge.gov.br/visualizacao/livros/liv101928_informativo.pdf (accessed on 07/Jul/2022).
https://biblioteca.ibge.gov.br/visualiza... .
There are few studies using NLP in health care in Brazil. Duval et al. 2222. Duval FV, Silva FAB. Mining in Twitter for adverse events from malaria drugs: the case of doxycycline. Cad Saúde Pública 2019; 35:e00033417. built a pharmacosurveillance system using twitter to detect adverse events caused by drugs - they used as a model the drug doxycycline for the treatment of malaria. Moreira et al. 2323. Moreira LB, Namen AA. A hybrid data mining model for diagnosis of patients with clinical suspicion of dementia. Comput Methods Programs Biomed 2018; 165:139-49. proposed a hybrid model through which NLP created patient clusters using unstructured data. These clusters were incorporated into structured data, improving the accuracy of the diagnosis of patients with suspected dementia 2323. Moreira LB, Namen AA. A hybrid data mining model for diagnosis of patients with clinical suspicion of dementia. Comput Methods Programs Biomed 2018; 165:139-49.. Diniz et al. 2424. Diniz EJS, Fontenele JE, Oliveira AC, Bastos VH, Teixeira S, Rabêlo RL, et al. Boamente: a natural language processing-based digital phenotyping tool for smart monitoring of suicidal ideation. Healthcare (Basel) 2022; 10:698. created a mobile phone system to identify patients with suicidal ideation that allowed the individual quantification of moment-to-moment risk (“digital phenotyping”) enabling the action of health care professionals.
No studies using supplementary health care data were found, probably due to the difficulty of access to data in this health care sector, limited by barriers of professional and commercial secrecy. This study fills this gap and contributes to the application of ML methods in free software through a real case study.
The objective of this article is to describe an unsupervised NLP method to identify patients with suspected COVID-19 infection through the analysis of a real database of prior authorizations issued by a private health care provider in the auto-management mode of the State of São Paulo, Brazil.
Methods
Study design and population
This is a descriptive observational study, based on secondary data from prior authorizations of a private health care provider in the State of São Paulo, in the auto-management mode (operator). Prior authorizations are requested by health care providers or beneficiaries before consultations, examinations, hospitalizations and other elective procedures. Emergency care authorization is automatically released in compliance with the rules of the legislation. For hospitalizations, only one authorization is issued covering the entire period of hospitalization of the patient. The payment of care to the provider only occurs upon submission of the prior authorization.
The database studied is anonymized, however, each prior authorization is issued to a specific beneficiary and there is a one-to-one relationship between prior authorization and beneficiary. The proposed method selects authorizations that contain information about suspected COVID-19 infection, and therefore the selected authorizations are considered to represent a patient with suspected COVID-19 infection.
The health care provider had, in the period, 29,336 beneficiaries exposed, of which 14,663 (50%) were female and 28,820 (98.2%) resided in the State of São Paulo. The mean age of the group was 45 years.
Database and variables studied
Each authorization contains a blank text field, “clinicalindication”, in which the reason or justification for the prior authorization request is indicated. Filling in this field is not mandatory. The provider may only attach documents justifying the request for the procedure. In this case, it is common to fill in the field with text “attached” or not to fill it in. The “clinicalindication” variable is the variable of interest in this study.
Prior authorizations issued between September 1st, 2019 and June 30, 2022 were selected (n = 742,901). Those missing the justification (missing values) in the “clinicalindication” field (n = 558,530, 75%) were excluded. Therefore, 184,371 (25%) prior authorizations were included in this study, of which 105,008 contain payment information. Each prior authorizations contains at least one health care event identified in the event structure and event description variables corresponding respectively to the code of the requested event and its description. Authorizations are classified according to: type (“treatmenttype”), regime (“treatmentregime”) and objective of care (“treatmentobjective”). Filling in the ICD-10 field is not mandatory. They have an expiration date (“expirationdate”) and can be canceled, reissued or revalidated according to the provider’s administrative criteria. Box 1 contains the variables present in the database and used in this study.
Natural language processing
Two NLP models were applied - BERTopic (https://maartengr.github.io/BERTopic/index.html) and Word2Vec - described briefly below.
BERTopic model
The BERTopic model is an unsupervised algorithm for vector-based topic modeling. Topic modeling is a mining method whose objective is to discover hidden patterns considering the context and classify the respective texts into similar groups 2525. Liu L, Tang L, Dong W, Yao S, Zhou W. An overview of topic modeling and its current applications in bioinformatics. Springerplus 2016; 5:1608.,2626. Alghamdi R, Alfalqi K. A survey of topic modeling in text mining. International Journal of Advanced Computer Science and Applications 2015; 6:147-56., called topics.
Initially, each document, in this case prior authorizations, is converted to its vector representation (word embedding) using the Bidirectional Encoder Representations from Transformers (BERT) model. The dimensionality of this representation is reduced using the Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) technique and the Density-Based Clustering Based on Hierarchical Density Estimates (HDBSCAN) algorithm is applied to create topics of documents that are semantically similar 2727. McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2018; 9 feb. https://arxiv.org/abs/1802.03426.
https://arxiv.org/abs/1802.03426... . For the description of each topic, we used the term frequency - inverse document frequency (TF-IDF) 2828. Grootendorst M. BERTopic: neural topic modeling with a class-based TF-IDF procedure. https://maartengr.github.io/BERTopic/algorithm/algorithm.html (accessed on 14/Dec/2022).
https://maartengr.github.io/BERTopic/alg... ,2929. Reimers N, Gurevych I. Sentence-BERT: sentence embeddings using siamese BERT-networks. arXiv 2019, 27 aug. https://arxiv.org/abs/1908.10084.
https://arxiv.org/abs/1908.10084... ,3030. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv 2022; 24 may. https://arxiv.org/pdf/1810.04805.pdf.
https://arxiv.org/pdf/1810.04805.pdf... ,3131. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need: 31st Conference on Neural Information Processing Systems (NIPS 2017). https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 17/Oct/2023).
https://proceedings.neurips.cc/paper_fil... method. Documents not classified by the model are grouped into a specific topic containing outliers. In this work, the methods were applied through a free library based on Python 2828. Grootendorst M. BERTopic: neural topic modeling with a class-based TF-IDF procedure. https://maartengr.github.io/BERTopic/algorithm/algorithm.html (accessed on 14/Dec/2022).
https://maartengr.github.io/BERTopic/alg... called BERTopic.
Two parameters were used to define the minimum number of authorizations in each topic created: 500 or more (BERTopic +500) and 1,000 or more (BERTopic +1,000) defined in the min_topic_size parameter of the model. Since it is an automatic model, the total number of topics created depends on this parameter. The language parameter was defined as multilingual for modeling the text in Portuguese and the vectorization model - embedding_model - as all-MiniLM-L6-v2, which is the standard of the model.
To identify the topics belonging to COVID-19, the get_topic_info() method of the model itself was used, which generates the automatic description of the topic.
Word2Vec model
Word2Vec is an NLP model that uses neural networks to learn the representation of words (word embedding) in a high-dimensional vector space, capable of capturing the semantic and syntactic context of words in a given text corpus. For the comparative analysis, we used the continuous Bag-of-Words3232. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv 2013; 16 jan. https://arxiv.org/abs/1301.3781.
https://arxiv.org/abs/1301.3781... ,3333. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf (accessed on 17/Oct/2023).
https://proceedings.neurips.cc/paper_fil... model of the Word2Vec algorithm. The texts of the “clinicalindication” variable were separated into words (tokens) using the NLTK library (Natural Language Toolkit - https://www.nltk.org/), on which we applied the Word2Vec algorithm from the Gensim library (https://pypi.org/project/gensim/), using a vector size equal to 300, recalculated considering their average and categorized into 20 clusters using the K-Means algorithm. These clusters were considered the topics of this model. This method does not automatically assign names to topics. To identify clusters with suspected cases of COVID-19 infection, each of the 20 clusters was manually analyzed by the main researcher. To this end, the first 100 authorizations classified in descending order of expenditure were selected in each cluster. Each text present in the clinical indication variable was analyzed and the respective cluster was classified, or not, in the COVID-19 group.
Each of the two models was applied to the descriptions - treated or not - contained in the prior authorization of the variable “clinicalindication”. The treatment of the variable is recommended to improve the performance of the Word2Vec model.
The treatment of the clinical indication variable occurred as follows: conversion of all words into lowercase, removal of stopwords in Portuguese, exclusion of most common words in health and exclusion of special characters. No accents or other features of Portuguese were replaced. The words COVID-19 and SARS-CoV-2 were turned into covid. The ICD-10-related words present in the clinical indication variable were also standardized.
Evaluation of the quality of the classification generated by the models
Thus, we reached 6 different types of models: BERTopic +500, BERTopic +1,000 and Word2Vec, each with and without text treatment of the clinical indication variable (treated and untreated).
To assess the quality of the classification, the main author analyzed the BERTopic +1,000 model because it presented the highest average cost per authorization. Thus, the first 100 authorizations classified as suspected or COVID-19-related events by this model were ordered in descending order of cost. The clinical indication text of each of these authorizations was manually analyzed by and classified it into classes of interest for study. This manual classification was compared to the automatic classification generated in this model.
For comparison with traditional structured query language (SQL) research methods, all prior authorizations containing the words covid, sars, coronavirus and coronavírus in uppercase or lowercase letters were selected and compared with the models generated using the authorization number as a binding index and identifying whether they were part of the groups identified as suspected COVID-19 infection.
Prior authorization cost
Prior authorization cost corresponds to the health care expenditures of each prior authorization. The payment basis contains the expenses paid to service providers net of disallowance. Costs were obtained using the prior authorization number as the connecting key.
The total amount paid corresponds to the sum of all expenses in the period from September 2019 to July 2022 found in the payment basis for each prior authorization. The number of authorizations paid corresponds to the count of authorizations with an amount spent per authorization greater than BRL 0.00.
The average cost per paid authorization corresponds to the ratio between authorization expenditure and the number of paid authorizations. In this study, the most severe cases were those with the highest average cost per prior authorization. Expenditures are presented in reais and without inflation adjustment.
Access to data was granted through a confidentiality and scientific cooperation agreement with the provider and approved by the Research Ethics Committee of Ribeirão Preto School of Medicine, São Paulo University (HCFMUSP/RP; protocol n. 55685722.9.0000.5440).
Results
A total of 742,901 authorizations were issued in the 34 months analyzed, of which 184,371 (24.9%) were filled in with at least one number or word, are part of this study and were analyzed. Of these, 105,008 were paid authorizations (14.1%). The total expense in the period was BRL 374,089,836. This expenditure is right skewed (R(105,008) = 0.438 p = 0.000 - skewness 41.3) (Figure 1).
Cumulative percentage expenditure (up to 50%) and cumulative percentage of prior authorizations (%) of supplementary health provider. São Paulo, Brazil, September/2019 to June/2022.
The most frequent health care events in the analyzed authorizations were: emergency room consultation (6.1% of the analyzed authorizations contain this event), individual psychotherapy session (5.7%) and RT-PCR screening for COVID-19 (5%). A total of 96.2% of the prior authorizations have no description of ICD-10 and only 587 (0.3%) have ICD-10 B34.2 - “Coronavirus infection, unspecified”.
The clinical indication variable had 64,917 (35.2%) authorizations with only one word or number and 77.6% of authorizations had up to 5 words. After treating the variable, the most common words were “covid” appearing 6,561 times, “pronto” (3,821) and “socorro” (3,692) [emergency room]. The longest sentence was 104 words.
As for treatment type, 90.7% were clinical treatments, 7.8% surgical and 0.3% obstetric. Regarding the health care regime, 81% were outpatient care, 16.9% hospital care, and 1% home care. Inpatient clinical care corresponded to 15,741 authorizations - 8.5% of the total (Table 1).
Regarding the objective of care, 75.1% were for diagnosis and 6.5% reparative treatment - 18.3% of the prior authorizations had no objective of care filled in. In the outpatient regimen, the diagnostic objective was more frequent (80.6%). In the hospitalization regimen, there is an important group of reparative care (34.5%) (Table 2).
Number of prior authorizations analyzed by treatment objective according to care regimen of the supplementary health care provider authorizations. São Paulo, Brazil, September/2019 to June/2022.
In the topics classified as COVID-19, the untreated BERTopic models presented higher average costs per paid authorization - BRL 10,205 in the one with more than 1,000 authorizations and BRL 10,138 in the one with more than 500 authorizations per topic. They correspond respectively to 1.9% (1,987) and 2.3% (2,443) of the authorizations paid and expenses of BRL 20.3 million (5.4% of total expenditure) and BRL 24.8 million (6.6%) respectively. The two models showed a significant number of paid authorizations considered discrepant - 58.8% (61,723) in the BERTopic +1,000 model and 48.3% (50,716) in the BERTopic +500 model (Table 3).
With the treatment of the “clinicalindication” variable, there was an increase in the number of authorizations of suspected cases of COVID infection in the BERTopic model with more than 500 authorizations (to 3.3% of the total authorizations paid) and a decrease in the model with more than 1,000 authorizations (1.7%) followed by a significant reduction in the total expenditure - BRL 5.2 million and BRL 14 million, respectively, when compared to the same models without word treatment, resulting in a decrease in the average costs per authorization in the two models. There was a decrease in the number of prior authorizations considered discrepant - although still high (36.3% in the BERTopic +1,000 model and 45.2% in the BERTopic +500 model) (Table 3).
The treatment of the “clinicalindication” variable substantially modified the indicators of the Word2Vec model. For cases classified as COVID-19, without treatment, this model presented lower numbers for paid authorizations (n = 1,005, 0.5%), total expenditure (BRL 4,909,189, 1.3%) and average cost per authorization (BRL 4,885) than those for the model with word treatment: 5,989 - 5.7%, BRL 30.1 million - 8%, and average cost of BRL 5,021, respectively (Table 3).
The comparison between the 06 models showed that the BERTopic +1,000 model without treatment has a lower number of authorizations classified as suspected covid with high total expenditure and the Word2Vec model with treatment has a higher number of authorizations classified as suspected covid with higher total expenditure (BRL 30 million), but resulting in a lower average cost (Table 3).
The evaluation of the classification quality of the BERTopic +1,000 model shows that, of the first 100 cases analyzed manually, 70 are related to suspicion of or infection by COVID clearly indicated in the text of the clinical indication variable. These patients had expenditure of BRL 11.5 million - 56.5% of the total expenditure identified in this model (Box 2).
Other 20 patients have signs, symptoms or respiratory diseases that may or may not be related to COVID. The expenditure in this group was BRL 2.5 million. Other 8 cases are of newborns with respiratory distress all with no connection to the disease except one extreme newborn born to a mother with COVID. The other 2 cases present respiratory signs and symptoms unrelated to the disease (Box 2). Box 3 shows the first 15 authorizations of this quality assessment with the original description of the prior authorization, the respective manual classification and expenditure per authorization. The analysis of the first 100 cases is shown in the Box 2.
The traditional method using SQL and selection of prior authorizations containing the words covid, sars, coronavirus and coronavírus resulted in 3,703 authorizations paid with a total expenditure of BRL 23,611,018 - average cost of BRL 6,376.
By comparing the traditional method with the generated NLP models, there are selected prior authorizations not classified by the models, cases of interest that were lost. These authorizations spread across the different topics of the models but concentrated in the topic with outliers, where it is not possible to make the classification.
In the BERTopic models, the greatest loss of cases occurred in the untreated model with more than 1,000 authorizations - 2,377 (64.2%) authorizations were not classified by the model, had a total expenditure of BRL 8.7 million and an average cost of BRL 3,673. The BERTopic model with more than 500 authorizations without treatment was little better - 1,622 (43.8%) unclassified authorizations, expenditure of BRL 5.1 million and average cost per authorization of BRL 3,214. These lost cases have an average cost per authorization almost 3 times lower than those classified by the models. The treatment of the words caused these models to stop classifying the less severe cases, the average costs per authorization of the lost cases were BRL 9,323 and BRL 7,217 in the BERTopic +1,000 and BERTopic +500 models respectively.
On the other hand, the models classified authorizations not selected in the traditional method. The 362 authorizations in excess in the BERTopic +500 untreated model that do not contain the words of the traditional search have an average cost of BRL 17,196 - an expense of BRL 6.2 million. In the BERTopic +1,000 untreated model, prior authorizations with the same characteristic (661 authorizations) have an average cost of BRL 8,165 and a total expense of BRL 5.4 million. The Word2Vec model with the best performance in this regard - 2,703 authorizations with expense of BRL 11,369,283 and average cost per authorization of BRL 4,206 - is the treated model (Table 4).
The BERTopic models generated other topics of interest - related to cancer (1,500 prior authorizations and BRL 6,662,411 spent), orthopedic diseases (4,531 prior authorizations and BRL 13,675,723 spent) and mental illnesses (3,603 prior authorization and BRL 818,893 spent). These topics vary depending on the method employed - the BERTopic +1,000 models, treated or untreated, were worse generating few additional topics. The topics formed by each model are shown in the Box 4, 5, 6 and 7.
Discussion
The BERTopic model without word treatment selected more severe patients while the Word2Vec model with word treatment selected less severe patients. As early as 1998, Hernández & Stolfo 3434. Hernández MA, Stolfo SJ. Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Discov 1998; 2:9-37. discussed the difficulty of working with real-world data. This challenge is greater with the use of unstructured data. The 100 cases manually analyzed show differences in how to name the virus, amplified by the peculiarities of the Portuguese language - accents, for example. Another challenge is the breadth of information - most authorizations were filled out with sentences of up to 5 words. Still, the BERTopic model was able to select cases with the description “flu-like symptoms for 10 days. Respiratory distress. With tachydyspnea” as suspected virus infection. It is observed that there is no explicit mention of COVID-19 and while respiratory has accent, tachydyspnea does not, an example of the problem of unstructured data.
This difficulty should explain why there are few studies using NLP applied to early detection of the disease. In a review of the use of artificial intelligence tools applied in the response to the pandemic, Syrowatka et al. 3535. Syrowatka A, Kuznetsova M, Alsubai A, Beckman AL, Bain PA, Craig KJT, et al. Leveraging artificial intelligence for pandemic preparedness and response: a scoping review to identify key use cases. NPJ Digit Med 2021; 4:96. indicated only 1 NLP-based study for early diagnosis or patient screening. Most studies (65 of 78) used chest image processing techniques. The authors indicate that most studies analyzed are still in the research phase and few are used for decision-making 3535. Syrowatka A, Kuznetsova M, Alsubai A, Beckman AL, Bain PA, Craig KJT, et al. Leveraging artificial intelligence for pandemic preparedness and response: a scoping review to identify key use cases. NPJ Digit Med 2021; 4:96.. A specific review on the use of NLP in the pandemic showed the use of topic modeling applied in the search for literature related to COVID-19 and non-adherence to social distancing with use 3636. Chen Q, Leaman R, Allot A, Luo L, Wei C-H, Yan S, et al. Artificial intelligence in action: addressing the COVID-19 pandemic with natural language processing. Annu Rev Biomed Data Sci 2021; 4:313-39..
In a study comparing different topic modeling methods in social media, Egger et al. 3737. Egger R, Yu J. A topic modeling comparison Between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts. Front Sociol 2022; 7:886498. showed that the BERTopic model better separated the topics and its analysis tools enable a better understanding of the interrelations between the topics. Such tools are visual and the authors state that the topics require human interpretation 3737. Egger R, Yu J. A topic modeling comparison Between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts. Front Sociol 2022; 7:886498..
As for human participation, a holistic and multidisciplinary view is needed, based on the human interpretation of the topics (modeling dimension) and the well-being of the patient (health dimension) considering financial aspects (economic dimension).
As an example of the challenge of this holistic view, it is observed that the models studied have opposite behaviors: one selects severe cases and the other selects less severe cases. The implementation of a health promotion program in the context of post-COVID-19 syndrome is much greater than the simple interpretation of the topics generated by an automatic model. It is a multidisciplinary enterprise also comprising the design of the program, identification and correct allocation of patients, their monitoring, evaluation of outcomes and financial results.
Post-COVID-19 syndrome patients require a wide gamut of special care ranging from reestablishment of previous health conditions to rehabilitation 3838. Leavell HR. The basic unity of private practice and public health. Am J Public Health Nations Health 1953; 43:1501-6.. In this context, it is important to note that automatically generated models and the interpretation of their topics, although interesting, are part of a process that is highly dependent on people. Although, in the health care field, human resources are specialized and expensive, human participation is essential, not only interpreting the topics generated but also designing the entire program in line with this interpretation. It is worth using an NLP model in the early identification of diseases as long as a multidisciplinary team conducts the task of providing patients with quality, accessible and sustainable health care.
Specifically considering the informational dimension, an unsupervised model, especially when there is no word treatment, has some advantages. It is not influenced by the researcher. Another advantage is serving as support for the supervised models being employed as exploratory techniques 3939. Nadif M, Role F. Unsupervised and self-supervised deep learning approaches for biomedical text mining. Brief Bioinform 2021; 22:1592-603.. The necessary human interpretation is perfectly consistent in a flow of patient discovery with the following steps: (1) unsupervised exploratory analysis - object of this study; (2) human interpretation and labeling based on the program design; (3) classification of cases; (4) application of labels in a supervised model with discovery of new patients. A supervised model has better performance and direct measures of quality assessment for classification, but the lack of labels on unstructured information makes its applicability very difficult.
In this study, we used two indirect quality assessment methods. In the first, there is human analysis and classification of authorization requests of the BERTopic +1,000 model, selected because of their possible greater severity and simulating the step of classification of cases by specialist. This practical exercise shows the dependence on human interpretation. While most cases (90%) would be of interest for careful evaluation through contact with patient for example, others were clearly misclassified (e.g., “respiratory distress”). However, they are still interesting - one of the cases is a premature newborn from a mother infected by COVID-19 - whose analysis may lead to a specific program for pregnant women in this pandemic period.
The second indirect quality assessment method used structured query language (SQL), indicating that BERTopic models lose a significant group of suspected patients. These cases were less severe. The loss was not resolved with a change in the number of documents per topic - there was an increase in outliers - nor with the treatment of words - the groups became less identifiable. These non-classified cases reinforce the need for a semantic context to apply the method that is associated with the quality of the information in the authorization request. Only 25% of prior authoriztions have some information and of these, most have few words, making contextual analysis by the method difficult. There is an old discussion about data quality and its solution in the process of knowledge discovery in databases - KDD 1010. Fayyad U, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Magazine 1996; 17:37-54.. The use of real databases, such as the one used here, has great potential, and can even be used in evidence based on real data provided that the limitations imposed by quality are corrected 4040. Liu F, Demosthenes P. Real-world data: a brief review of the methods, applications, challenges and opportunities. BMC Med Res Methodol 2022; 22:287.,4141. Raoof S, Kurzrock R. For insights into the real world, consider real-world data. Sci Transl Med 2022; 14:eabn6911..
The Word2Vec model performed better with word treatment when compared to traditional methods, in part because the treatment involved standardizing the COVID-19 words written in different ways. Although advantageous, this exposes the difficulty of maintaining such a model and it is necessary to consider whether traditional search using SQL would not be better than this model addressed. However, it should be considered that traditional methods for extracting data from texts are subject to human errors, a priori choice of words present in this text requires specialized knowledge 4242. Chen T, Dredze M, Weiner JP, Hernandez L, Kimura J, Kharrazi H. Extraction of geriatric syndromes from electronic health record clinical notes: assessment of statistical natural language processing methods. JMIR Med Inform 2019; 7:e13039. and may not fully take advantage of real-world information. Traditional database analysis options for identifying patients with certain diseases in providers are limited - ICD-10 are not informed and paid procedures do not allow the identification of the treated disease (e.g., lung computed tomography is paid in the same way for cancers, infections and checkup). There remains access to a wide range of unstructured information in which new methods, even if they need adjustments, can be more effective.
It is observed that, in this real setting with low quality of information, high volume of prior authorizations with missing values or filled in with only one word, the study demonstrated the viability of an unsupervised model for the analysis of prior authorizations from health care providers without any previous treatment with the use of software that is free, easy to use and easy to implement. This type of model is especially useful in the Portuguese language, in which coronavirus and coronavírus are different words for the computer but with identical meanings. It also addresses phrases such as - “HR: 65BPM RR: 26BPM BP:100/57MMGH SAT: 95% on RA. maintained respiratory distress” because it “understands” that respiratory distress may be related to COVID-19.
Unexpectedly, the model generated other groups of interest. Notably a group of cancer patients in which the topic formed practically describes the diagnosis attributed to patients - “neoplasm, malignant, breast” and groups of patients with orthopedic problems and mental disorders. These are patients who can certainly benefit from health promotion programs.
On the other hand, an unsupervised model selected prior authorizations belonging to cancer patients. This raises serious concerns about the ethical and responsible handling of information. This work highlights the problems that these models can cause in the ethical field 4343. Schwalbe N, Wahl B. Artificial intelligence and the future of global health. Lancet 2020; 395:1579-86. especially by focusing on the technical application of NLP disregarding the human dimension. There is a need for broad human participation in different stages of the creation of a health promotion program for patients with post-COVID-19 syndrome. This does not make the method less important; it only reinforces the need for human control.
To the best of our knowledge, this is the first study employing this technique using supplementary health care data in Brazil.
Study limitations
It is a model that cannot be much generalized due to factors such as: (i) being a proprietary base; (ii) difficulty in accessing information due to ethical and legal secrecy; and (iii) the use of the model trained in non-medical corpus in English. We also observed an important amount of authorizations with semantically poor descriptions, impairing the classification. The quality assessment of the model depended on manual analysis by the main researcher, which may introduce a bias that is mitigated by the exposure of the information and its classification.
Additional studies
The model should be enhanced by supervised method with the inclusion of labels created by specialists. It can also be enriched with other machine learning methods, such as the analysis of the images attached to the authorizations. It is necessary to discuss the ethical aspects of applying automated models, especially when they classify people into disease groups. It is necessary to assess the impact of treatment regimens and objectives (e.g., outpatient and diagnostic) on the behavior of the models. It is necessary to conduct further studies on the interrelation of different dimensions of knowledge and respective professionals in the provision of integrative, collaborative and sustainable care.
Conclusion
The BERTopic model without word treatment selected more severe patients with suspected COVID-19 infection than the Word2Vec model with word treatment. On the other hand, with word treatment, the latter model was able to select a larger group of suspected cases. It is observed that the decision on the best model depends on the complementary human analysis and on the health promotion program designed.
Compared to traditional methods, it was observed that the BERTopic models did not classify suspected cases, mostly with lower severity, but which may be relevant in an integrated health care model. Thus, it reinforces the exploratory character, its intermediate use for the application of a supervised model and the need to compare results with traditional research methods.
On the other hand, the model also generated topics of interest for future studies, with special attention to suspected cases of cancer patients.
The findings demonstrate the importance of human participation - analysis of the generated topics for correct classification generating information for a supervised model, choice of the best model according to the perspective of health care management (more severe versus less severe patients), design of a health promotion program aligned with this choice and attention to the ethical aspects of the use of machine learning tools in health care.
References
- 1Adil MT, Rahman R, Whitelaw D, Jain V, Al-Taan O, Rashid F, et al. SARS-CoV-2 and the pandemic of COVID-19. Postgrad Med J 2021; 97:110-6.
- 2Noronha KVMS, Guedes GR, Turra CM, Andrade MV, Botega L, Nogueira D, et al. The COVID-19 pandemic in Brazil: analysis of supply and demand of hospital and ICU beds and mechanical ventilators under different scenarios. Cad Saúde Pública 2020; 36:e00115320.
- 3Li J, Huang DQ, Zou B, Yang H, Hui WZ, Rui F, et al. Epidemiology of COVID-19: a systematic review and meta-analysis of clinical characteristics, risk factors, and outcomes. J Med Virol 2021; 93:1449-58.
- 4Victora CG, Hartwig FP, Vidaletti LP, Martorell R, Osmond C, Richter LM, et al. Effects of early-life poverty on health and human capital in children and adolescents: analyses of national surveys and birth cohort studies in LMICs. Lancet 2022; 399:1741-52.
- 5Mannelli C. Whose life to save? Scarce resources allocation in the COVID-19 outbreak. J Med Ethics 2020; 46:364-66.
- 6Crook H, Raza S, Nowell J, Young M, Edison P. Long covid-mechanisms, risk factors, and management. BMJ 2021; 374:n1648.
- 7Hope AA, Evering TH. Postacute sequelae of severe acute respiratory syndrome coronavirus 2 infection. Infect Dis Clin North Am 2022; 36:379-95.
- 8Pavli A, Theodoridou M, Maltezou HC. Post-COVID syndrome: incidence, clinical spectrum, and challenges for primary healthcare professionals. Arch Med Res 2021; 52:575-81.
- 9Khan A, Khan M, Ullah S, Wei D-Q. Hantavirus: the next pandemic we are waiting for? Interdiscip Sci 2021; 13:147-52.
- 10Fayyad U, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Magazine 1996; 17:37-54.
- 11Dinov ID. Volume and value of big healthcare data. J Med Stat Inform 2016; 4:3.
- 12Esfandiari N, Babavalian MR, Moghadam A-ME, Tabar VK. Knowledge discovery in medicine: current issue and future trend. Expert Systems with Applications 2014; 41:4434-63.
- 13Cios KJ, Kurgan LA. Trends in data mining and knowledge discovery. In: Pal NR, Jain L, editors. Advanced techniques in knowledge discovery and data mining. London: Springer London; 2005. p. 1-26.
- 14Idri A, Benhar H, Fernández-Alemán JL, Kadi I. A systematic map of medical data preprocessing in knowledge discovery. Comput Methods Programs Biomed 2018; 162:69-85.
- 15Alloghani M, Al-Jumeily D, Mustafina J, Hussain A, Aljaaf AJ. A systematic review on supervised and unsupervised machine learning algorithms for data science. In: Berry MW, Mohamed A, Yap BW, editors. Supervised and unsupervised learning for data science. Cham: Springer International Publishing; 2020. p. 3-21.
- 16Dogan O, Tiwari S, Jabbar MA, Guggari S. A systematic review on AI/ML approaches against COVID-19 outbreak. Complex Intell Systems 2021; 7:2655-78.
- 17Lauriola I, Lavelli A, Aiolli F. An introduction to deep learning in natural language processing: models, techniques, and tools. Neurocomputing 2022; 470:443-56.
- 18Junaid T, Sumathi D, Sasikumar AN, Suthir S, Manikandan J, Khilar R, et al. A comparative analysis of transformer based models for figurative language classification. Comput Electr Eng 2022; 101:108051.
- 19Agência Nacional de Saúde Suplementar. TISS - padrão para troca de informação de saúde suplementar. https://www.gov.br/ans/pt-br/assuntos/prestadores/padrao-para-troca-de-informacao-de-saude-suplementar-2013-tiss (accessed on 20/Dec/2021).
» https://www.gov.br/ans/pt-br/assuntos/prestadores/padrao-para-troca-de-informacao-de-saude-suplementar-2013-tiss - 20Agência Nacional de Saúde Suplementar. Resolução Normativa nº 259, de 17 de junho de 2011. Dispõe sobre a garantia de atendimento dos beneficiários de plano privado de assistência à saúde e altera a Instrução Normativa - IN nº 23, de 1º de dezembro de 2009, da Diretoria de Normas e Habilitação dos Produtos DIPRO. Diário Oficial da União 2011; 20 jun.
- 21Instituto Brasileiro de Geografia e Estatística. Conta-satélite de saúde: Brasil - 2010-2019. https://biblioteca.ibge.gov.br/visualizacao/livros/liv101928_informativo.pdf (accessed on 07/Jul/2022).
» https://biblioteca.ibge.gov.br/visualizacao/livros/liv101928_informativo.pdf - 22Duval FV, Silva FAB. Mining in Twitter for adverse events from malaria drugs: the case of doxycycline. Cad Saúde Pública 2019; 35:e00033417.
- 23Moreira LB, Namen AA. A hybrid data mining model for diagnosis of patients with clinical suspicion of dementia. Comput Methods Programs Biomed 2018; 165:139-49.
- 24Diniz EJS, Fontenele JE, Oliveira AC, Bastos VH, Teixeira S, Rabêlo RL, et al. Boamente: a natural language processing-based digital phenotyping tool for smart monitoring of suicidal ideation. Healthcare (Basel) 2022; 10:698.
- 25Liu L, Tang L, Dong W, Yao S, Zhou W. An overview of topic modeling and its current applications in bioinformatics. Springerplus 2016; 5:1608.
- 26Alghamdi R, Alfalqi K. A survey of topic modeling in text mining. International Journal of Advanced Computer Science and Applications 2015; 6:147-56.
- 27McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2018; 9 feb. https://arxiv.org/abs/1802.03426
» https://arxiv.org/abs/1802.03426 - 28Grootendorst M. BERTopic: neural topic modeling with a class-based TF-IDF procedure. https://maartengr.github.io/BERTopic/algorithm/algorithm.html (accessed on 14/Dec/2022).
» https://maartengr.github.io/BERTopic/algorithm/algorithm.html - 29Reimers N, Gurevych I. Sentence-BERT: sentence embeddings using siamese BERT-networks. arXiv 2019, 27 aug. https://arxiv.org/abs/1908.10084
» https://arxiv.org/abs/1908.10084 - 30Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv 2022; 24 may. https://arxiv.org/pdf/1810.04805.pdf
» https://arxiv.org/pdf/1810.04805.pdf - 31Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need: 31st Conference on Neural Information Processing Systems (NIPS 2017). https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 17/Oct/2023).
» https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf - 32Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv 2013; 16 jan. https://arxiv.org/abs/1301.3781
» https://arxiv.org/abs/1301.3781 - 33Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf (accessed on 17/Oct/2023).
» https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf - 34Hernández MA, Stolfo SJ. Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Discov 1998; 2:9-37.
- 35Syrowatka A, Kuznetsova M, Alsubai A, Beckman AL, Bain PA, Craig KJT, et al. Leveraging artificial intelligence for pandemic preparedness and response: a scoping review to identify key use cases. NPJ Digit Med 2021; 4:96.
- 36Chen Q, Leaman R, Allot A, Luo L, Wei C-H, Yan S, et al. Artificial intelligence in action: addressing the COVID-19 pandemic with natural language processing. Annu Rev Biomed Data Sci 2021; 4:313-39.
- 37Egger R, Yu J. A topic modeling comparison Between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts. Front Sociol 2022; 7:886498.
- 38Leavell HR. The basic unity of private practice and public health. Am J Public Health Nations Health 1953; 43:1501-6.
- 39Nadif M, Role F. Unsupervised and self-supervised deep learning approaches for biomedical text mining. Brief Bioinform 2021; 22:1592-603.
- 40Liu F, Demosthenes P. Real-world data: a brief review of the methods, applications, challenges and opportunities. BMC Med Res Methodol 2022; 22:287.
- 41Raoof S, Kurzrock R. For insights into the real world, consider real-world data. Sci Transl Med 2022; 14:eabn6911.
- 42Chen T, Dredze M, Weiner JP, Hernandez L, Kimura J, Kharrazi H. Extraction of geriatric syndromes from electronic health record clinical notes: assessment of statistical natural language processing methods. JMIR Med Inform 2019; 7:e13039.
- 43Schwalbe N, Wahl B. Artificial intelligence and the future of global health. Lancet 2020; 395:1579-86.
Publication Dates
- Publication in this collection
04 Dec 2023 - Date of issue
2023
History
- Received
19 Jan 2023 - Reviewed
26 June 2023 - Accepted
04 July 2023