The GGGL METHOD: Theoretical Concept on Investigating and Reliability Assessment of Artificial Intelligence Autonomously Generated Diagnostics & Treatment Plans

1 month ago 21

Elena Goranova¹, Alexander Gungov², Sergio Gianesini^3,4, Alexander Lazarov⁵

ABSTRACT: A constant growth in artificial intelligence (AI) use has been proposed in the different professional fields, including medicine. Although currently performing just in a narrowly specialized mode and conducting its activity in a manner diverse from human thinking, at present, AI is already demonstrating a powerful analytic-synthetic capacity, which in some respects surpasses human capabilities, especially once unstructured real-time Big Data streams are explored. However, there is a crucial uncertainty related to the accuracy and precision of diagnostics and treatment plans generated autonomously by AI, and related potential major issues in the patient population exposed to the phenomenon. These issues need special attention in the very demanding framework of medicine because any AI-generated bias may affect human health and well-being while fostering the commercial interests of technology companies.

This concept reviews reliable prospective and retrospective methods to investigate outcomes generated autonomously by AI. The prospective mode refers to conducting large-scale triple-blind studies, in line with the approach from the validated double-blind trial protocols regarding the introduction of new pharmaceutical products. At the same time, our retrospective test proposal describes an opportunity that any clinician or clinical team may independently deploy to test AI’s autonomous informational production.

Importantly, both modes of investigation are universal for any type of medical AI no matter whether it is based on Computer Vision, Natural Language Processing, AI intermodal performance, etc. Moreover, both suggested testing procedures don’t violate any ethical code, due to being 100% non-risky for patients, because in both cases of the proposed experimentation, patients receive traditional, entirely humanprovided health care. Finally, the standardization and reproducibility of the assessment methodology provide the possibility of large-scale meta-analysis on the topic.

Keywords: Artificial Intelligence, autonomous, diagnostics, treatment, accuracy, precision, investigation, trial, test.

INTRODUCTION

Artificial Intelligence (AI) has been progressively penetrating the workspace of many different fields, including the medical one (1), (2). While on one side the phenomenon can deliver great benefits for humankind’s healthcare, on the other one, significant attention must be dedicated to proper assessment of AI reliability, considering the excessive push technology companies can deliver in the outcome claims, without having properly conducted unbiased investigations. This is particularly true for the applications addressing self-management by patients, where direct healthcare professional control can be just relative. AI can be beneficial or harmful based on how it is used and, before all, assessed in its reliability. The essential queries in this process refer to two major issues. Firstly, are AI autonomously produced informational insights accurate and precise? Second, will patients successfully manage the self-diagnostics and treatment (3-11)? We recognize that, currently, the first one has a priority, because of a lack of universally validated methodology in assessing such potentially precious but also confounding tools (12).

Moreover, as per the medical devices, also AI needs a “post-market” surveillance that currently, up to our knowledge, is lacking in the scientific reports (13). The herein-presented text reviews the current scenario and describes a universal method to independently check AI performance. At the same time, an alternative general option to conduct large-scale international triple-blind prospective trials is reported too.

Both assessment methods don’t fall in conflict with any Code of Ethics as for the whole period of AI experimentation conduct, patients are served traditional, high-level expertise of purely human-performed healthcare assistance, and data privacy protection.

AIM

Our main goal is to demonstrate to medical experts that it is possible to assess the accuracy and precision of AI, without putting patients at risk, and to prove that AI’s deployment and utilization are helpful and necessary even in cases when its activity falls outside a human-controlled framework.

TESTING METHOD

We suggest a prospective and retrospective approach to investigating whether AI autonomously generated informational insights are correct with reference to the medical standards and ethical codes.

The prospective method is a triple-blind experimentation which corresponds to the currently approved and widely applied fourth stage of double-blind research protocols to provide trials related to the adoption of new pharmaceutical molecules. We develop this approach step-by-step as follows:

The Narrow AI designers present their automation together with a detailed description of which diseases’ analytics it is focused on. They must also inform who the medical consultants that supported their project realization are. These experts should not be engaged with the same AI for further clinical study.
The trial must be conducted by a company that is experienced in medical drugs and equipment research. This company must also be financially independent of the AI developers.
A highly qualified Key Group of globally respected experts in the specific medical specialty area draws up the target patient profile along with a list of all mandatory examinations necessary for differential diagnostics and treatment decision-making.
Physicians with a corresponding specialization are hired to search for patients who suit the trial purposes and enroll them in the study. To achieve experimentation higher objectivity, it would be better to avoid their participation in other testing activities. However, their exclusion is not mandatory.
Another group of experienced and specialized doctors must be involved to serve the selected patients’ diagnostics and provide follow-up medical care.
Once a targeted patient who fits the trial design profile is available along with a clinician to diagnose and treat him/her, the study enters the triple-blind stage.
The first blind component is the Referent Group of patients – they are unaware of their participation in a study to define the precision of AI. They meet a doctor who examines them and requires all mandatory tests (such as blood, urine, imaging tests etc.) and consultations conducted, which, according to the trial agenda, are uniform for all people signed up.
The second blind component is the physicians who examine and treat the patients – they are also not informed that their opinions will be compared to AI-produced outcomes.
Once each patient’s additional examination results are ready, they are presented to the specialist who requires them. He/she meets the patient again, makes a final diagnostic decision, and prescribes how to proceed with the treatment. The patient follows his instructions. In other words, he/she receives entirely human-based medical assistance in the traditional manner, avoiding any exceptional risks that the sick people may face due to attending the trial.
The same patient investigation data is uploaded to the AI operating system for processing and making a conclusion.
A purely objective computer program compares the AI-generated diagnosis and treatment plan to the doctor’s judgment. The reason for this is twofold: firstly, to check whether the AI outcome is coherent with the human opinion, and second, to calculate precisely the similarity share and to highlight diversities once they arise.
All cases of considerable variability in diagnostics are reported for judgment to the Key

Group and herein lies the third blind component. This group of experts serves as a jury that carries out a quality assessment on what is appropriate and what is wrong between opinions “1” and “2” all the time being unaware which one is machine-produced, and which is a clinician’s conclusion.

Finally, the AI accuracy is measured statistically to serve as evidence to evaluate its trustworthiness.
Meanwhile, the precision of human decision-making might be tested as well to assess and conclude whether physicians also make mistakes, so in cases of opinion variability between clinicians and machines, it would be evident which aspects of human or AI medical expertise are more reliable. An eventual trial outcome that shows AI demonstrated accuracy as better than human experts’ performance might be essential to get an insight into whether the explored Narrow AI solution should be recommended to apply as a doctor’s investigative instrumentation in addition to the current standard human-performed procedures, so to earn synergetic effect.
Significantly, the above stage of the trial must include an observation on whether similar inputs of patients’ symptoms, signs, and examination data lead to common AI diagnostic and treatment plan production. The necessity of such supervision appears as some publications about Chatbot GPT (14) hold that due to constantly performing Deep Learning, from its experience, if asked a similar question at different moments, it generates different answers.

Regarding the prospective method material, we consider that the number of patients involved in such a trial should be similar to the common practice used for innovative medical equipment experimentation, as well as for double-blind testing of new pharmaceutical products. The retrospective method which we suggest can be as large as the prospective one, but it also allows much smaller in-scale examinations intended to give the opportunity to any clinician or clinical team to conduct independent “private” AI testing and ascertain its accuracy and precision. They only need to know the patient profile which the AI design targets for autonomous investigation, diagnostics, and treatment. Then, physicians can open their archives and find out all their former clients (e.g. for a three or five-year past period) who fit the targeted maladies autonomous diagnostics and treatment plan development. They can immediately provide innovative automation with these patients’ data. The machines will quickly reach autonomously generated conclusions which may immediately undergo a direct comparison with the real diagnostics and treatment procedures/results related to these patients because they are documented in their files. Thus, doctors can easily reach an assumption on whether to trust the specific AI or not.

In reference to the retrospective method, evidently, the number of patients who are engaged in the test may be large-scale (conducted in many clinics around the world, or regionally), but it may also be limited if compared to the prospective mode approach, and this depends on the specific medical units involved, the disease type and the physicians’ experience. For example, a few years ago, E. Goranova tested retrospectively a software solution focused on predicting abdominal aneurysm rupture risks involving 82 former patients of the National Cardio Hospital of Sofia Vascular Surgery Dept. and it disclosed valuable information (15).

EXPECTED RESULTS

Our Prospective Method of Investigating Artificial Intelligence Autonomously Generated Diagnostics and Treatment Plans Accuracy and Precision is just a theoretical suggestion because we don’t have the capacity, nor the resources, that we believe are necessary to conduct such extensive research. Therefore, we address our proposal to any Medical Certification Authority to require providing this type of study which makes it possible to consider their results in advance to allow the adoption of AI.

Nevertheless, we report Goranova’s results regarding the retrospective mode. Firstly, we would like to point out that the typical aortic diameter in the abdomen is about 3 cm. Science has not explained yet why the aortic wall changes in the capacity to hold its normal identity. Therefore, we can never know when an aneurysm will occur and once it appears, it varies in shape and dimensions. In some sections, the diameter grows up to 8-9 cm, although rarely to a bigger size. The blood flow inside it is very dynamic, so in cases of rupture, patients experience powerful blood leakage. Thus, death is often observed unless an urgent reparatory surgical operation is conducted. During the pre-digital era, it was accepted that in cases of aneurysm diameter range up to 5,5 cm it is possible to postpone the surgical intervention for a long period and every three months these patients undergo monitoring to examine whether their aneurysm grows in volume or not. Once the diameter is higher than 5,5 cm, the operation is recognized as urgent.

The innovative digital analytic method to investigate rupture risk of aneurysms is based on CT imaging to which it applies Finite Element Testing that includes a virtual model of an aneurysm split up in numerous pieces and producing prediction of their behavior on a time horizon, taking into account the aneurysm wall characteristics and other patient data. Goranova’s retrospective test disclosed that the digital system predictions almost overlapped with the clinicians’ decisions except in a few circumstances. There were 6 cases in which the aneurysms had a diameter ranging higher than 5,5 cm where the computer system concluded a low level of rupture risk, while surgeons had conducted a successful intervention. At the same time, there were two aneurysms with a diameter of lower than 5,5 cm. Surgeons decided to postpone the surgery while the computer-made assessment claimed that these patients were at a great threat. In fact, unexpectedly these two patients passed away, due to an aneurysm rupture.

DISCUSSION

Goranova concludes that the introduction of digital-based autonomous investigation in reference to the prediction of an abdominal aneurysm rupture risk is very helpful regarding lower than 5,5 mm diameter aneurysms because the achieved informational insight proved to be more precise when compared to the clinicians’ approach to foresee the aneurisms’ developments. However, she warns that the software may produce incorrect forecasts regarding large aneurysms because all bio-data (despite the diameter measurement) are dynamic and may undergo substantial changes at any moment. Meanwhile, the specific software with which she experimented considers them as static – being constant at the level obtained during the patient CT investigation performance. For instance, the higher the aneurysm expansion is, the thinner its wall becomes, so it may hold a rupture at the blood pressure level during the CT conduct. However, the same aneurysm may become very dangerous if a high blood pressure crisis arises for this patient, because the stronger the pressure on a thin wall is, the higher the probability of its puncture grows.

Significantly, because any AI activity strongly depends on the initial data supply quality, one should be certain that the former patients’ data is precise. For example, if dealing with Computer Vision, a clinician should ensure that the resolution of past CT images corresponds to the new technology requirements. As AI implementation requires a constant Big Data upload, today, experts in various medical specialties must drive the process by arranging precise data source channels. In this trend, currently, v-Registry Global Real World Evidence Project is offering a perfect example of a validated, standardized yet also customizable option for proper machine learning, moreover in a multi-lingual scenario. Of particular notice, the related privacy control appropriate management.

CONCLUSION

Physicians must be open-minded to any potential AI solution related to their area of competence and must study them critically based on their experience. If it has undergone a trial in advance of its implementation, doctors will be provided with evidence of its autonomously generated diagnostics and treatment plans precision and accuracy. In addition, the suggested retrospective mode of independent investigation by clinicians or clinical teams is very helpful because to allows direct comparison between their opinions and the AI-generated conclusions. Most importantly:

Any AI production strongly depends on the quality of data supply, so this is a crucial issue in terms of any examination of its trustworthiness. If you upload garbage, any AI processing will result in pure rubbish!
As explained in this article, both suggested methods of investigating AI autonomous activity pose no threat to patients and secure AI reliability in a very confident manner.
According to us, AI will never reach the level of human critical thinking. Therefore, AI cannot substitute a human doctor, unless a critical patient situation arises, characterized by the impossibility of reaching medical assistance on behalf of a physician. However, soon, human critical thinking will run in a new environment where AI operates in various domains 24/7.
Clinician teams must never straightly apply conclusions of AI in case of contradiction between their opinions and the machine outcomes. If such a situation occurs, one should always consult a more experienced colleague.

Read Entire Article