The BC Cancer Registry (BCCR) collects data on cancers diagnosed in the province reaching back to the 1970s. The registry is a crucial tool for B.C.'s health care system — it enables planners and policy makers to track new cancer diagnoses, how major cancers are trending, as well as how new programs and treatments are improving patients' survival.
But until very recently, B.C. faced a two-year backlog entering cases of reportable cancer in the database. The backlog reflects problems faced by cancer registries not only in B.C., but across Canada and internationally.
"That's just not good enough for tracking how well new screening programs and treatments are reducing the burden of cancer," says Dr. Jonathan Simkin, scientific director of the BC Cancer Registry. "We need to know how many patients are seen and diagnosed every year so we can prepare enough healthcare providers, programs and services to support those patients going through treatment."
That's when BC Cancer and PHSA asked Dr. Raymond Ng and UBC's Data Science Institute to remove a clog in the data pipeline that would allow for timely access to data for cancer research and health planning.
"If we can cut that lag time down, we can help prepare the health system to evaluate these programs a bit faster for people," says Dr. Raymond Ng, computer scientist with the UBC Faculty of Science. "That's a goal of mine — I want my work to have an impact."
The majority of the information loaded into the BCCR database comes from more than 500,000 electronic pathology reports generated each year by hospitals and laboratories in B.C. The Registry uses a system called eMaRC Plus (Electronic Mapping, Reporting and Coding), a standard text-mining system, to extract relevant information from text-based pathology records.
But eMaRC isn't based on state-of-the-art Natural Language Processing (NLP) techniques like large language models. While it was detecting all reportable cancers diagnosed through pathology in B.C., it was also incorrectly labelling a high volume of reports which did not mention a reportable cancer. Although eMaRC helps speed up the review process to support planning and program evaluation, the increased number of wrongly labelled reports led to a backlog in the system — and more work for the BCCR team.
"We need more timely data on the number of patients and diagnoses every year so that we can plan a more responsive healthcare system," notes Dr. Simkin.
To ensure the validity of the database, highly skilled tumour registrars review all reports in order to weed out non-cancer reports incorrectly labelled by eMaRC. Even if human registrars take only a minute to read through each pathology report, it adds up to a lot of time that could be better spent resolving tough questions around classifying tumours and specialized cancers.
"The manual review of documents to classify cancer is very labour intensive," says Cathy MacKay who oversees registry data quality and reporting and evaluation for PHSA. "Training a tumour registrar takes three months for each grouping of cancer. And we have approximately 20 types of cancer groups to code."
"If we could develop an algorithm that could check each report in one minute, every night you could clear up the backlog of that day," says Dr. Ng. "It's saving money, saving time and overcoming a labour shortage problem."
Most NLP systems are based on a "language model", a very large neural network trained on vast amounts of unlabelled text — the entire web for example — in a process known as "deep learning." The resulting model knows a lot about word meanings and syntax, which allows it to comprehend the nuances of language. In the last four years, it's transformed NLP.
Deep learning, though, isn't intuitive — the process is so opaque and incomprehensible that it's been dubbed a "black box." Dr. Ng's team from the Data Science Institute chose to take a more transparent, user-friendly approach to building the BCCR's NLP pipeline.
"If you want to convince the clinicians in the health care system to use AI you can't say, 'I don't know why, it just works.' That's not an acceptable answer. From the very beginning we wanted to build an explainable model that does the job, and we wanted the expert tumour registrars to understand why the algorithm will say a cancer is reportable or not."
Dr. Ng adopted a state-of-the-art approach to NLP training called "query and answer" using a flowchart of the questions a human tumour registrar would ask when classifying cancer in a pathology report. True or false answers inform the next question in a logical progression.
"It's like the game Mastermind or Wordle — you try to guess what the word is by asking questions based on previous answers," says Dr. Ng. "We're essentially using NLP to play the game 'Is there invasive carcinoma?' Depending on the question, we ask a second question to be sure, and a third to draw a conclusion. We are just encoding human logic using NLP."
Working with the cancer registry team to learn how tumour registrars code pathology reports, the DSI team co-designed the questions so the process of training the model was transparent and understandable.
"It's not as much of a black box as a lot of other data science tools because Raymond and his team incorporated coding and tumour registry knowledge into the pipeline," MacKay says. "It's in-house and custom made."
The model was trained on two years' worth of BCCR data that had already been validated by human registrars. Errors were analyzed to determine where the models went wrong. It took relatively little time to deploy the system because BCCR experts had confidence in the model. They understood how its decision-making process works.
After the initial design in early 2021, the new system was tested in a pilot project running from September to November 2022.