In which I look at data issues generated via spreadsheet use (specifically how Excel changes data when are imported into Excel) in genomics research. I reference the European Spreadsheet Risks Interest Group (I’m a member!), ASOP 23 on data quality, and give some unsolicited advice to the genomics research community. There will be a part two on other spreadsheet issues, and other actuarial standards. (Professionalism credit! For free!)
Episode Links
Retraction Watch guest post: Genomics has a spreadsheet problem by Mandhri Abeysooriya and Mark Ziemann
Excerpt:
Gene-name errors were discovered by Barry R. Zeeberg and his team at the National Institutes of Health in 2004. Their study first showed that when genomic data is handled in Excel, the program automatically corrects gene names to dates. What’s more, Riken clone identifiers – unique name tags given to pieces of DNA – can be misinterpreted as numbers with decimal points.
In 2016, we conducted a more extensive search, showing that in more than 3,500 articles published between 2005 and 2015, 20% of Excel gene lists contained errors in their supplementary files.
….
The widespread adoption of these new names is a lengthy process: Our latest study, of more than 11,000 articles published between 2014 and 2020, found that 31% of supplementary files that included gene lists in Excel contained errors. This percentage is higher than in our previous 2016 study.
My comment:
The “helpful” way Excel automatically changes text it imports into numerals or dates can screw up my mortality tables I import, where the “1-4” (intended to mean ages 1 years to 4 years) gets transformed to January 4 of the current year.
Ed Cruz (from EuSpRIG) comment:
I believe the primary issue lies in Data Validation. Here’s a proven data management approach that has yielded success for clients in the banking and securities industry:
Rather than directly managing raw data within an Excel spreadsheet, create custom database application to handle data effectively:
1. Data Validation and Exception Reporting: The application initiates by uploading raw Genomics data into a database. Here, data validation rules are applied before the upload, ensuring data integrity, and promptly generating Exception reports in the event of any issues.
2. Excel Template Integration: Afterward, application can query data and publish into a Genomics Excel template, enabling researchers to work with clean data and significantly enhancing accuracy and efficiency.
This method not only addresses the data validation concern but also optimizes the entire Genomics data management process.
Cruz links to this 2-minute video:
Please pay particular attention to:
1. 0:40, Validation rules are applied and generates an Excel Exceptions report.
2. 1:10, Excel Template is populated with clean and processed data.
Video in which I mention Excel importation issue:
European Spreadsheet Risks Interest Group
Main website:
Horror stories page: https://eusprig.org/research-info/horror-stories/
Example stories:
Identifier:POB2001
Title:Data not controlled, 16000 UK Covid-19 test results lost for a week
Source:https://www.bbc.co.uk/news/technology-54423988
Release Date:08 October 2020
Risk:Lives put at risk because the contact-tracing process had been delayed
Discrepancy:16,000 test cases in a week
Excel: Why using Microsoft’s tool caused Covid-19 results to be lost
“The badly thought-out use of Microsoft’s Excel software was the reason nearly 16,000 coronavirus cases went unreported in England. [The labs] filed their [result logs] results in the form of text-based lists – known as CSV files – without issue. PHE had set up an automatic process to pull this data together into Excel templates so that it could then be uploaded to a central system. The problem is that [Public Health England] PHE’s own developers picked an old file format to do this – known as XLS. As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of. And since each test result created several rows of data, in practice it meant that each template was limited to about 1,400 cases. When that total was reached, further cases were simply left off. To handle the problem, PHE is now breaking down the test result data into smaller batches to create a larger number of Excel templates.”[POB] I can correct the headline to:
Why a lack of basic data controls caused Covid-19 results to be lost
Of course they should not have chosen a file format with a size limit to process results. Nonetheless, whatever technology they used, anywhere data is exchanged between systems there must be checks and controls that reconcile the output of a transformation stage to its input, such as record counts and hash totals. Even if the upload process would only accept XLS files, batch total controls could have been imposed there too.
Identifier:POB2003
Title:Scientists rename human genes to stop Microsoft Excel from misreading them as dates
Source:https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates
Release Date:06/08/2020
Risk:Inconvenience
Discrepancy:none
The bioinformatics community decided it was easier to change gene symbols than changing peoples’ habits.’Módos, whose job involves analyzing freshly sequenced genetic data, says Excel errors happen all the time, simply because the software is often the first thing to hand when scientists process numerical data. “It’s a widespread tool and if you are a bit computationally illiterate you will use it,” he says.’
My Slides for EuSpRIG 2010 meeting: I’ve Got 99 Problems but a Spreadsheet ain’t One
A lot of the links in that presentation are now dead. Alas.
Bonus MPC presentation: Think Outside the Parallelopiped - 2008
Podcasts?! Wha?
[and I brought my iPod with me, I think, to demonstrate the business podcasts I listened to then]
Actuarial Standards of Practice
Actuarial Standards of Practice: https://www.actuarialstandardsboard.org/standards-of-practice/
1.1 PURPOSE
The purpose of this actuarial standard of practice (ASOP) is to provide guidance to the actuary when performing actuarial services involving data.
….
SECTION 3. ANALYSIS OF ISSUES AND RECOMMENDED PRACTICES
3.1 OVERVIEW
Appropriate data that are accurate and complete may not be available. The actuary should use available data that, in the actuary’s professional judgment, allow the actuary to perform the desired analysis. However, if significant data limitations are known to the actuary, the actuary should disclose those limitations and their implications in accordance with section 4.1(b). The following sections discuss such considerations in more detail.
etc.
Spreadsheet Shenanigans - Part 1