Wednesday, 16 April 2025

More on Bad MOT data

Triumph Stag pronounced but not spelt as Triummphhh Stagggg


A friend asked me to have a look at the large UK car annual testing ( MOT ) dataset for Triumph Stags. This are the variations of Make "Triumph" appearing when associated with Model "Stag" ...


That would be about a 0.1 % error rate. Further review and analysis shows that bad data permeates the MOT data set even from the most basic Make and Model fields. It would appear that these fields were entered with little, if any, validation and correction. Examples of data problems are :
  • Spaces before names and after names
  • Both added and missing spaces between words, 
  • Zero 0 used as O and 1 used as I, ! used as I,
  • Lowercase used when standard appears to be all UPPER CASE
  • Obvious misspelling of well known Makes,
  • Model added into the Make field,
  • Make added into Model field,
  • Engine size and LHD/RHD and body type included in Make and Model fields,
  • Noise characters like - / * ( ) . , inconsistently included in fields
  • Over 10 Million entries with Model = UNKNOWN but also Models listed as 

"UNKN0WN",1

"UNKNOW",5

"UNKNOWN",10414473

"UNKNOWN ",2

"UNKOWN",10

Whilst it would be easy to be a data correction pedant, the opportunities to correct the obvious systemic errors appears to have been overlooked. When extracting the Make strings and counting the occurrences of each shows that 168,799 different Makes are listed. However 20,119 of the Make names are used just once and 3,129 used twice suggesting about a 12% error rate in this area. 

There will be some unique and one off cars and bikes made by folks in garages but it would seem that some of the worse offenders for badly recorded Make names are "RANGE ROVER" and "DIRECT BIKES" Both these companies have the "Just add the Model inconsistently into the Make name" issue.

These images from the list of just the Make field with the count of times used ....
 
 

Once the Make data is combined with the even more inconstant Model data the list of "Make,Model" combinations list is 239,896 entries long of which 147,964 ( 61% ) are used just once or twice. As this data set is directly reflects Vehicle registration data that is used for tax collections and law enforcement, more effort should have been made to keep it both consistent and accurate.

Going back to Registration year Make,Model also shows over 60 vehicles registered between years 1850 and 1899. In-between the historic vehicles there are many from manufacturers that did not exist at the time including Suzuki and BMW. Some examples highlighted below. These type of errors are likely to arise from a failure to validate at Registration input time between 18xx and 19xx dates.

The impact of this poor data consistency becomes apparent when researching small volume historic cars For example the DMC DeLorean, only produced one model for a few years in the '80s but has the following name variations .... showing Make,Model,Qty, [ registered in these years list .... ] 

DMC DeLorean ( there was only one type of these )

Strangle this article showed up in the Independent just a few days later about Deloreans....

Forty years after its cinematic debut in Back to the Future, the DeLorean DMC-12, famed for its gull-wing doors and brushed stainless steel exterior, remains a rare sight on UK roads. New figures reveal just 303 of these iconic vehicles are still registered in the country, a testament to their enduring appeal and collector status.

Originally manufactured in Dunmurry, Northern Ireland, in 1981, around 9,000 DeLorean DMC-12s were produced. Their unique design and subsequent Hollywood fame have transformed them into highly sought-after automotive treasures. The limited number remaining on UK roads underscores their rarity and the dedication of their owners to preserving a piece of cinematic and automotive history. .... Some 303 DeLoreans are taxed for use on public roads in the UK, according to Driver and Vehicle Licensing Agency figures obtained by online auction platform Collecting Cars.

Conclusion

A single Make and Model field value pair can not be relied on to find all the cars of a particular type.

Data on cars on the roads before 2005 appear in the listings but may not have had any MOT tests listed which would attribute this inconsistency to the original registration process. However the importation of the older registration data into a new digital platform could have been an opportunity to clean up the dataset. Within the current age of cars the proliferation of "build to order" and promotional variations is likely to further complicate the identification of cars using Model name.

Other examples

You have to love the lads that registered their creations with the DVLA with the Make name as "2BLOKESINASHED" and "FASTEST SHED".

How many ways can you spell Armstrong Siddeley ?  Highlighted ones are used that number of times.

        

 

And the correct spelling is Armstrong Siddeley because we had one turn up to a show.




This classic of unvalidated input. From Fail to Pass in about 13 days but also went from 85,668 to 865,610 miles. Would have to have travelled at 2,745 miles each hour for 312 hours !





** Article Version Date : 30 May 2025 **

No comments: