ECC in NAND flash memory

ECC in NAND flash memory

All modern flash storage devices have a problem with data integrity caused by a poor quality of the NAND chips. This problem is well known as “Bit errors”. When bit errors appear within the area where file is stored, it gets corrupted and unreadable. These bit errors are easy to spot on multimedia files like on the picture below. 



There are multiple sources of bit errors (cell-to-cell interference, charge leakage, read disturb, etc) in the NAND memory and several ways to reduce them (Error Correction Codes, power manipulations, built-in Read-Retry algorithms). The Error Correction Code is the most efficient way to protect data and fix it in case of errors, so all flash controllers utilize built-in ECC coders/decoders. During the data recording (programming) process controller generates special checksum and store it inside each Page. When user requests the data (Page read operation) controller checks data integrity of every page and perform correction if error was detected. 

The BCH ECC codes are among the most commonly used in flash storage devices. The BCH algorithm is adjustable and has a set of parameters. These parameters are pre-programmed in the controller’s FW and differ from model to model. When controller gets damaged, the information about parameters is lost, but ECC checksums still remain inside every page of NAND chip. The Visual Nand Reconstructor has a software BCH decoder that performs error correction of data using remaining ECC checksums. The error correction via ECC algorithm can be applied on the physical image once it’s extracted (or in case of Bad Columns after removal on BCR element). Sometimes data should be un-XOR'd before the ECC. 

 

ECC structure. Payload and parity areas 

The Page of flash memory usually consist of Data area, Spare area and ECC area. In almost every case these areas repeat multiple times, depending on the page size and page structure. For the ECC there are only two areas exist - Payload and Parity. The Parity is the ECC area itself and the Payload may contain Data area or Data area with Spare area depending on controller model. Each Payload and Parity shape Codeword. One Page may contain multiple codewords (1,4,8,16..).


Number of codewords depends on Page size and amount of Payload+Parity areas.

VNR - ECC autodetection

VNR has built-in mechanism of ECC autodetection (for controllers and codewords that are in database). To launch ECC autodetection simply press the "Find codewords" button on the toolbar (in parameter tab for older SW versions).



Select first option if controller brand is unknown. The search can be narrowed by selecting the vendor of controller if controller vendor is known.
Analysis starts by pressing "Find" button. As a result the probabilities of codes will appear. 



If autodetection doesn't show any matching ECC - check 'BCR and ECC' and 'XOR and ECC' chapters of this article below.

The codes are sorted automatically by probability and here are few things to pay attention to
  1. Code name (consist of respectively Controller model, Page size, ECC size, Number of Codewords)
  2. Result (probability). The bold values mean that all codewords match otherwise only first codeword was tested.

Normally the most important criteria for ECC choice is that the value is bold and is higher than 80%. 
The controller model is less important because some controllers may use same ECC codes.
The page size doesn't have to be exact, however number and size of codewords should match. (e.g. top three bold values on picture above belong to same codes).

In order to check whether code works or not open ECC map in parameter tab, scroll down a bit and make sure that majority of pages are green.

Do not pay attention to the first blocks in ECC map

ECC and Bad Columns

Many TLC NAND chips have Bad Columns and its important to remove them using BCR element before ECC autodetection (and use BCR as a source of physical image for further analysis). During bad columns removal special attention should be paid to the area where ECC ends. Some controllers leave lots of bit "1" at the end of ECC code , which looks like Bad Column, but there are actually few bits of ECC code. 

To get more informations about Bad Columns check this article: NAND Bad Columns analysis and removal.


Ex.


On the picture above there's just 1 bit of ECC and if it's removed as Bad Column it may affect the error correction efficiency. It's easy to make mistake in such situations when all Bad columns have 2 bytes and there are lot of them. 

ECC before/after XOR

Most of controller generate ECC (Parity) from scrambled (XOR'd) data while some do it on the raw data. So sometimes it's necessary to check if ECC works before XOR or after XOR element. Furthermore usage of ECC element after XOR may be required when parity is XOR'd.


Physical image correction

Physical Image correction is mandatory step in Logical image reconstruction. VNR have few options that allow to correct dumps:
  1. Turn ON ECC (At parameter tab, turns ECC on fly without physical dump modification. Very useful for SA correction when creating markers table )
  2. Correct dump (One-pass correction and removal of bit errors in dump files)
  3. ReRead dump (Multi-pass correction and uncorrected page re-read with usage of Read Retry algorithms)
All these options located at BCH toolbar when ECC element is selected.





When using ReRead dump option the current memory chip must be connected to the reader. 



There are few options here:
  1. Start address
  2. Maximum number of passes
  3. Read Retry (Read Retry command is automatically assigned when chip is supported)

Unsupported ECC - how to create a new code

We regularly update our database with new ECC codes, but there are too many combinations of memory chips and controllers also code configurations are frequently updated by vendors.  Because of that VNR has 'Codeword analysis' tool that allows to brute force the formula for any BCH ECC codeword. Later codeword can be populated to the whole page according to page structure. 

The 'Codeword analysis' function located on toolbar of ECC element.



It's necessary to specify the set of parameters in order to start ECC brute force process.



A switcher 'Bytes/Bits' converts Payload and Parity area ranges into bits in case of need to adjust ECC more precisely.

Codeword analysys options are:
  1. Page size - Page size of chip/dump
  2. Payload - Payload area (it's possible to set floating boundaries in case if position of ECC area is not very clear)
  3. Parity - Parity area (it's possible to set floating boundaries in case if position of ECC area is not very clear)
  4. Polynomial:
    1. By polynomials - brute force against known polynomials of BCH ECC
    2. By degrees - brute force against all existing polynomials of BCH ECC (long process that can take 1 - 24 hours)
  5. Operations: 
    1. No operation - There is no transformation on ECC
    2. Inversion 
    3. Bit rotation
    4. Both Inversion and rotate - ECC have Bit rotation and it's Inverted
  6. Address
    1. Dump - VNR tests each combination on automatically chosen block
    2. Fixed - VNR tests each combination at specific offset
Codeword analysis test all combinations in order to find proper ECC code.

Example:

Page size = 8640
Page structure:  DA(1024)+SA(8)+ECC(42)+7x[DA(1024)+SA(4)+ECC(42)]+EMPTY(76)
First codeword = Payload(1032) + Parity(42)



If weights are same for every setting it means that most likely none of found parameters worked. In this case you may need to find a block full of data through bitmap viewer and copy it's address to the 'fixed address' field

When matching settings of ECC for the first codeword are found it's weight is normally above 20. To add the codeword just select the matching settings and press  'Add' button.



When the first codeword has been added it's necessary to add all other according to page structure. The main function of codeword analyzer is to find parameters of code but not the whole ECC structure. This chip have 8640 Page size, so it is easy to notice that there are 7 codewords remain to add.



To add this 7 missing codewords automatically based on previous one , click on the button highlighted below



In case if Spare Area size differs from codeword to codeword it's necessary to adjust following codeword according to page structure. To edit the codeword select it and click Edit button.





The second and other codewords have 4 bytes smaller Payload area. Once it's been adjusted, click 'Ok' to add this codeword.
All other codewords have same structure so they can be easily populated from the current one.





When all codewords are added it's time to check ECC map.



Green map for the majority of pages mean code works properly. 

If most of ECC map elements are red, it means that ECC is incorrect and the code structure has to be checked again.

Practice

Case  #1
  1. Download the video tutorial
  2. Download the case and find ECC code
  3. Check the influence of ECC on data


    • Related Articles

    • Binary patterns in NAND flash memory

      Analysis and recognition of binary patterns in NAND flash memory is the key step in chip-off data recovery and digital forensic analysis of broken flash devices. This analysis is carried out in the Bitmap mode since the classic HEX view does not ...
    • NAND Memory Protocols: The Difference Between SDR and DDR

      Everyone who works in NAND data recovery knows that reading NAND requires using a protocol. The most popular protocols are Async and WL. Almost all protocols have two versions: SDR and DDR. 1. Protocols in reader configuration What is SDR and DDR and ...
    • NAND Bad Columns. Bad Column Remover

      Each time during the chip-off data recovery, after the physical image extraction, and before any further step, it's necessary to check the quality of the dump. Due that memory chips are being read physically, the extracted dump may contain defects. ...
    • Analysis of bit errors in NAND and power adjustment

      During the NAND chip reading process, there are internal noise and interference occur, which results in bit errors and data corruption. This problem is particularly critical for TLC flash chips. If a physical image extracted with a high number of bit ...
    • Flash Drive Data Recovery educational webinars

      Chip-off NAND data recovery with Visual NAND reconstructor consists of several essential steps whose task is to reverse transformations which controller applied on user data. In those education webinars, you will find out how to extract raw dumps ...