0.2 V 8T SRAM with PVT-Aware Bitline Sensing and Column-based Data Randomization

Anh Tuan Do, Member, IEEE, Zhao Chuan Lee, Bo Wang, Student Member, IEEE, Ik-Joon Chang, Member, IEEE, Xin Liu, Member, IEEE and Tony T. Kim, Senior Member, IEEE

Abstract—In sub/near-threshold operation, SRAMs suffer from considerable bit-line swing degradation when the data pattern of a column is skewed to ‘1’ or ‘0’. The worst scenarios regarding this problem occur when the currently read SRAM cell has different data compared to the rest cells on the same column. In this work, we overcome this challenge by using a column-based randomization engine (CBRE). This CBRE circuit randomized data stored to SRAM. This makes distribution of “1” and “0” in each column close to 50%, significantly increasing bit-line swing. To further improve the bit-line swing, we employ bit-line boost biasing and dynamic bit-line keeper schemes. Based on the mentioned techniques, we fabricated a 256 rows × 128 columns (32Kb) 8T SRAM array in 65 nm CMOS technology. In our silicon measurement, the SRAM array shows successful 200 mV operation at room temperature, where energy consumption and access time are 1 pJ and 2.5 µs, respectively.

Index Terms—8T SRAM, bitline sensing, data randomization, sub-threshold circuit, near-threshold circuit, PVT-aware design

I. INTRODUCTION

With the explosive market growth of portable electronics such as tablets and smartphones, low power consumption becomes of utmost importance to system designers [1]. Researchers have presented many low power techniques in previous literature[2, 3]. However, these techniques may not provide sufficient power saving for several applications such as wearable devices and wireless sensor nodes. These applications require ultra-low power consumption since they are operated under extremely small-size battery environment. Recently, the authors of [2] found that the energy-optimal point of digital circuits tend to appear near the threshold voltage of transistors. This motivates sub- or near-threshold operations as possible solutions to deliver the ultra-low power demand. However, in sub-100nm technologies these ultra-low voltage operation results in many design challenges. Most of all, it is extremely difficult to obtain secure ultra-low voltage operations in SRAM. In sub- or near-threshold operations, transistor current is extremely sensitive to process, voltage and temperature (PVT) variations. Then, read and write stabilities of the conventional 6T SRAM become very poor at their worst PVT corners [4]. In the 6T SRAM, read and write stabilities have contradicting design requirements [5], making it difficult to improve these stabilities at the same time. We may mitigate the above problems by employing higher SRAM supply voltage (VDD) compared to that of logic circuit. Then, level-shifters are necessary to interface between the logic circuit and the SRAM. Furthermore, more voltage regulator should be added in order to generate different VDD’s for the logic core and the SRAM. These extra circuits incur significant power penalties, harming the energy efficiency of the whole system. Hence, researchers have made various efforts to aggressively scale down the supply voltage of SRAM. For instance, alternative SRAM cells suitable for sub- or near-threshold operations have been developed. Some representative SRAM cells are shown in Fig. 1, where read and write ports are separated by adding more transistors. Here, data nodes are fully decoupled from read access, delivering good read stability in ultra-low voltage operations. The write stabilities of these SRAM cells are enhanced by utilizing some techniques such as collapsed cell VDD[6], boosted word-line voltage schemes [7], or negative bit-lines[8]. Among these decoupled SRAM cells, the 8T cell of Fig. 1(b) is the most popular owing to reasonable area overhead when compared to the 6T SRAM [9-15].

Small bit-line swing is another reason to impede robust ultra-low voltage operation of SRAM[16, 17]. As we scale down supply voltage (VDD), on and off current ratio becomes smaller. This implies that in the ultra-low voltage operation, leakage noise due to unaccessed SRAM cells may deteriorates bit-line swing[13]. To address this challenge in the 8T SRAM, the authors of [18] employed sense amplifier redundancy. The redundancy reduces the offset of a sense amplifier, thereby improving sensing stability. However, in local bit-lines of 8T SRAM array the single-end sensing scheme based on domino-style circuit is more widely used, where the above technique may not be applicable.

In this work, we propose some techniques to enhance the sensing stability of 8T SRAM. We note that leakage noise has data dependency, which has the worst scenario when the currently read SRAM cell has different data compared to the rest cells on the same column. We randomize data stored to SRAM, namely data randomization. This makes that distribution of “1” and “0” in each column is close to 50% and
all RBLs are pre-charged to \( V_{DD} \) by using pre-charging PMOS devices (Fig. 2). When the read cycle is activated, the pre-charging PMOSs are turned off, leaving RBLs floating. It is then discharged to ground if the accessed cell stores “0” and stays at \( V_{DD} \) otherwise.

As illustrated in Fig. 2, there are four currents associated with the read bitlines: \( I_{cell_0} \), \( I_{cell_1} \), \( I_{leak_0} \), and \( I_{leak_1} \) where \( I_{cell_0} \) and \( I_{cell_1} \) are the read currents sunk by the accessed cell when this stores a “1” and “0”, respectively. Note that when a cell stores “1” (i.e. \( Q = \text{“1”} \)), its complementary bit (i.e. \( QB = \text{“0”} \)) is applied to the gate of the pull down NMOS in the read port. Therefore, \( I_{cell_1} \) is ideally zero and much smaller than \( I_{cell_0} \). Similarly, \( I_{leak_1} \) and \( I_{leak_0} \) are the leakage currents flowing through the other unaccessed cells which store “1” and “0”, respectively. Ideally, both \( I_{leak_0} \) and \( I_{leak_1} \) are zero and hence, the sensing circuit can easily differentiate between \( I_{cell_0} \) and \( I_{cell_1} \). However in reality, leakage current is not negligible, which becomes more significant under low \( V_{DD} \), high temperature and large number of cells per RBL. Hence, the leakage currents of unselected cells may discharge RBL. For instance, if the selected cell stores a “1”, (i.e. \( Q = \text{“1”} \)), \( QB = \text{“0”} \) is applied to the read port. Ideally, RBL is not discharged under this condition. However, in reality \( I_{cell} + I_{leak} \) considerably discharges RBL, where \( I_{leak} \) is the total leakage contributed by unselected cells on the same column. To assure correct sensing despite of this problem, \( I_{cell} + I_{leak} \) should be significantly larger than \( I_{cell} + I_{leak} \). So that the sensing circuit can easily detect their difference.

However, as we scale down \( V_{DD} \) to save power, on-current \( (I_{on}) \) decreases more quickly compared to off-current \( (I_{off}) \), shown in Fig. 3(a). As a result, the \( I_{on}/I_{off} \) ratio, which is required to be very high for reliable sensing, quickly diminishes in ultra-low voltage region. For example, at 1 V supply voltage, \( I_{on} \) is 110× larger than \( I_{off} \) while they are almost comparable at 0.35V, as shown in Fig. 3(b). This makes that in such low voltage operation it is difficult to deliver the above condition.

We need to note that \( I_{leak} \) is dependent on data stored to SRAM cells, making this situation worse. In Fig. 2, \( I_{leak} \) flows through two stacked off-transistors while \( I_{leak} \) has single off-transistor in the leakage path. Since the stacked leakage path leakage current is exponentially reduced, \( I_{leak} \) is much smaller compared to \( I_{leak} \). Fig. 4 shows two extreme cases considering the data-dependency of leakage current. In the column of Fig. 4(a), \( QB’s \) of the accessed cells are “1” while \( QB’s \) of all unaccessed cells are “0”. Under this circumstance, \( I_{leak} \) has the maximum value and hence, we express \( I_{leak} \) as \((n-1)xI_{leak} \), where \( n \) is the number of SRAM cells per column. On the other hand, in the column of Fig. 4(b), \( QB’s \) of all unaccessed cells are “1” and hence, \( I_{leak} \) has the maximum value, therefore described as \((n-1)xI_{leak} \). To obtain correct sensing regardless of SRAM data pattern, the following equation should be satisfied:

\[
I_{cell_0} + (n-1)xI_{leak_0} > I_{cell_1} + (n-1)xI_{leak_1} \tag{1}
\]

Under low \( V_{DD} \), high temperature and large number of cells per RBL, it is pretty difficult to deliver equation (1). This is due to the fact that these circumstances make \( I_{on}/I_{off} \) ratio small, as mentioned above. Fig. 5 shows this well, the discharging waveforms of RBL under 0.3 V, 80 °C, and 256 SRAM cells per bitline. We perform the simulations for the best and the
worst data patterns. In the best data pattern, it is easy to perform data sensing while for the worst one, equation (1) is not satisfied and hence, it is difficult to assure correct sensing. Therefore, compensation techniques are needed to eliminate the effect of excessive leakage noise along the RBL.

B. Existing BL leakage compensation schemes

Several works have been reported to tackle the issues associated with large RBL leakage [20-23]. Their common approach is to measure the leakage during standby and inject a similar current to the active bitlines during the read cycle. For example, in [23], a diode connected PMOSs are used to pre-charge the BL during standby (P2 and P3 in Fig. 6(a)). Voltages at the drains of these transistors represent the level of leakage current sunk by the entire column of cells on the BLs. At the end of the standby period, this voltage is sampled and stored on two capacitors and subsequently applied to the pull-up PMOS P6 and P8 to inject small currents to the BLs during the read operation. If P2/P3 and P5/P6 are matched, then the bitline leakage currents can be compensated. However, this technique cannot be used when the supply voltage is near or lower than the threshold voltage, as the voltage drop across the PMOS cannot be reliably sensed. In addition, the peripheral circuitry required for each bitline costs a significant area overhead for the SRAM.

In [22], a pair of PMOS is used for each column as static load to provide pull-up current against the bitline leakage [21]. (c) using replica BL [22].

Fig. 6 Existing BL leakage compensation schemes. (a) using diode-connected PMOS and capacitors to store the BL leakage level [20], (b) a pair of PMOS is used for each column as static load to provide pull-up current against the bitline leakage [21]. (c) using replica BL [22].
is not suitable for ultra-low voltage operation. Furthermore, it is only applicable for differential sensing, which is not the case of 8T SRAM cell design. Kim et al. proposes a marginal RBL leakage compensation technique by means of replica bitline circuit[24], shown in Fig. 6(c). Initially, all bit-line pull-up control signals (i.e. PCHG_{[0:3]}) are set to “1” to turn off all pull-up devices of the dummy RBL. They are sequentially switched to “0” to increase the strength of the pull-up current until it is just enough for the output of the SA to switch from “0” to “1”. Since the dummy bitline was hardwired to the worst-case scenario (i.e. all “0”), it indicates the optimum level to pre-charge RBL in the main array. As a result, each RBL is tuned to settle just above the trip point of the sense amplifier so that when logic “0” is read, only a small swing is needed to change the sense amplifier output.

However, this approach may not account for the process variations that affect the trip point of each sense amplifier. Furthermore, the worst case scenario used in the replica RBL leads to more currents to be sourced from the pull-up devices of the main array, which can potentially hurt the pull-down path if the accessed cell stores a “0” and the rest of the column stores “1”.

III. COLUMN-BASED DATA RANDOMIZATION

A. Column-based data randomization concept

Our principal idea is to make the percentage of “1” and “0” in each column close to 50% so that bitline leakage is reduced and becomes independent of the input patterns. As a consequence, bitline sensing margin is improved. It also allows a given SRAM structure to widen its operation range to more extreme conditions where bitline leakage is comparable to the active current.

This concept is realized by shuffling specific data pattern such as in image processing applications so that we can eliminate the worst case scenarios of the read bitlines. In video encoders, pixels are processed in blocks (e.g. 16x16 in MPEG-2) which are stored in on-chip SRAM. Since most image frames have smooth backgrounds or large objects, nearby pixels tend to be similar (i.e. their MSBs are the same). For example in Fig. 7, the maximum intensity difference of all pixels in the green square is only 21 LSB (Fig. 7(b)). Because of that, MSB columns are most likely to have skewed data, i.e. majority of cells on the same column are ‘1’ or ‘0’. In extreme cases, an entire column may store “0” or “1”, except for only a few cells. Similar conclusion can also be applied to other applications where data in consecutive words change slowly such as image processing. If this data pattern is written into conventional 8T SRAM, reading them at low-voltage condition is challenging. On the other hand, if we can pseudo-randomly toggle consecutive data before writing, what actually stored in the memory will be approximately 50% one and 50% zero. Furthermore, if the randomization process is simple enough, output data can be reversed back to its original value without consuming too much power overhead.

There are several ways to toggle input data before writing into the main memory. Linear feedback shift register (LFSR) is one of them. LFSR is a popular hardware-based pseudorandom pattern generator [25]. It has also been used in Flash memory [26] to improve its endurance by redistributing data pattern along the WLs. The rule to shuffle data in Flash memory is different from row to row and therefore a long random sequence is required from the LFSR. In SRAM, we need to randomize data only along the column, not the row. Furthermore, the same rule can be used in different columns. Therefore, our implementation is different from that of Flash memory. In SRAM, if skewed data flow through a randomization engine and bits are randomly flipped, the output sequence of “1” and “0” will be equally distributed. As a result, if one looks at data pattern along a bitline, the worst case scenarios are transformed to the average case where number of “0s” and “1s” are almost the same. Fig. 8 compares the percentage of “1” in a simulated RBL during 20000 successive write cycles. Fig. 8(a) shows that when 99% of input data are “1”, percentage of “1” along the BL also approaches the same number in conventional SRAM. Similar conclusion can also be made when percentage of “1” in the input data is 50% and 1%, respectively. However, if this data pass through a randomization engine, density of “1” only marginally fluctuates around 50%, as shown in Fig. 8(b), regardless of the original data distribution. Thus, its temporal and spatial BL leakage variations are significantly reduced.

Other than minimizing the bitline leakage variations, data randomization also offers a more uniform bitline delay. For 8T SRAM structure, the best delay in reading zero happens when all unselected cells store “0” while the worst delay happens when they store “1”. As illustrated in Fig. 9, the worst case delay can be up to 50% longer than that of the best case. Note that the worst case delay defines the memory’s speed performance, regardless of how fast the best case is. Bitline for reading “1” is supposed to stay at V_{DD} thus the concept of delay is not applied...

![Image](https://example.com/image.png)
to it. With randomization, bitline delay varies less than 5% around the median value, also shown in Fig. 9.

Even though the proposed technique randomizes the column effectively, there will be specific input patterns that generate skewed data after passing through the randomization engine. However, it is very unlikely if the input pattern is skewed or has long sequences of continuous “0” or “1” (for example: “00001110000”). In practice, we can avoid this situation by applying the proposed data randomization only to columns of MSBs where it is more likely to have long segments of the same data (i.e. nearby pixels of the image have the same MSBs). The LSBs can be left unchanged as the correlations of LSBs between nearby pixels are statistically low. Another approach is to include column-based registers to enable/disable the flipping logic which can be programmed in real time. In short, the pseudo-randomization scheme is more effective if input data have long sequences of “0” or “1” and designers must be aware of this point to avoid extreme case scenarios after randomization.

B. Circuit realization

In this implementation, XROR-gate are used realize our column-based randomization engine (CBRE). Row address is used to control CBRE, whose output is fed to the column-based flipping logic of the writing circuits. Each input data is flipped or kept unchanged depending on the output of the CBRE. Since we only concern about the data pattern along each column, row address can be used to decide whether a bit should be flipped. This implementation is sufficient to validate the improvement of our proposed idea. Furthermore, it offers a simple way to reverse the data during read operations. Gate level schematics of the CBRE and flipping logic are shown in Fig. 10. The XNOR gates ensure that if a long chain of consecutive “1” or “0” is written to a column, it will be transformed into a sequence of alternative “1” and “0”. As Flip signal controls the MUX, ~Di will be selected if Flip signal is high. Otherwise, Di will be transferred to the write driver. Fig. 11 illustrates how this scheme works on a sequence of six write operations to the same column. Assuming the input bit sequence is “111100”, with six different address shown on the left. The address bits will be XOR to generate Flip. As can be seen, only the the 1st, 4th and 5th bits are flipped while the rest are kept the same. As a result, the actual sequence written to the column is “011010”. For video encoder or similar applications, this configuration can be applied to the MSBs. In this implementation, it is used in all columns for the purpose of characterization. Note that using this CBRE is only one way to implement the data-suffling concept. This approach also makes it very easy to switch it back to original data during the read operation and thus is more preferred implementation.

Since CBREIs shared between all columns, area overhead of the design only includes a MUX an inverter per column. As a result, it is less than 3% larger compared to the original design. Its write power overhead is less than 1%. Note that if the number of rows per column is increased, area and power overhead is even smaller.

Fig. 12 verifies the effectiveness of the toggling method using CBRE in Fig. 10. We assume that the green square (the Region of Interest, i.e. RoI) in Fig. 7 moves from top left corner of the Lena image to the bottom right corner, from left to right and top down. The RoIs are non-overlapping. Each RoI contain

![Fig. 7 Region of Interest (RoI) in the Lena image.](image)

![Fig. 8 Functional block diagram of a 16x16 pixels SRAM bank.](image)

16x16 pixels, each has 8-bit binary value. Each RoI is then written to a SRAM bank of 256x8 cells. The first column of the bank stores the MSB (i.e. bit 7) while the last column stores bit 0. As the RoI moves, data of the first 289 squares (i.e. 17 squares from left to right and 17 squares from top to bottom) are recorded and the percentage of “1” in each column of the SRAM bank is recorded and shown in Fig. 12(a). We only show bit 7, 6, 3, 1 and 0 for the sake of clarity. It can be seen that before toggling, the original Lena image results in extreme cases of 100% or 0% of “1” in the MSB column. Similar observation can also be seen in the column containing bit 6. Only in columns containing bit 0 and bit 1 that the percentage of “1” varies around 50%. This agrees with our hypothesis in pervious section.

When the proposed toggling method using CBRE is applied to the same image, same writing procedures, we can see in Fig.

![Fig. 9 Bitline discharge delay of the conventional 8T and randomized 8T.](image)

![Fig. 10 Gate-level flipping control and column-based flipping logic.](image)

![Fig. 11 Operating principle of the flipping control circuit.](image)
Similarly, we randomly choose 10 gray scale images from Google image search, performed the same procedures and observed the maximum and minimum percentage of “1” for each image. The results are shown in Fig. 12(b). All images have the minimum percentage of “1” equal to zero. This means at least one column of one bank when stores a RoI from these images contains all zeros. Most of them have maximum percentage of one equal to 100%, i.e. the whole column is “1”. On the other hand, the proposed method offers a good balance of “1” and “0” in every column, every banks since the maximum and minimum percentage of “1” is 62% and 38.6%, respectively.

Note that the randomization scheme does not reduce the BL leakage significantly, instead makes it almost the same for both reading “1” and “0”. As the BL leakage becomes a common component, it is easier to mitigate the impact of the BL leakage on sensing by providing a common BL leakage compensating current. However, when the BL leakage (even with 50% data pattern) is much larger than the actual cell current, it is difficult to generate sensible BL swing at ultra-low voltage condition, especially with process variation. Thus a BL length equal to or less than 256-row is recommended. Furthermore, we assume that nearby input image pixels are written continuously in the same column so that we can take advantage of the similarity of the data pattern along the column. This makes the proposed SRAM particularly useful in video/image processing applications.

IV. PVT-AWARE DYNAMIC BL KEEPER

A. BL-boosting current at ultra-low voltage operation

Reliable RBL sensing requires both large RBL swing and a wide sensing window. In[19], we proposed a PVT-aware bias voltage generator that helps to boosting both the RBL swing and BL sensing windows, especially at ultra-low voltage region. The proposed BL reading scheme and timing diagram of the proposed design is shown in Fig. 13. Unlike conventional 8T, during standby the read bitline (RBL) of the proposed design is discharged to ground instead of being charged to VDD. This reduces BL leakage current when it is in idle mode. When read, Read signal turns on P2 to charge the RBLs up. If the...
accessed cell stores a “1”, QB is low and thus Iread1 is insignificant. As a result, RBL rises to a high voltage level (VBL1). If it is stores a “0”, the corresponding VBL0 will be only slightly higher than ground, due to the fighting between Iboost and Iread0 (Fig. 13(b)). The stacked PMOS configuration shown in Fig. 13(a) is further biased by a Bias Voltage Generator [19] which automatically tunes to bias P1 to maximize the voltage gap between VBL1 and VBL0. Simplified block diagram of the Bias Voltage Generator is shown in Fig. 13 (c). It consists of two dummy columns which represent the worst-case for read “0” and “1”, respectively. Initially, the down-counter (CT0) is at the maximum value, thus Vref0 is at VDD and VRBL0 is at ground. As the CT0 counts, Vref0 steps down and Iboost increases. This process continues until VBL0-buffered turns high. This defines the minimum allowable value of Vbias0 and at the same time RBL0-buffer blocks the clock to CT0 and locks the output bits (A0-A2). Similarly, the bottom loop tracks the maximum allowable value of Vbias1 by using the up-counter. The locked values of A0-A2 and B0-B2 are fed to a second set of reference ladders to generate the final Vbias whose value is theoretically equal to (Vref0 + Vref1) / 2. The area overhead of this circuit is less than 5% (including the dummy column). The counters (and thus Vbias) can be reset and tuned again if the operating condition changes abruptly. In this example, we only use 3 bits for demonstration.

Fig. 15 Schematic of the proposed dynamic BL keeper control circuit.

Fig. 16 Transient simulation waveforms of the dynamic bitline keeper control.

Fig. 14. Impact of the proposed boosting and randomization techniques on the BL swing when compared to the conventional BL structure.

B. Dynamic BL keeper for wide operating voltage range

The above read scheme generates an optimum biasing voltage for the bitline keeper to boost the bitline swing at ultra-low voltage condition. However, it consumes a direct current from VDD to ground through the read ports of the access SRAM cells during the read operation. It is slower and consumes more power than the conventional design at high voltages. Therefore, it is only beneficial to the SRAM energy efficiency when the required operating voltage is extremely low. In this section, a PVT-aware leakage sensor is proposed to dynamically switch the RBL keeper on/off so that it can achieve both high speed, low power at high supply voltage domain while reliably operate at extremely low voltage condition.
Table I: Design Metric Comparison with Various Ultra-Low Voltage SRAMs

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Technology</td>
<td>65 nm</td>
<td>65 nm</td>
<td>65 nm</td>
<td>65 nm</td>
<td>65 nm</td>
</tr>
<tr>
<td>Density</td>
<td>128 kb</td>
<td>32 kb</td>
<td>4 kb</td>
<td>16 kb</td>
<td>32 Kb</td>
</tr>
<tr>
<td>Transistor count</td>
<td>8T</td>
<td>9T</td>
<td>8T</td>
<td>9T</td>
<td>8T</td>
</tr>
<tr>
<td>Cell size</td>
<td>N.A</td>
<td>1.24 x 2.31 µm²</td>
<td>1.4 µm²</td>
<td>2.86x0.72 µm²</td>
<td>0.52x2.6 µm²</td>
</tr>
<tr>
<td>V&lt;sub&gt;DDmin&lt;/sub&gt;</td>
<td>0.37 V</td>
<td>0.28 V</td>
<td>0.25 V</td>
<td>0.26 V</td>
<td>0.2 V</td>
</tr>
<tr>
<td>Access time</td>
<td>N.A</td>
<td>4.55 µs (0.3 V)</td>
<td>500 ns (0.25 V)</td>
<td>0.85 µs (0.26 V)</td>
<td>2.5 µs (0.2 V)</td>
</tr>
<tr>
<td>Leakage current</td>
<td>N.A</td>
<td>0.05 µA (0.4 V)</td>
<td>300 µA (0.25 V)</td>
<td>1.4 µA</td>
<td>1.1 µA (0.2 V)</td>
</tr>
<tr>
<td>Min. Energy (E&lt;sub&gt;min&lt;/sub&gt;)</td>
<td>21.2 pJ</td>
<td>0.57 pJ</td>
<td>N.A</td>
<td>2.07 pJ</td>
<td>1 pJ</td>
</tr>
<tr>
<td>Normalized E&lt;sub&gt;min&lt;/sub&gt;</td>
<td>162 aJ/b</td>
<td>278 aJ/b</td>
<td>N.A</td>
<td>126 aJ/b</td>
<td>31 aJ/b</td>
</tr>
</tbody>
</table>

Fig. 17: Normalized read energy per bit of the proposed dynamic BL keeper scheme when compared with the conventional design, room temperature, TT corner.

Fig. 18: Simulated leakage current of the on-chip leakage sensor, room temperature, TT corner.

Fig. 19: Output voltage, normalized to different VDD, of the on-chip leakage sensor at room temperature, TT corner.

Schematic design of the proposed dynamic BL keeper control is shown in Fig. 15, including a static leakage sensor and five standard logic gates. The leakage sensor replicates a real RBL configuration where P<sub>0</sub> serves as a RBL keeper and 128 pairs of N<sub>1</sub>-N<sub>2</sub> which act as a replica RBL in the off state. The gate of N<sub>1</sub> is grounded while that of N<sub>2</sub> is connected to V<sub>DD</sub> to model the worst-case leakage configuration. Since all N<sub>1</sub> are turned off, the DC current flowing through this leakage sensor is small.

At a reasonably high voltage (above 0.4 V), I<sub>on</sub> supplied by P<sub>0</sub> is strong enough to hold node A close to V<sub>DD</sub>, indicating that leakage current along the RBL is insignificant. As a result, boost-disable (BD) signal is high and PreCharge (PREC) signal is active. During standby, PREC is low to precharge RBL to V<sub>DD</sub> by turning on P<sub>1</sub>. During Read, PREC goes high to turn off P<sub>1</sub>, leaving RBLs floating and then conditionally discharged by the read current of the accessed SRAM cell similar to that of the conventional design.

Once the supply voltage is reduced to a very low level (i.e. below 0.35 V) or the operating temperature is high, BL leakage current dominates I<sub>on</sub>. Therefore, V<sub>DD</sub> is discharged to a low level. Once V<sub>DD</sub> cross the threshold voltatge of the buffer, it signifies that BL leakage is too high. As a result, BD goes low and boost-enable (BE) signal is goes high to enable the RBL boosting scheme mentioned in section IV.A. PREC is kept at V<sub>DD</sub> to disable P<sub>1</sub> while the pre-discharge signal (PRED) follows RD to controls P<sub>3</sub> in the same manner shown in Fig. 9. Note that around the borderline, both schemes perform similarly and the exact criteria to switch between these two schemes is not required. Fig. 16 illustrates the operation of the sensing circuit when V<sub>DD</sub> changes from 0.4 V to 0.2V. At 0.4 V, BD is high while BE is low and thus PREC signal is activated. At the same time, PRED is always high to turn off P<sub>3</sub>. When V<sub>DD</sub> is reduced to 0.2 V, BE goes high and BD goes low so that PREC signal is deactivated and PRED controls the read operation of the memory. Fig. 17 summarizes the normalized read energy per bit of the proposed dynamic BL keeper. It can be seen that the BL boosting design only outperforms the conventional design when V<sub>DD</sub> is below 0.4 V. The proposed dynamic BL keeper scheme closely tracks both BL boosting and conventional design and seamlessly switches between the two to achieve the lowest energy consumption possible over a wide range of operating condition.

It is important to ensure that power and area overhead of the sensing circuit is low. Fig. 18 shows the simulated current of the leakage sensor at various supply voltages. Since all N<sub>1</sub> are hardwired to the off state, its current consumption is essentially equal to the RBL leakage of one column, which is
Although output of the sensor buffer is either $V_{DD}$ or zero, voltage at node A can have any analog value which may cause a significant DC current to be dissipated by the first inverter of the output buffer because node A is static. Fortunately, at voltage higher than 0.45 V, $V_A$ is very close to $V_{DD}$. It only starts to reduce around 0.4 V and stays around $V_{DD}/2$ when the supply voltage is around 0.3 V. Since it is well within the subthreshold region, direct current through the buffer is insignificant, as indicated by the total leakage current of the sensor in Fig. 18.

Because an inverter-based buffer is used to amplify the analog voltage at node A to full CMOS level (Fig. 15), in a very rare event when $V_A$ equals to the threshold voltage of the buffer, its output will be neither $V_{DD}$ nor zero but an intermediate voltage value. To minimize this, the gain of the inverter must be maximize around its threshold voltage. Fig. 19 presents the static normalized output voltage of the buffer at different supply voltages at room temperature. 1000 supply voltage values from 0.2 V to 1V were used to closely examine the behaviour of the sensor. The graph in Fig. 19 has three distinctive regions: (i) At high voltages, its output equals to $V_{DD}$. (ii) at very low voltages, its output value is zero. (iii) there is a very sharp jump from 1 to zero, which spans 9 mV from 331 mV to 340 mV on the x-axis. We have ran more comprensive simulations at different temperatures and our results gave similar curves. To avoid this, more buffer stages can be added to make the likelihood of $V_A$ equal to multi-stage buffer extremely rare. In our simulation, by using four inverters, the span of the sharp rising slope is reduced from 9 mV to less than 2mV.

Fig. 20 shows the 1000 Monte-Carlo simulation of the BL waveforms and the inverter-based BL buffers during a read operation at 0.2 V. (a) Temperature = 27$^\circ$C; boost bias = 25 mV. (b) Temperature = 80$^\circ$C; boost bias = 10 mV. Boost bias voltage is generated by the on-chip generator circuit which tracks $V_{DD}$ and temperature changes.

![Fig. 20 1000 iteration Monte-Carlo simulation of the BL waveforms and the inverter-based BL buffers during a read operation at 0.2 V.](image)

Fig. 21 Structure of the proposed design

![Fig. 21 Structure of the proposed design](image)

Fig. 22 1000 iteration Monte-Carlo simulation of the write operation at 0.2 V, room temperature. (a) Write WL waveform. (b) Q and QB of a cell when BL= 200 mV; BLB = 0 V. (c) Q and QB of a cell when BL= 200 mV; BLB = 30 mV with no error. (d) Q and QB of a cell when BL= 200 mV; BLB = 40 mV.

![Fig. 22 1000 iteration Monte-Carlo simulation of the write operation at 0.2 V](image)

less than 0.3% of the total leakage of the whole memory. For example, its current consumptions at 1 V and 0.34 V are 209 nA and 32 nA, respectively. Its area is only 0.5% when compared to the main array area.
read operation at 0.2 V. It can be seen that even with process variations, there are still enough input margin for the inverter to decide the correct BL level. To minimize process variations, larger transistor sizes are used for the inverters and it is sized so that the trip point is around half-$V_{DD}$ at the target operating voltage. Note that at higher supply voltages, much larger BL voltage margin is available, thus the drift in the inverter trip point is not as critical as that at ultra-low voltage operation.

V. Test Chip and Measurement Results

The proposed design has been fabricated in a 65 nm CMOS process. Its block diagram is illustrated in Fig. 21 consisting of a row decoder, an array of 256 rows×128 columns of 8T SRAM cell. The CBREis shared within the chip and controls the flipping logic of all columns. In this work, to test the effectiveness of the proposed bitline scheme and the randomization, all columns are read/written at the same time and all flipping logic is activated. In practice, individual column-based flipping logic can be employed to individually enable some columns while the others can be disabled. This is to support our previous hypothesis that MSBs should be randomized while the LSBs should stay untouched.

In order for the SRAM to be written successfully at ultra-low voltage condition, we avoid the half-select disturbance issue by eliminating the bit-interleaving architecture. To test our proposed read schemes, all columns are written at the same time. Hierarchical WL architecture can also be used if only a fraction of columns are to be accessed. This approach is simpler and more stable than other advanced write-assist scheme and thus is more preferable to demonstrate our improved read operation. Fig. 22 shows the Monte-Carlo simulations of the write operation at 0.2 V, room temperature. It can be seen from Fig. 22(b-c) that when $BL = 200 mV$, $BLB = 0V$ or 30 mV, no write failure was observed. However, when $BLB$ is pulled down to 40 mV (Fig. 22(d)), write failures start to occur. This indicates that the SRAM cell has the write margin of about 30 mV without any write-assist techniques at 0.2 V. However, to avoid data destruction due to the half-select issue, bit-interleaving must be avoided. It is worth mentioning that the combination of both BL boosting and randomization makes our design much more robust in ultra-low voltage regime when compared to the conventional design[9]. Note that the proposed boosting technique will not even work if $I_{leak}$ $< I_{leak-min}$ $< I_{leak-max}$. In other words, the proposed boosting technique enhances the correct (but small) sensing margin. Therefore, positive sensing margin should be achieved before employing the proposed boosting technique, which is achieved by the proposed randomization technique in this work.

Micrograph of the testchip is shown in Fig. 23. Various test patterns were generated using Agilent Logic Analyzer to validate the advantage of the proposed design over different data distributions. The measured waveforms in Fig. 24 confirm that the SRAM based on the proposed data randomization scheme works properly at 0.2 V while the conventional design fails to read “1”. At 0.2 V, our fabricated SRAM has power consumption and access time of 0.7 µW and 2.5 µs, respectively. Fig. 25 compares their minimum $V_{DDmin}$’s ($V_{DDmin}$’s) for two different temperatures (27 and 80°C). At room temperature, the $V_{DDmin}$ of the randomized SRAM is 50 mV lower than that of the conventional one. With the rising of temperature, the possible $V_{DDmin}$’s of both designs tend to become higher since bit-line leakage noise increases exponentially. At 80°C, the randomized SRAM allows 90mV further scaling compared to the conventional one, which is larger than 50mV of 27°C. This shows that our proposed scheme is more effective at highly leaking PVT corners, where bit-line leakage noise is more problematic.

As supply voltage scales down, the proposed design offers much lower power consumption, as shown in Fig. 26. At extremely low supply (i.e. $V_{DD} < 0.4 V$), its leakage power becomes comparable to dynamic power. Coupled with the fact that read access time increases exponentially in this region (Fig. 27), total energy per access actually increases when supply voltage reduces to below 0.4 V. As a result, the SRAM obtains its minimum energy point of 1 pJ/access at 0.4 V supply voltage, as illustrated in Fig. 27(b).
We compare our measurement results to those of recently published low-voltage SRAMs, which is summarized in table I. As mentioned above, in our test-chip we add large number of SRAM cells to single RBL, deteriorating bit-line leakage noise. Nonetheless, our design shows the lowest $V_{DDmin}$ showing the effectiveness of our proposed scheme. In the comparison of the minimum energy per bit, expressed by normalized $E_{min}$ in table I, our proposed design shows considerably lower energy dissipation compared to other ones. Our design does not pre-charge RBL, significantly reducing active leakage current. Considering that in ultra-low voltage operation, active leakage energy is almost comparable to switching energy, it is obvious that such an approach considerably improves total energy dissipation.

VI. CONCLUSION

A 32Kb 8T SRAM array with column-based data randomization for ultra-low voltage applications is presented. Densities of “0” and “1” along the bit-line are well-balanced, which makes the bit-line leakage virtually independent from the input pattern. Coupled with a BL boosting and dynamic BL keeper control, the proposed SRAM provides robust ultra-low voltage operations. We fabricate a test-chip of the propose SRAM in 65nm CMOS. In our measurements, the minimum operating energy of the SRAM is 1 pJ/read, which is obtained at 0.4 V.

REFERENCES


