This work presents a novel strategy for efficiently extracting macromodels from multiphysics microelectromechanical system (MEMS) devices with weak coupling. To validate the strategy, the packaged thermal wind sensor is chosen to verify the feasibility of the strategy for the first time. Based on the characteristics of the packaged thermal wind sensor, it can be divided into three distinct energy fields: thermal field, fluid field, and electric field. The packaged thermal wind sensor is initially built as the finite element model, which represents a pure thermal system under no wind flow conditions. Then, the thermal system with linear and time-invariant (LTI) is identified as a state space model by its step response obtained by transient simulation of the finite element model. Other energy fields can be regarded as negative inputs for the state space model. Subsequently, the macromodel is constructed by describing the relationship among thermal field, fluid field and electric field with Verilog-A. Remarkably, the macromodel based on this strategy takes into account geometric factors and material parameters. The tests show that the error of far group thermistors between system-level simulation and experiment is less than 5% under the conventional wind speed range on constant power (CP) mode, and the results of wind direction between them also show a significant degree of overlap. In brief, the strategy of extracting macromodel enables the direct construction of connections between the finite element model of MEMS devices and integrated circuits.
Hyperscale data center applications are driving the need for high bandwidth, high throughput per chip-edge, ultra-low-power serial-IO solutions over extremely short-reach (XSR) MCM connections. This paper demonstrates an MCM chiplet with > 1.6Tb/s throughput over a range of CEI-112G-XSR standard-compliant channels using a fully integrated and adaptive PAM-4/NRZ transceiver supporting data rates from 9.8-to-113Gb/s. Key features of the transceiver include an architecture designed to lower TX signal launch amplitude and RX data path gain to relieve power/area/bandwidth constraints. It achieves required equalization while minimizing residual-ISI and quantization errors for improved performance. It has the capability for RX-driven dynamic adaptation of TX and RX equalization settings for seamless bring-up over hundreds of lanes in an MCM application, as well as to track supply and temperature drifts. It also features programmable TX-FFE roaming taps to boost performance in case of severe reflections and includes precision clock phase generation and correction.
The low efficiency of iterative optimization in Microelectromechanical Systems (MEMS) devices is due to their complete separation from interface circuits. In the case of board-level circuits, this problem is particularly evident. To solve this problem, this paper proposes a system-level co-simulation strategy that combines the macromodel of a MEMS device extracted by Verilog-A with board-level circuits, exploring an efficient and accurate MEMS-assisted optimization design method. In order to better adapt to actual interface circuit, the macromodel of sensor is divided into two sub macromodels corresponding to the control circuit and processing circuit. The sub macromodels are coupled with the temperature information. Subsequently, the influence of packaging is investigated based on co-simulation approach. The simulation results show the power consumption is proportional to the heat conductivity of packaging, and the sensitivity is inversely proportional to heat conductivity. Therefore, based on system-level co-simulation prediction, a hollow glass with low heat conductivity is used to encapsulate the sensor chip. The comparison shows good agreement among the results of co-simulation, experiment and finite element analysis. Specifically, the heating power of the latest sensor ranges from 70 mW ~ 150 mW, and the sensitivity has been improved to 16.3 mV/( $\text{m}\cdot \text{s}$ -1). Therefore, this robust system-level co-simulation method has shown its potential in MEMS design and optimization. [2023-0068]
A quarter-rate PAM-4 FFE employing INCC 1UIPG is implemented in 65 nm CMOS. The proposed INNC 1UIPG reduces the average transition time by ~20%, saving clocking power consumption by ~1.5X, lowering jitter amplification by about 2~5 dB compared with previous works. Along with the bandwidth- and power-efficient partially segmented tailless 1-stage front-end architecture, the proposed FFE achieves 128Gbps PAM-4 data rate with a 0.014 mm2 area. This letter presents a 4-level Pulse Amplitude Modulation (PAM-4) Feed Forward Equaliser (FFE) with a novel Internal Node Charge Controlled 1-Unit Interval Pulse Generator (INCC 1UIPG). Partially segmented architecture and tailless 1-stage front end are chosen to reduce the overall load capacitance for better bandwidth and power performance. The proposed INCC 1UIPG adopts a stacking-reduced structure and precisely controls the internal nodes, demonstrating advantages in speed, power, and jitter, showing better potential of working at a ultra-high baud rate. The wider bandwidth and faster transition edge allow the implementation of the equaliser working at 128Gbps with an area of 0.014 mm2 in 65 nm CMOS. The ever-increasing bandwidth demand in high-performance computing and other applications is continuously promoting the data rate of wireline communication systems with some protocols already requiring data rate in excess of 50-Gbaud, posing serious challenges to the design of transceivers. State-of-the-art TXs adopt a hybrid architecture to fully integrate the advantages of their analogue [1, 2] and Digital-to-Analogue Converter (DAC)-based [3] counterparts, which have not only high resolution and low complexity, but also flexible and efficient Finite Impulse Response (FIR) tuning, called segmented FFE architecture [4-6] in this letter. To further ease bandwidth pressure, high-speed TXs have a trend to reduce the number of full-rate nodes. By combining the 4:1 MUX into the pre-driver, the authors in ref. [2] reduce this number to 2 with the internal full-rate nodes that are peaked by inductors. However, this technique is not suitable for DAC-based and segmented TXs for area considerations. Another technical route attempts to further merge 4:1 pre-driver into the driver [1, 4, 6], thereby eliminating all the internal full-rate nodes, called 1-stage front end in this letter. In a 1-stage front end, total capacitance of output stage becomes even more critical, which determines achievable bandwidth and overall power dissipation. Extremely, some design employs the tailless CML driver to obtain the smallest size for a specified output swing [5, 6]. Given that this letter is targeted at an aggressive 128Gbps data rate in 65 nm CMOS technology, segmented architecture and 1-stage tailless front end are chosen with the FIR tap is designed to be partially adjustable to ease the bandwidth pressure ever further. Contrary to the trend of the front end, a high-performance full-rate working 1UIPG widely used in quarter-rate architecture attempts to adopt a multi-stage structure [5, 6] to improve speed – it is difficult to optimise both two edges of the pulse in a single stage, which usually corresponds to 3-stacked devices [1, 3]. The authors in reference [4] proposed a pre-charged structure that generates the 1UI pulse in a single-stage circuit. However, this technique is not suitable for a tailless CML driver, in which any pre-charge level will be translated to the output immediately. The authors in ref. [6] adopt a 2-stage structure to avoid 3-stacked paths. Unfortunately, the 2-stacked devices on critical path and undriven internal nodes ultimately limit the achievable speed. In order to address these drawbacks, the proposed 2-stage INCC 1UIPG optimises two edges of the pulse separately, reduces device stacking on critical paths, and reasonably controls the internal nodes, showing the best potential of working at ultra-high speed. Figure 1 shows the overall architecture of the proposed equaliser (half circuits). Data path is divided into MSB and LSB to generate PAM-4 output where MSB block is composed of the same two LSBs for good linearity. Each block is further divided into three groups of slices, X1, X2, and X6, forming a 3-bit DAC. X1 and X2 can be configured as a main tap or post tap as required with X6 is fixed as a main tap. Finite Impulse Response timing is generated at 1/8 rate with C8 clock. Subsequently, X1 and X2 slices can select data with different timing under the control of FFE_DAC<1:0> to be configured as different taps with X6 experiencing a matching delay. The selected 4-bit parallel data become time-interleaved 1UI pulses in the proposed INCC 1UIPG and finally complete the combination in the 4:1 tailless CML driver. When assigned as a post tap, the output current of the drivers of X1 and X2 slices can be further continuously adjusted through the bias of their cascode transistors. Feed Forward Equaliser (FFE) architecture (half circuits). The proposed equaliser adopts a partially segmented quarter-rate architecture and 1-stage tailless front end to reduce overall load capacitance and achieve the aggressive target of 128Gbps. A 3-bit DAC is used to provide coarse tuning, with the fine tuning being implemented in the analogue domain, forming a segmented architecture. Since X1 and X2 slices can be allocated as a main tap when ‘strong’ equalisation is not required, the equaliser is more bandwidth- and power-efficient compared with its analogue counterpart – in which the main tap driver itself must be sized to deliver specific output swing, and any of the equalisation tap drivers would introduce additional loading. At the same time, the DAC is allowed to be simple with low circuit complexity and small parasitic capacitance. (A ‘pure’ DAC-based TX needs to have much more bit with complex calibration for resolution and linearity considerations.) Moreover, the front end is designed to be partially adjustable – the largest X6 slices are fixed as main tap, allowing the cancelation of their cascode transistors to further reduce driver size under the same output swing, which greatly reduces the load capacitance, at a cost of tuning flexibility. Figure 2 shows three 1UIPGs with different structure and their timing diagrams under 64 Gbaud with the critical paths in stage 1 marked as red. Figure 2a adopts a single-stage structure, which uses the falling edge of CKQ and the rising edge of CKI to select the low level of the data, where M3 is used to control the internal charge of N2. This structure achieves 112 Gbps PAM-4 data rate in [1] and 224Gbps in [3], both in 10 nm CMOS. Although the charge of internal node N2 is reasonably controlled, there is a 3-stacked charging path (M1-M2-M4) existing, which leads to a slow rising edge at the output and it is difficult to reach full swing at a high baud rate. Comparison of 3 types of 1UIPGs under 64Gbaud. The authors in reference [6] adopt a two-stage architecture to avoid 3-stacked paths as shown in Figure 2b. Using the rising edge of CKQ and the falling edge of CKI to select the high level of the data, this structure achieves a PAM-4 data rate of 200Gbps in 28 nm CMOS. In the first stage, when the data is high and the rising edge of CKQ comes, OUT1, which is originally high, is pulled down. In the second stage, M6 pre-charges N2 when CKIB is pulled down and the falling edge of OUT1 controls M6 and M7 to charge OUT2, thus producing its rising edge. Subsequently, the rising edge of CKIB controls M8 to discharge OUT2, thus producing its falling edge. The pre-charged 2-stacked path allows OUT2 to have a faster rising edge. However, this structure's speed is still limited due to the following reasons. Firstly, the falling edge of OUT1, which is used to produce the final 1UI pulse, is generated by a 2-stacked path where N1 needs to be discharged first when M2 and M3 try to pull down OUT1. More importantly, when CKIB changes from low to high, OUT1 remains low for a period, thereby M8 needs to discharge not only OUT2 but also N2 at the same time, which leads to a slow falling edge of the final 1UI pulse. The proposed INCC 1UIPG is shown in Figure 2c. Different from (b), this 2-stage structure uses the rising edge of OUT1 and the falling edge of CKQB to generate the final 1UI pulse. In the first stage, the falling edge of CKI controls single M1 to produce the rising edge of OUT1 when data is high. Since M2 has been already turned off, N1 node will no longer affect this charging process. Considering that the falling edge of OUT1 is non-critical and N1 can be pre-discharged by M3, relative transistors are allowed to use smaller size, which further expands the bandwidth of OUT1. Meanwhile, CKQ generates CKQB through an inverter, matching the delay between CKI path to ensure an accurate 1UI pulse width under PVT variations. In the second stage, M6 pre-charges N2 when OUT1 is low, the rising edge of OUT2 is finally generated by the falling edge of CKQB. It is important to notice that M8 and M9 will discharge OUT2 and N2 simultaneously at the rising edge of OUT1, accelerating the falling edge of the final 1UI pulse. In this two-stage structure, bandwidth of the intermediate-node OUT1 has been further optimised with all the internal nodes (N1 and N2) are reasonably controlled, resulting a higher-performance 1UI pulse. Figure 3 shows a use case of the 3 aforementioned 1UIPGs. Use cases 1, 2, and 3 are obtained by using structures (a), (b), and (c) in Figure 2 as the 1UIPG in Figure 3, respectively. Note that the three use cases have the same input clock and data buffer and employ the same size 4:1 multiplexer for a fair comparison (marked as red in Figure 3). From the analysis and simulation results, we can explain the following properties of the proposed INCC 1UIPG. A use case of the three aforementioned 1UIPGs. Figure 4 shows simulation results of the 1UI pulses over PVT variations of the three use cases under 64Gbaud. As shown in Figure 4a, Use case #1 has the largest rise time due to the 3-stacked charging path. Figure 4b illustrates the limited fall time of Use case #2 due to the uncontrolled internal nodes. Figure 4c compares the average transition time of the 1UI Pulses. Use case #3 shows the best performance with the help of 2-stacked dynamic logic and reasonable INCC. Compared with the previous two, the average transition time is reduced by 22% and 17% under TT corner, respectively. Simulation results of 1UI pulses over PVT variations of the 3 use cases. (a) Rise time, (b) fall time, and (c) average time. Faster slew rate of the 1UI pulse can speed up the charging and discharging processes of the output of 4:1 multiplexer, extend the bandwidth, and therefore reduce its deterministic jitter (DJ). And what's more, the sharper slope at the transition point of pulse generator and multiplexer outputs reduces the conversion of their intrinsic voltage noise into jitter. Figure 5 shows 4:1 multiplexer output DJ of the three use cases under 64Gbaud and 80Gbaud, respectively. Use case 3 shows minimal output DJ, demonstrating its potential to work at higher baud rates. Simulation results of 4:1 multiplexer output of the 3 use cases. Reducing device stacking on a critical path can also improve the size design, reduce the total loading of clock path and data path, and therefore reduce the power consumption of their buffers. It is attractive to minimise the clock loading to reduce the design effort of clocking network, of which must take speed, jitter, and power consumption into fully consideration. Specifically, the critical edges of the proposed INCC 1UIPG (Rising edge of OUT1 and falling edge of CKQB, see Figure 2) are both generated by a stacking-free transistor (M1 and M5). M2 cuts off the pull-down path and shields N1 node when M1 charges OUT1 and therefore M1 can be small in size, just like in an inverter. The falling edge of OUT1 is non-critical so that M2 and M3 can be sized even smaller. By contrast, the critical edge in use case 2 – the falling edge of OUT1 is generated by stacking devices M2 and M3 with N1 cannot be discharged in advance, the size of relative transistors cannot be small (M3 is twice of M2 in use case #2, increasing clock loading by about 1.5X). Similarly, M2 is twice of M4 and M1 is triple of M4 in use case #1. Figure 6 shows power breakdown of the three use cases. Since the fan-out factor of buffers cannot be huge for speed and jitter considerations (we use FO2 for 16 GHz clock in 65 nm CMOS). The heavier loading of data and clock path leads to more buffer stages, greater total power dissipation, and more clocking jitters. Considering the large number of slices in an actual TX (need ~6X of the use case for a 1.2Vppd output swing), these power savings are very attractive. Power breakdown of the 3 Use cases. The stacked devices will also underperform in terms of jitter amplification due to the poor slope. We designed a simulation to verify this. As shown in Figure 7, a small jitter impulse (1ps in this simulation) is injected into one of the quarter-rate clocks (C0 in this simulation). By recording the transient response of the output of pulse generator and multiplexer when transmitting repeating clock patterns in the three use cases (we removed the clock and data buffers in this simulation; an ideal clock source with a fixed slope is used as a substitute to eliminate the impact of the multi-stage buffers), we can obtain their jitter impulse response (JIR). After normalising them to the input injection, we obtained the corresponding jitter transfer function (JTF) of the three use cases by Discrete-time Fourier Transform. Jitter amplification simulation method. Figure 8 shows the simulated JIR and JTF under 64Gbaud. Use case #3 reflects a milder JIR and about 5 dB/2 dB lower jitter amplification than use cases #1 and #2, respectively. Simulated jitter impulse response (JIR) and jitter transfer function (JTF) of the 3 Use cases. The FFE prototype chip is fabricated in 65 nm CMOS technology with a core area of 0.014 mm2 as shown in Figure 9a. Figure 9b demonstrates the post-layout simulation results of proposed INCC 1UIPG working at 64Gbaud. The 1UI pulse eye with 10.83ps rise time and 11.33ps fall time is shown in Figure 9c. The pulse is full-swing and fast enough to drive the subsequent tailless CML transistors. Power breakdown of the FFE (i.e., high speed data path of the TX prototype, design of high-performance clocking network is not discussed in this letter, and its power consumption is therefore not calculated here) is shown in Figure 9(d). Feed Forward Equaliser slices (FFE selectors + D4 buffers + INCC 1UIPGs, as shown in Figure 2) consume about half of the power consumption of the data path. The driver stage consumes about 45.6 mW power to provide ~1Vppd output swing. Layout details and post-layout simulation results of the Feed Forward Equaliser (FFE) prototype chip. The channel responses with 2.7 dB/5.7 dB/10.3 dB insertion loss, respectively, at Nyquist frequency (32 GHz) are shown in Figure 9e. Figure 9f shows the 128Gbps PRBS15 eye after a 2.7 dB channel loss. Figure 9g~j compare the 128Gbps PRBS15 PAM-4 eye w/or w/o TX FFE under 5.7 dB/10.3 dB channel loss, respectively. By adjusting the coefficient of the segmented equaliser reasonably, the eye can be opened up to 0.49UI with approximately 95mVppd height per sub-eye for a 10.3 dB loss. Table 1 summarises the performance of the proposed FFE and compares it with reported quarter-rate PAM-4 TXs' high-speed data paths. A quarter-rate PAM-4 FFE employing INCC 1UIPG is implemented in 65 nm CMOS. The proposed INNC 1UIPG reduces the average transition time by ~20%, saving clocking power consumption by ~1.5X, lowering jitter amplification by about 2~5 dB compared with previous works. Along with the bandwidth- and power-efficient partially segmented tailless 1-stage front-end architecture, the proposed FFE achieves 128 Gbps PAM-4 data rate with a 0.014 mm2 area. Jiawei Wang: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review and editing. Hao Xu: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – review and editing. Ziqiang Wang: Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing – review and editing. Haikun Jia: Methodology, Resources, Software, Validation, Writing – review and editing. Hanjun Jiang: Methodology, Resources, Software, Writing – review and editing. Chun Zhang: Methodology, Resources, Software, Writing – review and editing. Zhihua Wang: Funding acquisition, Methodology, Project administration, Resources, Supervision. This work is supported by the Shenzhen Science and Technology Program (No. JCYJ20180306170609470) and Key Research and Development Plan of Shandong Province (No. 2022CXGC010109). The authors declare that we do not have any possible conflicts of interest. Shenzhen Science and Technology Program, Grant/Award Number: JCYJ20180306170609470; Key Research and Development Plan of Shandong Province, Grant/Award Number: 2022CXGC010109 The data that support the findings of this study are available from the corresponding author upon reasonable request.