摘要:Stream-processingEngineApplicationsthatprocessreal-timedatastreamsarepushingthelimitsoftraditionaldataprocessingtechnologies.Theseapplicationsarecharacterizedbytheneedforsub-secondresponsetimes——whethertheyinvolveautomatingtrades,monitoringnetworksforintrusions,or
Stream-processing Engine
Applications that process real-time datastreams are pushing the limits of traditional data processing technologies. These applications are characterized by the need for sub-second response times——whether they involve automating trades, monitoring networks for intrusions, or tracking credit card transactions for fraud. Applications that depend on the traditional store-and-query model cannot handle the volume and velocity of streaming data, whose value might exist only in the moment.
A stream-processing engine (SPE) is data management software that enables the execution of queries and computations—— and ultimately, actions——on streaming data in real time. Previously, queries and computations could only be executed with stored data using standard database management systems. An SPE accepts SQL-like, stream-oriented, continuous queries and executes them over live event streams, outputting results in real time.
An SPE achieves real-time operation by integrating several mechanisms. First, it supports inbound processing, in which incoming event streams immediately start to flow through the continuous queries as they enter the system. The queries transform the events as they move, continuously producing results, all in main memory. Read or write operations to storage are optional and can be executed asynchronously in many cases.
Inbound processing overcomes a limitation of the traditional outbound processing model conventional database management systems employ, in which data must be inserted into the database and indexed before any processing can take place. By removing storage from the critical path of processing, an SPE achieves significant performance gains compared with traditional processing approaches.
Second, an SPE adopts a single-process model, in which all time-critical operations (including event processing, storage and execution of custom application logic) are run as part of one multi-threaded process. This integrated approach eliminates high-overhead process switches present in solutions that use multiple software systems to provide the same capabilities.
Third, an SPE provides a flexible, in-process storage model and standards-based access to external databases. In-memory hash tables are used for very fast insert and look-up operations. Embedded databases are used to ensure persistence of data and can be accessed and manipulated using SQL-style declarative queries. External, remote-process databases are accessible through standard Open Database Connectivity calls and are convenient to use when supporting legacy databases or facilitating database sharing with external applications.
An SPE has built-in filtering, aggregating and correlating, and merging operators that manipulate windows of events. Standard SQL is defined over finite-sized tables, and an execution engine thereby knows when it is finished with all its operations. In contrast, streams potentially never end, and an SPE must be instructed when to finish processing and output an answer.
The windowing construct serves this purpose by defining the scope of an operator. In a trading application, a one-hour window can be used to express a stream-oriented query that calculates an hourly volume-weighted average price. Windows are user-configurable and can be defined over time, number of events or breakpoints in other attributes of an event.
Stream-oriented operators provide resiliency to imperfections in datastreams, caused by out-of-order or delayed data arrivals, both of which occur frequently in real-world scenarios. Resiliency is achieved by making operators time-sensitive: Optionally, an operator can be told to wait a longer period of time for out-of-order messages, or timeout and stop waiting for late messages that might never arrive.
Finally, an SPE supports distributed operation for improved scalability and availability. Incremental scalability is achieved by letting processing be partitioned and distributed across multiple machines transparently, without necessitating any changes in the application. High availability is crucial to preserve the integrity of applications and to avoid disruptions in real-time processing.
流處理引擎
處理實時數據流的應用程序正在將傳統的數據處理技術推到極限。這些應用程序以亞秒的響應時間為特征的——不管它們是涉及到貿易自動化,為防入侵而監視網絡,還是為防詐騙而跟蹤信用卡交易。那些依靠傳統的存儲-查詢模型的應用程序已不能滿足流數據的量與速度方向的要求,而流數據的價值可能只存在于瞬間之間。
流處理引擎(SPE)是一種數據管理軟件,能實時地對流數據實現查詢與計算、以及最終(應采取的)動作。過去,只能對利用標準數據庫管理系統存儲的數據執行查詢和計算,而SPE接收類似SQL、面向流的連續查詢,并執行正在發生的事件流,實時地輸出結果。
SPE是通過將幾種機理整合在一起實現實時操作的。首先,支持入處理,即輸入的事件流一進入系統就馬上開始流經連續的查詢。在它們流動時,查詢變換事件,連續地給出結果,所有這一切都是在內存中進行的。對磁盤存儲的讀或寫操作是可選的,在很多情況下是被異步處理的。
入處理克服了常規數據庫管理系統使用的傳統出處理的局限,在出處理中,數據必須插入數據庫,并在開始任何處理之前建立索引。通過將磁盤存儲排除在處理的關鍵路徑之外,與傳統的處理方法相比,SPE獲得了明顯的性能提高。
第二,SPE采用了單處理模型,其中所有與時間密切相關的操作(包括事件處理、定制的應用邏輯的存儲和執行)是作為一個多線索進程的一部分運行的。這種整合的方法消除了進程轉換的高開銷,在使用多個軟件系統來提供同樣功能的解決方案中就存在著這種進程轉換。
第三,SPE提供了一個靈活的進程間存儲模型和基于標準的對外部數據庫的訪問。內存中散列表用于極快的插入和查找操作。嵌入的數據庫用于確保數據的一致性,以及能利用SQL風格的描述性查詢進行的訪問和操縱。外部的、遠程進程數據庫通過標準的“開放數據庫互連”調用進行訪問,當要支持過時的數據庫時,這種數據庫用起來很方便,能方便地實現數據庫與外部應用程序的共享。
SPE擁有內在的過濾、聚合和相關、以及合并操作符,它們操縱事件的窗口。標準SQL定義在有限大小的表格之上,從而執行引擎知道何時完成了所有的操作。相反,流存在著永不結束的潛在可能,在結束處理和輸出答案時SPE必須要有指令。
通過定義操作符的范圍,窗口構建為此目的服務。在傳統的應用程序中,一小時的窗口可以用來表達計算以小時為量加權的面向流的查詢。窗口是用戶可以配置的,可以定義在時間、事件數量或者一個事件中其他屬性的斷開點上。
面向流的操作符對數據流中因次序破壞或數據達到的延誤造成的破壞提供了彈性,而這兩種情況在現實世界中是經常發生的。彈性是通過使操作符對時間敏感而獲得的。操作符可以有選擇地被告知,對失序的信息等待更長一些時間,或者規定的時間用完不再等待可能永遠不會到來的過時信息。
最后,SPE支持改進可擴性和可用性的分布式操作。增強可擴性是通過讓處理分割并透明地分布到多個機器上實現的,不必修改應用程序。高可用性對保留應用程序的完整性是至關重要的,可避免實時處理的中斷。
軟考備考資料免費領取
去領取