概念
Anansi是一個利用網路連線計算機來探索世界網路資源的研究項目。原則上我們希望基於準確性和性能在分散式網路爬蟲上做一個評估,經過考慮,BOINC是我們最終的選擇。在這樣的一個系統中包括準確性,穩定性,適應性和性能等將被測量。
運作
形式
Anansi,客戶返回的唯一的URI被抓取與URI的HTTP狀態代碼,聯營公司,indcating它的空房情況。只有計畫http本身可以達到公眾的URI將被抓取。沒有E - mail地址,文字內容或用戶,密碼將被收集。它是一個非CPU密集型的項目,這是試圖在客戶端上,以減少CPU負載。機器人排斥和一些網頁的內容,如聯想信息正在收集和抓取過程中使用BOINC的志願者,但他們都將返回到Anansi伺服器。
Anansi收集的數據(URI)來將用於地圖,減少引擎,計算每個URI的重點。當務之急是建立後入度,出度和Anansi伺服器創建時間戳。Anansi伺服器考慮,重新計畫,保持continuely工作的系統
原文:In Anansi, clients returned only URIs been crawled associate with URI's http status code that indcating availability of it. Only URIs with scheme http itself that can be reached by the public will be crawled. No E-mail address, words content or user, password will be collected. It is an non-cpu-intensive project, which is trying to reduce CPU loads on the client. Associative information such as robots exclusion and some page contents are being collected and used by BOINC Volunteers during crawling, but none of them will be returned to Anansi server.
The data(URIs) collected by Anansi will be used by a Map-reduce engine that calculates priorities for each URI. The priority is established upon In-degree, out-degree and timestamp created by Anansi Server. Anansi server take it into consideration for revisit plans, which maintains a continuely working system.
寓言
Anansi the spider is a trickster from West Africa. He always wants more than he has. His greediness gets him into trouble.
Once Anansi tried to trick a fisherman. The fisherman was smart. He tricked Anansi. The greedy spider ended up with no fish at all.
阿南西是一隻來自西非的騙子蜘蛛。他總是想要更多的東西。他的貪婪使得他陷入麻煩。
一次阿南西試圖欺騙一個捕魚的人。這個漁人很聰明,他反過來欺騙了阿南西。這個貪婪的蜘蛛最終一隻魚都沒得到。