Guter Artikel mit sehr nützlichem Kommentar auch) ...

2015-08-20T16:57:29.191+02:00

Guter Artikel mit sehr nützlichem Kommentar auch) Vielen Dank den Autoren!

Das hatte ich mich auch gefragt, vor allem, da Aco...

2013-08-06T17:04:22.113+02:00

Das hatte ich mich auch gefragt, vor allem, da Acoon ja vorher anscheinend über 14 Jahre mehr oder weniger ohne Spenden auskam.

Die Hardware dürfte nicht all zu teuer gewesen sein (siehe Ausführungen unten). Das Teure war vermutlich der massive Traffic - der wahrscheinlich durch die Prism Debatte nochmals ordentlich zugenommen hatte.

Die Page ist ja jetzt vom Netz, aber früher war dort noch ein Acoon Blog vorhanden, in dem folgendes stand:

In this article I want to give you an overview about what Acoon currently represents, where the limits are, and how it's done.

Current State
As of now the search-index of Acoon contains about 350 million web-pages. These contain a total of about 50-60 billion words, and there are 400-500 million different words on these pages.
A full search-index takes up about 400gb of space, but it takes about 15-20tb of incoming data to build this index.
Acoon runs on only 2 servers. One is the web- and mail-server, and also has the search-index, so all queries run on this server. The other server is responsible for crawling and building the search-index.
The crawler is a leased server, has a quad-core 3.2GHz CPU, 32gb RAM and four 2tb harddrives and has a 1gbit/s Internet connection. The crawler is able to handle incoming data at about 400mbit/s, which means a bit over 1,000 pages/s. It takes this machine about a week to build a new search-index.
The web-server is actually sitting right here in my home-office, has a quad-core 3.5GHz CPU, 32gb RAM and two 500gb SSDs. This server is connected to the Internet via a relatively cheap 128mbit/s Downstream, 10mbit/s Upstream connection.

Current Limits
Currently the web-server can handle up to about 5-10 queries/second. Having the index on SSDs instead of regular harddisks is an absolute necessity. Both the higher transfer-rate and the higher-number of random-reads per second that an SSD provides are needed.
The software has an internal limit of about 536 million web-pages (2^29 -128 to be exact). More than that would overflow some internal data-structures.
While these data-structures would be easy to fix, it wouldn't really help the situation. The queries are still done single-threaded and that causes them to simply take too long in some cases. In practice I have found that 300-350 million pages is the limit before queries take too long to be acceptable.
Another limit is the crawler/indexer, especially early during the crawl while it is still collecting new links. The need to check about 50,000 newly-found links per second, and adding them to the URL-database, is slowing down the crawl during its early stages.
The parser takes up a lot of CPU-time too and the parser's limit on this server is at about 1,500 pages/second.

How It Is Done
The entire software is written in Delphi which is a Pascal dialect. It is a surprisingly small piece of software with less than 15,000 lines of source-code.
But the software is highly optimized and tailor-made for its task. There is no big database-solution that powers this. No, it is all done explicitly in the software. This is the ONLY way to achieve this much with this little hardware.
At the beginning of a crawl the URL-database is seeded with about one million URLs as starting-points for the crawl. These are crawled, parsed for text and links, the links are added to the database, and then the crawl continues on until enough pages have been crawled. The resulting data is not yet searchable. That requires an indexing-step which takes about 8 hours for 350 million pages. It then takes another 10-12 hours to transfer the search-index to the web-server.

Wo er Recht hat, hat er Recht. Aber 1000€ pro Tag ...

2013-08-06T16:40:54.128+02:00

Wo er Recht hat, hat er Recht. Aber 1000€ pro Tag für den Betrieb? Naja.

Comments on Schaffhausen: Datenschutz ist den Usern kein Geld wert

Guter Artikel mit sehr nützlichem Kommentar auch) ...

Das hatte ich mich auch gefragt, vor allem, da Aco...

Wo er Recht hat, hat er Recht. Aber 1000€ pro Tag ...