estela OSS release | Bitmaker: Full-stack Web scraping and data extraction company

estela, an elastic web scraping cluster

By Breno Colom

4 min read | June 24, 2022

estela is an elastic web scraping cluster running on Kubernetes. It provides mechanisms to deploy, run and scale web scraping spiders via a REST API and a web interface.

estela’s design builds upon years of experience in the trenches of web scraping projects of varying complexity and scale. The team behind the initial design and development of estela includes web scraping veterans who have been involved in the industry from the start.

Among other things, we’ve seen major changes in the industry:

The fast growth of the web scraping industry over the past ten years. Last year The Economist listed web scraping as the first among new data sources driving technological shifts in finance and recent forecasts have the global alternative data market expecting to reach a value of USD $ 143 billions by 2030.
We’ve moved from the first few years in which the legality of web scraping was a topic of controversy to having Fortune 100 companies, government organizations and even traditional banks engaging in web data extraction. A few months ago a U.S. appeals court reaffirmed that web scraping of public websites is legal. This ruling should be ratified by the U.S. Supreme Court soon, setting an important precedent and putting an end to a longstanding discussion.

Even so, these good news do not imply that web scraping has become easier, on the contrary:

Today web scraping teams and content protection developers are engaged in an ongoing arms race. Stakes are higher than ever, for instance last year a coordinated rollout of updated websites and anti-bot countermeasures by many retailers at the start of Black Friday caused major data outages for most organizations during the most critical time of the year for scraping in the e-commerce vertical.
The pandemic related tech boom (along with associated trends such as the e-commerce uptake in emerging regions such as Latinamerica) accelerated the need for web data insights, which has meant welcoming many newcomers to the industry, from consumers to providers. And yet the followup boom tech downturn is forcing organizations to evaluate their alternative data acquisition strategy, with a renewed focus on saving costs, reducing redundancy and easing management.

Whether you’ve outsourced web data extraction or managed it in house the technological and contextual challenges of this moment warrant a better way to move forward that does not require constantly reinventing the wheel or depending on proprietary third party clouds.

There has not been until now a comprehensive web scraping orchestration solution built on a modern, container-based stack. Today we’re proud to announce the release of estela as open-source software under the MIT license:

github.com/bitmakerla/estela

estela provides the following advantages over existing alternatives:

Easy management of scraping workloads. estela currently supports spiders written with Scrapy.
Built-in scalability and elasticity. estela builds on top of Kubernetes to provide containerized application management. This allows your scraping cluster to accommodate both the linear growth of your scraping infrastructure and unexpected peaks.
Asynchronous fault-tolerant design. Be it a handful to thousands of parallel scraping jobs, estela will make sure no data is lost.
Security and privacy. While more mature proprietary platforms exist, we do not expect these to be released as open-source anytime soon, if at all. estela on the other hand, as an open source project, can be deployed within your own cloud or datacenter now to start driving your scraping platform and automatically scale as needed.

estela consists of three modules:

REST API: built with the Django REST framework toolkit, exposing several endpoints to manage projects, spiders, and jobs. It uses Celery for task processing and takes care of deploying your Scrapy projects, among other things.
Queueing: estela needs a high-throughput, low-latency platform that controls real-time data feeds in a producer-consumer architecture. In this module, you will find a Kafka consumer used to collect and transport the information from the spider jobs into a database.
Web: A web interface implemented with React and Typescript that lets you manage projects and spiders.

Some of the next features planned for release include:

Accounting (per job, project and organization)
Data retention configuration
Support for collections (key / value store which multiple jobs can write to)
Reporting and alerting
Web interface improvements
Support for columnar storage formats (Apache Arrow / Parquet)
Support for additional scraping frameworks and languages

We have poured a lot of effort and experience into this first public release and we have high hopes for the project. We look forward to the rest of the web scraping community to find estela as useful as it has proven to be for us.

Finally, we’d like to welcome any feedback and contributions to the project, so be sure to check out the repo and start testing it out. Happy scraping!