Estela’s Year-One Transformation | Bitmaker: Full-stack Web scraping and data extraction company

By Mateo Gonzales

6 min read | August 23, 2023

estela

We released Estela as open-source one year ago. Estela has seen many changes during this time, including significant enhancements, new features, and numerous bug fixes to improve the user experience and empower web scraping developers. Let’s dive into the details:

Estela (Core)

New Features

Estela Requests Support: Beta support for the Requests library (ec08db0) has been recently added to Estela and will continue to see improvements. Estela was designed to be flexible and extendable, and this is the first step to building a more extensive repertoire of supported scraping languages and frameworks. Check the blog post to learn more about it and how it was done.
Live stats visualization: Redis is now used to store the stats of jobs in RUNNING status (5957952), allowing users to visualize the stats and resource consumption of their scraping jobs in real time.
Estela Activity Menu: A new proposal and implementation for the Activity Menu have been introduced in Estela (5d4c8dc), allowing users to see the history of actions performed in each project.
Estela Notifications: Along with the record of activities on each project, users are now also notified when an action occurs in a project they are part of (2da9074).
Project Stats: The project dashboard has seen a major overhaul. Features here include new charts that allow users to easily see the stats obtained from their scraping jobs, including various statistics views (536eaca, 2fdcd3e, fcd6e1e).
Estela can be deployed on Google Cloud Platform (GCP): We have added support to run Estela on Google Kubernetes Engine. We can now deploy Estela on GCP (ed30f44). This was made possible thanks to our partners at Emptor.

Improvements

User Interface Updates: The UI has been updated to support requests in addition to Scrapy (ec08db0) and provide a smoother user experience (cc68fd7).
Data download via Estela’s frontend: Estela was first designed to empower web scraping professionals, so some features, such as data downloads, were exclusive to the estela-cli as it allows a more efficient download. Data downloads are now supported via Estela’s front end (25f5fc5), although big data downloads are better managed using estela-cli.
External Components: New externalRegistration (8ba4280) and externalScripts (503fb36) components have been added to enhance the extendability of Estela in the front end.
Security: Pod privileges have been limited to strengthen security (04bf729).
Inherit Environment Variables: It is now possible to set environment variables for projects and spiders, and their child jobs and cronjobs will inherit these variables (21d199a).
Code Refactoring (43d1962, b1c9f57).

Bug Fixes

General Fixes: Corrected pagination for requests, logs, and spider selection (a19d0e7, 8db3020), fixed the control flow when managing project members (c6bcf35), fixed issues with the CrontabSchedule create and update operations (e41f62f), and others.
Item parsing: Some items with nested dictionaries and arrays were not correctly handled on the Estela frontend. The parsing of items has been improved to be more robust and correctly parse any item (7ee30f3).
Miscellaneous Fixes: Various other issues have been addressed.

Documentation and Miscellaneous

Documentation Updates: Enhancements to the documentation, including format fixes and dependency updates (fd6a41a, 229d319, c2d58be).

Estela Queue Adapter

By the start of the year, the queueing logic in Estela was deeply coupled to Kafka as a queuing system. This logic has been separated from the core Estela repository (01cc30e) and put into a new repository, estela-queue-adapter, placing Estela a step forward in our efforts to support other queuing systems such as RabbitMQ and Amazon Kinesis.

Estela Scrapy Entrypoint

Enhancements

Cache Stats in Redis: Spider stats are now cached in Redis while jobs run (25421fb) to improve responsiveness.
Job Status Update: The INCOMPLETE status has been removed, and all finished jobs are now set to COMPLETED, providing consistency in job status tracking (f166d1f).

Bug Fixes

Default JSON Serializer: A crucial fix has been applied to the default JSON serializer, ensuring accurate and reliable data serialization (7adea54).

Decoupling and Configuration

Decouple Kafka from Estela: In alignment with our commitment to flexibility and modularity, Kafka has been decoupled from Estela, and a queuing adapter has been set up. This change allows easier integration with different messaging systems and enhances maintainability (4142cd8).

Estela CLI

New Features

Create Requests Projects: Developers can now create Scrapy or Requests projects directly via estela-cli (b6090f0).

Improvements

Chunk Data Downloading: Support has been added for chunk data downloading, resolving a limitation in handling large data downloads (30af197).
CSV Data Retrieval: The CSV data retrieval process has been updated to only write fields present in the first item, improving consistency (29e226a).
Output Data in TSV Format: An option has been added to output data in TSV (Tab-Separated Values) format, providing more flexibility in handling data (881750b).
Specifying Data Type: Users can now specify the data type for data retrieval: items, requests, logs, and stats (b5a8041).
GitHub Action for PyPi Deployment: A new GitHub action has been added to streamline the deployment process to PyPi, enhancing the distribution workflow (d74d861).
Improved Project Uploads: We have improved the stability of project uploads for slower internet connections (2d01596)

Bug Fixes

Dockerfile Generation with Python 3.x0: Fixed an issue where estela-cli was not generating Dockerfiles correctly with Python versions 3.x0 (09db075).
Fixed Recursive Zip File Generations: Recursive zip files were sometimes generated, causing an infinite loop while deploying projects (09db075).

Conclusion

During the year, we have introduced powerful new features to Estela, streamlining existing functionalities and resolving numerous bugs. We invite you to explore these updates and provide your valuable feedback.

We want to extend a special thank you to everyone who contributed to this release. Your hard work and dedication make Estela a thriving and innovative platform.

If you are interested in all these new changes and how you can use them for your own projects, Contact Bitmaker today and incorporate new features into your data extraction.

To get started with Estela, head over to https://estela.bitmaker.la/docs/estela/installation/installation.html. Stay tuned for more exciting updates!

At Bitmaker we want to share our ideas and how we are contributing to the world of Web Scraping, we invite you to read our first technical article that covers our Estela project. Introducing requests support in Estela

Enhancements, Innovations, and Refinements: Reviewing Estela’s Evolution

Estela (Core)

New Features

Improvements

Bug Fixes

Documentation and Miscellaneous

Estela Queue Adapter

Estela Scrapy Entrypoint

Enhancements

Bug Fixes

Decoupling and Configuration

Estela CLI

New Features

Improvements

Bug Fixes

Conclusion