We are looking for talented profiles to help build and maintain the distributed data collection system that is at the heart of our business.
We are a data-driven company which collects and processes more than 500GB of raw data daily. We leverage big data technologies such as Serverless, Spark on AWS EMR to crunch these volumes of data and make it queryable.
In this role, you will ensure that our data collection engine, which consists of distributed web crawlers, is state of the art and ahead of our competition. You will ensure that we can scrape any webshop, no matter the ban-detection that has been put in place. Then, it will be important that proper monitoring tools are in place. We are currently scraping 60 sites and your goal is to at least triple that without losing completeness and quality.
Your responsibilities will include: