Online webcomic scraper database

Asked by Jeroen Rommers

Since the site layout or urls of webcomics regularly change, comic scrapers need to be constantly updated. The actual engine (Dosage itself) doesn't change so much and it should - in my opinion - be separated from the data/content.

What I'm suggesting is creating an online 'service' that provides all the necessary information for scraping a certain webcomic on request. Depending on the implementation of this, it could also be usable for various other webcomic downloaders or even user created scripts.

For example a repository system like apt:
- User can subscribe to (one or more) url, dosage retrieves comic list.
- User can also define local comic scrapers (these always overrule the online ones)

This would not only 'ensure' up to date scrapers for every user, but would also enable administrators to easily remove defunct or prohibited content (removal request from webhosts/authors ?).
By creating such a centralized system it would also be possible the add some sort of availability check to known comics (via i.e. RSS or a simple GET request) and keep track of recently updated sites (to prevent unnecessary network activity and thus stress on hosting sites).

Of course, a community interface would be a very nice addition because it would enable users to submit their own scrapers (or update existing).

Question information

Language:
English Edit question
Status:
Answered
For:
Dosage Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Gault Drakkor (gault-drakkor) said :
#1

The scrapers already are to a certain degree separated. Its relatively easy to create/modifiy a scraper for a comic. Typically it takes me about 5 min per comic for non custom ones.

Personally i feel that the source repository system is good enough. It does almost everything you want. Only question is, how easy is it for users to submit changes.

Its online- administrators can remove scrapers if desired.
Users can define local scrapers -by over writing- hmm no easy way to say folder x of local scrapers overrule remote.
    This a possible, easily defined feature request.
The repository is central source

So unless dosage is used _alot_ more its not worth it.

Perhaps argue for a separate more transparent path for submitting scrapers to the repository?

Revision history for this message
Tristan Seligmann (mithrandi) said :
#2

Part of the goal of the recent (well, not that recent anymore, I guess...) major refactoring of how scrapers are implemented is to ease the way for having scrapers defined as something other than straight Python source code. For example, most scrapers just assign regexes to the various class attributes, without overriding any of the methods, and so these could trivially be stored in some kind of external file format that could be updated from a central repository automatically. Additionally, the use of the Twisted plugin system allows the code that defines scrapers to be shipped in a separate package in your Linux distribution, thus paving the way for updating the library of scrapers independently from the core package.

Having a central community-run site for managing the simple scrapers is definitely something I'd like to do at some point; integrating new / fixed scrapers from the community isn't that much work, but it's a pretty tedious and repetitive process that really ought to be mostly-automated, and not have to wait for me (or another core developer) to get around to going through contributions. The steps of verifying that the scraper can actually fetch the comic strip images, and walk backwards to find previous strips, can be mostly automated. Some human input would be required (for example, to verify that the images it is fetching are actually the correct images), but I think this can be handled in a community-driven way either. For example, users would be able to flag scraper definitions as working or not working, and these flags could be weighted by a "reputation" score (something along the lines of Stack Overflow).

That, combined with functionality in the client to download updated scraper definitions from the community database, would probably remove about 90% of the core developer involvement that is currently required; which would be a good thing, since none of us have been putting very much time towards Dosage lately. Of course, to get there from here will still require quite a bit more work, but it is definitely on my mental roadmap; I suppose I should sit down and write up some more of my plans, so that everybody knows what I have in mind. (For example, I'd like to work with comic authors a lot more closely to avoid trampling on their toes / ad revenue / etc. where possible)

Can you help with this problem?

Provide an answer of your own, or ask Jeroen Rommers for more information if necessary.

To post a message you must log in.