-
Notifications
You must be signed in to change notification settings - Fork 758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate/integrate FW #1277
Comments
@owcz is very insistent that FW translators are easier to maintain/write. That certainly used to be the case -- my first 10 translators were all FW -- but I'm curious (actually) why they still think that's the case. With the In my view, we could just put up a regular translator skeleton for scrapes (along the line of what Philipp already has for other types) that emulates FW with regular js code. Using a good template also has the advantage of increasing readability/maintainability. I find the translators that follow Philipp's codeblocks particularly easy to work with. Philipp will chime in himself, I'm sure, but I know he's never been a fan of FW. |
I agree with the principle of FW: removing boilerplate and making the translator process more approachable to those who may be put off by curly braces. I wouldn't say I'm insistent about it—I certainly see its limitations—but in my estimation, it would be easier to train (and assist/review) contributors to focus on hunting XPaths for FW translators than it would be to familiarize each with JS and the Zotero libraries (especially without more attention to their documentation). In particular, Wikipedia will soon rely on Zotero's translators for handling automatic reference parsing in its new wikicode-less user interface. Accordingly, while the major academic sources are handled quite well, there will be much stress over all kinds of popular sites with non-compliant metadata complicating an otherwise large leap in productivity. If FW didn't exist, I certainly would have wanted someone to create it. |
Sebastian told quite accurate my views on FW. To be fair, I have also to say, that I didn't spend much time to learn the FW really (e.g. the after-hooks I just have seen some days ago). So, personally I will not create any new FW-translators, and moreover I mostly also skip reviewing them or updating them. Also to keep the framework up-to-date, we have to change it in each translator, which seems also not ideal. In general I prefer good documentation to over any mystery-functions (and conventions by name). Maybe we should start to update the documentation for translator coding or write a new one (and I would prefer to do that in GitHub directly). My common code blocks for translator coding could maybe expanded to a general scraping routine with XPath imitating what FW does (BTW we could also use the wiki here in the official repo, if you want to activate that). The current PR by @owcz share another thing which I am a little sceptical: They mostly deal with blogs or magazines. For example this blogpost http://www.digitalspy.com/gaming/xbox-one/news/a661615/rare-replay-review-roundup-one-of-the-best-collections-in-gaming-history/ seems to give already quite good data with the Embedded Metadata (EM) translator. Is EM now working and activated with Citoid? (I remember there were some problems...) Moreover, it looks that the data extraction can be improved by supporting JSON-LD and/or schema.org. I would rather prefer to minimize the number of special translators for blogs and try to improve the general translators like EM to support the majority of blogs better. |
Citoid+EM is blocked, I believe, by zotero/translation-server/issues/31 (T140539 at WM), but I imagine the goal is to get it up eventually. None of my recent submissions fit fully under EM. Some mostly work with EM apart for one or two key fields (usually author or date) and thus need separate translators. (I went over their XPaths with Sebastian some time ago and he found those not standard enough to merit integrating into EM.) Digital Spy, for instance, is mostly fine with EM, apart for where it butchers its |
Interesting question from @fcheslack: any reason we're still preferring XPaths in scrapers to CSS selectors (via |
No reason I'm aware of. This should work already, right? So we could just re-write a simple translator to see how it looks? |
Yup. |
How can we then write item.abstractNote = ZU.xpathText(doc, '//div[@class="abstract"]'); (We have AFAIK a lot of lines looking like this.) |
(It's also possible we could have some magic that automatically grabbed |
(And the |
Yes, that is also what I came up. But often these elements are not present on every page. Because |
True. So then magic |
Magic extraction would also be helpful if we ever do things that require For a helper function, we might be able to add a function into translator scope with |
Actually, now that I write that out, |
Let me know if you need anything from me. I'm sure there are some major missing conceptual elements in the framework, the code is not well documented or commented, etc., so I am glad to help if I can. |
@egh -- I think the main thinking is that Zotero helper functions have gotten much better (in particular ZU.xpath and ZU.xpathText), that site metadata across the internet has gotten so much better that it makes sense to call the embedded metadata translator almost everywhere as a basis for translators, and that as Dan points out FW currently causes massive code duplication (the same translator with and without FW is about 25 vs. 5 kb). |
@adam3smith I wasn't really clear - I totally understand the point of phasing it out, and that it doesn't fit it that well with the zotero ecosystem. But if you do decide otherwise, let me know if there's anything I can do to help update or otherwise integrate It makes me happy to know that I was able to help get you involved in translators! Thanks. |
Is there an estimate on that |
I don't see ZU.xpath and ZU.xpathText getting deprecated so you can just use those. Follow https://github.com/zuphilip/translators/wiki/Common-code-blocks-for-translators for best practices and easy review. I find xpaths very intuitive by now, so adding the utility function using CSS selectors isn't high priority for me; not sure what Dan's thinking is on this. |
I can handle writing without FW but I also teach others, so it's in my interest to simplify the writing process. One of FW's benefits was in multiples: FW.MultiScraper({
itemType : 'multiple',
detect : FW.Url().match(/\/news\//), // news page
choices : {
titles : FW.Xpath('//div[@class="title"]/h2/a').text().trim(),
urls : FW.Xpath('//div[@class="title"]/h2/a').key("href").trim()
}
}); which is much easier to visually parse than the common code block: > function getSearchResults(doc, checkOnly) {
var items = {};
var found = false;
//TODO: adjust the xpath
var rows = ZU.xpath(doc, '//a[contains(@href, "/article/")]');
for (var i=0; i<rows.length; i++) {
//TODO: check and maybe adjust
var href = rows[i].href;
//TODO: check and maybe adjust
var title = ZU.trimInternal(rows[i].textContent);
if (!href || !title) continue;
if (checkOnly) return true;
found = true;
items[href] = title;
}
return found ? items : false;
}
function doWeb(doc, url) {
if (detectWeb(doc, url) == "multiple") {
> Zotero.selectItems(getSearchResults(doc, false), function (items) {
> if (!items) {
> return true;
> }
> var articles = [];
> for (var i in items) {
> articles.push(i);
> }
> ZU.processDocuments(articles, scrape);
});
} else {
scrape(doc, url);
}
} especially given that the XPath/selector (to point to the multi's URLs) are almost always the only part of the multi that changes, would it be sensible/possible to break these two loops into two default subroutines/functions as well? |
I can work on this. I think CSS selectors and DOM methods are much easier and more familiar to people, and they can be generated using the developer tools built into the browser. I don't think new translators should use XPaths (except for XML documents). Any new functions will need to be included as polyfills for now, since they won't exist in 4.0. One catch is that, as polyfills, they won't have I've just spent a few minutes playing around with this with various translators, and it seems most things can be done with two helper functions: function attr(doc, selector, attr, index) {
if (index > 0) {
var elem = doc.querySelectorAll(selector).item(index);
return elem ? elem.getAttribute(attr) : null;
}
var elem = doc.querySelector(selector);
return elem ? elem.getAttribute(attr) : null;
}
function text(doc, selector, index) {
if (index > 0) {
var elem = doc.querySelectorAll(selector).item(index);
return elem ? elem.textContent : null;
}
var elem = doc.querySelector(selector);
return elem ? elem.textContent : null;
} ( It's trivial to replace current uses of If you want to use these as a polyfill, here's some minified code:
If you do try these, let me know if you run into any trouble or if there are things we can optimize further. |
We can simplify the two functions into one and reduce the code duplication a little further. Moreover, I would suggest to use a less generic name for the function, e.g. /*
* Query the CSS path `selector` in the document `doc`
* possibly at only element at index `index` and returns either
* i) a specific attribute `attr`
* ii) the text under this node
* iii) `null` if there is no matching node/attribute
*/
function returnCssPathValue(doc, selector, attr, index) {
var elem = index > 0 ? doc.querySelectorAll(selector).item(index) : doc.querySelector(selector);
if (attr) {
return elem ? elem.getAttribute(attr) : null;
} else {
return elem ? elem.textContent : null;
}
} My first notes here are: 1) What happens when we call |
I strongly disagree on the name — the whole point of this is to be short and easy to type and read, since it can appear many times in a translator. As for combining the two, I don't see much point to that. The goal is to make translators as easy to write as possible. An extra tiny function in the translator architecture (where this will end up) doesn't matter at all, and it seems easier to me to just specify whether you want an attribute or text than to mix concerns and worry about an additional positional argument (which you would also always have to set as
I'm not sure what you mean by that. You can't call
Not really problematic — you have to remember that the selectors are still from the root (matching only elements below the given element), but most of the time that won't matter. But I think the functions above should work fine with an element instead of a doc (though we should rename the parameter to reflect that). We can leave in support for an optional first element in the built-in versions even after |
I agree with keeping the name short, but I also agree with combining the functions (such that |
The function Shouldn't you test on the existing of Also if you don't want to combine the two function, you can still write it more condense without the code duplication, e.g. function text(doc, selector, index) {
var elem = index ? doc.querySelectorAll(selector).item(index) : doc.querySelector(selector);
return elem ? elem.textContent : null;
} The
Update: Sorry, my fault, I didn't see that you use |
I assume that was a mistake. There's a
You didn't cross this out, but you don't need to include the
Sure. I wrote these quickly as examples, but condensed versions would be fine: function attr(docOrElem, selector, attr, index) {
var elem = index ? docOrElem.querySelectorAll(selector).item(index) : docOrElem.querySelector(selector);
return elem ? elem.getAttribute(attr) : null;
}
function text(docOrElem, selector, index) {
var elem = index ? docOrElem.querySelectorAll(selector).item(index) : docOrElem.querySelector(selector);
return elem ? elem.textContent : null;
} Minified:
|
I think I generated the first version of that page, and it's quite possible that my script misunderstood |
Ok, I fixed the documentation: |
@dstillman With CSS-paths one cannot restrict on the text value of the nodes. We have to do this sometimes for scraping the information out of some unstructured websites. For example there can be a table where the first column contains the label and the second the value. Then we need to find the correct row according to the label and then go to its value. See e.g. translators/Patents - USPTO.js Line 93 in e04633b
|
Well, we're not deprecating use of XPath or the XPath functions, just FW, since The main reason to discourage XPaths would just be that they're another, more complicated, lesser-known technology than CSS selectors and ES6, which would use something like this: Array.from(document.querySelectorAll('th, td'))
.filter(el => el.textContent.startsWith('Inventors'))[0]
.parentNode.querySelector('td:last-child') I personally find that much more comfortable than XPaths. The main downside is that it would require a try/catch to avoid errors for optional values. Maybe there are a few more common patterns that we could encapsulate this sort of logic for (e.g, a |
The current document is automatically used (but can still be provided as the first argument to avoid accidental bugs during the transition). Closes #1323 Addresses zotero/translators#1277
This adds attr()/text() to the translator sandbox for zotero/translators#1277.
In the next client/connector versions, |
I've rewritten all of my translators that had been pending for the last year because they used Framework Should be good for review now, if you have time to take a look |
@owcz has been writing lots of new translators using FW. That's great (thanks!), but before we go too much further with that, I wanted to get @adam3smith's and @zuphilip's thoughts on FW. I know @simonster was never really a fan of it, both because it's limited in what it can do and because he thought that for people who knew basic JS it might actually be harder than writing a translator without it, but you guys have worked with translators more recently, so we'd be interested in your perspective. If we do think there are benefits, though, I think we should consider finding a way to achieve those in the main architecture rather than having this large chunk of code duplicated into lots of different translators.
The text was updated successfully, but these errors were encountered: