跳转到内容

User:SchlurcherBot

页面内容不支持其他语言。
维基百科,自由的百科全书

SchlurcherBot

Function overview: Convert links from http:// to https://

Programming language: C#

Source code available: Main C# script: commons:User:SchlurcherBot/LinkChecker

Namespaces: This bot only edits on namespace 0 (Main) and 6 (File)

Function details: The link checking algorithm is as follows:

  1. The bot extracts all http-links from the parsed html code of a page
    • It searches for all href elements and extracts the links
    • It does not search the wikitext, and thus does not rely on any Regex
    • This is also to avoid any problems with templates that modify links (like archiving templates)
    • Links that are subsets of other links are filtered out to minimize search and replace errors
  2. The bot checks if the identified http-links also occur in the wikitext, otherwise they are skipped
  3. The bot checks if both the http-link and the corresponding https-link is accessible
    • This step also uses a blacklist of domains that were previously identified as not accessible
  4. If both links redirect to the same page, the http-link will be replaced by the https-link (the link will not be changed to the redirect page, the original link path will be kept)
  5. If both Links are accessible and return a success code (2xx), it will be checked if the content is identical
    1. If the content is identical, and the link is directly to the host, then the http-link will be replaced by the https-link
    2. If the content is identical but not the host, it will be checked if the content is identical to the host link, only if the content is different, then the http-link will be replaced by the https-link
      • This step is added as some hosts return the same content for all their pages (like most domain sellers, some news sites or pages in ongoing maintenance)
    3. If the content is not identical, it will be checked if the content is at least 99.9% identical (calculated via the en:Levenshtein distance)
      • This step is added as most homepages use dynamic IDs for certain elements, like for ad containers to circumvent Ad Blockers.
    4. If the content is at least 99.9% identical, the same host check as before will be performed.
    5. If any of the checked links fails (like Code 404), then nothing will happen.

Source for pages: The bot works on the list of pages identified through the external links SQL dump. The list was scrambled to ensure that subsequent edits are not clustered from a specific area.

Further comments: The bot respects the API:Etiquette and uses both a user-agent header as well as respects the maxlag parameter.

Status: (CentralAuth)

Approved as global bot (per this request) and thus flagged as bot on all projects that did not opt-out (per this list).

Project Request Pages Edit Description Used Status
commonswiki Approved 31'145'089 Fix http to https Running…
dewiki Approved 1'888'381 Bot: http → https Running…
enwiki Approved 8'570'327 Bot: http → https Running…
eswiki Approved 2'191'542 Bot: http → https Running…
frwiki Approved 2'970'187 Bot: http → https Running…
itwiki Approved 2'359'233 Bot: http → https Running…
jawiki Allows global bots 994'375 Bot: http → https Running…
plwiki Approved 1'527'763 Bot: http → https Running…
ptwiki Approved 1'214'889 Bot: http → https Running…
ruwiki Allows global bots 1'797'992 Bot: http → https Running…
zhwiki Allows global bots 1'105'051 Bot: http → https Running…
dewikinews Pending 17'280 Bot: http → https  搁置
dewikiquote Pending 5'673 Bot: http → https  搁置
dewikisource Pending 97'284 Bot: http → https  搁置
dewikiversity Approved 9'301 Bot: http → https Working Waiting
dewikivoyage Pending 19'094 Bot: http → https  搁置
dewiktionary Pending 145'334 Bot: http → https  搁置