Downloading and parsing TPB metadata to estimate Bitcoin usage and revenue for uploaders
Background
SDr> gwern, here's a shiny new angle for your cryptocurrencies knowledge file: do crypto donations to warez distributors, like work?
gwern> SDr: how would that work? they put addresses in the READMEs of their torrents or something?
SDr> gwern, specifically eg. scraping Pirate Bay's NFO files for wallet addresses & cross-referencing it with blockchain, is there volume for it,
such that distributors are incentivized to provide clean cracks / keygens, as opposed to bundling blackmail-ware with it?
TODO: compare against Paypal, Flattr, Gratipay?
Watashi kininarimasu!
Data
Download
https://archive.org/details/Backup_of_The_Pirate_Bay_32xxxxx-to-79xxxxx https://github.com/andronikov/tpb2csv
more efficient to not download comments
diff --git a/download.py b/download.py
index 82837e2..9fed0aa 100644
--- a/download.py
+++ b/download.py
@@ -6,12 +6,12 @@
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
-#
+#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
-#
+#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
@@ -21,7 +21,7 @@ import HTMLParser
import torrent_page
import filelist
-import comments
+# import comments
import requests
import datetime
@@ -39,10 +39,10 @@ def main():
tp_status_code = torrent_page.get_torrent_page(torrent_id, protocol)
if (tp_status_code == 200):
filelist.get_filelist(torrent_id, protocol)
- comments.get_comments(torrent_id, protocol)
+ # comments.get_comments(torrent_id, protocol)
elif (tp_status_code == 404):
print "Skipping filelist..."
- print "Skipping comments..."
+ # print "Skipping comments..."
else:
print "ERROR: HTTP " + str(tp_status_code)
error_file.write(datetime.datetime.utcnow().strftime("[%FT%H:%M:%SZ]") + ' ' + str(torrent_id) + ": ERROR: HTTP " + str(tp_status_code) + '\n')
@@ -58,10 +58,10 @@ def main():
tp_status_code = torrent_page.get_torrent_page(torrent_id, protocol)
if (tp_status_code == 200):
filelist.get_filelist(torrent_id, protocol)
- comments.get_comments(torrent_id, protocol)
+ # comments.get_comments(torrent_id, protocol)
else:
print "Skipping filelist..."
- print "Skipping comments..."
+ # print "Skipping comments..."
time_log.write(datetime.datetime.utcnow().strftime("%FT%H:%M:%SZ") + ' ' + str(torrent_id) + " " + str(tp_status_code) + '\n')
time_log.flush()
break # Success! Break out of the while loop
@@ -72,10 +72,10 @@ def main():
tp_status_code = torrent_page.get_torrent_page(torrent_id, protocol)
if (tp_status_code == 200):
filelist.get_filelist(torrent_id, protocol)
- comments.get_comments(torrent_id, protocol)
+ # comments.get_comments(torrent_id, protocol)
else:
print "Skipping filelist..."
- print "Skipping comments..."
+ # print "Skipping comments..."
time_log.write(datetime.datetime.utcnow().strftime("%FT%H:%M:%SZ") + ' ' + str(torrent_id) + " " + str(tp_status_code) + '\n')
time_log.flush()
break # Success! Break out of the while loop
@@ -102,7 +102,7 @@ if (len(sys.argv) == 2+offset):
torrent_id = sys.argv[1+offset]
print torrent_id
main()
-
+
elif (len(sys.argv) == 3+offset):
if (int(sys.argv[1+offset]) > (int(sys.argv[2+offset])+1)):
for torrent_id in range(int(sys.argv[1+offset]),int(sys.argv[2+offset])-1, -1):
@@ -112,6 +112,6 @@ elif (len(sys.argv) == 3+offset):
for torrent_id in range(int(sys.argv[1+offset]),int(sys.argv[2+offset])+1):
print torrent_id
main()
-
+
elif (len(sys.argv) > 3 and not https):
print "ERROR: Too many arguments"
started 2014-02-25, 8:51PM EST
sudo apt-get install python-requests python-beautifulsoup
git clone git@github.com:andronikov/tpb2csv.git
cd tpb2csv
sed -i -e 's/thepiratebay\.sx/thepiratebay.se/' *.py
# End:
# http://thepiratebay.se/recent
# most recent: http://thepiratebay.se/torrent/9666564/Blonde_Avenger_008_%28BlitzWeasel_-_1995%29_%28Talon-Novus-HD%29_%5BNVS-D%5D
# ID: 9666564
python download.py 1 9666564
8,xxx,xxx
21:02:44 <@gwern> https://github.com/tpb-archive?tab=repositories oh, I think I understand: they post only each set of million torrents 21:03:15 <@gwern> before 3m apparently is unavailable, and the latest TPB torrent when I checked a little bit ago was #9,666,564 as in, the 9x series isn’t ready yet 21:03:32 <@gwern> so, it goes 3m, 4m, 5m, 6m, 7m, 8m 21:04:01 <@gwern> so I can download those, and use tpb2csv to pick up the 666k most recent ones 21:04:25 <@quanticle> gwern: If I might ask, why are you archiving TPB torrents? 21:04:45 <@gwern> quanticle: SDr asked how many torrent authors provide bitcoin donation addresses, and how much they’d received
09:06 PM 0Mb$ python download.py 9000000 9666564
git clone https://github.com/tpb-archive/3xxxxxx.git git clone https://github.com/tpb-archive/4xxxxxx.git git clone https://github.com/tpb-archive/5xxxxxx.git git clone https://github.com/tpb-archive/6xxxxxx.git git clone https://github.com/tpb-archive/7xxxxxx.git git clone https://github.com/tpb-archive/8xxxxxx.git
http://thepiratebay.se/recent http://thepiratebay.se/torrent/9669888/Aimi_Nagano_-_Girls_Do_Not_Sit_On_The_Bus
$ cd home/gwern/tpb2csv/ && python download.py 9000977 9666564
cd ~/tpb2csv/ && while true; do (LATEST=$(find ./data/9xxxxxx/ -type d | cut –delimiter=‘/’ –fields=6 | sort –numeric-sort | tail –lines=1); python download.py $LATEST 9670000; sleep 5s); done
problems with tpb2csv:
-
hardwired to .sx, eeds .se
-
fragile, easily bombs out after a few connection failures
-
doesn’t check for already-downloaded?
desired features: - default final target from http://thepiratebay.se/recent - optionally disable comments, filelist, metadata, description
but could also just figure out all missing ids
countmissing.hs
:
import Data.Set (fromList, difference, toList)
import System.Environment (getArgs)
main :: IO ()
main = do args <- getArgs
let latest = read (head args) :: Int
ids <- readFile "ids.txt"
let numbers = Data.Set.fromList ((Prelude.map read $ lines ids) :: [Int])
let allIDs = Data.Set.fromList [9000000..latest]
let missing = Data.Set.difference allIDs numbers
writeFile "missing.txt" $ unlines $ Prelude.map show $ Data.Set.toList missing
and finally, GNU parallel
to download each torrent ID separately without mucking around with ranges:
cd ~/tpb/tpb2csv/ && rm ids.txt missing.txt randomized.txt
find ./data/9xxxxxx/ -type d | cut --delimiter='/' --fields=6 | sort --unique | tail --lines=+2 > ids.txt
LATEST_TORRENT="$(elinks -dump 'http://thepiratebay.se/recent' | grep -F 'http://thepiratebay.se/torrent/1' | cut -d '/' -f 5 | head -1)"
runghc countmissing.hs $LATEST_TORRENT
# sort --random-sort missing.txt > randomized.txt
# cat randomized.txt | parallel --max-chars=40 --ungroup --jobs 7 -- python -OO download.py
cat missing.txt | tac | parallel --max-chars=40 --ungroup --jobs 1 -- nice python -OO download.py
gwern> there are an amazing number of 404s on tpb. I wonder why
gwern> why would they ever delete a page?
gwern> short of CP
X> They delete spam, viruses, fake uploads, uploads with wrong names. There is tons of that on TPB actually (and usually uploaded in bulk, that's why the 404s usually form "blocks"). Not that much CP.
Processing
$ find . -type f -name "description.txt" -exec grep --extended-regexp '[13][a-zA-Z0-9]{26,33}' {} \;
# Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# UniqueID/String : 198031419884096486800071953706812228345 (0x94FB76D26E78C1B7A2EA96FDDB10C2F9)
# http://easyimghost.com/ImageHosting/8102_7346420e085511e2ba4022000a1e89327.jpg.html
# http://easyimghost.com/ImageHosting/8104_83901b420fe611e29797123138133f0a7.jpg.html
# http://sharepic.biz/show-image.php?id=b264f152debff549632be27bfe965f86
# http://image.bayimg.com/08f6b9b52398b07c3c86e1dc1f3b3d36594e67b8.jpg
# http://image.bayimg.com/0b397e88fdefe06fa99acddf86190f4cd2ef3922.jpg
# Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# http://image.bayimg.com/34eae2b9725eb15e7a58fd6bf6e2fedb2c5af0a7.jpg
# Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# d17a78495f673990bb6d3ea096e5830bcc7dc4dd
# Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# http://rss.thepiratebay.se/user/01997b95ece549c293b824450ea84389
# http://rss.thepiratebay.se/user/01997b95ece549c293b824450ea84389
# B0A544F55125A26C21622D4118739305B9088448
# http://rss.thepiratebay.se/user/01997b95ece549c293b824450ea84389
# http://image.bayimg.com/6cd5e44b139c1506dfa8f8d59003187454700627.jpg
# Want to help us out? BitCoin: 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# http://rss.thepiratebay.se/user/01997b95ece549c293b824450ea84389
# http://image.bayimg.com/0070020142c02ecbadd55e73f0d60a7169367885.jpg
# http://photosex.biz/v.php?id=a7f013597a02d5ec0fffe529da47e7eb
# http://image.bayimg.com/b6a1d70ca01e50d91f30ef683756442d4d52a357.jpg
$ find . -type f -name "description.txt" -exec grep --only-matching --extended-regexp '[13][a-zA-Z0-9]{26,33}' {} \;
# 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# 1980314198840964868000719537068122
# 346420e085511e2ba4022000a1e89327
# 3901b420fe611e29797123138133f0a7
# 152debff549632be27bfe965f86
# 398b07c3c86e1dc1f3b3d36594e67b8
# 397e88fdefe06fa99acddf86190f4cd2ef
# 1Fw4sRdZYFwULfeE7oQ91GLYKy8j5EcuFW
# 34eae2b9725eb15e7a58fd6bf6e2fedb2c
maybe use R? http://beautifuldata.net/2015/01/querying-the-bitcoin-blockchain-with-r/
nice find tpb/ -type f -print0 | sort --zero-terminated --key=6,5 --field-separator="/" | tar --no-recursion --null --files-from - -c | nice xz -9e --check=sha256 --stdout > ~/tpb.tar.xz; alert
I still miss anything newer than ID 8599995 (06-2013). There is this torrent …9xxx, 10xxx and 11xxx are missing (the last torrents uploaded to TPB were 11xxx) …There were the github CSV repos (that I linked now on the page) by some other guy, but github took them down (at least the 4xxxxxx one)… I am not sure why, but they did, and they did it today, more exactly about 10 minutes ago while I was cloning it to my disk …Do you have them? I found your website when I googled it, it seems you did some experiments on it. …If you do, could you upload it somewhere? (Ideally some torrent) If you have newer than 8xxxxxx (as you seem you had), that would be even perfecter
I’m not sure I have as much as you think I have: the TPB downloader broke 2014-09-18, and I stopped scraping. (I was getting tired of having to babysit it, it was using up a lot of bandwidth and disk space which make my backups much slower, and I wanted to focus on scraping the blackmarkets.) Also, note that I hacked the downloader to only download the metadata I wanted for my Bitcoin analysis; I did not intend or want to mirror all of TPB since I assumed the original archivers were doing that and that the TPB itself had better procedures than my scraping. So while I don’t remember editing any of the archives I pulled off Github, the more recent files -which are probably what you really want - will not be complete.
Here’s a summary of what I have:
$ ls
03xxxxxx/ 04xxxxxx/ 05xxxxxx/ 06xxxxxx/ 07xxxxxx/ 08xxxxxx/ 09xxxxxx/ 10xxxxxx/ 11xxxxxx/ tpb2csv/
$ duh */.git/
481M 03xxxxxx/.git/
684M 04xxxxxx/.git/
718M 05xxxxxx/.git/
944M 06xxxxxx/.git/
806M 07xxxxxx/.git/
495M 08xxxxxx/.git/
156K tpb2csv/.git/
4.1G total
$ du -ch *
5.6G 03xxxxxx
7.7G 04xxxxxx
8.2G 05xxxxxx
11G 06xxxxxx
9.9G 07xxxxxx
6.4G 08xxxxxx
7.3G 09xxxxxx
3.8G 10xxxxxx
1.8G 11xxxxxx
208K tpb2csv
62G total
$ find ~/tpb/ -type f | sort | xz -9e --check=sha256 --stdout > ~/tpb.txt.xz
# https://www.dropbox.com/s/te1zimevmmzi1qg/tpb.txt.xz