Taupe: a tool to extract URLs from your personal Twitter archive
Project description
Taupe
A simple program to extract the URLs of your tweets, retweets, replies, quote tweets, and "likes" from a personal Twitter archive.
Table of contents
- Introduction
- Installation
- Usage
- Known issues and limitations
- Getting help
- Contributing
- License
- Acknowledgments
Introduction
When you download your personal Twitter archive, you receive a ZIP file. The contents are not necessarily in a format convenient for doing something with them. For example, you may want to send the URLs to the Wayback Machine at the Internet Archive or do something else with the URLs. For tasks like that, you need to extract URLs from your Twitter archive. That's the purpose of Taupe.
Taupe (a loose acronym of Twitter archive URL parser) takes a Twitter archive ZIP file, extracts the URLs corresponding to your tweets, retweets, replies, quote tweets, and liked tweets, and outputs the results in a comma-separated values (CSV) format that you can easily use with other software tools.
Installation
There are multiple ways of installing Taupe. Please choose the alternative that suits you.
Alternative 1: installing Taupe using pipx
You can use pipx to install Taupe. Pipx will install it into a separate Python environment that isolates the dependencies needed by Taupe from other Python programs on your system, and yet the resulting taupe
command wil be executable from any shell – like any normal program on your computer. If you do not already have pipx
on your system, it can be installed in a variety of easy ways and it is best to consult Pipx's installation guide for instructions. Once you have pipx on your system, you can install Taupe with the following command:
pipx install taupe
Pipx can also let you run Taupe directly using pipx run taupe
, although in that case, you must always prefix every Taupe command with pipx run
. Consult the documentation for pipx run
for more information.
Alternative 2: installing Taupe using pip
You should be able to install taupe
with pip
for Python 3. To install taupe
from the Python package repository (PyPI), run the following command:
python3 -m pip install taupe
As an alternative to getting it from PyPI, you can use pip
to install taupe
directly from GitHub:
python3 -m pip install git+https://github.com/mhucka/taupe.git
If you already installed Taupe once before, and want to update to the latest version, add --upgrade
to the end of either command line above.
Alternative 3: installing Taupe from sources
If you prefer to install Taupe directly from the source code, you can do that too. To get a copy of the files, you can clone the GitHub repository:
git clone https://github.com/mhucka/taupe
Alternatively, you can download the software source files as a ZIP archive directly from your browser using this link: https://github.com/mhucka/taupe/archive/refs/heads/main.zip
Next, after getting a copy of the files, run setup.py
inside the code directory:
cd taupe
python3 setup.py install
Usage
If the installation process described above is successful, you should end up with a program named taupe
in a location where software is normally installed on your computer. Running taupe
should be as simple as running any other command-line program. For example, the following command should print a helpful message to your terminal:
taupe --help
If not given the option --help
or --version
, this program expects to be given a personal Twitter archive file, either on the command line (as an argument) or on standard input (from a pipe or file redirection). Here's an example (and note this path is fake – substitute a real path on your computer when you do this!):
taupe /path/to/twitter-archive.zip
The URLs produced by taupe
will be, by default, as they appear in the archive. If you want to normalize the URLs into the canonical form https://twitter.com/twitter/status/TWEETID
, use the option --canonical-urls
(-c
for short):
taupe -c /path/to/twitter-archive.zip
The structure of the output
The output produced by taupe
differs depending on whether you are extracting tweets or "likes".
Tweets
When using --extract tweets
(the default), taupe
produces a table with four columns. Each row of the table corresponds to a type of event in the Twitter timeline: a tweet, a retweet, a reply to another tweet, or a quote tweet. The values in the columns provide details about the event. The following is a summary of the structure:
Column 1 | Column 2 | Column 3 | Column 4 |
---|---|---|---|
tweet timestamp in ISO format | The URL of the tweet | The type; one of tweet , reply , retweet , or quote |
(For type reply or quote .) The URL of the original or source tweet |
The last column only has a value for replies and quote-tweets; in those cases, the URL in the column refers to the tweet being replied to or the tweet being quoted. The fourth column does not have a value for retweets even though it would be desirable, because the Twitter archive – strangely – does not provide the URLs of retweeted tweets.
Here is an example of the output:
2022-09-21T22:36:29+00:00,https://twitter.com/mhucka/status/1572716422857658368,quote,https://twitter.com/poppy_northcutt/status/1572714310077673472
2022-10-10T22:04:20+00:00,https://twitter.com/mhucka/status/1579593701965582336,reply,https://twitter.com/arfon/status/1579572453726355456
2022-10-14T04:17:01+00:00,https://twitter.com/mhucka/status/1580774654217625600,tweet
2022-10-25T14:49:06+00:00,https://twitter.com/mhucka/status/1584919989307715586,retweet
...
Likes
When using the option --extract likes
, the output will only contain one column: the URLs of the "liked" tweets. taupe
cannot provide more detail because the Twitter archive format does not contain date/time information for "likes".
Here is an example of the output when using --extract likes
in combination with --canonical-urls
:
https://twitter.com/twitter/status/1588146224376463365
https://twitter.com/twitter/status/1588349144803905536
https://twitter.com/twitter/status/1590475356976578560
...
Other options recognized by taupe
Running taupe
with the option --help
will make it print help text and exit without doing anything else.
The option --output
controls where taupe
writes the output. If the value given to --output
is -
(a single dash), the output is written to the terminal (stdout). Otherwise, the value must be a file.
If given the --version
option, this program will print its version and other information, and exit without doing anything else.
If given the --debug
argument, taupe
will output a detailed trace of what it is doing. The debug trace will be sent to the given destination, which can be -
to indicate console output, or a file path to send the debug output to a file.
Summary of command-line options
The following table summarizes all the command line options available.
Short | Long form opt | Meaning | Default | |
---|---|---|---|---|
-c |
--canonical-urls |
Normalize Twitter URLs | Leave URLs unnormalized | |
-h |
--help |
Print help info and exit | ||
-e E |
--extract E |
Extract tweets or likes ? |
tweets |
|
-o O |
--output O |
Write output to file O | Write to the terminal | ✦ |
-V |
--version |
Print program version info and exit | ||
-@ OUT |
--debug OUT |
Debugging mode; write trace to OUT | Normal mode | ⚐ |
✦ To write to the console, you can also use the character -
as the value of O; otherwise, O must be the name of a file where the output should be written.
⚐ To write to the console, use the character -
as the value of OUT; otherwise, OUT must be the name of a file where the output should be written.
Known issues and limitations
This program assumes that the Twitter archive ZIP file is in the format which Twitter produced in mid-November 2022. Twitter probably used a different format in the past, and may change the format again in the future, so taupe
may or may not work on Twitter archives obtained in different historical periods.
The Twitter archive format for "likes" contains only the tweet identifier and the text of the tweet; consequently, taupe
cannot provide date/time information for this case.
This program does all its work in memory, which means that taupe
's ability to process a given archive depends on its size and how much RAM the computer has. It has only been tested with modest-sized archives. It is unknown how it will behave with exceptionally large archives.
Getting help
If you find a problem or have a request or suggestion, please submit it in the GitHub issue tracker for this repository.
Contributing
I would be happy to receive your help and participation if you are interested. Everyone is asked to read and respect the code of conduct when participating in this project. Please feel free to report issues or do a pull request to fix bugs or add new features.
License
This software is Copyright (C) 2022, by Michael Hucka and the California Institute of Technology (Pasadena, California, USA). This software is freely distributed under a 3-clause BSD type license. Please see the LICENSE file for more information.
Acknowledgments
This work is a personal project developed by the author, using computing facilities and other resources of the California Institute of Technology Library.
The vector artwork of a bird, used as the icon for this repository, was created by Noe Araujo from the Noun Project. It is licensed under the Creative Commons CC-BY 3.0 license. I manually changed the color to be a shade of taupe.
Taupe uses multiple other open-source packages, without which it would have taken much longer to write the software. I want to acknowledge this debt. In alphabetical order, the packages are:
- Aenum – Python package for advanced enumerations
- CommonPy – a collection of commonly-useful Python functions
- Plac – a command line argument parser
- Rich – library for writing styled text to the terminal
- Sidetrack – simple debug logging/tracing package
- Twine – utilities for publishing Python packages on PyPI
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.