This is a project to parse the Ethereum blockchain from a local geth node. Blockchains are perfect data sets because they contain every transaction ever made on the network.
Project description
Ethereum Blockchain Parser
This is a project to parse the Ethereum blockchain from a local geth node. Blockchains are perfect data sets because they contain every transaction ever made on the network. This is valuable data if you want to analyze the network, but Ethereum stores its blockchain in RLP encoded binary blobs within a series of LevelDB files and these are surprisingly difficult to access, even given the available tools. This project takes the approach of querying a local node via JSON-RPC, which returns unencoded transactional data, and then moves that data to a mongo database.
Usage
Streaming data
To stream blockchain data for real-time analysis, make sure you have both geth and mongo running and start the process with:
python3 stream.py
Note that this will automatically backfill your mongo database with blocks that it is missing.
Backfilling your Mongo database
To get data from the blockchain as it exists now and then stop parsing, simply run the following scripts, which are located in the Scripts
directory. Note that at the time of writing, the Ethereum blockchain has about 1.5 million blocks so this will likely take several hours.
-
Funnel the data from geth to MongoDB:
python3 preprocess.py
-
Create a series of snapshots of the blockchain through time and for each snapshot, calculate key metrics. Dump the data into a CSV file:
python3 extract.py
Prerequisites:
Before using this tool to analyze your copy of the blockchain, you need the following things:
Geth
Geth is the Go implementation of a full Ethereum node. We will need to run it with the --rpc
flag in order to request data (WARNING if you run this on a geth client containing an account that has ether in it, make sure you put a firewall 8545 or whatever port you run geth RPC on).
A geth instance downloads the blockchain and processes it, saving the blocks as LevelDB files in the specified data directory (~/.ethereum/chaindata
by default). The geth instance can be queried via RPC with the eth_getBlockByNumber([block, true])
endpoint (see here) to get the X-th
block (with true
indicating we want the transactional data included), which returns data of the form:
{
number: 1000000,
timestamp: 1465003569,
...
transactions: [
{
blockHash: "0x2052ce710a08094b81b5047ea9df5119773ce4b263a23d86659fa7293251055e",
blockNumber: 1284937,
from: "0x1f57f826caf594f7a837d9fc092456870a289365",
gas: 22050,
gasPrice: 20000000000,
hash: "0x654ac26084ee6e40767e8735f38274ef5f594454a4d34cfdd70c93aa95be0c64",
input: "0x",
nonce: 6610,
to: "0xfbb1b73c4f0bda4f67dca266ce6ef42f520fbb98",
transactionIndex: 27,
value: 201544820000000000
}
]
}
Since I am only interested in number
, timestamps
, and transactions
for this application, I have omitted the rest of the data, but there is lots of additional information in the block (explore here), including a few Merkle trees to maintain hashes of state, transactions, and receipts (read here.
Using the from
and to
addresses in the transactions
array, I can map the flow of ether through the network as time processes. Note that the value, gas, and gasPrice are in Wei, where 1 Ether = 1018 Wei. The numbers are converted into Ether automatically with this tool.
MongoDB
We will use mongo to essentially copy each block served by Geth, preserving its structure. The data outside the scope of this analysis will be omitted. Note that this project also requires pymongo.
graph-tool
graph-tool is a python library written in C to construct graphs quickly and has a flexible feature set for mapping properties to its edges and vertices. Depending on your system, this may be tricky to install, so be sure and follow their instructions carefully. I recommend you find some way to install it with a package manager because building from source is a pain.
python3
This was written for python 3.4 with the packages: contractmap, tqdm and requests. Some things will probably break if you try to do this analysis in python 2.
Workflow
The following outlines the procedure used to turn the data from bytes on the blockchain to data in a CSV file.
1. Process the blockchain
Preprocessing is done with the Crawler
class, which can be found in the Preprocessing/Crawler
directory. Before instantiating a Crawler
object, you need to have geth and mongo processes running. Starting a Crawler()
instance will go through the processes of requesting and processing the blockchain from geth and copying it over to a Mongo collection named transactions
. Once copied over, you can close the Crawler()
instance.
2. Take a snapshot of the blockchain
A snapshot of the network (i.e. all of the transactions occurring between two timestamps, or numbered blocks in the block chain) can be taken with a TxnGraph()
instance. This class can be found in the Analysis
directory. Create an instance with:
snapshot = TxnGraph(a, b)
where a is the starting block (int) and b is ending block (int). This will load a directed graph of all ethereum addresses that made transactions between the two specified blocks. It will also weight vertices by the total amount of Ether at the time that the ending block was mined and edges by the amount of ether send in the transaction.
To move on to the next snapshot (i.e. forward in time):
snapshot.extend(c)
where c
is the number of blocks to proceed.
At each snapshot, the instance will automatically pickle the snapshot and save the state to a local file (disable on instantiation with save=False
).
Drawing an image:
Once TxnGraph
is created, it will create a graph out of all of the data in the blocks between a and b. An image can be drawn by calling TxnGraph.draw()
and specific dimensions can be passed using TxnGraph.draw(w=A, h=B)
where A and B are ints corresponding to numbers of pixels. By default, this is saved to the Analysis/data/snapshots
directory.
Saving/Loading State (using pickle)
The TxnGraph
instance state can be (and automatically is) pickled with TxnGraph.save()
where the filename is parameterized by the start/end blocks and is saved. By default, this saves to the Analysis/data/pickles
directory. If another instance was pickled with a different set of start/end blocks, it can be reloaded with TxnGraph.load(a,b)
.
3: (Optional) Add a lookup table for smart contract transactions
An important consideration when doing an analysis of the Ethereum network is of smart contract addresses. Much ether flows to and from contracts, which you may want to distinguish from simple peer-to-peer transactions. This can be done by loading a ContractMap
instance. It is recommended you pass the most recent block in the blockchain for last_block
, as this will find all contracts that were transacted with up to that point in history:
# If a mongo_client is passed, the ContractMap will scan geth via RPC
# for new contract addresses starting at "last_block".
cmap = ContractMap(mongo_client, last_block=90000, filepath="./contracts.p")
cmap.save()
# If None is passed for a mongo_client, the ContractMap will automatically
# load the map of addresses from the pickle file specified in "filepath",
# ./contracts.p by default.
cmap = ContractMap()
This will create a hash table of all contract addresses using a defaultdict
and will save it to a pickle file.
4: Aggregate data and analyze
Once a snapshot has been created, initialize an instance of ParsedBlocks
with a TxnGraph
instance. This will automatically aggregate the data and save to a local CSV file, which can then be analyzed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ethereum_blockchain_parser-2.1.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 29953fcaa8024e58c8cb86351769906be7e8a747479140f0152e22886e55790c |
|
MD5 | e94277db9e23cc3d8e5d3548e9ed55db |
|
BLAKE2b-256 | f706d53cf8a559f6873637ded106f68396540db6cf2eb2ca43b1763c35c7c186 |
Hashes for Ethereum_Blockchain_Parser-2.1.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a2c76437860ecc5e1b3e7c423db58dedef0f83e5c0cc64aa9721c87b62b50793 |
|
MD5 | 0756e83eb1e9c414095ed5b485a08e02 |
|
BLAKE2b-256 | a11f3fe57201d4271ff35fa5a7a52abf4cb311911b4f93eaa04d6c1e8623c19e |