SEC Edgar Parser based on Python 3
Project description
py-sec-xbrl
SEC EDGAR Parser based on Python 3
Intruduction
This is a tool intended to parse XBRL files from SEC. Thus, the focus is to parse XBRL XML files so that data is more easily accessible. The idea is to provide a tool for you to code you want instead of a tool that implements a workflow but is rigid.
In addition, it's not intended to be a tool to scrap SEC EDGAR as it varies a lot as to how you want to do the scrapping and it's relatively easier. (though it can be added later if you want)
The repository is originally forked from https://github.com/tooksoi/ScraXBRL, but I soon find out that we have very different approaches and objectives, so soon afterwards the code in the 2 repositories are completely different and nothing is taken from ScraXBRL.
How to install
Current verion: v0.1
Dependencies: in the requirements.txt
file, currently only the lxml
library
How to use
- get some XBRL XML files (see below if you don't have one yet)
- see
test-parse.py
, modify the path to the XML file and it's really easy
Output data structure
The current version will give you 2 dictionaries, one for the data and another for the context.
Data part:
{"id":{ # the id of the data object
"tag":"...", # name of the tag, e.g. Revenue, CostOfSales ...
"value":"...", # value of the tag, current version returns all values in string
"prefix":"...", # namespace in the XML, e.g. us-gaap, dei, ...
"contextRef":"..." # the reference ID to the context part
"...":"..." # here you can have other attributes specific to the data object
}
}
Context part:
In the current version, 2 types of context are considered:
- unit of measure (USD, CAD, ...)
- context: time + segment (time instant/stard & end date, which part of the business or other stakeholders)
unit
{"id":{ # the id of the context object
"type":"unit", # the unit type
"unit":"...", # value of the unit, e.g. can be USD
"...":"..." # other attributes are possible
}
}
context
{"id":{ # the id of the context object
"type":"context", # the context type
"instant":"2020-05-08", # it will be either a date or start+end
"startDate":"...","endDate":"...",
"segment":[ # list of one or more stakeholders
"explicitMember":"...", # the "who"
"dimension":"..." # can be very diverse
],
"...":"..." # other attributes are possible
}
}
How to get data
- The entry point is here: https://www.sec.gov/edgar/searchedgar/accessing-edgar-data.htm
- As is mentioned on the page, you can get daily filing lists here: https://www.sec.gov/Archives/edgar/daily-index/
- Getting into the folders, you can see several types of index files, they are normally the same but organized in different ways, I personally prefer the "master" file
- The "Master" file is a text file, and every line is like this:
1001115|GEOSPACE TECHNOLOGIES CORP|10-Q|20200508|edgar/data/1001115/0001564590-20-023322.txt
- This text file is in fact not the XBRL file, but we can get its folder path: https://www.sec.gov/Archives/edgar/data/1001115
- And in the inner folder, you can see the XBRL file which is
0001564590-20-023322-xbrl.zip
Development roadmap
2 priorities for the moment:
- Basic SEC XBRL parsing capabilities
- Make the scripts ready as a library that can be installed
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for py_sec_xbrl-0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9556acf11861f2a09bfdcd5de1d8f28fff0927b453d8c97bbb5a988293f73996 |
|
MD5 | 87230b20fee6f2fe04be246b5eae1add |
|
BLAKE2b-256 | 60221886619afcc1796f2a8bb95dd8177b75c2c6fdfd43ff21769a619b0eaa52 |